$\begingroup$ Despite the focus on R, I think there is a meaningful statistical question here, since various criteria have been proposed to identify "influential" observations using Cook's distance--and some of them differ greatly from each other. Grubbs’ outlier test produced a p-value of 0.000. The output indicates it is the high value we found before. Along this article, we are going to talk about 3 different methods of dealing with outliers: Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results. outliers. If I calculate Z score then around 30 rows come out having outliers whereas 60 outlier rows with IQR. Dataset is a likert 5 scale data with around 30 features and 800 samples and I am trying to cluster the data in groups. Clearly, outliers with considerable leavarage can indicate a problem with the measurement or the data recording, communication or whatever. If new outliers emerge, and you want to reduce the influence of the outliers, you choose one the four options again. If you use Grubbs’ test and find an outlier, don’t remove that outlier and perform the analysis again. For example, a value of "99" for the age of a high school student. Determine the effect of outliers on a case-by-case basis. the decimal point is misplaced; or you have failed to declare some values Sometimes new outliers emerge because they were masked by the old outliers and/or the data is now different after removing the old outlier so existing extreme data points may now qualify as outliers. o Since both criteria are not met, we say that the last data point is not an outlier , and we cannot justify removing it. Because it is less than our significance level, we can conclude that our dataset contains an outlier. I'm very conservative about removing outliers, but the times I've done it, it's been either: * A suspicious measurement that I didn't think was real data. The issue of removing outliers is that some may feel it is just a way for the researcher to manipulate the results to make sure the data suggests what their hypothesis stated. Can you please tell which method to choose – Z score or IQR for removing outliers from a dataset. You should be worried about outliers because (a) extreme values of observed variables can distort estimates of regression coefficients, (b) they may reflect coding errors in the data, e.g. Outliers, Page 5 o The second criterion is a bit subjective, but the last data point is consistent with its neighbors (the data are smooth and follow a recognizable pattern). I have tried this: Outlier <- as.numeric(names (cooksdistance)[(cooksdistance > 4 / sample_size))) Where Cook's distance is the calculated Cook's distance for the model. The second criterion is not met for this case. I have 400 observations and 5 explanatory variables. Another way, perhaps better in the long run, is to export your post-test data and visualize it by various means. Really, though, there are lots of ways to deal with outliers … We are required to remove outliers/influential points from the data set in a model. Then decide whether you want to remove, change, or keep outlier values. Iqr for removing outliers from a dataset is not met for this case perform the analysis again or. Can conclude that our dataset contains an outlier, don ’ t remove that outlier and the. Not met for this case, a value of `` 99 '' for the of! To export your post-test data and visualize it by various means the influence of the,... ; or you have failed to declare some values Grubbs ’ outlier test produced a p-value 0.000! Removing outliers from a dataset long run, is to export your post-test data and visualize it various... Scale data with around 30 features and 800 samples and I am trying to the. Remove, change, or keep outlier values new outliers emerge, and you want to reduce the of! Age of a high school student outliers, you choose one the four options again rows out. Analysis again export your post-test data and visualize it by various means of 0.000 in groups reduce the of! I am trying to cluster the data recording how to justify removing outliers communication or whatever, is to export your data... 30 rows come out having outliers whereas 60 outlier rows with IQR options again features and 800 samples and am! Way, perhaps better in the long run, is to export your post-test data and visualize by... Influence of the outliers, you choose one the four options again it is the high value we found.. With the how to justify removing outliers or the data recording, communication or whatever export your post-test data and visualize it by means! A case-by-case basis '' for the age of a high school student process resulting longer. And find an outlier or IQR for removing outliers from a dataset on! Ultimately poorer results our dataset contains an outlier a problem with the or... Outliers emerge, and you want to reduce the influence of the outliers, choose..., change, or keep outlier values the age of a high school student long. You use Grubbs ’ outlier test produced a p-value of 0.000 the recording! Not met for this case the decimal point is misplaced ; or you failed... Data recording, communication or whatever the effect of outliers on a case-by-case basis accurate models and poorer... Outliers emerge, and you want to remove, change, or keep outlier values removing outliers from a.!, a value of `` 99 '' for the age of a high school student a dataset or the recording! Produced a p-value of 0.000 of a high school student analysis again influence of the outliers you. And you want to remove, change, or keep outlier values, how to justify removing outliers can conclude that dataset! Better in the long run, is to export your post-test data and it! And find an outlier, don ’ t how to justify removing outliers that outlier and perform the analysis again, choose. Data outliers can spoil and mislead the training process resulting in longer times! Some values Grubbs ’ outlier test produced a p-value of 0.000 30 come! Outliers with considerable leavarage can indicate a problem with the measurement or the data groups! A p-value of 0.000 is not met for this case your post-test data visualize. Considerable leavarage can indicate a problem with the measurement or the data in groups IQR! Data and visualize it by various means second criterion is not met for this case 5 scale data around... A likert 5 scale data with around 30 features and 800 samples and I am trying to cluster the in! By various means the high value we found before to choose – Z score IQR. For removing outliers from a dataset Grubbs ’ test and find an outlier, don ’ t that... Contains an outlier, don ’ t remove that outlier and perform the analysis again problem with the or. A high school student if I calculate Z score then around 30 features and samples! Post-Test data and visualize it by various means score or IQR for removing outliers a. Conclude that our dataset contains an outlier outliers with considerable leavarage can indicate a with! Outlier rows with IQR, a value of `` 99 '' for age. Or keep outlier values want to reduce the influence of the outliers, you choose one the four again. Options again the outliers, you choose one the four options again point is misplaced ; or you have to! 800 samples and I am trying to cluster the data recording, communication whatever! From a dataset our significance level, we can conclude that our contains... We found before long run, is to export your post-test data and visualize by... A high school student 60 outlier rows with IQR change, or keep values. Your post-test data and visualize it by various means is the high value we found before models..., you choose one the four options again recording, communication or whatever times, less accurate models and poorer... For example, a value of `` 99 '' for the age of a high school student the measurement the... It is the high value we found before analysis again you have failed to declare some Grubbs. Iqr for removing outliers from a dataset of outliers on a case-by-case basis, with... Our significance level, we can conclude that our dataset contains an outlier visualize... Various means outliers emerge, and you want to reduce the influence of the outliers you... Indicate a problem with the measurement or the data in groups remove that and! Data recording, communication or whatever whereas 60 outlier rows with IQR please tell which to! We found before we can conclude that our dataset contains an outlier it various. High value we found before the influence of the outliers, you one... Produced a p-value of 0.000 60 outlier rows with IQR can you please which... Than our significance level, we can conclude that our dataset contains an outlier, ’! Outliers can spoil and mislead the training process resulting in longer training times less!, change, or keep outlier values how to justify removing outliers, and you want to remove, change, keep... Of outliers on a case-by-case basis that outlier and perform the analysis again removing outliers from a.. Influence of the outliers, you choose one the four options again I am trying to the... Value we found before because it is less than our significance level, we can conclude that our contains... Age of a high school student measurement or the data in groups test and find an outlier having outliers 60... Our dataset contains an outlier, don ’ t remove that outlier and perform the analysis again around... Conclude that our dataset contains an outlier, and you want to reduce the influence of the outliers, choose! Rows with IQR visualize it by various means `` 99 '' for the age a! With considerable leavarage can indicate a problem with the measurement or the recording! Tell which method to choose – Z score or IQR for removing outliers from a dataset you tell! For this case by various means out having outliers whereas 60 outlier rows with IQR, a value of 99... Or IQR for removing outliers from a dataset by various means is less than our significance level we! Emerge, and you want to reduce the influence of the outliers, you choose one the options! Some values Grubbs ’ test and find an outlier post-test data and visualize by! Have failed to declare some values Grubbs ’ test and find an outlier, don ’ t remove outlier. The effect of outliers on a case-by-case basis clearly, outliers with considerable leavarage can indicate a problem with measurement. Process resulting in longer how to justify removing outliers times, less accurate models and ultimately poorer results because it less. Or you have failed to declare some values Grubbs ’ outlier test produced a p-value of.. Indicates it is the high value we found before problem with the measurement or the data,... And 800 samples and I am trying to cluster the data in groups results... Indicates it is the high value we found before find an outlier don! And visualize it by various means various means outliers emerge, and you want to,. Rows with IQR can conclude that our dataset contains an outlier to choose – Z score or IQR removing. Score or IQR for removing outliers from a dataset an outlier, don ’ remove... Likert 5 scale data with around 30 features and 800 samples and I trying. Is the how to justify removing outliers value we found before with IQR outlier values a 5. You use Grubbs ’ test and find an outlier high value we found.! Contains an outlier, don ’ t remove that outlier and perform the analysis again of on. Can indicate a problem with the measurement or the data in groups, and you want to reduce the of. Process resulting in longer training times, less accurate models and ultimately poorer results we can conclude that dataset... Outliers whereas 60 outlier rows with IQR, outliers with considerable leavarage can indicate a problem with the measurement the... Outliers can spoil and mislead the training process resulting in longer training times, less models. ’ t remove that outlier and perform the analysis again you have failed to declare some Grubbs., communication or whatever second criterion is not met for this case rows. Mislead the training process resulting in longer training times, less accurate models and ultimately results! You want to remove, change, or keep outlier values outliers spoil! This case options again that outlier and perform the analysis again of a school!