The problem is that it is hard to know which data points are outliers, as demonstrated in the following simulation.
# load required libraries
library(ggplot2)
library(retimes)
## Reaction Time Analysis (version 0.1-2)
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
# Simulate a population of "good" reaction times:
# generate an ex-Gaussian population:
rt_dist1 <- rexgauss(100000, 300, 100, 200, positive = F)
# keep positive values only:
rt_dist1 <- rt_dist1[rt_dist1 > 0]
# give it a nicer name:
Population_of_Good_Reaction_Times <- rt_dist1
# Mean and Histogram of "good" RT distribution
mean(Population_of_Good_Reaction_Times)
## [1] 500.8441
hist(Population_of_Good_Reaction_Times, xlab = "RT in milliseconds")
# Simulate a distribution of outliers
rt_outliers <- rexgauss(1000, 450, 100, 600, positive = F)
rt_outliers <- rt_outliers[rt_outliers > 0]
# give it a nicer name:
Population_of_Outliers <- rt_outliers
# Mean and Histogram of "outlier" distribution
mean(Population_of_Outliers)
## [1] 1044.028
hist(Population_of_Outliers, xlab = "RT in milliseconds")
Those two distributions look pretty different. But what does it look like when you have a sample (from an experiment) that contains a mixture of “real” RT responses and “outliers”?
Let’s see what a sample of 100 reaction times without outliers looks like:
# Set sample size
sampleN <- 100
# Take a sample of "good" reaction times (without outliers)
Sample_Data <- sample(rt_dist1, sampleN)
mean(Sample_Data)
## [1] 499.7751
hist(Sample_Data, xlab = "RT in milliseconds", main="Sample Data with no Outliers")
Now let’s see what that sample would look like if 10% of the RT responses were replaced by outliers:
# Create a sample of "outliers"
# set the proportion of the sample to replace with outliers
proportion_outliers <- 0.1
# calculate number of outliers to select
nOutliers <- round(proportion_outliers * sampleN)
# select the sample of outliers
Outliers <- sample(rt_outliers, nOutliers)
# Replace part of the sample with outliers
# remove the "good" data that will be replaced by outliers:
GoodData <- sample(Sample_Data, length(Sample_Data) - nOutliers)
s <- data.frame(RT = GoodData, Group = "GoodData")
o <- data.frame(RT = Outliers, Group = "Outliers")
Sample_Data_with_Outliers <- rbind (s, o)
# Mean and Histogram for Sample Data with Outliers
mean(Sample_Data_with_Outliers$RT)
## [1] 519.8143
hist(Sample_Data_with_Outliers$RT, xlab = "RT in milliseconds", main="Sample Data with Outliers")
It does look a little different than the sample without outliers - there are more data points in the right tail, and it extends further. Probably some of those data points in the tail should be excluded. But where would you make your cutoff?
# Dot Plot of the individual data points. Which are outliers?
dotPlot <- ggplot(Sample_Data_with_Outliers, aes(x=RT)) +
geom_dotplot( stackdir = 'center')
dotPlot
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Here are the actual identies of the outliers in the simulation. One thing to notice is that the good and bad data often overlap - there may be no way to exclude all the outliers without also excluding some genuine data points. Another thing to keep in mind is that in real life we NEVER get to know which points are really outliers.
# Dot Plot of the individual data points with actual outliers shown
dotPlot <- ggplot(Sample_Data_with_Outliers, aes(x=RT, fill=Group)) +
geom_dotplot( stackdir = 'center')
dotPlot
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
A set percentage of the data (usually 2% - 10%)
Tukey boxplot method (based on IQR)
Whatever approach you end up taking, clearly state what you did and why.
Ignore it - ANOVA is relatively robust against violations of normality (Glass et al. 1972, Harwell et al. 1992, Lix et al. 1996)
Transform the DV to make it more normal (e.g., inverse transformation: 1 / RT; Box-Cox procedure to select a power function to use for transformation; many others) before analyzing with ANOVA, LMM, etc.
Use a generalized linear model (GLM) or generalized linear mixed model (GLMM) and specify a non-Gaussian (non-normal) distributional assumption for the DV (e.g., inverse Gaussian) (e.g., Lo & Andrews, 2015). These generalized models do not require normal (Gaussian) data; you can specify other distributions when you specify the model.
In GLM or GLMM specify a link function to transform the DV (may not be necessary if an appropriate distributional assumption has been specified; Lo & Andrews recommend identity link function with inverse Gaussian distributional assumption for additive factors RT research, for example).
Glass, G.V., P.D. Peckham, and J.R. Sanders. (1972). Consequences of failure to meet assumptions underlying fixed effects analyses of variance and covariance. Rev. Educ. Res. 42: 237-288.
Harwell, M.R., E.N. Rubinstein, W.S. Hayes, and C.C. Olds. (1992). Summarizing Monte Carlo results in methodological research: the one- and two-factor fixed effects ANOVA cases. J. Educ. Stat. 17: 315-339.
Lix, L.M., J.C. Keselman, and H.J. Keselman. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev. Educ. Res. 66: 579-619.
Lo, S., & Andrews, S. (2015). To transform or not to transform: using generalized linear mixed models to analyse reaction time data. Frontiers in Psychology, 6. doi:10.3389/fpsyg.2015.01171
Ratcliff, R. (1993). Methods for dealing with reaction time outliers. Psychological Bulletin, 114(3), 510.
Baayen, R. H., & Milin, P. (2015). Analyzing reaction times. International Journal of Psychological Research, 3(2), 12-28.