TUTORIAL: Regression in R.-

The following is a tutorial on how to conduct a regression in R. Regression is a method used to analyze paired continuous variables, in which a dependent Y variable is predicted as a linear function of an independent X variable.

Before beginning, you should have R, RStudio, and ggplot2, downloaded and ready for use. See my Beginning Work in R Tutorial.

Note that the code is included in the gray boxes below to make it easy to cut and paste. The explanations are interspersed in regular html.

To begin, you will need to set the working directory, open the file, and attach the variables. To set the working directory, use the setwd function. Select the location on your computer where you created a folder for the data and output files. R will look for your data file and save output files in this folder. I created a folder specifically for this regression tutorial in my R_work folder.


setwd("C:/1awinz/R_work/regression_tutorial")


R should list the correct working directory as output once you hit enter.

Then we will open our data file. Data files can be generated in Excel by saving spread sheets as tab delimited text files. My example data file is called "data_RS_M_SL_HL.txt". It includes standard length (SL) and head length (HL) data for male threespine sticleback from an anadromous Alaskan population in Rabbit Slough. The data are from a study published by Aguirre and Akinpelu (2010) on sexual dimorphism of head length in threespine stickleback (click here for the pdf). We will use the SL data from one population in this tutorial for simplicity. The data file is located on Github if you want to download it and use it in this tutorial (click here to download the data file). This data file has headers (variable names).

If you are using your own data file, make sure to indicate whether it has headers and attach the variable names to the data file and list the data.

Use the following commands substituting the name of your data file if necessary:


read.table("data_RS_M_SL_HL.txt", header=T)
data=read.table("data_RS_M_SL_HL.txt", header=T)
attach(data)
names(data)


The first command "read.table("data_RS_M_SL_HL.txt", header=T)", should result in a listing of your data. The second command, "data=read.table("data_RS_M_SL_HL.txt", header=T)", assigns the name "data" to the data table and indicates that it has a header (T is for true indicating that the first row lists the variable names). The third command "attach(data)" attaches the variable names to the data file, and the final command "names(data)" lists the variable names. You should see "Sex" "Spec" "SL" "HL" as output for the variable names if you are using my example data file.

Now that we have our data ready, lets talk about regression. Regression is a simple method used with continuous bivariate data in which one variable, Y or the dependent variable, is modelled as a simple linear function of an X or independent variable. The function follows the form:

Y = a + bX

Where a is the Y intercept, the value of Y at which the modelled line crosses the Y axis, and b is the slope of the line, the increase in units of Y per unit of X .

The predicted variables of Y are symbolized as Y^ or "Y hat" and fall exactly on a straight line. The deviations between the observed and predicted values of Y are called residuals and represent the unexplained variation in Y. See the figure below:







In R, a regression can be conducted simply by using the following expression:


model1<-lm(HL~SL)


The function lm is telling R to create a linear model of our dependent Y variable HL as a function of the independent X variable, SL. The "model1<-" is telling R to save the model under the name "model1".

To see what this relationship looks like, use the following function:


ggplot(data, aes(x=SL, y=HL)) + geom_point() + geom_smooth(method=lm) + theme_classic()







This function calls the "ggplot" package to create our plot. We define the X (SL) and Y (HL) variables, tell ggplot to create a scatter plot using the "geom_point" function and fit a straight line through it using the "geom_smooth" function. We specify that a straight line should be fitted by indicating "method=lm". Confidence intervals are included about the regression line by default. To remove the confidence intervals, specify se=FALSE for the geom_smooth function. Finally, the "theme_classic()" makes the background white instead of the default gray color.

Besides the plot, you can generate statistics for the regression as follows. To list the Y intercept "a" and the slope "b", type the following function:


coefficients(model1)


This gives the following output in which the first number, 5.561..., is the Y intercept (a) and the second, 0.216..., is the slope (b):






You can also use the "summary" function.


summary(model1)






This also gives the Y intercept and slope, their respective standard errors, and t and P values testing whether they differ significantly from 0. The test of the slope (0.216+0.02986) is a t test of whether the slope differs significantly from 0 that can be used as a test of the significance of the regression. R2 values, the proportion of the variation in Y explained by X, are also given (0.5222 and 0.5122), and indicate that about half of the variation in head length is explained by variation in standard length. Finally, the F statistic for an F test of significance of the regression with the degrees of freedom is also given (F-statistic: 52.45, DF: 1,48, P: 3.128e-09). The results indicate that the regression is highly significant, variation in the standard length (body length) of stickleback explains a significant component of the variation in head length.

Another way to get the results of the significance test for the regression in the format of an ANOVA table is to use the "anova" function as follows:


anova(model1)










SUGGESTED READING:

For a general treatment of statistical tests and regression, see:

-Whitlock, M., and D. Schluter. 2015. The analysis of biological data. Roberts and Company Publishers. Greenwood Village.

For an R focused treatment of these topics, see:

-Crawley, M.J. 2015. Statistics, an introduction using R. John Wiley & Sons. West Sussex.

OTHER REFERENCES CITED:

-Aguirre, W.E., and O. Akinpelu. 2010. Sexual dimorphism of head morphology in threespine stickleback. Journal of Fish Biology 77:802-821.



Date last modified: Feb/28/20
Date created: Feb/28/20 (by: Windsor Aguirre)