Skip to Main Content

Working with Quantitative Data

Choosing what software to use, understanding file formats, and organizing your data.
URL: https://libguides.law.ucla.edu/data

What is a regression?

It's almost certain that most of the educators you've had over the years have encouraged you to study, arguing that studying more will help you improve your grades.  When we look at a scatterplot like the one below, our eyes tell us that there seems to be a relationship between these two variables.  As we move from left (low studying) to right (higher studying), we see that most of the people who studied more have higher grades than those who studied less.  For something like a first day of lecture presentation slide, a graph like this could suffice to convince students that they'd be well served to study more.

Scatterplot showing that an increase in studying is associated with an increase in GPA

But what if we needed to be more specific?  What if we needed to convey to students or some other audience how much a person's GPA would likely increase if they were to increase their studying by an hour per week?  Educators generally don't have the ability to enforce a recommendation to study more, nor can (or should) they run an experiment by randomly assigning students to study at different rates to observe differences in grades.  They also don't have the power to select a person represented by a dot in the graph and place them in an alternate universe to see what would happen to their grades if they had studied more or less.  Instead, they can employ the same data as presented in the scatterplot and run a regression like this one:

There are many kinds of regression, but the one we'll discuss first is the most commonly employed: Ordinary Least Squares (OLS).  When there is only one independent variable in the model, as there is in our example of the relationship between studying (the independent variable) and grades (the dependent variable), then we have a special case of OLS called "Simple Linear Regression".  It can provide not just statistical evidence that the two variables are related, but also tell us the answer to "if we increase a person's studying by one hour per week, then by how much would we expect their GPA to increase?"

OLS answers this question by estimating the equation for a straight line that minimizes the sum of squared residuals.  In English, that means that OLS looks at a scatterplot like the one above and tries to plot the most appropriate straight line through it that it can.  For each value we have for "number of hours studied per week", it will produce a prediction for that person's GPA.  OLS aims to produce the straight line that results in the smallest possible differences between a person's actual GPA and the value of GPA predicted by the model.  Consider the graph below:

Each dot on the graph represents a person's actual amount of studying and actual GPA.  The solid, diagonal line represents the OLS regression line.  The vertical, dashed lines between the regression line and the dot represents an error (a residual) in the accuracy of the model's prediction.  Some of the errors are positive (the model's prediction was too high), while others are negative (the prediction was too low).  If we added up the positive and negative residuals in a real case, rather than my poor drawing, they would total 0.  Before we can add them together, we have to square them so that they are all positive.  When I say that OLS minimizes the sum of squared residuals, I mean that of all of the lines that it could have drawn, it draws the line that has the smallest possible value for this total of the squared errors.  Once it does so, it presents us with the equation for the straight line that it estimated.

From an earlier math class, you may recall that the formula for a straight line is Y = mX + b, where "m" is the slope of the line.  "b" is the Y-intercept, meaning the point at which the estimated line crosses the Y-axis of the graph.  While these things get new names and symbols, the idea remains the same.  Just like when you would draw a graph of an equation, we need the slope of the line and where it crosses the Y-axis.  You will most frequently see regression equations written like this:

y = β 0 + β 1 x + ϵ

, where the Greek letter beta with 0 as a subscript marks the y-intercept and the beta with a one subscript marks the slope.  The only part of this that you wouldn't have seen in your early math classes is the Greek letter epsilon at the end.  This is called the "error term" and refers to the residuals that I mentioned above.  It captures all of the reasons why the dots on the scatterplot don't lie exactly on the straight line estimated by the regression.

 

The Y-intercept, also called "the constant" is the estimated value of the dependent variable if the independent variable is equal to zero.  That is, if a person did not study at all, this is what we would predict their GPA to be.  The slope shows what happens to the predicted value of the dependent variable if we increase the value of the independent variable by one unit.  Note that since we've drawn a straight line, the slope is the same for all values of the independent variable.  That is, this model would have us predict that increasing from one hour per week of studying to two hours per week would have the same effect on a person's GPA as increasing from four hours per week to five or from forty hours to forty-one.  This may raise some questions in your mind, but that's a good thing and is something that we'll talk about in future sections.  For now though, when someone is talking about running a regression, this is what they mean.