Skip to Main Content

Working with Quantitative Data

Choosing what software to use, understanding file formats, and organizing your data.
URL: https://libguides.law.ucla.edu/data

Regression Results

Whether you are conducting your own analysis or reading results prepared or presented by someone else, you need to understand what story the results are trying to tell.  When reading a prepared article or attending a presentation, these will typically be in nicely formatted tables.  However! The contents of the formatted table can depend on the author's field of study and style preferences.  An identical regression with identical data and identical results may be presented in one of several ways.  As such, I'll present a raw version of the results directly from statistical software (Stata, in this instance), as it contains examples of all common presentation methods.  Thankfully, this means that the table below contains some redundant information.  It also contains some additional information that most users won't need and that we can ignore for now.

This example uses entirely fake data.  Suppose that we are studying whether wealthy individuals tend to receive lighter prison sentences for the same crime than low-income folks.  To study this, we might collect income and sentence length data from 500 randomly selected individuals who all committed the same offense.  We might then run a simple regression with the sentence length (in days) as the dependent variable and the person's income (in thousands of dollars per year) as the independent variable.  The raw results might look something like the table below. 

This version was created in Stata, but other statistical software would present similar information with exactly the same results.  It presents quite a lot of information and can seem overwhelming at first, but we often don't need most of it.  Below, I've broken the table into three rectangles.  The bottom rectangle is the most important for us, as it contains our estimates and information we can use to determine whether the results are statistically significant or not.  The top-right rectangle contains some information about how well the model performed, only two bits of which we'll need right now.  Finally, the rectangle on the top-left contains some further information about model performance, but is rarely needed.  In my many years of working with data, I don't think that I've encountered a need for these outside of writing exam questions for Econometrics students.  As such, we're safe to ignore the top-left for now.

Starting with the bottom rectangle, we need to define each of the components:

  1. jail_days: On the top left of this section, you will see the name of the dependent variable.  This is the phenomenon that we are trying to explain with our model.  We're studying why it is that some people receive longer sentences than others.  It's helpful to have it written here as a reminder, since we may end up running many regressions in the course of even a single project.
  2. income: This is the name of the independent variable; the one that we are using to explain why some people have longer prison sentences than others.  All of the numbers in the same row as income provide information on the estimated effect that income has on jail_days.
  3. _cons: This is how Stata marks the Y-intercept of the estimated OLS line, also referred to as "the constant".  If the independent variable is equal to 0, then this represents the predicted value of the dependent variable.  All of the numbers in the same row provide information about the constant. 
  4. Coefficient: These are our estimates.  The number next to income is the estimated slope of the line.  It will be referred to as "the estimated coefficient on income".  As described above, the number next to _cons is the constant.  We'll come back to interpretations after we finish the definitions.
  5. Std. err.: This is our measure of uncertainty.  Recall that what we get from our results are estimates.  There may be some "true" relationship between income and jail_days, but we do not know exactly what it is and almost certainly never will.  We only have data on a sample of people who committed the same crime, rather than data on every person in the population who committed the same crime.  If we had a different sample of data, we would likely get at least slightly different results just due to random differences in people.  The standard error helps us understand how much our results might vary from one sample to the next.  More importantly though, we use them in our hypothesis tests.
  6. t: This is a Student's t-Test statistic.  When we run a regression, the most common tests we use to evaluate our hypotheses are t-tests.  Stata, and most (if not all) other statistical software calculate these based on a non-directional null hypothesis that the effect is 0.  We can modify this and run other tests if needed.  Refer back to the section on hypothesis testing if needed.
  7. P > |t|: This is the p-value.  It represents the probability that we obtained a coefficient as large as we did in our sample by random chance, if the null hypothesis was true.  In other words, if we were to reject the null hypothesis, this is the probability that we would be wrong in doing so.  Recall that our standard for statistical significance is that we only reject the null hypothesis if the p-value is smaller than 0.05.  NOTE: A p-value can never be exactly equal to zero.  When software reports 0.000 (as it does in this example), it is telling us that the number was too small for it to write in the available space.
  8. [95% conf. interval]: Finally, this is the 95% confidence interval around the coefficient estimate.  We don't know the true relationship between income and jail_days, but if our model is correct, then we would be 95% sure that the true slope lies between the lower bound (left number) and upper bound (right number).  If we collected twenty different samples and ran the same regression, we would expect nineteen of those twenty estimates to be inside of this range.

As for the rectangle on the top-right, there are only two numbers that we need to discuss right now:

  1. Number of obs: This is the number of observations used in the model.  This number will be equal to or smaller than the total number of rows in your dataset.  If it equals the total number of rows of your dataset, it means that the model is using all of your data.  If it is smaller, then the likely explanations are that either (1) you deliberately told the software to use only some rows and not others or (2) one or more of your variables has some missing data.  For instance, maybe whoever collected the data wasn't able to get both income and sentence information for everyone.  If the researcher was able to get one and/or the other for 500 people, but only able to get both for 450, then "Number of obs" would be 450.  We can only use a row of the dataset if the row has values for all of the variables we're including in the model.
  2. R-squared: The proportion of variation in the dependent variable explained by the model.  That is, how much of the reason why some people have longer prison sentences than others can be explained by just the income variable.  A value of 0 would mean that income is perfectly useless in predicting sentence length, while a value of 1 would indicate that a person's income can perfectly predict how long they will spend in prison. 

NOTE: Like with correlation coefficients, there are no such things as "good" or "bad" values of R-squared.  We do prefer higher values over lower values, but it's all relative.  Our example value of 0.026 looks small, but could still be "good" if no other model can do better.  Likewise a value of 0.947 could still be "bad" if it's lower than we could have obtained from other models.

Interpreting Regression Results

Now that we've discussed what the parts of the table mean, we can move to interpreting the results from our example.  The part that we typically care most about is the coefficient on income and whether or not it is statistically significant or not.  It is the slope of our estimated regression line and represents the estimated effect that income has on sentence length.  The generic way to interpret a regression coefficient is:

On average, a one unit increase in X is associated with a (coefficient on X) unit increase in Y, all else being equal.

It may seem a bit awkward in this generic form, but will make more sense when we get to our example.  Before that though, we need to pay attention to three key parts of this:

  1. "On average": These are average effects across our sample, rather than exact effects that will happen to each and every person.  People with higher incomes do not always have shorter sentences, they just tend to have shorter sentences.
  2. References to "unit": A regression coefficient represents how much Y is expected to increase when we increase the value of X by 1.  No matter what scale the independent variable is on, whether it's 1 to 100, -50 to 5 million, 0.331 to 1.384, or any other range of numbers, the coefficient always refers to an increase in X of exactly 1.  Further, when our variables have units like dollars, days, miles, or any other, we're able to use them directly in the interpretation.  That won't always be true for fancier forms of regression and is one of the nice things about OLS regression.
  3. "All else being equal": This translates roughly to "under the assumption that we have done everything correctly" and is often a very bold assumption.  The fact that the assumption is so tough to meet is why Econometrics is an entire field and not just a single course.

Now let's return to interpreting the model in our example:

Coefficient:

  • income: -0.831.  Recall that income is measured in thousands of dollars and our prison sentence variable is measured in days.  The coefficient tells us that on average, a one thousand dollar increase in annual income is associated with a 0.831 day shorter prison sentence, all else being equal.
  • _cons: 327.278.  This is the predicted sentence length for a person with 0 income.  On average, a person with no annual income would be expected to receive a sentence of 327.278 days, all else being equal.
    • Note: The value of the constant doesn't usually matter to us.  In many cases we will get values that are impossible.  We will also have instances where it's not possible for the value of X to be exactly 0.  This is fine.  We're more interested in the coefficient on the independent variable and whether or not it is statistically significant.

P > |t| (the p-value):

  • income: 0.000. It is important to remember that a p-value cannot be exactly equal to zero.  When software reports 0.000 or similar, it is because it doesn't have space to write a more exact number.  In papers and presentations, report this as p < 0.001.  Under the assumption that we did everything (selecting what variables to use, deciding on how to collect our data, and so forth) correctly, this is the probability that we would get a value of -0.831 as our estimated coefficient, if the "true" value was of the coefficient was 0.  When we get a p-value smaller than 0.05, we say that our result is statistically significant at the 5% level or better.
  • _cons: 0.000. We get the same result here as we do for the coefficient on income.  However, we usually don't care whether the constant is statistically significant or not.  All that it tells us here is that a person with no income who was convicted of this particular crime would be expected to receive a sentence of significantly more than 0 days.

95% conf. interval (95% confidence interval)

  • income: [-1.277, -0.384].  Under the assumption that we did everything correctly, we are 95% sure that the "true" coefficient is between these two bounds.
  • _cons: [324.089, 330.465].  Under the assumption that we did everything correctly, we are 95% sure that a person with no income would receive a sentence of between roughly 324 and 330 days, on average.

Note that I didn't spend much time explaining the confidence intervals and skipped over the standard errors and t-statistics entirely.  Why?  Because for our purposes they're redundant.  The t-statistic is calculated as (coefficient - 0) / (standard error).  The p-value is calculated using the t-statistic, number of observations, and number of coefficients we're estimating (2 in this example, one for income and one for the constant).  The confidence interval is based on the coefficient, standard error, number of observations, and t-critical value.  Just having the coefficient and one of the other columns would be sufficient for us to detect whether the effect was statistically significant or not.  If one says that the result is significant, the others will too.  If one says that a result is not significant, the others will report the same. 

Our software presents all of these measures as a matter of our convenience.  Researchers in different fields (and more specifically their journals) have different preferences over what information should be reported.  And they are just that: preferences.  They are all equivalent and none of the options are definitively better than the others in all situations.  When reading results in a paper, here are some ways you might identify which effects are statistically significant at the 5% level without the author directly telling you (which they usually will):

  • A p-value smaller than 0.05
  • A t-statistic larger than 1.96 (for large samples of data)
  • A 95% confidence interval that does not include the value of the null hypothesis (usually 0)
  • A coefficient at least twice the size of its standard error.

Again, in a technical sense, it does not matter which of these versions you use when presenting results.  What does, unfortunately, matter is conforming with the norms of the field in which you work.  To conclude this section on interpretation, I'll remind you that there is a difference between statistical significance and an effect being substantively meaningful (large enough to matter in practice).  Our regression produced a statistically significant coefficient, but how does it translate into actual differences between people?  For that, it can be helpful to plot our regression line over a scatterplot, like so:

This demonstrates that for a conviction on the same crime, a person earning $25,000 per year (the maximum in this particular sample) would be expected to receive a sentence an average of 20 days shorter than a person with no income?  Is that a big effect or a small one?  That's up to your ability to argue.