Skip to Main Content

Working with Quantitative Data

Choosing what software to use, understanding file formats, and organizing your data.
URL: https://libguides.law.ucla.edu/data

Descriptive Statistics

Before we can begin studying the relationships among many variables, it is important to first understand what a single variable "looks" like and then how two variables tend to "move together" or "move apart".  Descriptive statistics illustrate basic information about a variable and how it is measured.  Also referred to as "summary statistics" or "descriptives", these figures can be useful in and of themselves, but can also help us figure out what sorts of analysis are appropriate and if there are any problems that need to be addressed before we start the analysis.  In the boxes below, we'll discuss some descriptive statistics that are calculated for one variable at a time, and others that we can employ to examine the basics of a relationship between two variables.

One Variable

A capitalized "N" is typically used to denote the total number of observations in a dataset, while a lowercase "n" reflects the number of observations for one particular variable.  For instance, if we surveyed 1000 people, N would be 1000.  But just because we survey 1000 people doesn't mean that we necessarily have 1000 observations for each and every variable.  It is common (and good practice) to allow survey respondents to refuse to answer a question.  They might find the question too personal, offensive, embarrassing, or otherwise just not want to answer.  If you force them to answer, then they may be more likely to give a false one and you would not be able to distinguish true from false answers later.  In the table below, you will see that while we surveyed 1000 people, only 218 people answered a question about drug use.  This matters for a few reasons:

  1. Sample size is part of the formula for most statistical tests.  If you include the drug use variable in a statistical test, then 218 will be the maximum number of observations that can be used for that test, no matter how many people you surveyed.  In this example, that would mean that of the 1000 people you spent the time, effort, and money to survey, you would only be able to use 21.8% of the data.
  2. Your sample size could decrease further if you are running a test with multiple variables at the same time, where other variables also have missing values.  For instance, if using both drug use and income in the same test, maybe some of the people who answered about income did not answer about drug use, and some of the people who answered about drug use did not answer about income.  The sample size available for your test could drop quickly.  This matters, as the smaller your sample size, the harder it is to detect a relationship between variables.
  3. In the course of your analysis, you would need to consider why the values are missing and if the reason(s) might affect your results.  For instance, even in an anonymous survey, it's common for people to not want to give answers that they think will make the surveyor think less of them.  If only 1.8% of people reported having used illegal drugs, but I see that 78.2% of people didn't answer the question, I might suspect that the rate of drug use would have been higher if everyone had answered.  That sort of deliberate non-response would almost certainly affect the results of any tests we employ that include the drug use variable.

 

Variable n Mean Min Max
Income (USD per year) 867 63129 0 170000
Employed (1=yes, 0=no) 945 0.317 -9 1
Education (years) 1000 12.374 8 16
Illegal Drug Use (1=yes, 0=no) 218 0.018 0 1

 

For any categorical variable, whether dummy, ordinal, or nominal, we can construct a table that shows how many observations we have of each possible value.  This sort of "frequency table" or "one-way tabulation" would let us rank the options in terms of how commonly they occur.  They can also point us to certain data problems that we might need to fix.  Suppose that we asked 1000 people before an election to write whether they prefer Candidate A or Candidate B and that we constructed the following frequency table:

 

Candidate Frequency Percent
A 237 23.7
a 123 12.3
B 491 49.1
b 112 11.2
C 37 3.7

The results would tell us that we've experienced a pretty common data problem.  Some people who preferred A wrote "A", while others wrote "a" and likewise for B.  We would know that A and a both refer to the same candidate, as do B and b, but the software wouldn't necessarily know that.  We would need to correct the data ourselves.  Doing so in this particular instance would be fine, as there is no ambiguity as to what the respondent intended.

We also see that some people wrote C, even though they were instructed to choose between A and B.  The solution in this instance would be to replace every answer of "C" with a blank value.  That is, we would treat the data as if the person had not answered the question at all.  We don't know whether a person who wrote "C" would have preferred A to B or B to A.  We also don't know whether those people who followed instructions properly would have preferred C if they had been given the choice. 

The minimum is the lowest observed value of a variable and the maximum is the highest observed value.  Note that I say "observed" and not "possible".  $170,000 was the top income reported by people in our survey, but is certainly not the highest income that exists.  Likewise, $0 was reported by at least one person, but income could technically be negative if the person was losing money.  To see why looking at the minimum and maximum values is useful, consider the employment variable.  The variable is supposed to be coded 1 if the person is employed or 0 if the person is not employed and yet we see a minimum value of -9.  This is because it is common for datasets to contain placeholder values like this that have special meaning.  In surveys, a value like -9, -99, 8, 88 (or really any value other than the ones the variable is supposed to take) may be used to mark people who said "don't know", "not applicable", or otherwise gave an answer other than one of the expected options.  These special values will (or at least should) always be defined in the documentation that comes with a dataset.  The folks who collect data use these "impossible" values because they don't know how you intend to use the data.  If you were looking for differences between people who said "don't know" and those who refused to give any answer at all, you would need a way to distinguish one type from the other.  In your actual analysis, the most common way to handle impossible values is to recode them to be missing.  We don't know for sure whether a particular instance of -9 should be a 0 or a 1.

 

Variable n Mean Min Max
Income (USD per year) 867 63129 0 170000
Employed (1=yes, 0=no) 945 0.317 -9 1
Education (years) 1000 12.374 8 16
Illegal Drug Use (1=yes, 0=no) 218 0.018 0 1

The mean value of a variable is defined as the sum of all observed values divided by the total number of observations.  It is used as a measure of "central tendency", meaning that it helps give us an idea of what we might consider to be a typical value.  For instance, the value of $63,129 for income would be calculated by taking person 1's income, plus person 2's income, ... , plus person 867's income, all divided by 867.  Thankfully, we rarely, if ever, have to do this sort of thing by hand.

For dummy variables that are coded 1 or 0, the mean represents the probability that a randomly chosen person in the sample has a value of 1.  If we were to randomly select one of the 218 people who answered the drug question, there would be a 0.018 probability (1.8% chance) that the person was one who reported having used illegal drugs. 

Even if we are not especially interested in the mean of a variable in and of itself, the mean can help point us to problems.  For instance, the mean value of the employment variable is 0.317, suggesting that only 31.7% of respondents to the survey were employed.  Depending on context, that number might potentially indicate a problem (it does here, see the tab on min/max) or it might be reasonable.  For instance, if the 1000 people surveyed were high school students over their summer break, that number might be too high.  If they were 1000 randomly chosen heads of household in the US who had just filed their taxes, it would certainly be too low.

 

Variable n Mean Min Max
Income (USD per year) 867 63129 0 170000
Employed (1=yes, 0=no) 945 0.317 -9 1
Education (years) 1000 12.374 8 16
Illegal Drug Use (1=yes, 0=no) 218 0.018 0 1

The median is another measure of central tendency.  It represents the "middle value" of the variable in that 50% of the values are lower than the median and 50% are higher than the median.  This is a useful statistic, as it is less sensitive to outliers than the mean.  An outlier is a value that is much higher or much lower than most of the other values.  When calculating the mean, we add up all of the values and divide by the number of observations.  If one of the values is, say, much larger than the rest, then it can result in the mean being considerably higher than it otherwise would have been such that it would no longer be useful in inferring a "typical" value of the variable.  The median could also be affected by the presence of an outlier, but not nearly as much.  As an example, suppose that we look at the annual incomes of nine recent college graduates.  While some earn more and some earn less, the mean of 46,111 and median of 48,000 are pretty close to one another and both decently represent the center of the data.  But suppose that we add a tenth graduate to our sample, one who happens to have recently signed a professional sports contract.  That person's income is so much higher than the others that the mean increases to 841,500, which is higher than the incomes of any of the original nine graduates.  The mean no longer reflects a typical value.  The median increases too, but only to 48,500.  It's still useful in describing the middle of the data.  While this was an extreme example, it is why figures like income, home price, and so forth tend to be reported as medians rather than means.

 

Graduate Income (before) Income (after)
1 61,000 61,000
2 40,000 40,000
3 34,000 34,000
4 52,000 52,000
5 48,000 48,000
6 28,000 28,000
7 57,000 57,000
8 46,000 46,000
9 49,000 49,000
10   8,000,000
Mean 46,111 851500
Median 48,000 48500

 

The mode is the third, and final, measure of central tendency that I'll discuss.  It refers to the most commonly observed value in the data.  While we could calculate this for any variable, it's most useful when examining nominal variables, ones that are qualitative in nature.  For this sort of variable, we can't calculate the mean or the median.  If our variable was a list of crime types on which a set of prisoners had been convicted, it wouldn't make sense to ask "what is the mean/median crime type?".  We can count the number of people convicted on a given crime type and we can calculate the proportion or percent of people convicted on that type, but we cannot calculate the mean or median.  In elections, winners are typically chosen based on the mode:

Candidate n Percent
A 300 30% 
B 435 43.5% (winner)
C 115 11.5%
D 150 15%

 

While mean, median, and mode are measures of central tendency, standard deviation is a measure of dispersion.  That is, it tells us how spread out values of the variable tend to be from the mean.  If you have taken some form of statistics course previously, you may remember calculating variance with the formula:

Formula for Variance

That is, we take one observed value of our variable, subtract the mean of that variable, square that difference, do the same thing for all other observations of the variable, add them all together, and divide by the sample size minus one.  Each observation minus the mean is referred to as a "deviation".  We square the deviations because any number times itself must be a positive number or zero.  If we didn't square them, then they would add up to zero.  With squaring them, we get a statistic that tells us how far from the mean most observations are likely to be.

But there's a problem.  Well, less of a problem and more of an unnecessary complication.  Because we squared the deviations, the unit on variance becomes a squared unit.  If our variable was measured in dollars, then its mean would be measured in dollars, but the variance would be measured in dollars squared.  Variance could also end up being a really big number.  To solve both of these at the same time, we take the square root of variance to get the standard deviation. 

Formula for Standard Deviation

Standard deviation is nicer to work with, as it has the same units as the mean and is easier to interpret.  While $63129 might be the average income in our sample, the standard deviation of $16536 suggests that most incomes we observe are likely to be within $16536 of that mean.

 

Variable n Mean Std. Dev.
Income (USD per year) 867 63129 16536
Employed (1=yes, 0=no) 945 0.317 0.139
Education (years) 1000 12.374 3.347
Illegal Drug Use (1=yes, 0=no) 218 0.018 0.007

Two Variables

Just as a frequency table shows the number of observations of a variable that take a particular value, a joint frequency table (also called "cross-tabulation" or "cross-tabs") shows the number of observations in your dataset that take particular combinations of values.  Most frequently, you will see these as two-way frequency tables, meaning that it shows all possible combinations of values for two variables.  Versions with three, four, or even more variables are possible, but get increasingly annoying for us to create or for readers to examine.  If looking for a relationship between two categorical variables, a table like this can be a good place to start.

Suppose that we are studying the relationship between level of education and employment status and that we believe that attaining a greater level of education will make a person more likely to work full-time rather than part-time and less likely to be unemployed.  While the table below would not be sufficient evidence with no additional work needed on our part, it would be a piece of evidence in our favor.  We would proceed to more sophisticated testing.  If the table showed that people with lower levels of education were more likely to be employed full-time, we would stop and examine if either there was something wrong with how we collected our data or if our claim about education and employment needs to be revised.

 

Frequency Employed Full-time Employed Part-time Unemployed Total
Less than High School 55 30 10 95
High School Graduate 70 25 10 105
Some College 80 20 5 105
College Graduate 90 20 4 114
Total 295 95 29 419

 

Probability is a tool that we use to explain how likely it is for an event or combination of events to occur.  I'll start with the single event (single variable) version here and then move to the multiple event version in the next tab.  Probability ranges from 0 to 1, with 0 representing a 0% chance that the event occurs and 1 representing a 100% chance that it occurs.  Values less than 0 or greater than 1 are not possible, as an event can't be less likely than impossible or more likely than completely certain.  To calculate how likely it is for each outcome to occur, we need to create a mutually exclusive and exhaustive list of outcomes.  That is, we need to list all possible outcomes and we need it to be that a single event would be placed in one and only one of the possible outcomes.  The table below shows four possible levels of education.  Under this definition of the variable, any person in our dataset would have to fit in one and only one of these categories.  If a person had some other type of education (like trade school), the researcher would need to decide whether to place the person into one of the existing categories or to create a new category.  To calculate probability, we need the number of times that an outcome occurred and the total number of outcomes.  The probability of an outcome is calculated by taking the frequency of the outcome and dividing by the total number of outcomes.

Levels of Education for Full-time Workers Frequency Formula Probability Percent Chance
Less than High School 55 55/295 0.186 18.6%
High School Graduate 70 70/295 0.237 23.7%
Some College 80 80/295 0.271 27.1%
College Graduate 90 90/295 0.305 30.5%
Total 295 295/295 1 100%

If we were to take the exact probability of each outcome and add them all together, we should get a total of precisely 1.  Were we to get a total other than 1, it would suggest that we have either (1) made a calculation mistake, (2) forgotten to include a possible outcome, or (3) placed a person in more than one category (which we're not allowed to do here).  In a table that we present to humans however, a bit of rounding error is fine.  We only need to present a level of precision that is meaningful and plausibly useful to our audience.  People reading the table probably do not care that the exact probability of a person being a college graduate is 0.3050847...  This is also just a sample of 295 people and not an accounting of all humans that exist in the population.  The value of 0.305 might be used as an estimate of the probability in the population, but it's just an estimate and therefor probably at least a little wrong.  Were we to get a different sample of 295 people, we would almost certainly get a different result, especially for any smaller decimal places.

Note that "Percent chance" conveys the same information as probability, just on a more reader-friendly scale of 0 to 100.  Using either probability or percent chance in an explanation is fine and makes no difference.  Just don't mix the two and refer to a "30.5% probability", as such a thing doesn't exist.

The idea of joint probability is identical to that of the regular probability explained on the previous tab, but for the fact that it refers to the likelihood of a combination of events, rather than the likelihood of a single event.  It can be generalized to combinations of any number of events, but for simplicity we'll start with two.  In the first tab of this section, we discussed the frequency table below.  In a sample of 419 people, it shows the number of people who had a particular combination of education levels and employment statuses.  Just like before, if we want to calculate the probability that a person has a certain combination, then we need the table to show mutually exclusive and exhaustive sets of options.  That is, no individual person should belong in more than one cell of the table.

Frequency Employed Full-time Employed Part-time Unemployed Total
Less than High School 55 30 10 95
High School Graduate 70 25 10 105
Some College 80 20 5 105
College Graduate 90 20 4 114
Total 295 95 29 419

Calculating joint probability works the same way that it did before, but now we use the total of all frequencies for both variables, rather than the total for a single variable.

Formula Employed Full-time Employed Part-time Unemployed Total
Less than High School 55/419 30/419 10/419 95/419
High School Graduate 70/419 25/419 10/419 105/419
Some College 80/419 20/419 5/419 105/419
College Graduate 90/419 20/419 4/419 114/419
Total 295/419 95/419 29/419 419/419

By applying the formula to all cells in the table, we can see that if we were to randomly pick a person from this sample of 419 people, there would be a 16.7% chance that the person selected would be a full-time worker whose highest level of education is a high school diploma.  Looking at the row for "Total", we see that across all levels of education, there is a 70.4% chance that a randomly selected person is employed full-time.  Likewise, the column "Total" shows that there would be an overall 27.2% chance that a randomly selected person has a college degree.  The probabilities in the twelve cells for combinations of education and employment must add to 1, as must the three cells in the bottom row and the four cells in the right-most column.  If you see a reference to "marginal probabilities", it is talking about the totals.  The marginal probability of each category of employment would be the bottom row of the table.   This is identical to if we had just looked at the employment variable by itself, without considering education level at all. 

Joint Probability Employed Full-time Employed Part-time Unemployed Total
Less than High School 0.131 0.072 0.024 0.227
High School Graduate 0.167 0.060 0.024 0.251
Some College 0.191 0.048 0.012 0.251
College Graduate 0.215 0.048 0.010 0.272
Total 0.704 0.227 0.069 1

When we calculate joint probability, we use information about all of the possible outcomes.  But in many cases, we aren't interested in studying some of those outcomes.  For instance, maybe we are only interested in differences between people who are employed full-time versus those who are unemployed.  If we don't want to consider the probability of being employed part-time, then we need to remove that column from the table.

Joint Probability Employed Full-time Employed Part-time Unemployed Total
Less than High School 0.131 0.072 0.024 0.227
High School Graduate 0.167 0.060 0.024 0.251
Some College 0.191 0.048 0.012 0.251
College Graduate 0.215 0.048 0.010 0.272
Total 0.704 0.227 0.069 1

But we're not finished.  The probabilities of the remaining eight combinations of education and employment status still need to add up to 1.  The probabilities of each level of education for folks employed part-time don't just disappear.  They need to be redistributed proportionally to the full-time and unemployed columns.  To do this, we need to calculate the probability that a person is not employed part-time.  The formula is 1 - Pr(part-time), where "Pr()" stands for "probability of".  In this example, we get 1 - 0.227 = 0.773.  We then take each of the eight remaining combinations of employment status and education and divide them by 0.773.  After rescaling the remaining values like this, we get approximately the results below.  The eight cells with different combinations of education and employment status once again add up to 1 (aside from rounding error).

Formula Employed Full-time Unemployed Total
Less than High School 0.131/0.773 = 0.170 0.024/0.773 = 0.031 0.201
High School Graduate 0.167/0.773 = 0.216 0.024/0.773 = 0.031 0.247
Some College 0.191/0.773 = 0.247 0.012/0.773 = 0.015 0.262
College Graduate 0.215/0.773 = 0.278 0.010/0.773 = 0.012 0.290
Total 0.911 0.090 1

 

 

A correlation coefficient measures the strength and direction of a linear relationship between two variables.  That is, if one variable increases, does the other tend to increase, decrease, or stay the same and how closely do the two variables track one another in the shape of a straight line?  Calculating a correlation coefficient tells us this in the form of a number between -1 and 1.  A value of 1 means perfect positive correlation.  For any increase in one variable, the other increases by a fixed amount.  A value of -1 reflects the reverse, perfect negative correlation, where any increase in one variable is associated with a decrease in the other by a fixed amount.  Note that these do not mean that if x increases by 1 that y must increase (or decrease) by exactly 1.  It just means that if x increases by 1, we know exactly how much y will increase or decrease.  For instance, if a cafe charges $5 for a cup of coffee, then each additional cup a person buys will increase their total coffee bill by $5.  If a person buys one cup, their bill will be $5.  If they buy four cups of coffee, their bill will be $20.  This would be perfect positive correlation, as we could set any number of cups of coffee to buy and know what the bill is going to be.  A correlation of 0 between two variables would mean the complete lack of a linear relationship between the two.  Increasing x by any amount tells us exactly nothing about what value y will take.  Examples of exactly 0 correlation between two variables tend to be a bit silly or abstract, as they're quite rare in practice.  For instance, the number of squirrels on a university's campus is (probably) unrelated to the number of students who will eventually run for elected office.  Each value of one variable would tell us nothing about the value of the other.  The formula to calculate a correlation coefficient (technically Pearson's correlation coefficient) between two variables x and y is:

, where x with an i subscript indicates the value of x for a particular unit (person, firm, government, etc.), y with an i subscript marks the value of y for that same unit, and any term with a bar over it indicates the average value of that variable in the dataset. 

Most correlations we see will not be perfect, but the closer we get to -1 or +1, the more that a scatterplot between the two variables is likely to look like a straight line.  The closer the correlation coefficient is to 0, the more it will usually look like a jumble of dots.  Between zero and one of the extremes, we can see all manner of weak or strong relationships.  It will be important to remember though that terms like "weak" and "strong" are entirely subjective and contextual.  There is no standard that definitively tells us what a relationship is weak or strong.