LibGuides: Working with Quantitative Data: Variable Types

Variable Types

The way that we choose to measure our variables can affect the types of analysis that are available to us and what can be done with statistical software. In general, it is best to measure a variable in a way that gives you the greatest possible degree of variation appropriate for what it is that you are studying. The "appropriate" part of the definition is important. For instance, if studying the effect of education on a person's income, there are several ways that you could potentially measure education. Among the possibilities:

Did the person graduate from high school? 1 = yes, 0 = no.
What level of education did the person receive? 0 = none, 1 = primary only, 2 = some high school, 3 = high school graduate, 4 = some college, 5 = college graduate, 6 = some postgraduate, 7 = graduate degree
How many years of education does the person have?
How many days of education does the person have?

Option (4) would give us the greatest degree of variation, as we could potentially see values from 0 to thousands. But! One additional day of education is a very small amount, isn't likely to have a strong effect on income, and a person probably doesn't have an accurate memory of exactly how many days they have spent learning. On the other end of the spectrum, (1) has the lowest possible degree of variation to still be considered a variable: two values. However, if your argument is that education levels below a high school degree don't have a meaningful effect on income and that once a person has a high school degree, additional education doesn't have much of an effect either, then this variable could be entirely appropriate. Having a greater level of variation would be useful, but maybe not strictly necessary.

In defining variable types, we will need to distinguish the terms used in statistics/econometrics from those employed by statistical software. The former are of greater consequence, as some econometric techniques require a variable to be of a particular type. Variable type matters in software too, but more in terms of what steps we need to execute before carrying out a statistical test.

Variable Types in Statistics

All forms of statistical analysis that examine relationships between variables use variation in one variable to explain variation in another. That is, each unit (person, firm, country, etc.) in our dataset has a value for the independent variable(s) and the dependent variable. While some units may have the same value for a particular variable, they cannot all have the same value, otherwise it would be a constant and not a variable. When we refer to "variation", we are referring to how many values are observed for a variable and how frequently each value occurs. The type of variable we have, particularly the type of dependent variable we are studying, can push us to one form of analysis or another. For instance, if you have taken a statistics class before, you may be familiar with Ordinary Least Squares regression (often just called "regression") and logistic regression (logit). OLS technically requires that the dependent variable is continuous, while logit requires the dependent variable to be a dummy. If you execute a model with an inappropriate type of dependent variable, then your results can potentially be biased and/or inefficient (more on these terms later).

Dummy variables: These are categorical variables that can take only two values, usually 1 or 0. Values 1 or 2 are also common and any other combination of exactly two values would also qualify as a dummy variable. For sake of clarity, I will always refer to dummy variables being coded 1 or 0. If you are working with a dataset where a variable can take some other combination of two values, I would recommend recoding the variable so that the lower value becomes 0 and the higher value becomes 1. It's common for statistical commands to require these two exact values. You will also see these referred to as "binary variables", "dichotomous variables", or "indicator variables".

Ordinal variables: Here we have categorical variables that can take more than two possible values and the order of values matters. Consider a variable for level of education coded as 0 = some high school, 1 = graduated high school, 2 = some college, 3 = college graduate. This is a categorical variable, as we have only those four values. In this example, we don't have data on any levels lower or higher, nor do we have values between the categories that might indicate how close the person was to entering the next level. Still, the order of the four categories matters (hence the name "ordinal"), as a value of 1 indicates a higher level of education than a value of 0, a value of 2 means more education than a value of 1, and so forth. The exact numbers used to represent each category don't matter. Values of 1-4 would be just as valid as values 0-3. Like with dummy variables however, I recommend coding ordinal variables so that 0 is the lowest possible value, 1 is the next lowest, and so forth.

Nominal variables: These are also categorical variables that can take more than two possible values, but here the order does not matter. We have assigned a different numerical value to each category, but any method of assigning unique values would have been equally as valid. For instance if we had a variable for a person's sector of full-time employment coded 1 for government, 2 for nonprofit, or 3 for private sector. We have assigned values to each category, but it would have been perfectly fine to choose 1 for private sector, 2 for government, and 3 for nonprofit or any other ordering instead. Even if you think that one sector of work is "better" than another on some factor, any such assessment would be entirely subjective. Another analyst could have a different opinion and theirs would be equally as valid and useful.

Continuous variables: These are variables that, in theory anyway, can take an infinite number of possible values. For any two values we pick (so long as they're not identical), any number between those two values must be possible. So if one value is 7 and another value is 8, then 7.5, 7.213, and any other number between 7 and 8 must be possible. Continuous variables may be bounded (meaning they have maximum and/or minimum possible values) or unbounded (no max or min). In practice, we do not need to have infinitely many possible values, we just need to have more than a small number. Generally speaking, if the variable we're talking about is not a nominal variable and has more than roughly 7 possible values (there's no fixed number here), then it tends to be treated as being more or less continuous. As an example, a person's annual salary would be a continuous variable. We're more likely to see big round numbers or at least whole dollar amounts, but any number (even including fractions of a penny) is theoretically possible.

Variable Types in Data Analysis Software

Suppose that your dataset is organized such that each column represents a variable and each row contains all of the information about one particular unit (person, firm, country, etc.). In statistical software, "variable type" refers to what each cell within a column contains and how much memory it takes to store the information within each cell. The exact types and subtypes that exist depend on what software you are using, but it is generally possible to convert a variable from one form or another. The type does not limit you in terms of what types of analysis are possible, rather it limits what commands can be executed on or with a particular variable. In doing any form of analysis based on "real world" data, it is almost certain that you will need to convert at least one variable from one storage type to another.

Numerical variables: As the name would suggest, these are variables that contain only numbers. You will see subtypes such as int, byte, float, double, long, and so forth. They differ in terms of how many decimal places they can contain and how much memory they take to store. For instance "int" is short for "integer". An int variable can only contain whole numbers and no decimals. A "byte" variable is one where each cell in your dataset requires one byte of memory to store. Unless you are working with a truly massive dataset, storage space is unlikely to be nearly as important a constraint as it would have been in the 1990's or earlier. Still, it is common for statistical software to require that a variable is of a particular type to execute a given command. For instance, if you wanted to execute the logistic regression mentioned earlier, that type of model requires that the dependent variable must be a dummy variable. The folks who create statistical software are aware of this and so generally program the software to produce an error message if you try to estimate a logit regression on a variable that isn't a dummy. If your variable is not a dummy, you would need to recode it before you could run the logit regression.

String variables: Sometimes also called "character" variables, these are variables that can contain letters, numbers, symbols, or any combination thereof, rather than only numbers. These could be variables like a column full of county names or they could be variables are are supposed to contain only numbers, but contain symbols like $, %, and the like. The vast majority of quantitative econometric methods cannot be executed directly on strings. If a variable you need to use is stored as a string, you will need to convert it to a numerical form first or rely on a function in your statistical software that does this conversion automatically. This is because while you may understand what the words, symbols, abbreviations, and so forth in a string variable mean, your statistical software likely does not. These programs do what they are told, nothing more and nothing less. It is up to you to give the software the correct instructions.

Date variables: Somewhere between numerical variables and string variables are dates. Like strings, they can contain numbers, letters, and symbols. But like numerical variables, the order of dates has a specific meaning. We know that 5 January 2024 comes four days after 1 January 2024, 1 February 2024 is one month after 1 January 2024, 1 January 2023 is one year before 1 January 2024, and so forth. But we also have complications like leap years and months that have different numbers of days, not to mention different calendar systems in different cultural contexts and whether or not we need our analysis to account for weekends and holidays. Beyond this, there are many ways in which humans write dates that are readily understood by humans. 1 January 2024, 1 Jan 24, 01/01/2024, 1.1.24, and many others all hold the same meaning. But our software can't automatically understand every possible combination and ordering of day, month, and year symbols that translate to the same date. As such, we generally need to teach the software how to convert a human-readable date into a form that it can understand and/or the reverse.