Statistics
Baseline/Baseline Variables
The term baseline denotes the initial valid measurement of a variable. Baseline variables are generated by collecting measurements at multiple time points and created using the merge and update options within Stata. With each additional wave of information, missing data is substituted and modified with the corresponding valid measurement.
Binary/Dummy/Flag Variable
Binary variables have a value of either 0 or 1, indicating whether the variable is “true” or “false.” For example, a binary variable that indicates whether a school is situated in a city would have two possible responses: 1, indicating that the school is located in a city, and 0, indicating that it is not. These variables are also known as indicator, dummy, or flag variables.
Categorical Variable
Categorical variables have character or integer values that denote specific categories. These values may be ordinal, such as a scale rating, or they may be nominal, in which case, they have no inherent connection to the number itself. For example, 1 could represent left-handedness, and 2 could signify right-handedness.
Composite Variable
A composite variable is “a variable made up of two or more variables or measures that are highly related to one another conceptually or statistically”1. For example, an IQ test produces a single score that is based on a series of responses to various questions. The responses are grouped to create one single measure of intelligence.
Continuous Variable
Continuous variables have all numeric values (including decimals, e.g. 1.1, 7.0, -30.4) indicating a measurement (e.g. height, math score).
Descriptive Statistics
Descriptive statistics, or descriptives, refer to a summary of data that provides useful insights into the characteristics of the sample. In the case of categorical and discrete variables, descriptives generally include the mode, range, and frequency table of the responses. For continuous variables, descriptives typically include the mode, median, mean, and range, along with the distribution (such as a histogram or density curve) of the responses. By providing an overview of common values, descriptive statistics can help in understanding the data better.
Continuous Variables
Continuous variables have all numeric values (including decimals, e.g. 1.1, 7.0, -30.4) indicating a measurement (e.g. height, math score).
Discrete Variables
Discrete variables have integer values. These are often counts, like the number of absences in a year or the number of children in a school.
HLM – Hierarchical Linear Modeling
Hierarchical linear models, also known as multilevel models or mixed effects, are a statistical model that accounts for hierarchically structured data. This type of data structure contains variables that describe one base unit of analysis nested in more extensive units, with multiple micro-level units sampled for each macro-level unit. In other words, these data structures have multiple micro-level units sampled for each macro-level unit. For example, a hierarchical linear model of student achievement could have a model for students (taking into account variables such as income or gender), a model for the classrooms that the students use (factoring in variables like classroom size), and a model for the schools that the classrooms are in (accounting for variables like urban versus rural school sites). For further clarity, please refer to the “Units of Analysis” section.
Levels
Levels of data/models refer to the different layers of the units. In the example above, the model would have three tiers: a student level (level 1), a classroom level (level 2), and a school level (level 3). The levels are linked via ID’s, meaning that each student (base unit of analysis) has a student ID, a classroom ID, and a school ID.
Modeling
Modeling is a statistical technique that assesses the correlation between outcome and predictor variables. Models aim to reflect the true data generating process. Examples of modeling approaches include regression, Bayesian inference, trees, neural networks, and more.
Regression
Regression describes a type of modeling that measures the relationship between the mean value of an outcome (e.g. reading scores) and the values of other variables (e.g. absences, math scores).
Wide vs. Long Format Data
Wide format data has one observation (row) for each unit of analysis (e.g. child), and multiple variables (columns) for each wave of data collection. Long format data has multiple observations (rows) for each unit of analysis (e.g. child), and a single variable (column) with multiple entries for each wave of data collection. See the example below.
Unit of Analysis
In a model or dataset, the unit of analysis means each observation (row). It can refer to time, individuals, schools, countries, and more. From the example seen in levels illustration, the level-1 unit of analysis would be students, meaning that each row of the dataset would correspond to a single student. The level-2 unit of analysis would be a classroom, meaning that each row of the dataset would correspond to a classroom, in which the students are nested in. The level-3 unit of analysis would be schools, meaning that each row of the dataset would correspond to a school, in which the classrooms are nested in.
- Song, M. K., Lin, F. C., Ward S. E., and Fine, J. P. “Composite Variables: When and how.” Nursing Research 62, no. 1 (2013): 45-49. https://doi.org/10.1097/NNR.0b013e3182741948.
↩︎