Common Terminology

Coding

.do file 

A .do file is a collection of Stata commands that can be executed either line-by-line in the console or all at once. It is exclusively readable in Stata.

.log file 

In Stata, a .log file preserves a record of all executed commands and outputs, including tables and graphs, throughout a given session. While it can only be opened within Stata, it can be exported as a PDF.

Codebook

Codebooks provide a comprehensive description of data, detailing its structure, layout, and contents. By outlining variable names, variable labels, question texts, values, and critical notes, a codebook equips you with all the necessary information to understand what each variable measures.

R Program

R is a versatile programming language and software used for statistical computing and graphics, often used by statisticians and social scientists for data cleaning and analysis. This open-source software allows users to download packages, also known as libraries, containing specialized functions created by others. R is primarily operated through a command line interface, requiring all data manipulations to be written as code in a console [Resource link].  

R Script

An R script file contains R code written to execute a series of R commands to manipulate and/or analyze data. One can execute the code either line-by-line in the console or all at once. The script file can be accessed as a plain text file or opened directly in R.

Recoding 

Recoding a variable is systematically changing its values. This is often necessary when two separate waves of data collection used a different coding scheme (e.g. the first wave coded female as 1 and the second wave coded female as 2). 

Renaming 

Renaming a variable means changing its name (e.g. MATH1), not its values.

Stata Program 

Stata is a general-purpose statistical software that is often used by social scientists for data cleaning and analysis. It has a graphical user interface, meaning it has built-in analyses options, and a command line interface, meaning commands can also be written as code in a console [Reference link]. 

Transpose / reshape data 

Transposing data means changing the rows of a dataset to its columns, and its columns to its rows (i.e. from wide to long format and vice versa). Reshaping data means more generally changing the way the data is organized into rows and columns. 

Values and Value Labels 

Each variable has a set of values that make up the “answers” or possible “responses” for that measure. Values make up the cells in a dataset and are the actual coded values in the data for this variable (e.g., 0, 1, 2). Value labels provide the textual descriptions of the codes (e.g., male, female, non-binary). 

Variable Names and Variable Labels 

Variable names are the names or numbers assigned to each variable in the data set (e.g., GENDER). Variable labels are short descriptions of what the variable measures (e.g., “Gender of the respondent”). They provide enough information about what the variable measures without the researcher having to rely on the codebook.