Skip to main content

Codebook

biul |

A codebook is a document (usually a table) that describes the variables present in a dataset. Its purpose is to record detailed information on each variable. 

The following information is typically included in a codebook:

  • ID variable(s): which variable(s) contain(s) the unique observation identifier (number or alphanumeric combination)?
  • Data collection variables: which variables contain information about data collection (date of collection, place, researcher, etc.)?
  • Variable name and description: what is the name of the variable in the dataset? What is its full description? Variable names are usually short to facilitate analysis and need to comply with software-specific rules (e.g., do not include special characters or spaces). Full descriptions are useful for identifying the variable in more detail and may include definitions or explanations of acronyms. If the variable is a survey question, the exact wording of the question and instructions may also be indicated here.
  • Variable type: is the variable categorical, ordinal, continuous or textual? It is important to check that the variable is correctly identified as such in the software used for storage or analysis.
  • Variable values: what are the possible values of the variable (categories or numerical range)? If the variable is categorical, what are the labels corresponding to each category? For example, gender may be encoded as 1/2, with 1 corresponding to "Women" and 2 to "Men".
  • Variable unit: what is the unit of the variable (percentage, kilograms, number of people, etc.)?
  • Missing values: how are missing values indicated? It is important to check that the values are identified as such in the software used for storage or analysis. Different types of missing values can be indicated in different ways, for example to distinguish observations for which a specific variable should be empty (for consistency reasons or due to a filter) from variables for which a value was expected but none was coded (data entry error, non-response, etc.).
  • Variable processing: is the variable the result of a data processing step? Is it a score, an index or the results of a calculation? Was it recoded based on other variables? Was it standardised or otherwise transformed?
  • Variable base: which population is the variable based on? Are the data filtered or limited to a subgroup of observations? What is the size of the base?
  • Variable links: is the variable standalone or should it be analysed in conjunction with other variables? For instance, a multiple-choice question in a survey needs to be coded in several related variables and a follow-up question needs to be analysed taking into account the previous answer.
  • Weights: are there any weighting variables? How were they created? When should they be used?
  • Typologies or classifications: is the variable based on an existing classification? What is it and what are the sources or references?
  • Technical information: what is the width of the variable and the specific variable type in the software used for storage/analysis? What are the decimal and thousands separators? What is the number of decimals?