Practical Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Data dictionaries

Data dictionaries can be valuable as a source for understanding the types of variables under analysis and how they are measured. Here are some useful metadata items to keep in a data dictionary:

  • Name of the variable: Consistency in naming conventions helps in understanding and readability. Some analysts like to use CamelCase, others like to use punctuation for an object, as in Variable.data.frame, and others will insist on only having lowercase letters.
  • Measurement data: This answers questions such as, "Is the data numeric or categorical?", "How many levels are contained in each category?", and "What is the length of each variable?"
  • Sources of the data: This covers, "Where did the data originally came from?"
  • Transformations: This answers, "How was the data manipulated from its original form to what it is now?"
  • Data quality items: Attaching frequency distributions and summary statistics to each variable will be helpful, along with any comments regarding any questionable data quality.

It is important to keep a data dictionary up to date, because variables or their values can change meaning over time. For example, a marketing offer with the value of A100 could mean we give a 15% discount to the customer now, but upon closer inspection, the same A100 code happened to be used five years ago in a different marketing application where it meant 10% discount to the customer.

Here is an example of a simple data dictionary:

As you begin to explore your dataset, you can also add additional columns, such as the percentage of missing values, variable means, extreme values, and so on.