A practical flow for handling missing data
Missing data is one of the most common problems in empirical research, data analysis, and quantitative modeling. In my experience, it often requires substantial time before the research process can begin or before reliable visual tools can be developed.
This blog goes straight to the point, avoiding long theoretical descriptions and focusing on a practical workflow for handling missing data:
Before choosing an imputation method or dropping observations, first understand why the data are missing, what pattern the missingness follows, and how it may affect statistical power, bias, volatility, trends, and final conclusions.
The first step is not to fill the missing values immediately, is to extract meaningful insights from the missing data itself.
It can reveal problems in data collection, software systems, reporting processes, survey design, data integration, or the behavior of the units being analyzed.
Understand the Cause of the Missing Data
Before applying any treatment method, try to identify why the data are missing. Some common causes are:
Understanding the cause matters because different causes require different treatments. A missing value generated by a software error is not the same as a missing value generated by a respondent refusing to provide information.
Evaluate the Implications
Missing data can affect the quality and reliability of the analysis.
The main implications are:
This is especially important in empirical research and quantitative modeling because the treatment of missing data can change the final results.
Identify the Pattern of Missing Data
The pattern describes how missing values appear in the dataset.
Common patterns include:
- Univariate or multivariate: missing values appear in one variable or across several variables.
- Monotone: common in longitudinal studies, where once an observation becomes missing, later values are also missing.
- Non-monotone: missing values appear irregularly across time or variables.
- Connected or unconnected: missingness may be related across variables or independent across variables.
- Planned: missingness is part of the research or survey design.
- Random: missing values appear without an evident structure.
Identifying the pattern helps to decide whether the missing data problem is simple, systematic, or potentially harmful for the analysis.
Identify the Type of Missingness
The type of missingness refers to the mechanism behind the missing values.
The three standard types are:
- Missing Completely at Random (MCAR): the probability of missingness is unrelated to observed or unobserved data.
- Missing at Random (MAR): the probability of missingness is related to observed information, but not to the missing value itself.
- Missing Not at Random (MNAR): the probability of missingness is related to the missing value itself or to unobserved factors.
This distinction is important because some methods work well under MCAR or MAR, but can produce biased results under MNAR.
Choose a Handling Approach
After understanding the cause, pattern, and type of missingness, the analyst can choose a treatment method.
Common approaches include:
Imputation
- Single-value imputation, such as mean, median, or fixed-value replacement.
- Multiple imputation, using the distribution and uncertainty of the data.
- Hot-deck imputation, using values from similar non-missing cases.
- Last Observation Carried Forward (LOCF), especially in time-series or longitudinal data.
Model-Based Methods
- Interpolation.
- K-Nearest Neighbors.
- MICE.
- Linear, logistic, or stochastic regression.
- Support Vector Machines.
- Decision Trees.
- Clustering imputation.
- Ensemble methods.
Dropping Observations
In some cases, observations with missing values can be dropped. However, this should be done carefully because it can reduce the sample size, affect representativeness, and introduce bias if the missingness is not random.
Validate the Result
Validation is a critical step.
Missing data treatment should be validated to ensure that it does not change the main trend, volatility, relationships, or conclusions of the analysis.
After applying a treatment method, compare the results before and after the adjustment. Check whether the method changes:
- The trend of the series.
- The volatility of the data.
- The distribution of the variables.
- The correlation between variables.
- The main empirical conclusions.
The goal is not only to fill missing values. The goal is to make a defensible analytical decision that preserves the integrity of the research.
Handling missing data is not only a technical step. It is part of the analytical process.
Before imputing, interpolating, modeling, or dropping observations, the analyst should understand the missing data problem. The correct approach depends on the cause, the pattern, the type of missingness, and the potential impact on the conclusions.
A practical rule is simple:
Do not treat missing data mechanically. First understand it, then decide how to handle it.
Reference source:
Princeton University. (n.d.). In R: Missing data. Princeton University. Retrieved from https://libguides.princeton.edu/R-Missingdata
Alayo, B. (2023, February 12). Missing data: Causes, types, and handling techniques. LinkedIn. Retrieved from https://www.linkedin.com/pulse/missing-data-causes-types-handling-techniques-bilikis-alayo-ho9if/
Masters in Data Science. (n.d.). How to deal with missing data. Retrieved from https://www.mastersindatascience.org/learning/how-to-deal-with-missing-data/
Cook, A. B. (2023, August 12). Missing values. Kaggle. Retrieved from https://www.kaggle.com/code/alexisbcook/missing-values
National Center for Biotechnology Information. (2019). Types of missing data. In NCBI Bookshelf. Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK493614/
