Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, however, it remains a challenge to decide which method is most appropriate for their problem, in part because this topic is not systematically covered in statistics or data science curricula. To help address this challenge, we have launched the `R-miss-tastic`

platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations), `R-miss-tastic`

covers the development of standardized analysis workﬂows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation, and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and teachers who are looking for didactic materials (notebooks, recordings, lecture notes).

Missing data are unavoidable as soon as collecting or acquiring data is involved. They occur for many reasons including: individuals choosing not to answer survey questions, measurement devices failing, or data having simply not been recorded. Their presence becomes even more important as data are now obtained at increasing velocity and volume, and from heterogeneous sources not originally designed to be analyzed together. As pointed out by Zhu et al. (2019), “one of the ironies of working with Big Data is that missing data play an ever more significant role, and often present serious difficulties for analysis”. Despite this, the approach most commonly implemented by default in software is to toss out cases with missing values. At best, this is inefficient because it wastes information from the partially observed cases. At worst, it results in biased estimates, particularly when the distributions of the missing values are systematically different from those of the observed values (e.g., Enders 2010 2).

However, handling missing data in a more efficient and relevant way
(than limiting the analysis to solely the complete cases) has attracted
a lot of attention in the literature in the last two decades. In
particular, a number of reference books have been published
(Schafer and Graham 2002; Carpenter and Kenward 2012; Buuren 2018; Little and Rubin 2019)
and the topic is an active field of research (Josse and Reiter 2018). The
diversity of the missing data problems means there is great variety in
the proposed and studied methods. They include model-based approaches,
integrating likelihoods or posterior distributions over missing values,
filling in missing values in a realistic way with single, or multiple
imputations, or weighting of observations, appealing to ideas from the
design-based literature in survey sampling. The multiplicity of the
available solutions makes sense because there is no single solution or
tool to manage missing data: the appropriate methodology to handle them
depends on many features, such as the objective of the analysis, the
type of data, the type of missing data and their pattern. Some of these
methods are available in various software solutions. As R is one of the
main pieces of software for statisticians and data scientists, with its
development starting almost three decades ago (Ihaka 1998), R offers
the largest number of implemented approaches. This is also due to its
ease in incorporating new methods and its modular packaging system.
Currently, there are over 270 R packages on CRAN that mention missing
data or imputation in their DESCRIPTION files. These packages serve many
different applications, data types or types of analysis. More precisely,
exploratory and visualization tools for missing data are available in
packages like *naniar*, *VIM*, and *MissingDataGUI*
(Cheng et al. 2015; Kowarik and Templ 2016; Tierney and Cook 2018; Tierney et al. 2021). Imputation methods are
included in packages like *mice*, *Amelia*, and *mi*
(Gelman and Hill 2011; Honaker et al. 2011; van Buuren and Groothuis-Oudshoorn 2011). Other packages focus on dealing with complex,
heterogeneous (categorical, quantitative, ordinal variables) data or
with large dimensional multi-level data, such as *missMDA*, and
*MixedDataImpute* (Murray and Reiter 2015; Josse et al. 2016). Besides R, other
languages such as Python (Van Rossum and Drake 2009), which currently only have few
publicly available implementations of methods that handle missing
values, offer increasingly more solutions. Two prominent examples
are: 1) the *scikit-learn* library (Pedregosa et al. 2011) which has
recently added a module for missing values imputation; and 2) the
*DataWig* library (Biessmann et al. 2018) which provides a framework to learn to
impute incomplete data tables.

Despite the large range of options, missing data are often not handled appropriately by practitioners. This may be for several reasons. First, the plethora of options can be a double-edged sword; while it is great to have many options, finding the most appropriate method is challenging as there are so many. Second, the topic of missing data is often missing itself from many statistics and data science syllabuses, despite its omnipresence in data. So, when faced with missing data, practitioners are left powerless; quite possibly never having been taught about missing data, they do not know how to approach the problem, the dangers of mismanagement, how to navigate the methods, software, or how to choose the most appropriate method or workflow.

To help promote better management and understanding of missing data, we
have released `R-miss-tastic`

, an open platform for missing values. The
platform takes the form of a reference website^{1}, which collects,
organizes and produces material on missing data. It has been conceived
by an infrastructure steering committee working group (ISC; its members
are authors of this article), which first provided a CRAN Task View^{2}
on missing data^{3} that lists and organizes existing R packages on the
topic. The `R-miss-tastic`

platform extends and builds on the CRAN Task
View by collecting, creating and organizing articles, tutorials,
documentation, and workflows for analyses with missing data.

This platform is easily extendable and well documented, allowing it to
seamlessly incorporate future works and research in missing values. The
intent of the platform is to foster a welcoming community, within and
beyond the R community. `R-miss-tastic`

has been designed to be
accessible for a wide audience with different levels of prior knowledge,
needs, and questions. This includes students, teachers, statisticians,
and researchers. Students can use its content as complementary course
material. Teachers can use it as a reference website for their own
classes. Statisticians and researchers can find example analysis
workflows, or even contribute information for specific areas and find
collaborators.

The platform provides new tutorials, examples and pipelines of analyses that we have developed with missing data spanning the entirety of an analysis. These have been developed in R and in Python, implementing standard methods for generating missing values, and for analyzing them under different perspectives. In addition, we reference publicly available datasets that are commonly used as benchmarks for new missing values methodologies. The developed pipelines cover the entirety of a data analysis: exploratory analyses, establishing statistical and machine learning models, analysis diagnostics, and finally interpreting results obtained from incomplete data. We hope these pipelines also serve as a guide when choosing a method to handle missing values.

The remainder of the article is organized as follows: In the section entitled “Structure and content of the platform” we describe the different components of the platform, the structure that has been chosen, and the target audience. The section is organized as the platform itself, starting by describing materials for less advanced users then materials for researchers and finally resources for practical implementation. We then detail the implementation and use-cases of the provided R and Python workflows in the following section entitled “Details of the missing values workflows”. Finally, in the conclusion, we outline an overview of planed future developments for the platform and interesting areas in missing values research that we would like to bring to a wider audience.

The `R-miss-tastic`

platform is released at
https://rmisstastic.netlify.com/. It has been developed using the R
package *blogdown* (Xie et al. 2017) which generates static websites
using Hugo^{4}. Live examples have been included using the tool
https://rdrr.io/snippets/ provided by the website
`R Package Documentation`

. The source code and materials of the platform
have been made publicly available on GitHub at
https://github.com/R-miss-tastic, which provides a transparent record
of the platform’s development, and facilitates contributions from the
community.

We now discuss the structure of the `R-miss-tastic`

platform, the aim
and content of each subsection, and highlight key features of the
platform.

An important contribution and novelty of this work is the proposal of several workflows that allow for a hands-on illustration of classical analyses with missing values, both on simulated data and on publicly available real-world data. These workflows are provided both in R and in Python code and cover the following topics:

*How to generate missing values?*Generate missing values under different mechanisms, on complete or incomplete datasets. This is useful when performing simulations to compare methods that impute or handle missing data.*How to do statistical inference with missing values?*In particular, we focus on different solutions for estimating linear and logistic regression parameters with missing covariate values (maximum likelihood or multiple imputation).*How to impute missing values?*We compare different single imputation/matrix completion methods (e.g., using conditional models, low-rank models, etc.).*How to predict with missing values?*We consider building predictive models, e.g., using random forests (Breiman 2001), on data with incomplete predictors. The workflows present different strategies to deal with missing values in the covariates both in the training set and in the test set.

The aim of these workflows is threefold: 1) they provide a practical implementation of concepts and methods discussed in the lectures and bibliography sections of the platform; 2) they are implemented in a generic way, allowing for re-use on other datasets, for integration of other estimation or imputation methods; 3) the distinction between inference, imputation, and prediction lets the user keep in mind the solutions are not the same.

Furthermore, the workflows allow for a transparent and open discussion about the proposed implementations, which can be followed on the project GitHub repository, referencing proposals and discussions about practicable extensions of the workflows.

Additionally, a workflow on *How to do causal inference with incomplete
covariates/attributes in R?* demonstrates simple weighting and doubly
robust estimators for treatment effect estimation using R. This workflow
is based on the R implementation of the methodology proposed by
Mayer et al. (2020).

We provide a more detailed view on the proposed workflows in a later section, with examples of tabular or graphical outputs that can be obtained as well as recommendations on how to interpret and leverage these outputs.

For someone unfamiliar with missing data, it is a challenge to know
where to begin the journey of understanding them, and the methods to
handle them. This challenge is addressed with `R-miss-tastic`

, which
makes the material to get started easily accessible.

Teaching and workshop material takes many forms – slides, course notes,
lab workshops, video tutorials and in-depth seminars. The material is of
high quality, and has been generously contributed by numerous renowned
researchers who investigate the problems of missing values, many of whom
are professors having designed introductory and advanced classes for
statistical analyses with missing data. This makes the material on the
`R-miss-tastic`

platform well suited for both beginners and more
experienced users.

These teaching and workshop materials are described as “lectures”, and are organized into five sections:

General lectures: Introduction to statistical analyses with missing values; the role of visualization and exploratory data analysis for understanding missingness and guiding its handling; theory and concepts are covered, such as missing values mechanisms, likelihood methods, and imputation.

Multiple imputation: Introduction to popular methods of multiple imputation (joint modeling and fully conditional), how to correctly perform multiple imputation and limits of imputation methods.

Principal component methods: Introduction to methods exploiting low-rank type structures in the data for visualization, imputation and estimation.

Specific data or applications types: Lectures covering in details various sub-problems such as missing values in

*time series*, in*surveys*, or in treatment effect estimation (*causal inference*). Indeed, certain data types require adaptations of standard missing values methods (e.g., handling time dependence in time series (Moritz and Bartz-Beielstein 2017)) or additional assumptions about the impact of missing values (such as the impact on confounded treatment effects in causal inference (Mayer et al. 2020)). More in-depth material, e.g., video recordings from a virtual workshop on*Missing Data Challenges in Computation, Statistics and Applications*^{5}held in 2020, is also available.Implementations: A non-exhaustive list of detailed vignettes describing functionalities of R packages and of Python modules that implement some of the statistical analysis methods covered in the other lectures. For example, the functionalities and possible applications of the

*missMDA*R package are presented in a brief summary, allowing the reader to compare the main differences between this package and the*mice*package which is also summarized using the same summary format.

Figure 1 illustrates two views of the lectures page:
Figure 1(a) shows a collapsed view presenting the
different topics, Figure 1(b) shows an example of the
expanded view of one topic (General tutorials), with a detailed
description of one of the lectures (obtained by clicking on its title),
`Analysis of missing values`

by Jae-Kwang Kim. Each lecture can contain
several documents (as is the case for this one) and is briefly described
by a header presenting its purpose.

Lectures that we found very complete and thus highly recommend are:

*Statistical Methods for Analysis with Missing Data*by Mauricio Sadinle (in`General tutorials`

);*Missing Values in Clinical Research – Multiple Imputation*by Nicole Erler (in`Multiple imputation`

);*Handling missing values in PCA and MCA*by François Husson. (in`Missing values and principal component methods`

);*Modern use of Shared Parameter Models for Dropout (in longitudinal and time-to-event data)*by Dimitris Rizopoulos (in`Specific data or application types`

).