R-miss-tastic: a unified platform for missing values methods and workflows

Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss or biased analyses. Since the seminal work of Rubin (1976), a burgeoning literature on missing values has arisen, with heterogeneous aims and motivations. This led to the development of various methods, formalizations, and tools. For practitioners, it remains nevertheless challenging to decide which method is most suited for their problem, partially due to a lack of systematic covering of this topic in statistics or data science curricula. To help address this challenge, we have launched the"R-miss-tastic"platform, which aims to provide an overview of standard missing values problems, methods, and relevant implementations of methodologies. Beyond gathering and organizing a large majority of the material on missing data (bibliography, courses, tutorials, implementations),"R-miss-tastic"covers the development of standardized analysis workflows. Indeed, we have developed several pipelines in R and Python to allow for hands-on illustration of and recommendations on missing values handling in various statistical tasks such as matrix completion, estimation and prediction, while ensuring reproducibility of the analyses. Finally, the platform is dedicated to users who analyze incomplete data, researchers who want to compare their methods and search for an up-to-date bibliography, and also teachers who are looking for didactic materials (notebooks, video, slides).

it is great to have many options, finding the most appropriate method is challenging as there are so many.Second, the topic of missing data is often itself missing from many statistics and data science syllabuses, despite its omnipresence in data.So, when faced with missing data, practitioners are left powerless; quite possibly never having been taught about missing data, they do not know how to approach the problem, the dangers of mismanagement, how to navigate the methods, software, or how to choose the most appropriate method or workflow.
To help promote better management and understanding of missing data, we have released 'R-miss-tastic', an open platform for missing values.The platform takes the form of a reference website1 , which collects, organizes and produces material on missing data.It has been conceived by an infrastructure steering committee working group (ISC; its members are authors of this article), which first provided a CRAN Task View2 on missing data3 that lists and organizes existing R packages on the topic.The 'R-miss-tastic' platform extends and builds on the CRAN Task View by collecting, creating and organizing articles, tutorials, documentation, and workflows for analyses with missing data.
This platform is easily extendable and well documented, allowing it to seamlessly incorporate future works and research in missing values.The intent of the platform is to foster a welcoming community, within and beyond the R community.'R-miss-tastic' has been designed to be accessible for a wide audience with different levels of prior knowledge, needs, and questions.This includes students, teachers, statisticians, and researchers.Students can use it as complementary course materials.Teachers can use it as a reference website for their own classes.Statisticians and researchers can find example analysis workflows, or even contribute information for specific areas and find collaborators.
The platform provides new tutorials, examples and pipelines of analyses that we have developed with missing data spanning the entirety of an analysis.These have been developed in R and in Python, implementing standard methods for generating missing values, and for analyzing them under different perspectives.In addition, we reference publicly available datasets that are commonly used as benchmark for new missing values methodologies.The developed pipelines cover the entirety of a data analysis: exploratory analyses, establishing statistical and machine learning models, analysis diagnostics, and finally interpreting results obtained from incomplete data.We hope these pipelines also serve as a guide when choosing a method to handle missing values.
The remainder of the article is organized as follows: In the section entitled "Structure and content of the platform" we describe the different components of the platform, the structure that has been chosen, and the target audience.The section is organized as the platform itself, starting by describing materials for less advanced users then materials for researchers and finally resources for practical implementation.We then detail the implementation and use-cases of the provided R and Python workflows in the following section entitled "Details of the missing values workflows".Finally, in the conclusion, we outline an overview of planed future developments for the platform and interesting areas in missing values research that we would like to bring to a wider audience.

Structure and content of the platform
The 'R-miss-tastic' platform is released at https://rmisstastic.netlify.com/.It has been developed using the R package blogdown (Xie et al., 2017) which generates static websites using Hugo4 .Live examples have been included using the tool https://rdrr.io/snippets/provided by the website 'R Package Documentation'.The source code and materials of the platform have been made publicly available on GitHub at https://github.com/R-miss-tastic,which provides a transparent record of the platform's development, and facilitates contributions from the community.
We now discuss the structure of the 'R-miss-tastic' platform, the aim and content of each subsection, and highlight key features of the platform.

Missing values workflows
An important contribution and novelty of this work is the proposal of several workflows that allow for a hands-on illustration of classical analyses with missing values, both on simulated data and on publicly available real-world data.These workflows are provided both in R and in Python code and cover the following topics: • How to generate missing values?Generate missing values under different mechanisms, on complete or incomplete datasets.This is useful when performing simulations to compare methods that impute or handle missing data.
• How to do statistical inference with missing values?In particular, we focus on different solutions for estimating linear and logistic regression parameters with missing covariate values (maximum likelihood or multiple imputation).
• How to impute missing values?We compare different single imputation/matrix completion methods (for instance using conditional models, low-rank models, etc.).
• How to predict with missing values?We consider building predictive models, e.g. using random forests (Breiman, 2001), on data with incomplete predictors.The workflows present different strategies to deal with missing values in the covariates both in the training set and in the test set.
The aim of these workflows is threefold: 1) they provide a practical implementation of concepts and methods discussed in the lectures and bibliography sections of the platform; 2) they are implemented in a generic way, allowing for re-use on other datasets, for integration of other estimation or imputation methods; 3) the distinction between inference, imputation, and prediction lets the user keep in mind the solutions are not the same.
Furthermore, the workflows allow for a transparent and open discussion about the proposed implementations, which can be followed on the project GitHub repository, referencing proposals and discussions about practicable extensions of the workflows.
Additionally, a workflow on How to do causal inference with incomplete covariates/attributes in R? demonstrates simple weighting and doubly robust estimators for treatment effect estimation using R.This workflow is based on the R implementation of the methodology proposed by Mayer et al. (2020).
We provide a more detailed view on the proposed workflows in a later section, with examples of tabular or graphical outputs that can be obtained as well as recommendations on how to interpret and leverage these outputs.

Missing values lectures
For someone unfamiliar with missing data, it is a challenge to know where to begin the journey of understanding them, and the methods to handle them.This challenge is addressed with 'R-miss-tastic', which makes the material to get started easily accessible.
Teaching and workshop material takes many forms -from slides, course notes, lab workshops, video tutorials and in-depth seminars.The material is of high quality, and has been generously contributed by numerous renowned researchers who investigate the problems of missing values, many of whom are professors having designed introductory and advanced classes for statistical analyses with missing data.This makes the material on the 'R-miss-tastic' platform well suited for both beginners and more experienced users.These teaching and workshop materials are described as 'lectures', and are organized into five sections: 1. General lectures: introduction to statistical analyses with missing values; the role of visualization and exploratory data analysis for understanding missingness and guiding its handling; theory and concepts are covered, such as missing values mechanisms, likelihood methods, and imputation.
2. Multiple imputation: introduction to popular methods of multiple imputation (joint modeling and fully conditional), how to correctly perform multiple imputation and limits of imputation methods.
3. Principal component methods: introduction to methods exploiting low-rank type structures in the data for visualization, imputation and estimation.
4. Specific data or applications types: lectures covering in details various sub-problems such as missing values in time series, in surveys, or in treatment effect estimation (causal inference).Indeed, certain data types require adaptations of standard missing values methods (for instance handling time dependence in time series (Moritz and Bartz-Beielstein, 2017)) or additional assumptions about the impact of missing values (such as the impact on confounded treatment effects in causal inference (Mayer et al., 2020)).But also more in-depth material, for instance video recordings from a virtual workshop on Missing Data Challenges in Computation, Statistics and Applications5 held in 2020. 5. Implementations: a non-exhaustive list of detailed vignettes describing functionalities of R packages and of Python modules that implement some of the statistical analysis methods covered in the other lectures.For instance, the functionalities and possible applications of the missMDA R package are presented in a brief summary, allowing the reader to compare the main differences between this package and the mice package which is also summarized using the same summary format.
Figure 1 illustrates two views of the lectures page: Figure 1A shows a collapsed view presenting the different topics, Figure 1B shows an example of the expanded view of one topic (General tutorials), with a detailed description of one of the lecture (obtained by clicking on its title), 'Analysis of missing values' by Jae-Kwang Kim.Each lecture can contain several documents (as is the case for this one) and is briefly described by a header presenting its purpose.
Lectures that we found very complete and thus highly recommend are: • Statistical Methods for Analysis with Missing Data by Mauricio Sadinle (in 'General tutorials'); • Modern use of Shared Parameter Models for Dropout (in longitudinal and time-to-event data) by Dimitris Rizopoulos (in 'Specific data or application types').
The purpose of these lectures is to provide either an introduction or a deeper understanding of the statistical problems and proposed solutions in terms of their (mathematical) derivation and theoretical scope.So there is less focus on practical demonstrations with real data, or a systematic comparison of all methods for the same problems.This is covered in the section presenting in detail the developed workflows.

References on missing values
Complementary to the Lectures section, this part of the platform serves as a broad overview on the scientific literature discussing missing values taxonomies and mechanisms and statistical, machine learning methods to handle them.This overview covers both classical references to books, articles, etc. such as Schafer and Graham (2002); van Buuren (2018); Carpenter and Kenward (2012); Little and Rubin (2019) and more recent developments such as Josse et al. (2019); Gondara and Wang (2018), which study the consistency of supervised learning with missing values.The entire (non-exhaustive) bibliography can be browsed in two ways: 1) a complete list, filtered by publication type and year, with a search option for the authors or, 2) as a contextualized version.For 2), we classified the references into several domains of research or application, briefly discussing important aspects of each domain.This dual representation is shown in Figure 2 and allows for an extensive search in the existing literature, while providing some guidance for those focused on a specific topic.All references are also collected in a unique BibTex file made available in the GitHub repository6 .This shared file allows external users to easily propose additions to the bibliography, which are then reviewed by the platform committee, composed of researchers with different focuses on missing values.

Missing values implementations R packages
As mentioned in the introduction, the platform development is based on the release of the MissingData CRAN Task View, which currently lists approximately 150 R packages.The CRAN Task View is continuously updated, adding new R packages, and removing obsolete ones.Packages are organized by topic: exploration of missing data, likelihood based approaches, single imputation, multiple imputation, weighting methods, specific types of data, and specific application fields.We selected only sufficiently mature and stable packages already published on CRAN or Bioconductor.This ensures all listed packages can easily be installed and used by practitioners.
Even though the Task View classifies packages into different sub-domains, it may still be a challenge for practitioners and researchers inexperienced with missing values to choose the most relevant package for their application.To address this challenge, we provide a partial and slightly more detailed overview of existing R packages, selecting the most popular and versatile ones.This overview is a blend of the Task View, and of the individual package description pages and vignettes as provided on CRAN or Bioconductor.For each selected package (7 at the date of writing of this article: imputeTS, mice, missForest, missMDA, naniar, simputation and VIM), we provided a category (in the style of the categorization in the Task View), a short description of use-cases, its description (as on CRAN), the usual CRAN statistics (number of monthly downloads, last update), the handled data formats (e.g., data.frame,matrix, vector), a list of implemented algorithms (e.g., k-means, PCA, decision tree), the list of available datasets, some references (such as articles and books), and a small chunk of code, ready for a direct execution on the platform via the R package Documentation7 .Figure 3 shows the condensed view of the package page and the expanded description sheet of a given package (here naniar).
We believe shortlisting R packages is highly useful for practitioners new to the field, as it demonstrates data analysis that handles missing values in the data.We are aware that this selection is subjective, and we welcome external suggestions for other packages to add to this shortlist.

Python modules
To the best of our knowledge, very few methods are already implemented for handling missing data in Python.However, one of the major libraries for machine learning and data analysis, scikit-learn (Pedregosa et al., 2011) has recently proposed a module for simple and multiple imputations, sklearn.impute.Also, as an alternative, the statsmodels8 library also has an implementation module for multiple imputation in Python now.Additionally, the missingno toolset (Bilogur, 2018) facilitates visualizing missing values missing values for exploratory data analyses.We regularly survey new Python implementations for handling missing values and, if pertinent, from a

CONTRIBUTED RESEARCH ARTICLE
Website template created by @mdo, ported to Hugo by @mralanorth.Website proudly powered by Blogdown for R.

Back to top
The ampute function of the mice R-package.Rianne Schouten and her colleagues wrote a self-contained tutorial on how to ampute data.
The R workflow on How to generate missing values?extending some functionalities of the ampute function.For the related R source code click here.
The missCompare R-package.

Incomplete data
The data sets listed below are either widely used in general in the missing data community or used for illustration of different methods handling missing values in the tutorials from the Tutorials and R packages sections.This presentation scheme is inspired by the UCI Machine Learning Repository.
Click on a table entry to obtain further information about the data set.theoretical and practical point of view, reference them on our platform.We expect this to promote their use but also additional assessment by practitioners and researchers from the missing values (statistics/machine learning) community.

Datasets
Especially in methodology research, an important aspect is the comparison of different methods to assess their respective strengths and weaknesses.Several datasets are recurrent in the missing values literature but have not been referenced together yet.We gathered publicly available datasets that have recurrently been used for comparison or illustration purposes in publications, R packages and tutorials.Most of these datasets are already included in R packages but some are available in other data collections.Figure 4 shows how the datasets are presented, with a detailed description shown for one of the dataset ('Ozone', obtained by clicking on its name).The description follows the UCI Machine Learning Repository presentation (Dua and Graff, 2019), including a short description of the dataset, how to obtain it, external references describing the dataset in more details, and links to tutorials/lectures on our websites or to vignettes in R packages that use the dataset.
In addition, the Datasets section also references existing methods for generating missing data, given assumptions on their generation mechanisms (as in the R package mice).
Note, however, that the list of datasets gathered here is short compared to benchmark datasets for full data methods such as the UCI Machine Learning Repository.Therefore, our proposed list also serves as an invitation to tackle this lack of a wider variety of common benchmark datasets in the missing data community.

Additional content
This unified platform collects and edits the contributions of numerous individuals who have investigated missing values problems, and developed methods to handle them.To provide an overview of some of the main actors in this field, the list of all contributors who agreed to appear on this platform is given with links to their personal or to their research lab website.
We also provide links to other interesting websites or working groups, not necessarily working with R and Python (Van Rossum and Drake, 2009) but with other programming languages such as SAS/STAT ® and STATA (StataCorp., 2019).
Two other features are finally provided to engage the community: 1. a regularly updated list of events such as conferences or summer schools with special focus on missing values problems, and 2. a list of recurring questions together with short answers and links for more details for every question (FAQ).

Details of missing values workflows
After this general introduction to the 'R-miss-tastic' project and platform and the overview of its structure, we now turn to a more detailed presentation of the various workflows that we have developed and proposed on this platform.
To allow for both hands-on tutorials illustrating current practices and state-of-the-art and ready-touse pipelines, we propose the workflows under different formats such as HTML, PDF, R Markdown (for R code) and IPython Notebook (for Python code).We encourage practitioners and researchers to use and adapt these workflows and propose modifications and improvements, in order to increase reproducibility and comparability of their work.Of course, we are aware that these workflows do not cover the entire spectrum of existing methods and data problems.The goal of the proposed workflows is rather to initiate a joint effort to create a larger spectrum of open-source workflows, and to encourage the use of standardized procedures to handle missing values.With an incomplete dataset at hand, prior to embarking on an in-depth statistical analysis, two preliminary steps are essential: (i) a descriptive analysis leveraging visualization packages such as VIM (Kowarik and Templ, 2016) or naniar (Tierney et al., 2021); (ii) a specific aim has to be defined in order to choose a specific method to use.
An example of a method whose success crucially depends on the analyst's goal is mean imputation: this approach is strongly counter-indicated if the aim is to estimate parameters, but it can be consistent if the aim is to predict as well as possible (Josse et al., 2019).Following this observation, our workflows are divided into different parts, defined by the objective of the statistical analysis.We aim to present and compare the main implementations available both in R and Python for each objective.Currently there are seven workflows available on the platform and we briefly present their scope and use below.For details on the implementations we encourage the reader to open the corresponding workflows, all available on the 'R-miss-tastic' platform.

How to generate missing values?
The goal of these workflows is to propose functions to generate missing values under different mechanisms.This code aims to unify classical solutions to do this.Indeed, a usual strategy to compare imputation or estimation strategies is to introduce (additional) missing values in the dataset, and use the ground truth for these missing values to evaluate the strategies (see the following section).Rubin (1976) classifies the cause for a lack of data into three missing data mechanisms.The missing data mechanism is said to be: (i) missing completely at random (MCAR) if the lack of data is totally independent of the data values, (ii) missing at random (MAR) if the process that causes the missing data only depends on the observed values and (iii) missing not at random (MNAR) if the unavailability of the data depends on the missing variables.See Sportisse (2021) for a recent overview on the topic.
In R In the R workflow 9 , we have implemented the main function produce_NA 10 which facilitates generating missing values under the three missing data mechanisms outlined above.This function internally calls the ampute function from the mice package (van Buuren and Groothuis-Oudshoorn, 2011) but we chose to simplify its use while adding some additional options to specify the missing values generation.In addition, the original ampute function generates missing values only for a complete dataset with quantitative variables11 .In the main function of our workflow, the user can easily introduce (additional) missing values in a complete or incomplete dataset composed of quantitative, categorical, or mixed variables, by choosing the mechanism and the percentage of missing values to be introduced.The function then returns the data matrix containing the new dataset with missing values, that also includes the missing values already present in the input data, and the indicator matrix (a binary matrix where an entry is equal to 1 if a new missing value has been generated at the same location in the data matrix and 0 otherwise).
The three main arguments are the initial dataset (data) in which the missing values are introduced using a given missing data mechanism (mechanism) and a given percentage of missing values (perc.missing).For example, to introduce 20% of MCAR values in the dataset X, the code is detailed below.
X.miss.mcar<-produce_NA(data = X, mechanism = "MCAR", perc.missing= 0.2) X.mcar <-X.miss.mcar['data.incomp']R.mcar <-X.miss.mcar['idx_newNA'] For instance, if X contains three variables (fully observed) denoted as X 1 , X 2 , X 3 , two options are available to generate MAR values: 1.The first option consists of generating missing values in X 1 by using a logistic model depending on (X 2 , X 3 ), which are fully observed, i.e., where ϕ = (ϕ 2 , ϕ 3 ) is the parameter of the missing data mechanism.In our function, ϕ is chosen such that the given percentage of missing values is achieved.This allows us to obtain missing values in the first variable X NA 1 .Then, the same strategy is performed to introduce missing values in X 2 and X 3 , by using a logistic model depending on (X 1 , X 3 ) (fully observed) and (X 1 , X 2 ) (fully observed) respectively.To get the final matrix containing missing values, we concatenate X NA 1 , X NA 2 and X NA 3 by handling the rows containing only missing values.
2. The second option consists of generating the missing values by pattern, i.e., by rows.In this case, the combinations of which variables are observed and missing are specified in a pattern matrix.
For the MAR mechanism, in each pattern, at least one variable must be observed.An example (the choice by default) of such a pattern matrix is where 0 indicates that the variable should have missing values whereas 1 means that it should be observed.For example, the first pattern means that the process which causes the missingness of the first variable X 1 depends on the values of X 2 and X 3 which are observed.
We also propose several ways to generate missing values, under the MNAR mechanism.It includes the most general one when the missingness depends on both the missing variables and the observed variables.It also includes the self-masked mechanism, where the unavailability of the data only depends on their values themselves.For example, it is possible to introduce self-masked missing values using a quantile censorship for which the form is precised by the argument self.mask,e.g., if set to 'lower', then the values are amputed based on a quantile from the lower tail of the empirical distribution such that the target proportion of missing values is achieved.
In Python To our knowledge, there is no specific module in Python to generate missing values.Consequently, we implemented such functions, in a Python workflow, which similarly to its R counterpart workflow12 allows us to generate missing values under by different mechanisms and different percentage of missing values. 13The key difference with the R workflow is that the data set must be complete and can currently only contain quantitative variables.For MAR and MNAR mechanisms, only the option not by pattern has been implemented.In this case, for a dataset X with three variables, a variable is chosen to be fully observed (say X 3 ), and the process which causes the missingness of two other variables (X 1 and X 2 ) depends on the values of the fully observed variable, for example with the logistic model given in (1).

How to impute missing values?
The aim of these workflows (in R and Python) is to compare the most classical imputation methods and to propose a reference pipeline for comparison on simulated and real datasets, which can be easily extended with other imputation methods.Here, the imputation methods are considered as such, i.e., the objective is not to estimate a parameter or to perform a statistical analysis on a completed dataset but to impute missing values to get a complete dataset in the best possible way.Therefore, we evaluate the methods in terms of imputation quality, by using the mean squared error (MSE).More precisely, the procedure is the following one: (i) we have access to a complete dataset X, (ii) missing values are introduced in X and we get an incomplete dataset X NA , (iii) this incomplete dataset is imputed and we obtain an imputed dataset X imp , (iv) the MSE, which measures the error committed by the imputation of the missing values, is computed: it is the ℓ 2 -norm of the difference of the imputed dataset and the complete one).Note that this procedure can also be performed on an incomplete dataset by introducing additional missing values.However, for now, both R and Python workflows only consider complete datasets.
Different types of imputation methods are included in this workflow: 1. imputation by the mean, which serves as a naive baseline.
2. conditional models, if the imputation relies on the conditional expectation or a draw from the conditional distribution of each variable given the others.
• in R: -mice (van Buuren and Groothuis-Oudshoorn, 2011): a multiple imputations method by chained equations.Even if it is a returns several imputed data sets, they can be aggregated using the mean of the imputations to get a single imputation.-missForest (Stekhoven and Bühlmann, 2012): imputes iteratively by training random forests.
• in Python: -IterativeImputer of scikit-learn library (Pedregosa et al., 2011): this function is inspired by mice, but it uses (iterative) regularized regression, imputing by the conditional expectation, and providing a simple imputation.We also use the ExtraTreesRegressor estimator of IterativeImputer, which trains iterative random forests (it is similar to missForest in R).
3. low-rank based models, the data matrix to impute is assumed to be generated as a low rank structure plus a noise term.
• in Python: softImpute (coded for the purpose of this notebook and available here14 ), which minimizes the re-weighted least squares error penalized by the nuclear norm and with an internal cross-validation step to choose the regularization parameter.
4. machine learning methods (for the Python workflow only) using optimal transport or variational autoencoders.
• in Python: -MIWAE (Mattei and Frellsen, 2019): imputes missing values with a deep latent variable model based on importance weighted variational inference.-Sinkhorn (Muzellec et al., 2020): randomly extracts several batches and minimizes optimal transport distances between batches to impute missing values.
For the sake of clarity, we show a comparison table (Table 1) in the Appendices, showing the difference of scope between R and Python packages used in the R-miss-tastic workflows.(a) Output of the function how_to_impute in R. The results for the MSE are truncated to two digits.Note that the line X.pca is the result for missMDA.For all methods, the default parameter choices are used.In R This workflow15 provides two main functions which compares the imputation methods: (i) on a simulated dataset for different mechanisms and percentage of missing values (how_to_impute) or (ii) on a list of real datasets and a given mechanism and percentage of missing values (how_to_impute_real).
The function how_to_impute takes as input a complete dataset (X), a list of percentages of missing values (perc.list)and a list of missing data mechanisms (mecha.list).The code to use this function is given below.perc.list<-c(0.1,0.3, 0.5) mecha.list<-c("MCAR", "MAR", "MNAR") res <-how_to_impute(X = X, perc.list= perc.list,mecha.list= mecha.list,nbsim = 10) The output of the first function how_to_impute is the mean of the methods' MSEs for the different missing values settings by taking the average over several repetitions (the number of repetitions can be specified through the argument nbsim).Figure 5 shows the output of this function and its associated plot, when the simulated dataset is Gaussian with n = 1000 observations, d = 10 variables, a mean vector such that µ i = 1, ∀i ∈ {1, . . ., d} and a covariance matrix such that Σ ij = 0.5 if i ̸ = j ∈ {1, . . ., d}, and Σ ij = 1 if i = j.First, the mean of the methods' MSEs for the different missing values settings are reported in Figure 5a.We can remark that for the MCAR mechanism, the methods perform well, while for the MNAR mechanism, the results are generally closer to those of the naive imputation by the mean.As expected, most methods give worse results for high percentages of missing values.Besides, Figure 5b shows one of the associated plot for MCAR data (there is also a plot for MAR data and a plot for MNAR data).In the first part of the appendix, this function is illustrated for a particular dataset and the code to obtain Figure 5 is given.
The second function how_to_impute_real takes as input a list of datasets (datasets_list), a list of missing data mechanisms (mech) and a given percentage of missing values (perc).It returns a table containing the mean of the MSEs for the simulations performed and a table for the summary plot shown in Figure 6.This can be particularly useful for practitioners who would like to have an indication of which method might be the most suited for a given or for several specific datasets.Here, the real datasets are taken from the UCI repository16 (Dua and Graff, 2019).An example of how to use this function in practice is detailed below.datasets_list <-list(wine_white = wine_white, wine_red = wine_red, slump = slump, movement = movement, decathlon = decathlon) names_dataset <-c("winequality-white", "winequality-red", "slump", "movement", "decathlon") perc <-0.2 mecha <-"MCAR" res <-how_to_impute_real(datasets_list = datasets_list ,perc = perc, mech = mecha, nbsim = 10, names_dataset = names_dataset) An additional workflow17 is available and compare other deep-learning imputation strategies to most classical ones on data sets simulated either with linear relationships and nonlinear relationships.The conclusions points to better behavior of the low-rank based imputations methods even when deep-learning methods are tuned.
In Python The Python workflow is very similar to its R counterpart.The two same functions, how_to_impute and how_to_impute_real, have been implemented.

How to estimate parameters with missing values in R?
This R workflow18 is dedicated to a specific inferential framework when the aim is to estimate linear and logistic regression parameters for multivariate normal data.It is currently only available in R, as there are no analogous implementations available in Python to our knowledge.
There are two main methods to estimate parameters with missing values: maximum likelihood estimation adapted to missing values, using, e.g., EM-based algorithms or using multiple imputation.In this workflow, we compare two instances of these main methods, using available R implementations: the EM algorithm for logistic and linear regressions with the package misaem (Jiang et al., 2020) which uses the Stochastic Approximation of EM algorithm (SAEM Delyon et al., 1999) and multiple imputation with the package mice.Both strategies are valid under the MAR missing data mechanism.The workflow performs the estimation on a simulated dataset, but the dataset can be replaced with any custom dataset that the user believes satisfies the assumptions about the missing data mechanism and distribution of covariates.
The misaem package facilitates estimation of parameters of linear and logistic regression models from incomplete data, and also provides valid estimates of these parameters' variances.The functions miss.lm,miss.glmresemble the standard lm and glm functions both in terms of their signature and output.
The rationale behind the popular multiple imputation approach is to create M > 1 complete datasets by imputing the missing values with 'plausible' values, and then to estimate a parameter of interest θ on each of the imputed datasets.The multiple estimations of θ and their variability reflect the uncertainty due to the unknown missing values.The parameter estimation is performed by applying the analytic method used had, the data been complete.This provides an estimate of the parameter θ and an estimate of the corresponding variance, for each imputed dataset.These quantities are finally 'pooled' by using specific rules named 'Rubin's rules' (Rubin, 2004), leading to a final point estimate, with a corresponding estimation of its variance that takes into account the uncertainty due to missing values.
In the corresponding workflow, we compare this method to the previous EM algorithm and provide the basic lines of code required to estimate parameters of linear or logistic regression models with incomplete covariate data.
For a additional example of how to estimate regression parameters we refer to the tutorial19 on handling missing values in R by Julie Josse: it walks through a complete analysis, covering visualization of missing data patterns, data visualization, dimensionality reduction of incomplete data, and regression, in the presence of missing data.

How to predict in the presence of missing values?
As mentioned in the introduction, methods to deal with missing values are not the same when the aim is to estimate parameters or to predict a target variable.Josse et al. (2019) study the problem of supervised learning with missing values, i.e., when the aim is to predict an outcome y, from incomplete covariates in X.Note that contrary to the estimation setting, supervised learning involves training and test sets and both may have missing values.Josse et al. (2019) recommend to impute the training set and the test set with a same constant, such as the mean, and then to apply a universally consistent learner, i.e., a very powerful learner, such as gradient boosting, able to learn or fit any function.When forests-based methods are used to do prediction, another method is available, the Missing Incorporated in Attributes (MIA) method (Twala et al., 2008).Note that constant imputation or MIA are recommended asymptotically but when having limited data in the prediction setting, other imputation methods can outperform these asymptotically consistent methods (Josse et al., 2019).This is explored in the following workflows.The different methods are compared in terms of quality of the prediction of the outcome (AUC for a binary outcome and MSE for a continuous outcome).
In R The R workflow20 assesses a popular strategy (two-step strategy) which involves independently imputing the training and testing sets using the same imputation method.These datasets are then treated as being complete data, and regular learning algorithms are applied to predict some target variable. 21Several imputation methods are compared, such as mice, missForest, softImpute, and mean imputation.Note that, until recently, using the popular mice package for learning predictive models on incomplete data in R was hindered by the fact that it did not allow using the same imputation model for the training and test set.This has, however, been addressed with the argument ignore of the R function mice, the details of this recent extension can be found on GitHub. 22  In Python The Python workflow23 compares two strategies, where the aim is to predict a target variable and the covariates may contain missing values: 1.The two-step strategy consists of imputing the missing values both in the training and in the test set with a method like mean imputation or IterativeImputer of the scikit-learn library, and to apply usual learning algorithms (random forests, gradient boosting, linear regression) on the imputed dataset.This learning algorithm can be applied to the imputed dataset X but also to a new variable made of the combination of the imputed covariates X with the response pattern R: 2. The one-step strategy performs prediction using with learning methods adapted to the missing data without necessarily imputing them, such as the MIA method (Twala et al., 2008), which we have implemented in our notebook.
We propose a function, score_pred, which compares these strategies in terms of prediction performances by introducing missing values in complete covariates (x_comp) under a specific missing data mechanism (mecha and a given percentage of missing values (p).The code for calling this function is given below, when the learning algorithm is the gradient boosting and 20% of MCAR values are introduced.learner = HistGradientBoostingRegressor() p = 0.2 res = score_pred(x_comp=X, y = y, learner=learner , p=p, nbsim=10, mecha="MCAR") The dataset is then split into a training set and a test set (75% in the training set, 25% in the test set) and the methods presented below are applied by considering a specific learning algorithm.The function then returns the prediction error on the test set, by comparing the ground truth (y) and the predicted outcome values on the test set for each simulation (i.e., each run for the generation of missing values).Figure 7 shows the graphical output of this function called for different learning algorithms (linear regression, random forests and gradient boosting) and for different missing data mechanisms (20% MCAR and MNAR, see the section on how to generate missing values).When the learner is the linear regression, the two-step methods with added mask, both for the MCAR mechanism and the MNAR mechanism, perform well.Since the simulated dataset is generated using a linear regression, the linear regression is expected to give better results than the other learners.In addition, for the MNAR mechanism, the one-step strategy MIA (especially when the gradient boosting is performed) appears to be a good choice.
Another function is specifically designed to handle datasets which already contain missing values.The second part of the appendix shows a concrete example of this notebook on a real dataset.This concludes the overview of the workflows developed in this project.We invite other practitioners and researchers to use and extend these methods.Overall, we hope that by creating and sharing methods, new methods can be more easily developed and easily compared and evaluated.

Perspectives and future extensions
By providing a platform and community to discuss missing data, software, approaches and workflows, the sharing of expertise on missing data can hopefully be improved and extended more easily.

Towards uniformization and reproducibility
One way to promote and encourage practitioners and researchers in their work with missing values is to provide benchmarks and workflows around missing data.As has been shown in data competitions, community involvement produces many creative solutions and discussions that move the field forward, and challenge existing strategies.We will continue to work on our workflows and related source code.In doing so, we hope to encourage users to continue to test new methods and present results in a clear and reproducible manner.In addition, we plan to propose two types of data challenges: 1) imputation and estimation, and 2) analysis workflows.For the first part of the challenge, the objective is to find the best imputation or estimation strategy.The community will be given a dataset with missing values, for which there is actually a hidden copy of the real values.The community will then get the task of creating imputed values, which are assessed against the original dataset with complete values, to determine which imputation is best.This is similar in spirit to The R Journal Vol.14/2, June 2022 ISSN 2073-4859 the Netflix prize (Bennett et al., 2007) and the M4 challenge in the time series domain (Makridakis et al., 2018).This benchmarking could be extended to other areas, such as parameter estimation, and predictive modeling with missing data.Analysis workflows could form another community challenge, assessed in a similar way to existing 'datathon' events where entries are assessed by an expert panel.Here the challenge could be to develop workflows and data visualizations from complex data.The data could have challenging features, and be combined from various data sources with complex structure, such as data with several types of missingness, images, text, data, longitudinal data, and time series.

Future extensions
Possible enhancements that could be added in future releases of the platform, for which we welcome suggestions and contributions, are the following: a workflow with a focus on MNAR data and different solutions that can handle such data (as diversity of existing solutions is large, such a unified workflow will be a consequential contribution); for more applied users, a

Participation and interaction
This platform is aimed to offer a venue for the community, in the sense that we welcome every comment and question, encourage submissions of new works, theoretical or practical, either through the provided contact form or directly via the GitHub project repository.We have already received useful feedback and several external contributions, organized several remote calls and working sessions at statistics conferences.We are planning on regularly relaunching calls for new material for the platform, for example through the R consortium blog24 , R-bloggers25 and social media platforms.We also intend to use these channels to communicate more generally about the platform and the topic of missing values.
In order for the platform to be a reference to the community, it must provide regularly updated, user-friendly content.To achieve this goal, it is important to propose sustainable and accessible solutions for the maintenance of the 'R-miss-tastic' platform.We hope that the well documented source code of the platform facilitates external contributions and community feedback on this project.
In conclusion, the aim of this platform is to go beyond mere community participation, namely to seed meaningful community interactions, and to offer a hub of communication among groups that rarely exchange, both within, and between academia and industry.The R Journal Vol.14/2, June 2022 ISSN 2073-4859
Example of plot for the MNAR mechanism.

Figure 5 :
Figure 5: Tabular and graphical outputs of the R function how_to_impute.The methods mice, miss-Forest, softImpute and missMDA are compared with the naive imputation by the mean for several percentages of missing values (10%, 30%, 50%).The mean of the MSEs computed for several generations of missing values are given.In the tabular, the results are shown for several mechanisms (MCAR, MAR, MNAR) and the plot corresponds to the MNAR mechanism.

Figure 6 :
Figure 6: Output of the R function how_to_impute_real.The results for the MSE are truncated to two digits.The methods mice, missForest, softImpute and missMDA for several real datasets in which 20% MCAR missing values have been introduced.

Figure 8 :
Figure 8: Graphical outputs of the R function how_to_impute.The methods mice, missForest, softImpute and missMDA are compared with the naive imputation by the mean for several percentages of missing values (20%, 50%).The mean of the MSEs computed for several generations of missing values are given.The results are shown for different mechanisms (MCAR, MAR, MNAR).
Los Angeles Ozone Pollution Data, 1976.This data set contains daily measurements of ozone concentration and meteorological quantities.It can be found in R in the mlbench package and is loaded by calling data(Ozone).
Tutorials illustrating methods on this data: Julie Josse's course on missing values imputation using PC methods.Julie Josse's and Nick Tierney's tutorial on handling missing values.Download the data set from this tutorial: ozoneNA.csvNick Tierney's naniar vignette for missing data visualization.
Plot of the function score_pred to compare different strategies when the aim is to predict in Python.20% of missing values are introduced in a simulated dataset using the MCAR mechanism or the MNAR mechanism.The covariates X ∈ R 1000×3 are generated under a multivariate Gaussian distribution, the parameter of the regression β ∈ R 3 follows a random uniform distribution.The outcome y is generated according to a linear model such that y = Xβ + ϵ, with ϵ representing Gaussian noise.The two-step strategies (IterativeImputer and the mean imputation) with or without adding a mask and the one-step strategy MIA are compared in terms of prediction error, and several learners are performed (linear regression, random forests, gradient boosting).The closer the result is to 1, the more accurate the prediction is (1 corresponds to perfect prediction, 0 to the worst prediction).
comparison of computation times of different methods, benchmarked on various types of data.Another problem that is becoming more common is missing values in data integration.Indeed, questions such as what do I do when I have clinical data from multiple centers with different mechanisms of missing values or with systematically missing values in certain data?or what do I do when I have time series and missing values in one of the groups of variables?would be also worth addressing in additional workflows.