The imputeTS package specializes on univariate time series imputation. It offers multiple state-of-the-art imputation algorithm implementations along with plotting functions for time series missing data statistics. While imputation in general is a well-known problem and widely covered by R packages, finding packages able to fill missing values in univariate time series is more complicated. The reason for this lies in the fact, that most imputation algorithms rely on inter-attribute correlations, while univariate time series imputation instead needs to employ time dependencies. This paper provides an introduction to the imputeTS package and its provided algorithms and tools. Furthermore, it gives a short overview about univariate time series imputation in R.
In almost every domain from industry (Billinton et al. 1996) to biology (Bar-Joseph et al. 2003), finance (Taylor 2007) up to social science (Gottman 1981) different time series data are measured. While the recorded datasets itself may be different, one common problem are missing values. Many analysis methods require missing values to be replaced with reasonable values up-front. In statistics this process of replacing missing values is called imputation.
Time series imputation thereby is a special sub-field in the imputation
research area. Most popular techniques like Multiple Imputation
(Rubin 1987), Expectation-Maximization (Dempster et al. 1977),
Nearest Neighbor (Vacek and Ashikaga 1980) and Hot Deck
(Ford 1983) rely on inter-attribute correlations to estimate
values for the missing data. Since univariate time series do not possess
more than one attribute, these algorithms cannot be applied directly.
Effective univariate time series imputation algorithms instead need to
employ the inter-time correlations.
On CRAN there are several packages solving the problem of imputation of
multivariate data. Most popular and mature (among others) are
AMELIA (Honaker et al. 2011),
mice (van Buuren and Groothuis-Oudshoorn 2011),
VIM (Kowarik and Templ 2016) and
missMDA (Josse and Husson 2016).
However, since these packages are designed for multivariate data
imputation only they do not work for univariate time series.
At the moment imputeTS
(Moritz 2017a) is the only package on CRAN that is solely dedicated to
univariate time series imputation and includes multiple algorithms.
Nevertheless, there are some other packages that include imputation
functions as addition to their core package functionality. Most
noteworthy being zoo (Zeileis and Grothendieck 2005)
and forecast
(Hyndman 2017). Both packages offer also some advanced time series
imputation functions. The packages
spacetime
(Pebesma 2012),
timeSeries
(Rmetrics Core Team et al. 2015) and xts (Ryan and Ulrich 2014)
should also be mentioned, since they contain some very simple but quick
time series imputation methods. For a broader overview about available
time series imputation packages in R see also (Moritz et al. 2015). In this
technical report we evaluate the performance of several univariate
imputation functions in R on different time series.
This paper is structured as follows:
Section 2 gives an overview, about all
features and functions included in the imputeTS package. This is
followed by 3 of the different provided functions.
The paper ends with a 4 section.
The imputeTS package can be found on CRAN and is an easy to use package that offers several utilities for ‘univariate, equi-spaced, numeric time series’.
Univariate means there is just one attribute that is observed over time. Which leads to a sequence of single observations \(o_{1}\), \(o_{2}\), \(o_{3}\), ... \(o_{n}\) at successive points \(t_{1}\), \(t_{2}\), \(t_{3}\), ... \(t_{n}\) in time. Equi-spaced means, that time increments between successive data points are equal \(|t_{1} - t_{2}| = |t_{2} - t_{3}| = ... = |t_{n-1} - t_{n}|\). Numeric means that the observations are measurable quantities that can be described as a number.
In the first part of this section, a general overview about all
available functions and datasets is given. This is followed by more
detailed overviews about the three areas covered by the package: ‘Plots
& Statistics’, ‘Imputation’ and ‘Datasets’. Information about how to
apply these functions and tools can be found later in the
3 section.
As can be seen in Table 1, beyond several imputation algorithm implementations the package also includes plotting functions and datasets. The imputation algorithms can be divided into rather simple but fast approaches like mean imputation and more advanced algorithms that need more computation time like kalman smoothing on a structural model.
Simple Imputation | Imputation | Plots & Statistics | Datasets |
---|---|---|---|
na.locf | na.interpolation | plotNA.distribution | tsAirgap |
na.mean | na.kalman | plotNA.distributionBar | tsAirgapComplete |
na.random | na.ma | plotNA.gapsize | tsHeating |
na.replace | na.seadec | plotNA.imputations | tsHeatingComplete |
na.remove | na.seasplit | statsNA | tsNH4 |
tsNH4Complete |
As a whole, the package aims to support the user in the complete process
of replacing missing values in time series. This process starts with
analyzing the distribution of the missing values using the statsNA
function and the plots of plotNA.distribution
,
plotNA.distributionBar
, plotNA.gapsize
. In the next step the actual
imputation can take place with one of the several algorithm options.
Finally, the imputation results can be visualized with the
plotNA.imputations
function. Additionally, the package contains three
datasets, each in a version with and without missing values, that can be
used to test imputation algorithms.
An overview about the available plots and statistics functions can be found in Table 2. To get a good impression what the plots look like section 3 is recommended.
Function | Description |
---|---|
plotNA.distribution | Visualize Distribution of Missing Values |
plotNA.distributionBar | Visualize Distribution of Missing Values (Barplot) |
plotNA.gapsize | Visualize Distribution of NA gap sizes |
plotNA.imputations | Visualize Imputed Values |
statsNA | Print Statistics about the Missing Data |
The statsNA
function calculates several missing data statistics of the
input data. This includes overall percentage of missing values, absolute
amount of missing values, amount of missing value in different sections
of the data, longest series of consecutive NAs and occurrence of
consecutive NAs. The plotNA.distribution
function visualizes the
distribution of NAs in a time series. This is done using a standard time
series plot, in which areas with missing data are colored red. This
enables the user to see at first sight where in the series most of the
missing values are located. The plotNA.distributionBar
function
provides the same insights to users, but is designed for very large time
series. This is necessary for time series with 1000 and more
observations, where it is not possible to plot each observation as a
single point. The plotNA.gapsize
function provides information about
consecutive NAs by showing the most common NA gap sizes in the time
series. The plotNA.imputations
function is designated for visual
inspection of the results after applying an imputation algorithm.
Therefore, newly imputed observations are shown in a different color
than the rest of the series.
An overview about all available imputation algorithms can be found in Table 3. Even if these functions are really easy applicable, some examples can be found later in section 3. More detailed information about the theoretical background of the algorithms can be found in the imputeTS manual (Moritz 2017b).
Function | Option | Description |
---|---|---|
na.interpolation | ||
linear | Imputation by Linear Interpolation | |
spline | Imputation by Spline Interpolation | |
stine | Imputation by Stineman Interpolation | |
na.kalman | ||
StructTS | Imputation by Structural Model & Kalman Smoothing | |
auto.arima | Imputation by ARIMA State Space Representation & Kalman Sm. | |
na.locf | ||
locf | Imputation by Last Observation Carried Forward | |
nocb | Imputation by Next Observation Carried Backward | |
na.ma | ||
simple | Missing Value Imputation by Simple Moving Average | |
linear | Missing Value Imputation by Linear Weighted Moving Average | |
exponential | Missing Value Imputation by Exponential Weighted Moving Average | |
na.mean | ||
mean | MissingValue Imputation by Mean Value | |
median | Missing Value Imputation by Median Value | |
mode | Missing Value Imputation by Mode Value | |
na.random | Missing Value Imputation by Random Sample | |
na.replace | Replace Missing Values by a Defined Value | |
na.seadec | Seasonally Decomposed Missing Value Imputation | |
na.seasplit | Seasonally Splitted Missing Value Imputation | |
na.remove | Remove Missing Values |
For convenience similar algorithms are available under one function name
as parameter option. For example linear, spline and stineman
interpolation are all included in the na.interpolation
function. The
na.mean
, na.locf
, na.replace
, na.random
functions are all simple
and fast. In comparison, na.interpolation
, na.kalman
, na.ma
,
na.seasplit
, na.seadec
are more advanced algorithms that need more
computation time. The na.remove
function is a special case, since it
only deletes all missing values. Thus, it is not really an imputation
function. It should be handled with care since removing observations may
corrupt the time information of the series. The na.seasplit
and
na.seadec
functions are as well exceptions. These perform seasonal
split / decomposition operations as a preprocessing step. For the
imputation itself, one out of the other imputation algorithms can be
used (which one can be set as option). Looking at all available
imputation methods, no single overall best method can be pointed out.
Imputation performance is always very dependent on the characteristics
of the input time series. Even imputation with mean values can sometimes
be an appropriate method. For time series with a strong seasonality
usually na.kalman
and na.seadec
/ na.seasplit
perform best. In
general, for most time series one algorithm out of na.kalman
,
na.interpolation
and na.seadec
will yield the best results.
Meanwhile, na.random
, na.mean
, na.locf
will be at the lower end
accuracy wise for the majority of input time series.
As can be seen in Table 4, all three datasets are available in a version with missing data and in a complete version. The provided time series are designated as benchmark datasets for univariate time series imputation. They shall enable users to quickly compare and test imputation algorithms. Without these datasets the process of testing time series imputation algorithms would require to manually delete certain observations. The benchmark data simplifies this: imputation algorithms can directly be applied to the dataset versions with missing values, which then can be compared to the complete dataset versions afterwards. Since the time series are specified, researchers can use these to compare their algorithms against each other.
Reached RMSE or MAPE values on these datasets are easily understandable
results to quote and compare against. Nevertheless, comparing algorithms
using these fixed datasets can only be a first indicator of how well
algorithms perform in general. Especially for the very short tsAirgap
series (with just 13 NA values) random lucky guesses can considerably
influence the results. A complete benchmark would include: ‘Different
missing data percentages’, ‘Different datasets’, ‘Different random seeds
for missing data simulation’.
Overall there is a relatively small time series provided in tsAirgap
,
a medium one in tsNH4
and a large time series in tsHeating
. The
tsHeating
and tsNH4
are both sensor data, while tsAirgap
is count
data.
Dataset | Description |
---|---|
tsAirgap | Time series of monthly airline passengers (with NAs) |
tsAirgapComplete | Time series of monthly airline passengers (complete) |
tsHeating | Time series of a heating systems’ supply temperature (with NAs) |
tsHeatingComplete | Time series of a heating systems’ supply temperature (complete) |
tsNH4 | Time series of NH4 concentration in a waste-water system (with NAs) |
tsNH4Complete | Time series of NH4 concentration in a waste-water system (complete) |
The tsAirgap
time series has 144 rows and the incomplete version
includes 14 NA values. It represents the monthly totals of international
airline passengers from 1949 to 1960. The time series originates from
(Box et al. 2015) and is a commonly used example in time series analysis
literature. Originally known as ‘AirPassengers’ or ‘airpass’ this
version is renamed to ‘tsAirgap’ in order improve differentiation from
the complete series (gap signifies that NAs were introduced). The
characteristics (strong trend, strong seasonal behavior) make the
tsAirgap
series a great example for time series imputation.
As already mentioned in order to use this series for comparing
imputation algorithm results, there are two time series provided. One
series without missing values (tsAirgapComplete
), which can be used as
ground truth. Another series with NAs, on which the imputation
algorithms can be applied (tsAirgap
). While the missing data for
tsNH4
and tsHeating
were each introduced according to patterns
observed in very similar time series from the same source, the missing
observations in tsAirgap
were created based on general missing data
patterns.
The tsNH4
time series has 4552 rows and the incomplete version
includes 883 NA values. It represents the NH4 concentration in a
waste-water system measured from 30.11.2010 - 16:10 to 01.01.2011 - 6:40
in 10 minute steps. The time series is derived from the dataset of the
Genetic and Evolutionary Computation Conference (GECCO) Industrial
Challenge 2014 1.
As already mentioned in order to use this series for comparing
imputation algorithm results, there are two time series provided. One
series without missing values (tsNH4Complete
), which can be used as
ground truth. Another series with NAs (tsNH4
), on which the imputation
algorithms can be applied. The pattern for the NA occurrence was derived
from the same series / sensors, but from an earlier time interval. Thus,
it is a very realistic missing data pattern. Beware, since the time
series has a lot of observations, some of the more complex algorithms
like na.kalman
will need some time till they are finished.
The tsHeating
time series has 606837 rows and the incomplete version
includes 57391 NA values. It represents a heating systems’ supply
temperature measured from 18.11.2013 - 05:12:00 to 13.01.2015 - 15:08:00
in 1 minute steps. The time series originates from the GECCO Industrial
Challenge 2015 2. This was a challenge about ‘Recovering missing
information in heating system operating data’. Goal was to impute
missing values in heating system sensor data as accurate as possible.
As already mentioned in order to use this series for comparing
imputation algorithm results, there are two time series provided. One
series without missing values (tsHeatingComplete
), which can be used
as ground truth. Another series with NAs (tsHeating
), on which the
imputation algorithms can be applied. The NAs thereby were inserted
according to patterns found in similar time series. According to
patterns found / occurring in other heating systems. Beware, since it is
a very large time series, some of the more complex algorithms like
na.kalman
may need up to several days to complete on standard
hardware.
To start working with the imputeTS package, install either the stable version from CRAN or the development version from GitHub (https://github.com/SteffenMoritz/imputeTS). The stable version from CRAN is hereby recommended.
All imputation algorithms are used the same way. Input has to be either
a numeric time series or a numeric vector. As output, a version of the
input data with all missing values replaced by imputed values is
returned. Here is a small example, to show how to use the imputation
algorithms. (all imputation functions start with na.’algorithm name’)
For this we first need to create an example input series with missing
data.
# Create a short example time series with missing values
<- ts(c(1, 2, 3, 4, 5, 6, 7, 8, NA, NA, 11, 12)) x
On this time series we can apply different imputation algorithms. We
start with using na.mean
, which substitutes the NAs with mean values.
# Impute the missing values with na.mean
na.mean(x)
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 5.9 5.9 11.0 12.0
Most of the functions also have additional options that provide further
algorithms (of the same algorithm category). In the example below it can
be seen that na.mean
can also be called with option="median"
, which
substitutes the NAs with median values.
# Impute the missing values with na.mean using option median
na.mean(x, option="median")
[1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 5.5 5.5 11.0 12.0
While na.interpolation
and all other imputation functions are used the
same way, the results produced may be different. As can be seen below,
for this series linear interpolation gives more reasonable results.
# Impute the missing values with na.interpolation
na.interpolation(x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
For longer and more complex time series (with trend and seasonality)
than in this example it is always a good idea to try na.kalman
and
na.seadec
, since these functions very often produce the best results.
These functions are called the same easy way as all other imputation
functions.
Here is a usage example for the na.kalman
function applied on the
tsAirgap
(described in 2.4) time series. As can
be seen in Figure 1, na.kalman
provides really
good results for this series, which contains a strong seasonality and a
strong trend.
# Impute the missing values with na.kalman
# (tsAirgap is an example time series provided by the imputeTS package)
<- na.kalman(tsAirgap)
imp
#Code for visualization
plotNA.imputations(tsAirgap, x.imp, tsAirgapComplete)
This function visualizes the distribution of missing values within a time series. Therefore, the time series is plotted and whenever a value is NA the background is colored differently. This gives a nice overview, where in the time series most of the missing values occur. An example usage of the function can be seen below (for the plot see Figure 2).
# Example Code 'plotNA.distribution'
# (tsAirgap is an example time series provided by the imputeTS package)
# Visualize the missing values in this time series
plotNA.distribution(tsAirgap)
As can be seen in Figure 2, in areas with missing data the background is colored red. The whole plot is pretty much self-explanatory. The plotting function itself needs no further configuration parameters, nevertheless it allows passing through of plot parameters (via ...).
This function also visualizes the distribution of missing values within a time series. This is done as a barplot, which is especially useful if the time series would otherwise be too large to be plotted. Multiple observations for time intervals are grouped together and represented as bars. For these intervals, information about the amount of missing values are shown. An example usage of the function can be seen below (for the plot see Figure 3).
# Example Code 'plotNA.distributionBar'
# (tsHeating is an example time series provided by the imputeTS package)
# Visualize the missing values in this time series
plotNA.distributionBar(tsHeating, breaks = 20)
As can be seen in the x-axis of Figure 3, the
tsHeating
series is with over 600.000 observations a very large time
series. While the missing values in the tsAirgap
series (144
observations) can be visualized with plotNA.distribution
like in
Figure 2, this would for sure not work out for
tsHeating
. There just isn’t enough space for 600.000 single
consecutive observations/points in the plotting area. The
plotNA.distributionBar
function solves this problem. Multiple
observations are grouped together in intervals. The ‘breaks’ parameter
in the example defines that there should be 20 intervals used. This
means every interval in Figure 3 represents
approximately 30.000 observations. The first five intervals are
completely green, which means there are no missing values present. This
means from observation 1 up to observation 150.000 there are no missing
values in the data. In the middle and at the end of the series there are
several intervals each having around 40% of missing data. This means in
these intervals 12.000 out of 30.000 observation are NA. All in all, the
plot is able to give a nice but rough overview about the NA distribution
in very large time series.
This plotting function can be used to visualize how often different NA gaps (NAs in a row) occur in a time series. The function shows this information as a ranking. This ranking can be ordered by total NAs gap sizes account for (number occurrence gap size * gap length) or just by the number of occurrences of gap sizes. In the end the results can be read like this: In time series x, 3 NAs in a row occur most often with 20 occurrences, 6 NAs in a row occur 2nd most with 5 occurrences, 2 NAs in a row occur 3rd most with 3 occurrences. An example usage of the function can be seen below(for the plot see Figure 4).
# Example Code 'plotNA.gapsize'
# (tsNH4 is an example time series provided by the imputeTS package)
# Visualize the top gap sizes / NAs in a row
plotNA.gapsize(tsNH4)
The example plot (Figure 4) reads the following: In the
time series tsNH4
gap size 157 occurs just 1 time, but makes up for
most NAs of all gap sizes (157 NAs). A gap size of 91 (91 NAs in a row)
also occurs just once, but makes up for 2nd most NAs (91 NAs). A gap
size of 42 occurs two times in the time series, which leads to 3rd most
overall (84 NAs). A gap size of one (no other NAs before or behind the
NA) occurs 68 times, which makes this 4th in overall NAs (68 NAs).
This plot can be used, to visualize the imputed values for a time series. Therefore, the imputed values (filled NA gaps) are shown in a different color than the other values. The function is used as below and Figure 5 shows the output.
# Example Code 'plotNA.imputations'
# (tsAirgap is an example time series provided by the imputeTS package)
# Step 1: Perform imputation for x using na.mean
<- na.mean(tsAirgap)
tsAirgap.imp
# Step 2: Visualize the imputed values in the time series
plotNA.imputations(tsAirgap, tsAirgap.imp)
The visual inspection of Figure 5 indicates, that
the imputed values (red) do not fit very well in the tsAirgap
series.
This is caused by na.mean
being used for imputation of a series with a
strong trend. The plotting function enables users to quickly detect such
problems in the imputation results. If the ground truth is known for the
imputed values, this information can also be added to the plot. The
plotting function itself needs no further configuration parameters.
Nevertheless, it allows passing through of plot parameters (via ...).
The statsNA
function prints summary stats about the distribution of
missing values in univariate time series. Here is a short explanation
about the information it gives:
Length of time series
Number of observations in the time series (including NAs)
Number of Missing Values
Number of missing values in the time series
Percentage of Missing Values Percentage of missing values in the time series
Stats for Bins
Number/percentage of missing values for the split into bins
Longest NA gap
Longest series of consecutive missing values (NAs in a row) in the
time series
Most frequent gap size
Most frequent occurring series of missing values in the time series
Gap size accounting for most NAs
he series of consecutive missing values that accounts for most
missing values overall in the time series
Overview NA series
Overview about how often each series of consecutive missing values
occurs. Series occurring 0 times are skipped
The function is used as below and Figure 6 shows the
output.
# Example Code 'statsNA'
# (tsNH4 is an example time series provided by the imputeTS package)
# Print stats about the missing data
statsNA(tsNH4)
Using the datasets is self-explanatory, after the package is loaded they are directly available and usable under their name. No call of data() is needed. For every dataset there is always a complete version (without NAs) and an incomplete version (containing NAs) available.
# Example Code to use tsAirgap dataset
library("imputeTS")
tsAirgap
Missing data is a very common problem for all kinds of data. However, in
case of univariate time series most standard algorithms and existing
functions within R packages cannot be applied.
This paper presented the imputeTS package that provides a collection
of algorithms and tools especially tailored to this task. Using example
time series, we illustrated the ease of use and the advantages of the
provided functions. Simple algorithms as well as more complicated ones
can be applied in the same simple and user-friendly manner.
The functionality provided makes the imputeTS package a good choice
for preprocessing of time series ahead of further analysis steps that
require complete absence of missing values.
Future research and development plans for forthcoming versions of the
package include adding additional time series algorithm options to
choose from.
Parts of this work have been developed in the project ‘IMProvT: Intelligente Messverfahren zur Prozessoptimierung von Trinkwasserbereitstellung und -verteilung’ (reference number: 03ET1387A). Kindly supported by the Federal Ministry of Economic Affairs and Energy of the Federal Republic of Germany.
AMELIA, mice, VIM, missMDA, imputeTS, zoo, forecast, spacetime, timeSeries, xts
Econometrics, Environmetrics, Finance, MissingData, MixedModels, OfficialStatistics, Psychometrics, Spatial, SpatioTemporal, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Moritz & Bartz-Beielstein, "imputeTS: Time Series Missing Value Imputation in R", The R Journal, 2017
BibTeX citation
@article{RJ-2017-009, author = {Moritz, Steffen and Bartz-Beielstein, Thomas}, title = {imputeTS: Time Series Missing Value Imputation in R}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {207-218} }