This paper presents a computational program named BINCOR (BINned CORrelation) for estimating the correlation between two unevenly spaced time series. This program is also applicable to the situation of two evenly spaced time series not on the same time grid. BINCOR is based on a novel estimation approach proposed by (Mudelsee 2010) for estimating the correlation between two climate time series with different timescales. The idea is that autocorrelation (e.g. an AR1 process) means that memory enables values obtained on different time points to be correlated. Binned correlation is performed by resampling the time series under study into time bins on a regular grid, assigning the mean values of the variable under scrutiny within those bins. We present two examples of our BINCOR package with real data: instrumental and paleoclimatic time series. In both applications BINCOR works properly in detecting well-established relationships between the climate records compared.
There are several approaches for quantifying the potential association between two evenly spaced climate time series, e.g. Pearson’s and Spearman’s correlation or the cross-correlation function (CCF). However, these methods should not be directly applied when the time series are unevenly spaced (“irregular”), particularly when two time series under analysis are not sampled at identical points in time, as is usually the case in climate research, especially in paleoclimate studies (Weedon 2003; Mudelsee 2014; Emile-Geay 2016). The most common way of tackling this problem is to interpolate the original unevenly spaced climate time series in the time domain so as to obtain equidistance and the same times. The series can then be analysed using existing conventional correlation analysis techniques. However, experience shows that interpolation has its drawbacks: depending on the features of the method applied, the interpolated time series may show deviations in terms of variability or noise properties, and additional serial dependence may be introduced (Horowitz 1974; Mudelsee 2014; Olafsdottir and Mudelsee 2014). Thus, interpolation should be avoided as far as possible.
Fortunately, there are some algorithms and software available to carry out this task, at least for unevenly spaced climate time series sampled at identical points in time (Mudelsee 2003; Olafsdottir and Mudelsee 2014). However, there are few statistical techniques for estimating the correlation between two time series not sampled at identical points in time and their corresponding computational implementations. One exception is the Gaussian-Kernel-based cross-correlation (gXCF) method and its associated software named NESTOOLBOX (Rehfeld et al. 2011; Rehfeld and Bedartha 2014; Rehfeld and Kurths 2014) and the extended version (Roberts et al. 2017) that includes a confidence interval obtained by a bootstrapping resampling approach; another exception is binned correlation as proposed by (Mudelsee 2010, 2014). However, the software for this method is not freely available on the Internet.
Binned correlation is a statistical technique developed to estimate the correlation between two unevenly spaced time series sampled at different points in time. It is also applicable to two evenly spaced time series that are not on the same time grid (Mudelsee 2014). It is performed by resampling the time series into time bins on a regular grid, and then assigning the mean values of the variable under scrutiny within those bins. (Mudelsee 2010) proposes a novel approach adapting the binned correlation technique (used mainly with astronomical data) to analyse climate time series taking into account their memory (or persistence), which is a genuine property of climate time series. Autocorrelation, persistence, memory or serial dependence is characteristic of weather and climate fluctuations, and is recorded in climate time series (Mudelsee 2002; Wilks 2011). A simple persistence model used to “represent” climate time series is a first-order autoregressive (AR1) process where a fluctuation depends only on its own immediate past plus a random component (Gilman et al. 1963; Mann and Lees 1996; Mudelsee 2002). However, paleoclimate time series are usually unevenly spaced in time, and it is necessary to use an AR1 version for the case of uneven spacing, such as the method proposed by (Robinson 1977). The technique of (Mudelsee 2010) requires the concept of nonzero persistence times, enabling the mixing information (i.e. covariance) to be recovered, even when the two timescales differ. The BINCOR package presented in this paper is based on a method that is not applicable when one or both of the time series under examination have zero persistence. Similarly, this method is not applicable when the time series are sampled with significantly longer spacing than the persistence time, so that the effectively sampled persistence time is zero. A fundamental condition for using this method is that the time spacing should not be much larger than the persistence times. Enough common data points then fall within a time bin, and knowledge can be acquired on the covariance (Mudelsee 2010, 2014).
In this paper we present a computational package named BINCOR (BINned
CORrelation), which is based on the approach proposed by
(Mudelsee 2010, 2014). The BINCOR package contains (i) a
main function named bin_cor
, which is used to convert the irregular
time series to a binned time series; (ii) two complementary functions
(cor_ts
and ccf_ts
) for computing the correlation between the two
binned climate time series obtained with the bin_cor
function; and
(iii) an additional function (plot_ts
) for plotting the “primary” vs.
the binned time series. This package is programmed in R
language and
is available at the CRAN repository
(https://CRAN.R-project.org/package=BINCOR).
This paper is divided into four sections. The first outlines the method and the computational program. The second presents a Monte Carlo experiment to study the effect of binning size selection. In the Examples section we apply BINCOR to a couple of unevenly spaced real-world climate data sets: instrumental and paleoclimate. Finally, the Summary section presents our main conclusions.
In this section we outline the main mathematical ideas behind the binned correlation technique for unevenly spaced sampled at different points in time, following the methodology introduced by (Mudelsee 2010, 2014). The procedure is described as follows:
Input: two unevenly spaced climate time series
Compute the average spacing between samples
where
Estimate the bin-width (
Estimate the bias-corrected equivalent autocorrelation coefficients
Estimate the bin-width as
FLAGTAU option | Reference | ||
---|---|---|---|
1 | Eq. 7.44 in (Mudelsee 2014) | ||
2 | Eq. 7.45 in (Mudelsee 2014) | ||
3 | Eq. 7.48 in (Mudelsee 2014) |
Determine the number of bins:
Set:
id
id
L
L
if (L
Output: two binned climate time series
Estimate the correlation between the two binned time series. This
can be done through the native R
functions cor
and ccf
or by
means of the BINCOR functions cor_ts
and ccf_ts
.
We conducted Monte Carlo experiments to study how the specific rules (Table 1) chosen for calculating the bin-width based on persistence reduce the error compared to arbitrarily choosing a bin-width. The parameter configuration for the Monte Carlo experiments is presented in Figure 2. To carry out the Monte Carlo simulations, we used the bivariate Gaussian AR1 process for uneven time spacings (Mudelsee 2014), which is given by
where
Control case (equal timescales):
“Well” mixed unequal timescales:
“Wildly” mixed unequal timescales:
|
|
|
|
|
|
The outcome of the Monte Carlo experiments is as follows: 1) For equal
timescales (figures not shown), all three rules behave similarly (as
expected) in terms of RMSE, although the RMSE increases slightly as the
persistence increases. 2) The well mixed case shows that for RMSE the
rules take two different “patterns” with the first two rules (sum and
max) on one hand and the third rule (the default rule option) on the
other. This difference is most noticeable in the first values of the
samples (from 10 to 100) and is most pronounced with high persistence
values (
The BINCOR package developed in R
version 3.1.2 to be run from the
command line runs on all major operating systems and is available from
the CRAN repository (http://CRAN.R-project.org/package=BINCOR). The
BINCOR package contains four functions: 1) bin_cor
(the main
function for building the binned time series); 2) plot_ts
(for
plotting and comparing the “primary” and binned time series); 3)
cor_ts
(for estimating the correlation between the binned time
series); and 4) ccf_ts
(for estimating the cross-correlation between
the binned time series). The graphical outputs can be displayed on the
screen or saved as PNG, JPG, or PDF graphics files. BINCOR depends on
the dplR (Bunn et al. 2015) and
pracma (Borchers 2015)
packages. The dplR package
is used by the function bin_cor
to calculate the persistence for the
climate time series under study, whereas the
pracma package is used by
the functions cor_ts
and ccf_ts
to remove the linear trend before
estimating the correlation.
The first (and main) function, bin_cor
, estimates the binned time
series taking into account the memory or persistence of the unevenly
spaced climate time series to be analysed (Mudelsee 2002). It has the
following syntax:
R> bin_cor(ts1, ts2, FLAGTAU=3, ofilename),
where
ts1
and ts2
are unevenly spaced time series.FLAGTAU
defines the method used to estimate the bin-width
(FLAGTAU = 3
) as the default rule because Monte Carlo simulations
perform better in terms of RMSE than the other rules in estimating
the bin-width and the binned correlations (Mudelsee 2014).ofilename
is the name of the output file (in ASCII format) which
contains the binned time series.bin_cor
returns a list object containing the following outputs:
"Binned_time_series", "Auto._cor._coef._ts1", "Persistence_ts1", "Auto._cor._coef._ts2",
"Persistence_ts2", "bin width", "Number_of_bins", "Average spacing", "VAR. ts1",
"VAR. bin ts1", "VAR. ts2", "VAR. bin ts2", "VAR. ts1 - VAR bints1",
"VAR. ts2 - VAR bints2", "% of VAR. lost ts1", "% of VAR. lost ts2".
The names of the outputs are self-explanatory, but we wish to highlight
that Average spacing
is the mean value of the times for the binned
time series; VAR. ts1
, VAR. bin ts1
, VAR. ts2
and VAR. bin ts2
are the variances for ts1
and ts2
for their respective binned time
series; the next two outputs are the differences between the variances
of ts1
and ts2
and their corresponding binned time series; and the
last two outputs are the percentages of variance lost for ts1
and
ts2
as a result of the binned process.
The second function, called plot_ts
, plots the “primary” (unevenly
spaced) time series and the binned time series. The plot_ts
function
contains the following elements:
R> plot_ts(ts1, ts2, bints1, bints2, varnamets1="", varnamets2="",
colts1=1, colts2=1, colbints1=2, colbints2=2, ltyts1=1,
ltyts2=1, ltybints1=2, ltybints2=2, device="screen", ofilename),
where the input arguments ts1
and ts2
are the unevenly spaced time
series, bints1
and bints2
are the binned time series, varnamets1
and varnamets2
are the names of the variables under study, colts1
,
colts2
(by default both curves are in black) and colbints1
,
colbints2
(by default both curves are in red) are the colours for the
“primary” and binned times series; ltyts1
, ltyts2
, ltybints1
and
ltybints2
are the types of line to be plotted for the “primary” and
binned times series, respectively (1 = solid, 2 = dashed, 3 = dotted, 4
= dot-dashed, 5 = long-dashed, 6 = double-dashed); device
is the type
of output device (“screen” by default, the other options being “jpg,”
“png,” and “pdf”); resfig
is the image resolution in “ppi” (by default
R
does not record a resolution in the image file, except for BMP; 150
ppi could be a suitable value); ofilename
is the output filename; and
finally, Hfig
, WFig
and Hpdf
, Wpdf
are the height and width of
the output for the JPG/PNG and PDF formats, respectively.
The third function, cor_ts
, calculates three types of correlation
coefficient: Pearson’s correlation, Spearman’s and Kendall’s rank
correlations. These correlation coefficients are estimated through the
native R
function cor.test
from the R
package Stats
. The
cor_ts
function has an option to remove the linear trend of the time
series under analysis – other pre-processing methods could be used
before the cor_ts
function is applied. This function has the following
syntax:
R> cor_ts(bints1, bints2, varnamets1="", varnamets2="",
KoCM, rmltrd="N", device="screen", Hfig, Wfig, Hpdf, Wpdf,
resfig, ofilename)
where KoCM
indicates the correlation estimator: pearson
for Pearson
(the option by default), spearman
for Spearman and kendall
for
Kendall; rmltrd
is the option to remove the linear trend in the time
series under study (by default the linear trend is not removed, but the
function can be enabled via the option “Y” or “y”). The other parameters
are described some lines above. cor_ts
has as its output a list object
containing the main information for the estimated correlation
coefficient (e.g. a 95% confidence interval for Pearson and a p-value
for Spearman and Kendall). The cor_ts
function also provides a
scatterplot for the binned time series, which can be plotted on the
screen (by default) or saved in JPG, PNG or PDF formats (the parameter
ofilename
is available to assign a name to this output).
Finally, the fourth function, ccf_ts
, estimates and plots the
cross-correlation between two evenly spaced paleoclimate time series. We
use the native R
function ccf
(R
Stats
package) to estimate the
cross-correlation in our ccf_ts
function. The ccf_ts
function has
the following syntax:
R> ccf_acf <- ccf_ts(bints1, bints2, lagmax=NULL, ylima=-1, ylimb=1,
rmltrd="N", RedL=T, device="screen", Hfig, Wfig,
Hpdf, Wpdf, resfig, ofilename)
All these elements are already defined above except the parameters
lagmax=NULL
, ylima=-1
, ylimb=1
and RedL
. The first parameter
indicates the maximum lag for which the cross-correlation is calculated
(its value depends on the length of the data set), the next two
parameters indicate the extremes of the range in which the CCF will be
plotted and the last parameter (the default option is TRUE) plots a
straight red line to highlight the correlation coefficient at lag 0. The
ccf_ts
function generates as its output the acf
(auto-correlation
function; ACF) R
object, which is a list with the following
parameters: lag
is a three dimensional array containing the lags at
which the ACF is estimated; acf
is an array with the same dimensions
as lag
containing the estimated ACF; type
is the type of correlation
(correlation
(the default), covariance
and partial
); n.used
is
the number of observations in the time series; and snames
provides the
names of the time series (bints1
and bints2
).
We first examine two evenly-spaced annually-resolved instrumental
climate records that cover the time interval from 1850 to 2006
(z
=6.52 and p-value
z
= 10.214 and p-value
A) | C) |
|
|
B) | D) |
|
|
The code used to generate Figure 3 is shown below.
# Load the package
library(BINCOR)
# Load the time series under analysis: Example 1 and Figure 1 (ENSO vs. NHSST)
data(ENSO)
data(NHSST)
# Compute the binned time series though our bin_cor function
bincor.tmp <- bin_cor(ENSO.dat, NHSST.dat, FLAGTAU=3, "output_ENSO_NHSST.tmp")
binnedts <- bincor.tmp$Binned_time_series
# Applying our plot_ts function
# "Screen"
plot_ts(ENSO.dat, NHSST.dat, binnedts[,1:2], binnedts[,c(1,3)], "ENSO-Nino3",
"SST NH Mean", colts1=1, colts2=2, colbints1=3, colbints2=4, device="screen")
Figures 3 A and 3 B show
the binned time series (ENSO in green and NH-SST in red) obtained with
our bin_cor
function. Although we use residuals, they show a relative
high autocorrelation (R
package
TSdist
(Mori et al. 2015; Mori et al. 2016). The dissimilarity metric (DISSIM)
has the following interpretation: a value of zero indicates a perfect
relationship such that the closer DISSIM is to zero, the more similar
are the time series. The DISSIM between the binned and “primary” ENSO
time series and the binned and “primary” NH-SST series are 3.70 and
0.84, respectively. This corroborates the similarity between the
“primary” and binned time series observed visually. Figure
3 also shows a comparison between the “primary”
climate time series (Figure 3 C) and the binned
series (Figure 3 D). Note that this plot shows
that the number of elements (bin_cor
function is
able to tackle time series with different numbers of elements.
The second result obtained from our BINCOR package, and more
specifically from the cor_ts
function, is shown in Figure
4, which shows the scatterplot between the ENSO
(x-axis) and NH-SST (y-axis) binned time series. This scatterplot shows
a moderate increasing trend from left to right, suggesting a potentially
positive relationship between the two binned time series. This pattern
can be confirmed statistically by means of the cor_ts
function output,
which also provides the correlation coefficient between two time series
under analysis. For this case, the Pearson’s correlation (with 95%
confidence interval) obtained is cor_ts
). This value is close to
the Pearson’s correlation estimated for the evenly spaced climate time
series, which is
The code used to generate Figure 2 is shown below.
# Load packages
library(BINCOR)
library(pracma)
# Load the time series under analysis: Example 1 and Figure 2 (ENSO vs. NHSST)
data(ENSO)
data(NHSST)
# Compute the binned time series though our bin_cor function
bincor.tmp <- bin_cor(ENSO.dat, NHSST.dat, FLAGTAU=3, "output_ENSO_NHSST.tmp")
binnedts <- bincor.tmp$Binned_time_series
# Compute the scatterplot by means of our function cor_ts
# PDF format (scatterplot) and Pearson
cor_ts(binnedts[,1:2], binnedts[,c(1,3)], "ENSO-Nino3", "SST NH Mean",
KoCM="pearson", rmltrd="y", device="pdf", Hpdf=6, Wpdf=9, resfig=300,
ofilename="scatterplot_ENSO_SST")
We report an analysis of two temporally unevenly-spaced pollen records
from two marine sediment cores (MD04 and MD95) collected on the
south-western European margin (Figure 5). The aim of
this case study is to show the use of BINCOR to estimate the
correlation between two unevenly spaced paleoclimate time series by
means of the cross-correlation function. The pollen time series analysed
in this example span the interval between 73,000 and 15,000 years before
present (BP), thus covering the last glacial period (LGP). The climate
during the LGP was characterised by millennial variability with “abrupt”
transitions between cold stadials and warm interstadials known as
Dansgaard-Oeschger (D-O) cycles
(Dansgaard et al. 1993; Wolff et al. 2012). The D-O cycles are
characterised by rather fast atmospheric warming events over Greenland
of up to 16
Figure 6 illustrates the variations in the
pollen percentages of the temperate forest, a type of vegetation typical
of moderate, warm, wet climates. Figure 6 A
shows the primary and binned pollen records from site MD04-2845
(Sanchez Goni et al. 2008; Sánchez Goñi et al. 2017). Figure
6 B shows the primary and binned pollen records
from site MD95-2039 (Roucoux et al. 2005; Sánchez Goñi et al. 2017). We
use the pollen time series with a harmonised, consistent chronology
(Sánchez Goñi et al. 2017) to carry out a fair comparison. We apply our
bin_cor
and plot_ts
functions and obtain the binned time series,
which have 27 elements, and a temporal distance between elements of 1220
years. The binned time series show a relatively high level of
autocorrelation,
A) | C) |
|
|
B) | D) |
|
|
The code used to generate Figure 6 is as follows.
# Load the package
library(BINCOR)
library(pracma)
# Load the time series under analysis: Example 2 and Figure 6
data(MD04_2845_siteID31)
data(MD95_2039_siteID32)
# Compute the binned time series though our bin_cor function
bincor.tmp <- bin_cor(ID31.dat, ID32.dat, FLAGTAU=3, "salida_ACER_ABRUPT.tmp")
binnedts <- bincor.tmp$Binned_time_series
# To avoid NA values
bin_ts1 <- na.omit(bincor.tmp$Binned_time_series[,1:2])
bin_ts2 <- na.omit(bincor.tmp$Binned_time_series[,c(1,3)])
# Applying our plot_ts function
# PDF format
plot_ts(ID31.dat, ID32.dat, bin_ts1, bin_ts2, "MD04-2845 (Temp. forest)",
"MD95-2039 (Temp. forest )", colts1=1, colts2=2, colbints1=3, colbints2=4,
device="pdf", Hpdf=6, Wpdf=9, resfig=300, ofilename="ts_ACER_ABRUPT")
The cross-correlation (CCF) analysis obtained with our ccf_ts
function
is shown in Figure 7. Before applying the ccf_ts
function, a linear trend was removed from the binned time series by
enabling the rmltrd
option in ccf_ts
, and then the residuals were
used. The CCF reveals a high correlation (
The code used to generate Figure 7 is the following.
# Load packages
library(BINCOR)
library(pracma)
# Load the time series under analysis: Example 2 and Figure 7 (ID31 vs. ID32)
data(MD04_2845_siteID31)
data(MD95_2039_siteID32)
# Compute the binned time series though our bin_cor function
bincor.tmp <- bin_cor(ID31.dat, ID32.dat, FLAGTAU=3, "salida_ACER_ABRUPT.tmp")
binnedts <- bincor.tmp$Binned_time_series
# To avoid NA values
bin_ts1 <- na.omit(bincor.tmp$Binned_time_series[,1:2])
bin_ts2 <- na.omit(bincor.tmp$Binned_time_series[,c(1,3)])
# Applying our ccf_ts function
# PDF format
ccf_acf <- ccf_ts(bin_ts1, bin_ts2, RedL=TRUE, rmltrd="y", device="pdf", Hpdf=6,
Wpdf=9, resfig=300, ofilename="ccf_ID31_ID32_res")
We present a computational package named BINCOR (BINned CORrelation)
that can be used to estimate the correlation between two unevenly spaced
climate time series which are not necessarily sampled at identical
points in time, and between two evenly spaced time series which are not
on the same time grid. BINCOR is based on a novel estimation approach
proposed by (Mudelsee 2010). This statistical technique requires the
concept of nonzero persistence times, thus enabling mixing information
to be recovered, even when the two timescales examined differ
(Mudelsee 2014). The package contains four functions (bin_cor
,
cor_ts
, ccf_ts
and plot_ts
) with a number of parameters to obtain
a high degree of flexibility in the analysis. BINCOR is programmed in
R
language and is available from the CRAN repository. The results when
BINCOR s applied to real climate data sets suggest that the R
package BINCOR performs and works properly in detecting relationships
between instrumental and paleoclimate records.
JMPM was funded by a Basque Government post-doctoral fellowship. MM’s work was supported by the European Commission via Marie Curie Initial Training Network LINC (project number 289447) under the Seventh Framework Programme. Thanks to Charo Sánchez for help to use the i2BASQUE HPC facilities, to the two anonymous reviewers and Editor (Olivia Lau) for their input and comments that have improved the quality of the manuscript. The authors thank the support of the computing infrastructure of the i2BASQUE (Basque Government) academic network. The persistence time estimation software is freely available via http://www.climate-risk-analysis.com/software/.
DifferentialEquations, NumericalMathematics, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Polanco-Martinez, et al., "BINCOR: An R package for Estimating the Correlation between Two Unevenly Spaced Time Series", The R Journal, 2019
BibTeX citation
@article{RJ-2019-035, author = {Polanco-Martinez, Josue M. and Medina-Elizalde, Martin A. and Goni, Maria F. Sanchez and Mudelsee, Manfred}, title = {BINCOR: An R package for Estimating the Correlation between Two Unevenly Spaced Time Series}, journal = {The R Journal}, year = {2019}, note = {https://rjournal.github.io/}, volume = {11}, issue = {1}, issn = {2073-4859}, pages = {170-184} }