This paper presents a computational program named *BINCOR* (BINned CORrelation) for estimating the correlation between two unevenly spaced time series. This program is also applicable to the situation of two evenly spaced time series not on the same time grid. *BINCOR* is based on a novel estimation approach proposed by (Mudelsee 2010) for estimating the correlation between two climate time series with different timescales. The idea is that autocorrelation (e.g. an AR1 process) means that memory enables values obtained on different time points to be correlated. Binned correlation is performed by resampling the time series under study into time bins on a regular grid, assigning the mean values of the variable under scrutiny within those bins. We present two examples of our *BINCOR* package with real data: instrumental and paleoclimatic time series. In both applications *BINCOR* works properly in detecting well-established relationships between the climate records compared.

There are several approaches for quantifying the potential association between two evenly spaced climate time series, e.g. Pearson’s and Spearman’s correlation or the cross-correlation function (CCF). However, these methods should not be directly applied when the time series are unevenly spaced (“irregular”), particularly when two time series under analysis are not sampled at identical points in time, as is usually the case in climate research, especially in paleoclimate studies (Weedon 2003; Mudelsee 2014; Emile-Geay 2016). The most common way of tackling this problem is to interpolate the original unevenly spaced climate time series in the time domain so as to obtain equidistance and the same times. The series can then be analysed using existing conventional correlation analysis techniques. However, experience shows that interpolation has its drawbacks: depending on the features of the method applied, the interpolated time series may show deviations in terms of variability or noise properties, and additional serial dependence may be introduced (Horowitz 1974; Mudelsee 2014; Olafsdottir and Mudelsee 2014). Thus, interpolation should be avoided as far as possible.

Fortunately, there are some algorithms and software available to carry out this task, at least for unevenly spaced climate time series sampled at identical points in time (Mudelsee 2003; Olafsdottir and Mudelsee 2014). However, there are few statistical techniques for estimating the correlation between two time series not sampled at identical points in time and their corresponding computational implementations. One exception is the Gaussian-Kernel-based cross-correlation (gXCF) method and its associated software named NESTOOLBOX (Rehfeld et al. 2011; Rehfeld and Bedartha 2014; Rehfeld and Kurths 2014) and the extended version (Roberts et al. 2017) that includes a confidence interval obtained by a bootstrapping resampling approach; another exception is binned correlation as proposed by (Mudelsee 2010, 2014). However, the software for this method is not freely available on the Internet.

Binned correlation is a statistical technique developed to estimate the
correlation between two unevenly spaced time series sampled at different
points in time. It is also applicable to two evenly spaced time series
that are not on the same time grid (Mudelsee 2014). It is performed by
resampling the time series into time bins on a regular grid, and then
assigning the mean values of the variable under scrutiny within those
bins. (Mudelsee 2010) proposes a novel approach adapting the binned
correlation technique (used mainly with astronomical data) to analyse
climate time series taking into account their memory (or persistence),
which is a genuine property of climate time series. Autocorrelation,
persistence, memory or serial dependence is characteristic of weather
and climate fluctuations, and is recorded in climate time series
(Mudelsee 2002; Wilks 2011). A simple persistence model used to
“represent” climate time series is a first-order autoregressive (AR1)
process where a fluctuation depends only on its own immediate past plus
a random component (Gilman et al. 1963; Mann and Lees 1996; Mudelsee 2002).
However, paleoclimate time series are usually unevenly spaced in time,
and it is necessary to use an AR1 version for the case of uneven
spacing, such as the method proposed by (Robinson 1977). The technique
of (Mudelsee 2010) requires the concept of nonzero persistence times,
enabling the mixing information (i.e. covariance) to be recovered, even
when the two timescales differ. The
*BINCOR* package presented
in this paper is based on a method that is not applicable when one or
both of the time series under examination have zero persistence.
Similarly, this method is not applicable when the time series are
sampled with significantly longer spacing than the persistence time, so
that the effectively sampled persistence time is zero. A fundamental
condition for using this method is that the time spacing should not be
much larger than the persistence times. Enough common data points then
fall within a time bin, and knowledge can be acquired on the covariance
(Mudelsee 2010, 2014).

In this paper we present a computational package named *BINCOR* (BINned
CORrelation), which is based on the approach proposed by
(Mudelsee 2010, 2014). The *BINCOR* package contains (i) a
main function named `bin_cor`

, which is used to convert the irregular
time series to a binned time series; (ii) two complementary functions
(`cor_ts`

and `ccf_ts`

) for computing the correlation between the two
binned climate time series obtained with the `bin_cor`

function; and
(iii) an additional function (`plot_ts`

) for plotting the “primary” vs.
the binned time series. This package is programmed in `R`

language and
is available at the CRAN repository
(https://CRAN.R-project.org/package=BINCOR).

This paper is divided into four sections. The first outlines the method
and the computational program. The second presents a Monte Carlo
experiment to study the effect of binning size selection. In the
Examples section we apply *BINCOR* to a couple of unevenly spaced
real-world climate data sets: instrumental and paleoclimate. Finally,
the Summary section presents our main conclusions.

In this section we outline the main mathematical ideas behind the binned correlation technique for unevenly spaced sampled at different points in time, following the methodology introduced by (Mudelsee 2010, 2014). The procedure is described as follows:

Input: two unevenly spaced climate time series \(\{X(i), T_X\}_{i=1}^{N_X}\) and \(\{Y(i), T_Y\}_{i=1}^{N_Y}\), where \(T_X\), \(T_Y\) and \(N_Y\), \(N_Y\) are the time domains and the sample sizes of each series, respectively.

Compute the average spacing between samples

- \(\bar{d}_X = [T_X(N_X) - T_X(1)]/(N_X - 1)\)
- \(\bar{d}_Y = [T_Y(N_X) - T_Y(1)]/(N_Y - 1)\)
- \(\bar{d}_{XY} = [\bar{T}_\mathrm{max} - \bar{T}_\mathrm{min}]/(N_X + N_Y - 1)\)

where \(\bar{T}_\mathrm{max} = \max[T_X(N_X), T_Y(N_Y)]\) and \(\bar{T}_\mathrm{min} = \min[T_X(1), T_Y(1)]\).

Estimate the bin-width (\(\bar{\tau}\)) taking into account the persistence (memory) estimated for each unevenly spaced climate time series, \(X\) and \(Y\) denoted as \(\hat{\tau}_X\) and \(\hat{\tau}_Y\), respectively. To estimate the persistence, an AR1 model (Robinson 1977) is fitted to each unevenly spaced time series (Mudelsee 2002).

*BINCOR*includes three rules for estimating the bin-width (the options are shown in Table 1), but we prefer to use rule number 3 as the default value (FLAGTAU=3) because in terms of the RMSE (Section Monte Carlo experiments) of this rule Monte Carlo simulations are superior to the other rules for estimating the bin-width (Mudelsee 2014).Estimate the bias-corrected equivalent autocorrelation coefficients

- \(\hat{\bar{a}}'_X = \exp (-\bar{d}_X/\hat{\tau}'_X)\), \(\hat{\bar{a}}'_Y = \exp (-\bar{d}_Y/\hat{\tau}'_Y)\), and \(\hat{\bar{a}}'_{XY} = \sqrt{ \hat{\bar{a}}'_X \cdot \hat{\bar{a}}'_Y }\)

Estimate the bin-width as \(\bar{\tau} = -\bar{d}_{XY} / \ln (\hat{\bar{a}}'_{XY})\) (Eq. 7.48 in (Mudelsee 2002)), the default option (FLAGTAU=3) in the BINCOR package, other options are:

Table 1: The FLAGTAU options and its corresponding methods (rules) to estimate the bin-width. \(\bar{\tau}\) rule FLAGTAU option Reference \(\tau_x + \tau_y\) 1 Eq. 7.44 in (Mudelsee 2014) \(\mathrm{max}(\tau_x, \tau_y)\) 2 Eq. 7.45 in (Mudelsee 2014) \(-\bar{d}_{XY} / \ln (\hat{\bar{a}}'_{XY})\) 3 Eq. 7.48 in (Mudelsee 2014)

Determine the number of bins: \(N_b = (\bar{T}_\mathrm{max} - \bar{T}_\mathrm{min}) / \bar{\tau}\)

Set: \(\lim_\mathrm{inf}(n=1) = \bar{T}_\mathrm{min}\). Then, for \(n=1, 2, \dots, N_b\), define (Figure 1):

\(\lim_\mathrm{sup}(n) = \bar{T}_\mathrm{min} + n \cdot \bar{\tau}\)

id\(T_X\) = WHICH \([T_X \geq \lim_\mathrm{inf}(n)\) AND \(T_X \leq \lim_\mathrm{sup}(n)]\)

id\(T_Y\) = WHICH \([T_Y \geq \lim_\mathrm{inf}(n)\) AND \(T_Y \leq \lim_{sup}(n)]\)

L\(T_X\) = LENGTH(id\(T_X\))

L\(T_Y\) = LENGTH(id\(T_Y\))

if (L\(T_X\) \(>\) 0 AND L\(T_Y\) \(>\) 0)

\(F(n)\) = mean of \(X\)(id\(T_X\))

\(G(n)\) = mean of \(Y\)(id\(T_Y\))

\(T(n)\) = [\(\lim_\mathrm{inf}(n)\) + \(\lim_\mathrm{sup}(n)\)] / 2

\(\lim_\mathrm{inf}(n) = \lim_\mathrm{sup}(n)\)

Output: two binned climate time series \(\{T_n,\, F(n)\}_{n=1}^{N_b}\) and \(\{T_n, G(n)\}_{n=1}^{N_b}\), where \(N_b\) is the number of bins.

Estimate the correlation between the two binned time series. This can be done through the native

`R`

functions`cor`

and`ccf`

or by means of the*BINCOR*functions`cor_ts`

and`ccf_ts`

.

We conducted Monte Carlo experiments to study how the specific rules (Table 1) chosen for calculating the bin-width based on persistence reduce the error compared to arbitrarily choosing a bin-width. The parameter configuration for the Monte Carlo experiments is presented in Figure 2. To carry out the Monte Carlo simulations, we used the bivariate Gaussian AR1 process for uneven time spacings (Mudelsee 2014), which is given by

\[\begin{aligned} ~\label{biAR1-1} X(1) = \mu_{N(0,1)}^{X}(1), \nonumber \\ Y(1) = \mu_{N(0,1)}^{Y}(1), \nonumber \\ X(t) = a_X X(t-1) + \mu_{N(0,1-a_X^2)}^{X}(t), \;\; t= 2,...,N,\nonumber \\ Y(t) = a_Y Y(t-1) + \mu_{N(0,1-a_Y^2)}^{Y}(t), \;\; t= 2,...,N, \end{aligned} \tag{1}\]

where \(a_X\) and \(a_Y\), the autoregressive parameters for \(X(t)\) and \(Y(t)\), are defined as (Mudelsee 2014): \(a_X = exp\{-[T_X(t) - T_X(t-1)]/\tau_X\}\) and \(a_Y = exp\{-[T_Y(t) - T_Y(t-1)]/\tau_Y\}\). The correlation (by construction) between \(X(t)\) and \(Y(t)\) is \(\rho_{XY}\) (see Mudelsee 2014 307 for more details about the statistical properties of the bivariate AR1 process for unevenly spaced time series). To generate the uneven timescales for \(X(i)\) and \(Y(j)\), we follow the methodology proposed by (see Mudelsee 2014 299), which consists of producing a number (10 \(N\)) of data pairs on an evenly spaced grid of 1.0, discarding 90% of points and retaining 10% of \(X\) and \(Y\) (\(N_x=N_y=N\)) points. The time points for \(X(i)\) and \(Y(j)\) are subject to the following conditions:

Control case (equal timescales):

- Condition 1: \(N_X=N_Y\)
- Condition 2: \(\{T_X(i)\}_{i=1}^{N_X}=\{T_Y(j)\}_{j=1}^{N_Y}\)

“Well” mixed unequal timescales:

- Condition 1: \(T_X(i) \neq T_Y(j) \; for all \; i \; and \; j\)
- Condition 2: \(T_X(1) < T_Y(1) < T_X(2) < T_Y(2) < T_X(3) < ... < T_X(N_X) < T_Y(N_Y)\)

“Wildly” mixed unequal timescales:

- There are not conditions for this case.

The outcome of the Monte Carlo experiments is as follows: 1) For equal
timescales (figures not shown), all three rules behave similarly (as
expected) in terms of RMSE, although the RMSE increases slightly as the
persistence increases. 2) The well mixed case shows that for RMSE the
rules take two different “patterns” with the first two rules (sum and
max) on one hand and the third rule (the default rule option) on the
other. This difference is most noticeable in the first values of the
samples (from 10 to 100) and is most pronounced with high persistence
values (\(\tau_x\) and \(\tau_y\)). The rule that shows the smallest RMSE is
rule 3 (the default option), though it is important to point out that
for \(\tau_x = \tau_y\) = 50 the RMSE figures are practically
indistinguishable for sample sizes from 200 to 1000. 3) Finally, RMSE in
the wildly mixed case behaves more or less similarly to the well mixed
case, though rule 3 yields the smallest RMSE for all three persistence
values. Bearing in mind that the wildly mixed case does not impose
conditions on generating timescales, and in practice the unevenly spaced
climate time series could contain some degree of randomness in the
sampling times, the best rule in terms of RMSE for estimating bin-width
(\(\bar{\tau}\)) and binned correlation can be said to be number 3, i.e.
the default rule used in *BINCOR* to estimate the bin-width.

The *BINCOR* package developed in `R`

version 3.1.2 to be run from the
command line runs on all major operating systems and is available from
the CRAN repository (http://CRAN.R-project.org/package=BINCOR). The
*BINCOR* package contains four functions: 1) `bin_cor`

(the main
function for building the binned time series); 2) `plot_ts`

(for
plotting and comparing the “primary” and binned time series); 3)
`cor_ts`

(for estimating the correlation between the binned time
series); and 4) `ccf_ts`

(for estimating the cross-correlation between
the binned time series). The graphical outputs can be displayed on the
screen or saved as PNG, JPG, or PDF graphics files. *BINCOR* depends on
the *dplR* (Bunn et al. 2015) and
*pracma* (Borchers 2015)
packages. The *dplR* package
is used by the function `bin_cor`

to calculate the persistence for the
climate time series under study, whereas the
*pracma* package is used by
the functions `cor_ts`

and `ccf_ts`

to remove the linear trend before
estimating the correlation.

The first (and main) function, `bin_cor`

, estimates the binned time
series taking into account the memory or persistence of the unevenly
spaced climate time series to be analysed (Mudelsee 2002). It has the
following syntax:

`> bin_cor(ts1, ts2, FLAGTAU=3, ofilename), R`

where

`ts1`

and`ts2`

are unevenly spaced time series.`FLAGTAU`

defines the method used to estimate the bin-width (\(\bar{\tau}\)). There are three methods included in*BINCOR*for estimating bin-width (Table 1), but we prefer to use (`FLAGTAU = 3`

) as the default rule because Monte Carlo simulations perform better in terms of RMSE than the other rules in estimating the bin-width and the binned correlations (Mudelsee 2014).`ofilename`

is the name of the output file (in ASCII format) which contains the binned time series.

`bin_cor`

returns a list object containing the following outputs:

```
"Binned_time_series", "Auto._cor._coef._ts1", "Persistence_ts1", "Auto._cor._coef._ts2",
"Persistence_ts2", "bin width", "Number_of_bins", "Average spacing", "VAR. ts1",
"VAR. bin ts1", "VAR. ts2", "VAR. bin ts2", "VAR. ts1 - VAR bints1",
"VAR. ts2 - VAR bints2", "% of VAR. lost ts1", "% of VAR. lost ts2".
```

The names of the outputs are self-explanatory, but we wish to highlight
that `Average spacing`

is the mean value of the times for the binned
time series; `VAR. ts1`

, `VAR. bin ts1`

, `VAR. ts2`

and `VAR. bin ts2`

are the variances for `ts1`

and `ts2`

for their respective binned time
series; the next two outputs are the differences between the variances
of `ts1`

and `ts2`

and their corresponding binned time series; and the
last two outputs are the percentages of variance lost for `ts1`

and
`ts2`

as a result of the binned process.

The second function, called `plot_ts`

, plots the “primary” (unevenly
spaced) time series and the binned time series. The `plot_ts`

function
contains the following elements:

```
> plot_ts(ts1, ts2, bints1, bints2, varnamets1="", varnamets2="",
Rcolts1=1, colts2=1, colbints1=2, colbints2=2, ltyts1=1,
ltyts2=1, ltybints1=2, ltybints2=2, device="screen", ofilename),
```

where the input arguments `ts1`

and `ts2`

are the unevenly spaced time
series, `bints1`

and `bints2`

are the binned time series, `varnamets1`

and `varnamets2`

are the names of the variables under study, `colts1`

,
`colts2`

(by default both curves are in black) and `colbints1`

,
`colbints2`

(by default both curves are in red) are the colours for the
“primary” and binned times series; `ltyts1`

, `ltyts2`

, `ltybints1`

and
`ltybints2`

are the types of line to be plotted for the “primary” and
binned times series, respectively (1 = solid, 2 = dashed, 3 = dotted, 4
= dot-dashed, 5 = long-dashed, 6 = double-dashed); `device`

is the type
of output device (“screen” by default, the other options being “jpg,”
“png,” and “pdf”); `resfig`

is the image resolution in “ppi” (by default
`R`

does not record a resolution in the image file, except for BMP; 150
ppi could be a suitable value); `ofilename`

is the output filename; and
finally, `Hfig`

, `WFig`

and `Hpdf`

, `Wpdf`

are the height and width of
the output for the JPG/PNG and PDF formats, respectively.

The third function, `cor_ts`

, calculates three types of correlation
coefficient: Pearson’s correlation, Spearman’s and Kendall’s rank
correlations. These correlation coefficients are estimated through the
native `R`

function `cor.test`

from the `R`

package `Stats`

. The
`cor_ts`

function has an option to remove the linear trend of the time
series under analysis – other pre-processing methods could be used
before the `cor_ts`

function is applied. This function has the following
syntax:

```
> cor_ts(bints1, bints2, varnamets1="", varnamets2="",
Rrmltrd="N", device="screen", Hfig, Wfig, Hpdf, Wpdf,
KoCM, resfig, ofilename)
```

where `KoCM`

indicates the correlation estimator: `pearson`

for Pearson
(the option by default), `spearman`

for Spearman and `kendall`

for
Kendall; `rmltrd`

is the option to remove the linear trend in the time
series under study (by default the linear trend is not removed, but the
function can be enabled via the option “Y” or “y”). The other parameters
are described some lines above. `cor_ts`

has as its output a list object
containing the main information for the estimated correlation
coefficient (e.g. a 95% confidence interval for Pearson and a `p-value`

for Spearman and Kendall). The `cor_ts`

function also provides a
scatterplot for the binned time series, which can be plotted on the
screen (by default) or saved in JPG, PNG or PDF formats (the parameter
`ofilename`

is available to assign a name to this output).

Finally, the fourth function, `ccf_ts`

, estimates and plots the
cross-correlation between two evenly spaced paleoclimate time series. We
use the native `R`

function `ccf`

(`R`

`Stats`

package) to estimate the
cross-correlation in our `ccf_ts`

function. The `ccf_ts`

function has
the following syntax:

```
> ccf_acf <- ccf_ts(bints1, bints2, lagmax=NULL, ylima=-1, ylimb=1,
Rrmltrd="N", RedL=T, device="screen", Hfig, Wfig,
Hpdf, Wpdf, resfig, ofilename)
```

All these elements are already defined above except the parameters
`lagmax=NULL`

, `ylima=-1`

, `ylimb=1`

and `RedL`

. The first parameter
indicates the maximum lag for which the cross-correlation is calculated
(its value depends on the length of the data set), the next two
parameters indicate the extremes of the range in which the CCF will be
plotted and the last parameter (the default option is TRUE) plots a
straight red line to highlight the correlation coefficient at lag 0. The
`ccf_ts`

function generates as its output the `acf`

(auto-correlation
function; ACF) `R`

object, which is a list with the following
parameters: `lag`

is a three dimensional array containing the lags at
which the ACF is estimated; `acf`

is an array with the same dimensions
as `lag`

containing the estimated ACF; `type`

is the type of correlation
(`correlation`

(the default), `covariance`

and `partial`

); `n.used`

is
the number of observations in the time series; and `snames`

provides the
names of the time series (`bints1`

and `bints2`

).

We first examine two evenly-spaced annually-resolved instrumental
climate records that cover the time interval from 1850 to 2006
(\(N = 157\) points). To test our *BINCOR* package we created irregular
time series by randomly removing 20% of the data from the evenly spaced
time series. We note that the new “sampling” times are not necessarily
the same for both irregular series. The new irregular time series
(“primary” hereafter) consist of 125 data points and have an average
temporal spacing \(\bar{d}\) of 1.24 years. Specifically the two time
series used were a record of Northern Hemisphere (NH) sea surface
temperature (SST) anomalies (HadCRUT3, (Brohan et al. 2006)) and a
record of equatorial Pacific SST anomalies from the El Niño 3 region
(2.5\(^\circ\)S to 2.5\(^\circ\)N, 92.5 to 147.5\(^\circ\)W)
(Mann et al. 2009), which is a indicator of El Niño-Southern Oscillation
(ENSO). Both time series, especially the NH-SST data, show strong
autocorrelation (plots not shown) and long-term trends (inspected by
Mann-Kendall test; ENSO, `z`

=6.52 and `p-value`

\(<\) 0.001 and NH-SST,
`z`

= 10.214 and `p-value`

\(<\) 0.001). To generate the sample data, we
fit a linear model to each evenly spaced time series and, after removing
the model fitted to the evenly spaced data, we use the residuals (i.e.
the difference between the observed data and the model fitted) to build
the irregular time series and then create the binned time series.

A) | C) |