Kernel semi-parametric models and their equivalence with linear mixed models provide analysts with the flexibility of machine learning methods and a foundation for inference and tests of hypothesis. These models are not impacted by the number of predictor variables, since the kernel trick transforms them to a kernel matrix whose size only depends on the number of subjects. Hence, methods based on this model are appealing and numerous, however only a few R programs are available and none includes a complete set of features. Here, we present the KSPM package to fit the kernel semi-parametric model and its extensions in a unified framework. KSPM allows multiple kernels and unlimited interactions in the same model. It also includes predictions, statistical tests, variable selection procedure and graphical tools for diagnostics and interpretation of variable effects. Currently KSPM is implemented for continuous dependent variables but could be extended to binary or survival outcomes.
In the last decades, the popularity and accessibility of machine learning has increased as a result of both the availability of big data and technical progress in computer science. The flexibility of machine learning methods enables avoidance of assumptions about functional relationships, such as strong linear or additive hypotheses, that are often involved in classical statistical models. However, this flexibility and adaptability also limits the capacity to interpret results or make inference. When understanding and inference are required, simpler statistical models are often preferred, as they are easy to understand and implement. Methods from the machine learning field might better model complex relationships, and yet would be more attractive if the results could be made interpretable. With this goal in mind, (Liu et al. 2007) clearly demonstrated the under-appreciated equivalence between a machine learning tool and a classical statistic model through the link between kernel semi-parametric models – developed first in the machine learning field – and more classical linear mixed models. This equivalence allows analysts to take advantage of knowledge advances in both the machine learning domain and the traditional statistical inference domain, including hypotheses testing, when using kernel semi-parametric models (Table 1).
Features of kernel models | |
---|---|
from traditional statistical models | from machine learning models |
|
|
The kernel semi-parametric model assumes that the outcome is related to the set of variables through an unknown smooth function, which is simply approximated by computing the similarity matrix between subjects according to the set of variables and a chosen kernel function (i.e., the kernel trick). Matrix size depends only on the number of subjects, making kernel models particularly suited to datasets where the number of features is very large (\(p >>n\)). The equivalence between kernel semi-parametric models and linear mixed models motivates a score test, which is simple to implement, for the simultaneous effect of all variables on the outcome.
Methods based on the kernel semi-parametric model and its extensions are appealing and numerous, but only a few programs are available, and none includes a complete set of the features that correspond to recent developments. Table 2 gives some examples of existing R packages according to their features of interest.
Adjustment | User’s own kernel | Single kernel test | Test of interaction | Predictions | Interpretation plot | Diagnostic plots | Variable selection | |
---|---|---|---|---|---|---|---|---|
coxme | \(\checkmark\) | \(\checkmark\) | ||||||
SKAT | \(\checkmark\) | \(\checkmark\) | ||||||
SPA3G | \(\checkmark\) | |||||||
KRLS | \(\checkmark\) | \(\checkmark\) | ||||||
e1071 | \(\checkmark\) | |||||||
KSPM | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
Several packages in Table 2 were developed in the
genetics field where interest often lies in testing the contribution of
a group of variables (variants) simultaneously, notably through the
sequence kernel association test (SKAT) for single nucleotide
polymorphisms (Wu et al. 2011; Chen et al. 2016). Extensions from single to
multiple kernel model were motivated by (i) interaction
tests (Li and Cui 2012a; Ge et al. 2015; Marceau et al. 2015; Wang et al. 2017) and
(ii) estimating the conditional effect of one set of variables after
adjusting for another set of variables, or after adjusting for
population structure (Oualkacha et al. 2013). The latter goal may be
achieved easily using the lmekin()
function from the coxme package
under mixed model theory where kinship matrix corresponds to the kernel
matrix measuring similarity between subjects, and defines the covariance
matrix of the random effect term.
In the machine learning field, packages like e1071 have been developed to perform accurate predictions and they focus on a single kernel function. Traditionally, model interpretation has remained an outstanding challenge of the machine learning field. However it has been recently demonstrated that the kernel semi-parametric models can be used to interpret effects of variables on the outcome through a graphical tool based on derivatives (Hainmueller and Hazlett 2013) and this is implemented in the KRLS package. When the kernel function and the corresponding approximated smooth function are differentiable with respect to the variable of interest, pointwise derivatives capture the effect of this variable on the outcome.
Since researchers may be interested in all of these features, we have consolidated them in the R package KSPM. The package and a vignette including detailed examples are available from the comprehensive R archive network (CRAN) at https://CRAN.R-project.org/package=KSPM. Our package, currently designed for continuous outcomes and normal errors, fits kernel semi-parametric models and their extensions in a unified framework incorporating all the previously described model fitting features and tests. It allows multiple kernels and unlimited interactions in the same model. Although most popular kernel functions are available in the package, the user also has the option of designing and using his/her own kernel functions. Furthermore, whenever interest lies in prediction or making inference from the model, diagnostic assessments of the model may be performed through graphical tools to detect data points with large residuals or high leverage that may greatly influence the results. Finally, we have also included a variable selection procedure based on Akaike’s and Bayesian information criteria (AIC and BIC). These last two options are not included in other software packages.
The KSPM package is a new tool for semi-parametric regression. It is not competing with other R packages for semi- or non-parametric regression models since either our methods or our objectives are different. Indeed, the estimation method involved in KSPM is based on regularized least squares in the kernel Hilbert space and should not be confused with local kernel smoothing based on Nadaraya-Watson estimator (np package, (Racine and Hayfield 2020)). Similarly, KSPM is also different from other smoothing methods connecting regression segments through knots among which we may cite splines (mgcv package, (Wood 2020)).
However, since the kernel semi-parametric model is equivalent to a linear mixed effect model, our package could be compared to lme4 (Bates et al. 2019) or nlme (Pinheiro et al. 2019) packages for linear mixed effect models. Although it is possible to obtain similar inference from any variables involved in a linear part of each model, and to obtain similar predictions from both overall models, loss function maximization are different. Moreover, KSPM has the advantage of being easier-to-use as the user does not need to compute kernel matrix nor matrix of interactions and provides interpretations for variable effects that cannot be obtained with traditional packages.
Let \(Y = (Y_1, ..., Y_n)^\top\) be a \(n \times 1\) vector of continuous
outcomes where \(Y_i\) is the value for subject \(i\) (\(i = 1, ..., n)\), and
\(X\) and \(Z\) are \(n \times p\) and \(n \times q\) matrices of predictors,
such that \(Y\) depends on \(X\) and
\(Z = (\mathbf{z_{.1}} , ..., \mathbf{z_{.q}})\) as follows:
\[\label{eq:kspm}
Y = X\beta + h(Z) + e \tag{1}\]
where \(\beta\) is a \(p \times 1\) vector of regression coefficients for
the linear parametric part, \(h(.)\) is an unknown smooth function and \(e\)
is an \(n \times 1\) vector of independent and homoscedastic errors such
as \(e_i \sim \mathcal{N}(0, \sigma^2)\). Throughout this article,
\(X\beta\) will be referred to as the linear part of the model and we
assume that \(X\) contains an intercept. \(h(Z)\) will be referred to as the
kernel part. The function \(h(.)\) need not be explicitly specified but it
is assumed to lie in \(\mathcal{H}_k\), the function space generated by a
kernel function \(k(.,.)\) which is, by definition, symmetric and positive
definite. We note that \(K\), the \(n \times n\) Gram matrix of \(k(.,.)\)
such that \(K_{ij} = k(\mathbf{z_{i.}}, \mathbf{z_{j.}})\), represents the
similarity between subjects \(i\) and \(j\) according to \(Z\). Of note, in
the estimation process, \(h(.)\) will be expressed as a linear combination
of \(k(.,.)\) (see Equation ((7))). See "Fitting the
kernel semi-parametric model with kspm
" for straightforward coding.
Now suppose there are \(L\) matrices \(Z_1, ..., Z_L\) of dimension \(n \times q_1\), ..., \(n \times q_L\) respectively. Keeping a similar notation, \(Y\) can be assumed to depend on \(X\) and \(Z_1\), ..., \(Z_L\) as follows: \[\label{eq:mkspm} Y = X\beta + \sum\limits_{\ell = 1}^L h_{\ell}(Z_{\ell}) + e \tag{2}\] where \(\forall \ell \in \left\lbrace 1, ..., L \right\rbrace\), \(h_{\ell}\) is an unknown smooth function from \(\mathcal{H}_{k_{\ell}}\), the function space generated by a kernel function \(k_{\ell}(.,.)\) whose Gram matrix is noted \(K_{\ell}\). We assume \(\forall \ell \neq m\), \(h_{\ell}(Z_{\ell}) \neq h_{m}(Z_{m})\) either because \(Z_{\ell} \neq Z_{m}\) or \(k_{\ell}(.,.) \neq k_{m}(.,.)\) or both. Note that when \(L = 1\), the model corresponds to Equation ((1)).
Suppose there is an interaction between two sets of variables represented by \(Z_1\) and \(Z_2\), \(n \times q_1\) and \(n \times q_2\) matrices, and \(Y\) depends on \(X\), \(Z_1\) and \(Z_2\) as follows: \[\label{eq:ikspm} Y = X\beta + h_1(Z_1) + h_2(Z_2) + h_{12}(Z_1, Z_2) + e \tag{3}\] where \(h_1(.)\), \(h_2(.)\) and \(h_{12}(.,.)\) are unknown smooth functions from \(\mathcal{H}_{k_1}\), \(\mathcal{H}_{k_2}\) and \(\mathcal{H}_{k_{12}}\) respectively. The \(h_{12}(.,.)\) function represents the interaction term if its associated kernel function \(k_{12}(.,.)\) is defined as the product of kernel functions \(k_1(.,.)\) and \(k_2(.,.)\) as follows: \[\label{eq:inter} k_{12}\left((Z_1, Z_2)_i, (Z_1, Z_2)_j\right) = k_1(\mathbf{z_{1i.}}, \mathbf{z_{1j.}}) \times k_2(\mathbf{z_{2i.}}, \mathbf{z_{2j.}}) \tag{4}\] such that \(K_{12} = K_1 \odot K_2\) where \(\odot\) is the Hadamard product (Ge et al. 2015). Model ((3)) is a particular case of the multiple kernel semi-parametric model ((2)) with \(L=3\), \(Z_3 = (Z_1, Z_2)\), \(q_3 = q_1 + q_2\) and \(h_{3}\) = \(h_{12}\). Of notes, in KSPM, different kernel choices can be made for \(k_1(.,.)\) and \(k_2(.,.)\). Obviously, this 2-way interaction model could be generalized to higher order interaction terms in a similar manner.
As shown by (Liu et al. 2007), model ((2)) is equivalent, without additional conditions, to the following linear mixed model: \[\label{eq:lmm} Y = X\beta + \sum\limits_{\ell = 1}^L h_{\ell} + e \tag{5}\] where \(\forall \ell \in \left\lbrace 1,...,L\right\rbrace\), \(h_{\ell} \sim \mathcal{N}(0, \tau_{\ell} K_{\ell})\) with \(K_{\ell}\) is a matrix of similarity between subjects as defined in model ((2)). Throughout the paper, we denote the variance parameters as \(\theta = (\tau_{1}, ..., \tau_L, \sigma^2)\) and \(\Sigma_{\theta} = \sum\limits_{\ell = 1}^L \tau_{\ell}K_{\ell} + \sigma^2 I\), the variance-covariance matrix of \(Y\), where \(I\) is the \(n \times n\) identity matrix.
The parameter estimates can be obtained either by maximizing the penalized log-likelihood associated with the kernel semi-parametric model ((2)), or the log-likelihood derived from the corresponding linear mixed model ((5)) (Liu et al. 2007). Even though the latter option may be computationally less time-consuming, we implemented the KSPM package using a penalized log-likelihood associated with the kernel semi-parametric model, because this method leads to an estimation of \(h\) useful for prediction, and also leads to suitable approaches for interpretation through kernel derivatives. Estimation of parameters is obtained by maximizing the penalized log-likelihood function: \[\label{eq:pll} l(\beta, h) = -\frac{1}{2}\sum\limits_{i = 1}^n \left(Y_i - X_i^\top\beta - \sum\limits_{\ell = 1}^L h_{\ell}(Z_{\ell i}) \right)^2 - \frac{1}{2}\sum\limits_{\ell = 1}^L \lambda_{\ell} \parallel h_{\ell}\parallel_{\mathcal{H}_{k_{\ell}}}^2 \tag{6}\] where \(\lambda_1\), ..., \(\lambda_L\) are tuning parameters associated with each smooth function \(h_1(.)\), ..., \(h_L(.)\) and \(\parallel .\parallel^2_{\mathcal{H}_{k_{\ell}}}\) defines the inner product on \(\mathcal{H}_{k_{\ell}}\) space. According to the Representer theorem (Kimeldorf and Wahba 1971), the functions \(h_1\), ..., \(h_L\) satisfying Equation ((6)) can be expressed as: \[\label{eq:representer} \forall \ell \in \left\lbrace 1,...,L \right\rbrace, \quad h_{\ell}(.) = \sum\limits_{i=1}^n \alpha_{\ell i} k(., Z_{\ell i}) \tag{7}\] where \(\forall \ell \in \left\lbrace 1,...,L \right\rbrace\), \(\alpha_{\ell} = (\alpha_{\ell 1}, ..., \alpha_{\ell n})^\top\) is a \(n \times 1\) vector of unknown parameters. Estimating the kernel semi-parametric model parameters consists in estimating \(\alpha_1, ..., \alpha_L\) and \(\beta\). Then, estimators of \(h_1(.)\), ..., \(h_L(.)\) are deduced from \(\hat{\alpha}_1, ..., \hat{\alpha}_L\).
In KSPM, we estimate penalization parameters by minimizing the mean sum of squares of the leave-one-out errors (LOOE). An advantage of LOOE compared to other cross-validation methods is that we do not need to recompute new model(s), because its value may be derived directly from the primary model parameters: \[\label{eq:looe} \forall i \in \left\lbrace 1, ..., n\right\rbrace, \qquad \text{LOOE}_i(\lambda_1, ...,\lambda_L) = \frac{Y_i - \hat{Y}_i}{1-H_{ii}} \tag{8}\] where \(H_{ii}\) is the \(i^{th}\) diagonal element of the Hat matrix \(H\) such as \(\hat{Y} = H Y\).
Penalization parameters and tuning parameters, if applicable, are
estimated simultaneously during the optimization algorithm, by
minimizing the LOOE. If only one parameter needs to be estimated, the
convexity of the function to be minimized makes the problem easier and
the convergence faster. In that case, KSPM uses the standard
optimize()
function from the basic R package, based on the golden
section search algorithm (Brent 2013). When several
hyperparameters need to be estimated, the resulting function to be
minimized may not be convex. Hence, more complex optimization algorithms
should be envisaged, and KSPM uses the DEoptim()
function from the
DEoptim
package (Ardia et al. 2020) based on the differential evolution algorithm
(Mullen et al. 2009). Given the random nature of the algorithm, it would
be safe to apply the algorithm several times to ensure convergence
toward the global minimum.
For either a single kernel or a multiple kernel semi-parametric model, a standard test of interest is \(H_0: h_{\ell}(.) = 0\), i.e., there is no effect, singly or jointly, of a set of variables on the outcome. Following (Liu et al. 2007), this test is equivalent to \(H_0: \tau_{\ell} = 0\), a test of the variance component in the linear mixed model ((5)) using the REML-based score test. The corresponding test statistic \(Q_{\ell}\) follows a mixture of independent chi-square distributions with one degree of freedom (Zhang and Lin 2003). The KSPM package uses exact distribution of \(Q_{\ell}\) in single kernel model but in multiple kernel model, we use Davies’ method to approximate this distribution (Davies 1980). Based on similar methodology, KSPM also provides the global test for multiple kernel semi-parametric models \(H_0: h_1(.) = ... = h_L(.) = 0\) i.e., \(H_0: \tau_1 = ... = \tau_L = 0\).
A kernel represents similarity between subjects through combining a set of variables in a possibly complex way, that may include their possible interactions. Hence the general effect of a kernel on an outcome is hard to see and difficult to interpret. Nevertheless, the marginal effect of each variable in the kernel may be illustrated using a partial derivative function (Hainmueller and Hazlett 2013). Indeed, the derivative measures how the change in a variable of interest impacts the outcome in an intuitively understandable way. When a variable’s effect is linear, the interpretation is straightforward since the derivative corresponds to standard slope (\(\beta\)) coefficients. In kernel semi-parametric models, we are simultaneously modeling a set of variables. Therefore, when exploring the effect of any one variable of interest, the other variables in the kernel must be taken into account. Thus, plotting pointwise derivatives of the prediction for each subject against the value of the variable of interest may help in interpreting the effect of this variable on the outcome. Although a variable-level summary statistic may be obtained by averaging the pointwise derivatives across subjects (Hainmueller and Hazlett 2013), we did not implement this option in KSPM because the average can mask relevant variability. For example, when positive and negative derivatives occur for the same variable, the average may be zero.
In any kernel semi-parametric model, the smooth unknown function \(h_{\ell}(.)\) is approximated using basis functions from \(\mathcal{H}_{k_{\ell}}\). Since the inner product of the basis functions corresponds to the kernel function \(k_{\ell}(.,.)\), the choice of the kernel determines the function space used to approximate \(h_{\ell}(.)\). The KSPM package includes the most popular kernel functions, described below. The linear kernel function \(k(Z_i, Z_j) = Z_i^\top Z_j\) assumes that variables have a linear effect on outcome. It generates a linear function space so that the kernel semi-parametric model leads to a penalized multiple linear model using an \(L^2\) norm (equivalent to a ridge regression). The polynomial kernel function \(k(Z_i, Z_j) = (\rho \, Z_i^\top Z_j + \gamma)^d\) assumes that \(d^{th}\)-order products of the variables have a linear effect on the outcome, and is equivalent to a \(d^{th}\)-order interaction model. The Gaussian kernel function \(k(Z_i, Z_j) = exp(-\parallel Z_i - Z_j\parallel^2 / \rho)\) generates the infinite function space of radial basis functions. The sigmoid kernel function \(k(Z_i, Z_j) = tanh(\rho \, Z_i^\top Z_j + \gamma)\) and the inverse quadratic kernel \(k(Z_i, Z_j) = (\parallel Z_i - Z_j\parallel^2 + \gamma)^{-1/2}\) are also often cited in the literature. Finally, we propose also the equality kernel \(k(Z_i, Z_j) = 1 \; \text{if} \; Z_i = Z_j \; \text{and} \; 0 \; \text{otherwise}\). Of note, users can define their own kernel function in KSPM by providing the corresponding kernel matrix. Some kernel functions, such as the linear or polynomial kernels, make assumptions about the shape of the effect of the variables on the continuous outcome, whereas other kernels like the gaussian may, in theory, handle all types of effects, regardless of their complexity. Indeed, in contrast to the linear and polynomial kernels, the gaussian kernel function captures a kernel space of infinite size leading to higher flexibility for approximation of \(h(.)\). If true effects are linear, the linear kernel and gaussian kernel should converge toward similar results. However, in practice, if sample size is low or noise is large, the gaussian kernel will tend to retain a sinusoidal shape to the fit even when the truth is linear.
Kernel functions often involve tuning parameters; above, the parameters \(\rho\) and \(\gamma\) were used to indicate these kernel specific parameters. In practice, these tuning parameters are usually chosen by the user. However, if user does not provide a value for these parameters, the KSPM package estimates them at the same time as the penalization parameter(s), by minimizing the mean sum of squares of LOOE. The choice of tuning parameters may strongly impact the results and modify the smoothness of the \(\hat{h}(.)\) function leading to overfitting or underfitting. For example, with a \(\rho\) value that is too large, the Gaussian kernel tends to lose its non-linear properties, whereas with \(\rho\) too small, it is very sensitive to noise.
In general, the choice of tuning parameters may strongly impact the results With that in mind, sensitivity analyses may include a comparison of results obtained with different values for these parameters. Also, comparing models based on information criteria such as AIC and BIC may help to choose the kernel function and its tuning parameter(s).
The KSPM package provides an R interface to perform kernel
semi-parametric models and is available from the comprehensive R archive
network (CRAN) at https://CRAN.R-project.org/package=KSPM. The package
is called through the main function kspm()
taking data and model
hyperparameters as inputs, fitting the model, and returning a model fit
object of class "kspm"
.
kspm
The main function of the package, kspm()
, can fit the single or
multiple kernel semi-parametric models described in the earlier, as
detailed below:
kspm(response, linear, kernel, data, ...)
The argument response
indicates a continuous outcome variable. It can
be specified as a string corresponding to the column name associated
with the response in the dataset provided in the data
argument, as a
vector of length \(n\), or as a \(n \times 1\) matrix containing the
continuous values of the outcomes. Of note, kspm
does not deal with
multivariate outcomes, and if an \(n \times r\) (\(>1\)) matrix is provided,
only the first column is used. Argument linear
specifies the linear
part of the model and could be either a formula, a vector of length \(n\)
if only one variable is included in the linear part, or an \(n \times p\)
design matrix containing the \(p\) variables included in the linear part.
By default, an intercept is added. To remove the intercept term, the
user should use the formula specification and add the term -1
, as
usual. kernel
specifies the kernel part of the model. Its argument
should be a formula of Kernel
object(s), described below. For a
multiple kernel semi-parametric model, Kernel
objects are separated by
the usual signs "+"
, "*"
and ":"
to specify addition and
interaction between kernels.
The Kernel
object regroups all information about a kernel including
the choice of kernel function and its parameters. It is specified using
the Kernel
function as follows:
Kernel(x, kernel.function, scale, rho, gamma, d)
Argument x
represents either the variables included in the kernel part
of the model, or a kernel Gram matrix. In the latter case, the user
should specify kernel.function = "gram.matrix"
and all other arguments
are not used. When x
represents the variables included in the kernel
part, it may be specified as a formula, a vector, or a matrix, and
kernel.function
indicates which kernel function should be used (e.g.,
"gaussian"
, "polynomial"
, ...). scale
is a boolean indicating if
variables should be scaled before computing the gram matrix. The need
for other arguments depends on the choice of kernel function: rho
and
gamma
are tuning parameters and d
is the highest order in a
polynomial kernel function; these parameters correspond to the \(\rho\),
\(\gamma\) and \(d\) introduced earlier. It is worth noting that in a
multiple kernel model, KSPM allows kernel objects to follow different
formats.
Different options were considered for the interface with respect to
specification of the linear and kernel parts of the model. We decided to
use an interface with separate formulae for the two parts. This
structure makes it straightforward to manage variables coming from
different sources or data structures within the package. For example,
genetic data or high dimensional genomic data are often provided in a
matrix format, whereas other variables and phenotypes are saved in
vectors or data frames with meaningful variable names. This diversity of
data source and format cannot be handled by a unique formula. If data
elements are assembled from different sources, they should include the
same individuals or observations (\(i=1,..,n\)), with identical ordering;
if not, kspm
will return an error. It is worth noting that the KSPM
package does tolerate observations containing missing values, although
these observations will be removed prior to model fitting.
The function kspm
returns an object of class "kspm"
with summary()
and plot()
commands available, the first giving estimates and
\(p\)-values and the second displaying diagnostic plots.
"kspm"
An object of class "kspm"
results from a kernel semi-parametric model
fit. It contains obviously estimated coefficients, fitted values and
residuals, but also information about kernels such as \(n \times n\)
kernel matrices, hyperparameters, penalization parameters.
The "kspm"
object can be summarized or viewed using commands and
methods very similar to those implemented in "lm"
or "glm"
.
> methods(class = "kspm")
1] case.names coef confint cooks.distance deviance
[6] extractAIC fitted logLik nobs plot
[11] predict print residuals rstandard sigma
[16] summary variable.names
['?methods' for accessing help and source code see
A predict()
command has been implemented for the class "kspm"
.
predict.kspm(object, newdata.linear, newdata.kernel, interval, level)
where object
refers to a "kspm"
object. If a new dataset is not
specified, predict.kspm
will return the fitted values from the
original data. If predict.kspm
is applied to a new dataset, all
variables used in the original model should be provided in the
newdata.linear
and newdata.kernel
arguments. newdata.linear
should
be a data frame, a vector or design matrix of variables used in the
linear part. newdata.kernel
is a list containing data frames, vectors
and/or design matrices for each kernel. Formats depend on the ones
previously used in model specification as shown in
Table 3. For simplicity, users may follow the
list returned by the info.kspm()
function.
kernel specifications | new data specification |
---|---|
formula of \(q\) variables | a data.frame with columns names corresponding to variables included in the formula. Number of rows is \(n^{\star}\), number of columns is \(q\) |
vector | a vector of length \(n^{\star}\) |
design matrix \(n \times q\) | a matrix \(n^{\star} \times q\) |
kernel matrix \(n \times n\) | a matrix \(n^{\star} \times n\) where cell \(i,j\) represents the similarity between new subject \(i\) and \(j^{th}\) subject included in the model. |
In predict.kspm
, interval
can be either "none"
, "confidence"
, or
"prediction"
according to whether the user wants a confidence or
prediction interval. The level of confidence/prediction interval is
specified using the level
argument. By default, level = 0.95
is
used.
A variable selection procedure algorithm has been implemented for the
class "kspm"
. It is similar to the step()
command existing for other
regression packages.
stepKSPM(object, linear.lower, linear.upper, kernel.lower, kernel.lower, k, direction)
As before, object
refers to a "kspm"
object. However, in contrast to
the generality allowed for fitting a single kernel semi-parametric
model, here the Kernel
object should not be specified with the Gram
matrix option. The arguments of linear.lower
, linear.upper
,
kernel.lower
and kernel.lower
are formulae corresponding to the
desired boundaries for the smallest and largest numbers of variables to
be included. As is standard in many R packages, all variables in the
linear.lower
and kernel.lower
formulae cannot be removed from the
model and variables that are not in the linear.upper
and
kernel.upper
formulae cannot be added to the model. The procedure to
select variables is based on AIC or BIC depending on the value of k
.
If k
is set to \(2\), AIC is used, if k
is set to \(ln(n)\), BIC is used
instead. The direction
argument may be "forward"
for a forward
selection, "backward"
for a backward selection and "both"
for a
stepwise selection. Our package was implemented so that variable
selection for the linear part and the kernel part of the model may be
done simultaneously. However, it is also possible to perform the
variable selection procedure on each part separately by giving the
appropriate formula to linear.lower
, linear.upper
, kernel.lower
and kernel.lower
.
Of note, as for the standard stepAIC()
function, this procedure can
only be used on complete observations. Thus missing data should be
removed by the user prior to calling stepKSPM()
so that the number of
observations is identical at each step.
The plot()
method has been implemented for "kspm"
and
"derivatives.kspm"
objects. The former gives usual diagnostic plots
including residual distribution, leverage, Cook’s distance,... The
latter gives interpretation plot from pointwise derivatives.
This first example illustrates the functions provided in KSPM included the model fit, the diagnotic plot, the interpetation plot based on pointwise derivatives, the test of interaction as well as the variable selection procedure.
The conventional and social media movies (CSM) dataset is available on the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php) as well as in KSPM and is fully described in (Ahmed et al. 2015). It contains data on 232 movies from 2014 and 2015; the movies are described in term of conventional features (sequel, budget in USD, gross income in USD, number of screens in USA) as well as social media features (aggregate actor followers on Twitter, number of views, likes, dislikes, comments of movie trailer on Youtube, sentiment score) in the aim of predicting ratings. In all subsequent analyses, we used only the 187 entries without missing data.
> data("csm")
> head(csm)
Year Ratings Gross Budget Screens Sequel Sentiment Views Likes1 2014 6.3 9130 4.0e+06 45 1 0 3280543 4632
2 2014 7.1 192000000 5.0e+07 3306 2 2 583289 3465
3 2014 6.2 30700000 2.8e+07 2872 1 0 304861 328
4 2014 6.3 106000000 1.1e+08 3470 2 0 452917 2429
5 2014 4.7 17300000 3.5e+06 2310 2 0 3145573 12163
7 2014 6.1 42600000 4.0e+07 3158 1 0 3013011 9595
Dislikes Comments Aggregate.Followers1 425 636 1120000
2 61 186 12350000
3 34 47 483000
4 132 590 568000
5 610 1082 1923800
7 419 1020 8153000
In our first model, we assume a gaussian kernel function to fit the
joint effect of the conventional features on ratings. Here we do not
provide any value for the \(\rho\) parameter. It will be estimated at the
same time as the penalization parameter by minimizing the LOOE. We also
do not provide a linear
argument, meaning that only an intercept will
be included in the linear part of the model.
> csm.fit1 <- kspm(response = "Ratings", kernel = ~Kernel(~Gross + Budget +
+ Screens + Sequel, kernel.function = "gaussian"), data = csm)
> summary(csm.fit1)
:
Callkspm(response = "Ratings", kernel = ~Kernel(~Gross + Budget +
+ Sequel, kernel.function = "gaussian"),
Screens data = csm)
:
Sample size= 187
n
:
Residuals
Min Q1 Median Q3 Max -3.0066 -0.4815 0.0109 0.5534 2.1228
Coefficients (linear part):
Pr(>|t|)
Estimate Std. Error t value 6.297723 1.058707 5.948505 1.427565e-08
(Intercept)
for non-parametric kernel:
Score test -value
lambda tau p0.04804093 16.13793 5.625602e-06
Ker1
: 0.88 on 175.82 effective degrees of freedom
Residual standard error-squared: 0.2643, Adjusted R-squared: 0.2217 Multiple R
The summary output gives information about the sample size used in the model, the residual distribution, the coefficient for the linear part (similar to other regression R packages), and the penalization parameter and score test associated with the kernel part. The kernel results indicate that the conventional features are strongly associated with ratings. In such a complex model, the number of free parameters – i.e., the standard definition of the degrees of freedom of a model – is undefined, and we use instead the effective degrees of freedom of the model, which is not necessarily a whole number. However, our definition for effective degrees of freedom is similar to the one used in linear regression models and depends on the trace of the hat matrix. The \(\rho\) parameter may be extracted using the following code:
> csm.fit1$kernel.info$Ker1$rho
par161.22
This value alone may not provide much information about linearity of the kernel function. However interpretation is feasible when comparing gaussian kernel functions with different \(\rho\) parameter values, or when comparing the gaussian and linear kernel functions.
The plot()
command may be applied to the model to display diagnostic
plots. Figure 1 has been generated using the
following code:
> par(mfrow = c(2,2), mar = c(5, 5, 5, 2))
> plot(csm.fit1, which = c(1, 3, 5), cex.lab = 1.5, cex = 1.3, id.n = 7)
> hist(csm$Ratings, cex.lab = 1.5, main = "Histogram of Ratings", xlab =
+ "Ratings")
Outlier points are annotated and we can see movie \(134\) (Jurassic World) has high leverage. Also of note, the lower tail of the residuals distribution is not as expected for a Normal distribution. The histogram of ratings shows that the deviation could be due to the left skewed distribution of the response variable.
The derivative function may help to interpret the effect of each variable individually on the outcome. Indeed, the sign of the derivatives captures variational changes of the effect of the variable on the outcome. To illustrate this feature, Figure 2 displays the derivatives for Gross income, Budget and Screens. It has been generated with the following code:
> par(mfrow = c(1,2), mar = c(5,5,5,2))
> plot(derivatives(csm.fit1), subset = "Gross", cex.lab = 1.5, cex = 1.3,
+ main = "Pointwise derivatives according to Gross Income")
> plot(derivatives(csm.fit1), subset = "Screens", col = csm$Sequel,
+ cex.lab = 1.5, cex = 1.3, pch = 16, main = "Pointwise derivatives
+ according to Number of Screens \n and Sequel")
> legend("topleft", fill = palette()[1:7], legend = 1:7, title = "Sequel",
+ horiz = TRUE)
By default, the X-axis label gives the name of the variable and, in
brackets, the kernel in which it is included. When only one kernel is
included in the kernel, it is named Ker1
. Because genre and sequel
variables, even if they are numeric, refer to categorical variables, it
is possible to easily highlight some patterns of interaction between
these variables and the others. Derivatives according to Gross income
are mostly positive meaning that higher Gross income is associated with
higher ratings. However the slope of this effect decreases as the gross
income grows. It is difficult to interpret the derivatives for Gross
income higher than 2.5e+08 since this category includes only a few
movies. Based on the display in the right panel, whether a movie has
sequels seems to interact with the effect of the number of screens on
the ratings. Indeed when the movie is the first or second release
(Sequel = 1 or 2), the derivatives are always negative meaning that as
the number of screens on which the movie is released increases, the
ratings tend to decrease. However, this relationship seems to be
stronger for the first release than the second. No clear pattern can be
observed for subsequent sequels. It is difficult to know if this reveals
an absence of effect or whether there are simply too few movies past
sequel \(2\) in these data.
To help in the choice of the kernel function, we may compare several
models by using information criteria. As an example, we fit a second
model assuming a polynomial kernel function with fixed tuning parameters
(\(\rho=1\), \(\gamma=1\) and \(d=2\)). This model can be compared to the
previous one using information criteria such as the AIC or BIC. By
default the extractAIC()
command gives the AIC.
> csm.fit2 <- kspm(response = "Ratings", kernel = ~Kernel(~Gross + Budget +
+ Screens + Sequel, kernel.function = "polynomial", rho = 1, gamma = 1,
+ d = 2), data = csm, level = 0)
> extractAIC(csm.fit1)
1] 941.4521 [
> extractAIC(csm.fit2)
1] 944.4618 [
Here, we concluded that gaussian kernel function fits our data better than the polynomial kernel function, given the tuning parameters we considered.
Now, we assume a model with two kernel parts, one for conventional features and one for social media features, as well as their interaction. We propose to use the gaussian kernel function for each set of features, although different kernels could be used. The hyperparameters we chose are those obtained for each kernel separately.
> csm.fit3 <- kspm(response = "Ratings", linear = NULL, kernel = ~Kernel(~
+ Gross + Budget + Screens + Sequel, kernel.function = "gaussian",
+ rho = 61.22) * Kernel(~ Sentiment + Views + Likes + Dislikes + Comments +
+ Aggregate.Followers, kernel.function = "gaussian", rho = 1.562652),
+ data = csm)
While the model is running, R returns a summary of the kernel part(s) and interaction(s) included in the model.
------------------------------------------------------------------
:
The model includes the following kernels
Ker1
Ker2 :Ker2
Ker1------------------------------------------------------------------
:
Details: ~Kernel(~Genre + Gross + Budget + Screens + Sequel,
Ker1kernel.function = "gaussian", rho = 55.5897)
: ~Kernel(~Sentiment + Views + Likes + Dislikes + Comments +
Ker2kernel.function = "gaussian", rho = 1.562652)
Aggregate.Followers, ------------------------------------------------------------------
As defined by model ((4)), the summary()
command will
return the \(p\) value of tests \(H_0: h_1(.) = 0\), \(H_0: h_2(.) = 0\) and
\(H_0: h_{12}(.) = 0\). By default all tests are performed. However, if
our interest lies only in the test of interaction, the kernel.test
option may be used to choose the test of interest and reduce the
computation time. If interest lies in the global test
\(H_0: h_1(.) = h_2(.) = h_{12}(.) = 0\), the global.test
option should
be set at TRUE
.
> summary(csm.fit3, kernel.test = "Ker1:Ker2", global.test = TRUE)
:
Callkspm(response = "Ratings", linear = NULL, kernel = ~Kernel(~Gross +
+ Screens + Sequel, kernel.function = "gaussian",
Budget rho = 61.22) * Kernel(~Sentiment + Views + Likes + Dislikes +
+ Aggregate.Followers, kernel.function = "gaussian",
Comments rho = 1.562652), data = csm)
:
Sample size= 187
n
:
Residuals
Min Q1 Median Q3 Max -1.4185 -0.3396 0.0112 0.3291 1.3597
Coefficients (linear part):
Pr(>|t|)
Estimate Std. Error t value 4.548524 1.256813 3.619095 0.0004324838
(Intercept)
for non-parametric kernel:
Score test -value
lambda tau p:Ker2 187 0.00208359 0.7376828
Ker1
: p-value = 6e-04
Global test
: 0.62 on 121.17 effective degrees of freedom
Residual standard error-squared: 0.7452, Adjusted R-squared: 0.6089 Multiple R
Adding social media features to the model improved the predictions as indicated by the adjusted \(R^2\). However the smooth function associated with the kernel interaction does not significantly differ from the null, leading to the conclusion that there is no interaction effect between conventional and social media features on the ratings.
Suppose now, we want to predict the ratings that will be attributed to
the three artificial movies described in tables 4
and 5 below, according to the model csm.fit3
.
Gross | Budget | Screens | Sequel | |
---|---|---|---|---|
Movie 1 | 5.0e+07 | 1.8e+08 | 3600 | 2 |
Movie 2 | 50000 | 5.2e+05 | 210 | 1 |
Movie 3 | 10000 | 1.3e+03 | 5050 | 1 |
Sentiment | Views | Likes | Dislikes | Comments | Aggregate.Followers | |
---|---|---|---|---|---|---|
Movie 1 | 1 | 293021 | 3698 | 768 | 336 | 4530000 |
Movie 2 | 2 | 7206 | 2047 | 49 | 70 | 350000 |
Movie 3 | 10 | 5692061 | 5025 | 305 | 150 | 960000 |
> newdata.Ker1 <- data.frame(Gross = c(5.0e+07, 50000, 10000),
+ Budget = c(1.8e+08, 5.2e+05, 1.3e+03), Screens = c(3600, 210, 5050),
+ Sequel = c(2, 1, 1))
> newdata.Ker2 <- data.frame(Sentiment = c(1, 2, 10), Views = c(293021,
+ 7206, 5692061), Likes = c(3698, 2047, 5025), Dislikes = c(768, 49,
+ 305), Comments = c(336, 70, 150), Aggregate.Followers = c(4530000,
+ 350000, 960000))
> predict(csm.fit3, newdata.kernel = list(Ker1 = newdata.Ker1, Ker2 =
+ newdata.Ker2), interval = "prediction")
fit lwr upr1 4.682560 3.147755 6.217365
2 6.401853 5.100309 7.703396
3 6.128641 4.395417 7.861864
The output of the predict()
function gives the predicted values
(fit
) and the lower (lwr
) and upper (upr
) bounds of prediction
intervals.
We may obtain the predictions for the original data directly from the
model or from the predict()
function. With the latter, confidence
intervals may be additionally obtained.
> pred <- csm.fit3$fitted.value
> pred <- predict(csm.fit3, interval = "confidence")
> plot(csm$Ratings, pred$fit, xlim = c(2, 10), ylim = c(2, 10),
+ xlab = "Observed ratings", ylab = "Predicted ratings", cex.lab = 1.3)
> abline(a = 0, b = 1, col = "red", lty = 2)
Figure (3) shows that for smaller values, the model overestimates the outcome. This is again probably due to the left skewness of the outcome distribution.
Suppose we fit a single kernel semi-parametric model with a gaussian kernel to adjust the social media features in the CSM data. The kernel part contains the set of social media features. We want to select the relevant variables to be included in the kernel. We therefore can perform a stepwise variable selection procedure based on AIC, while letting \(\rho\) vary at each iteration. To do so, we first fit the full model including all features.
> csm.fit4 <- kspm(response = "Ratings", kernel = ~Kernel(~ Sentiment + Views
+ + Likes + Dislikes + Comments + Aggregate.Followers, kernel.function =
+ "gaussian"), data = csm)
Then, we apply the stepKSPM()
command on the full model as follows.
> stepKSPM(csm.fit4, kernel.lower = ~1, kernel.upper = ~ Sentiment + Views
+ + Likes + Dislikes + Comments + Aggregate.Followers, direction = "both",
+ k = 2, kernel.param = "change", data = csm)
At each iteration, R returns the current model, and the list of variables that may be added or removed.
: AIC = 913.2
Start: ~ 1
Linear part: ~ Sentiment + Views + Likes + Dislikes + Comments +
Kernel part
Aggregate.Followers
Part AIC- Sentiment kernel 910.8282
<none> 913.1769
- Views kernel 913.9753
- Likes kernel 917.6532
- Comments kernel 921.2163
- Aggregate.Followers kernel 925.4491
- Dislikes kernel 969.4304
: AIC = 910.8
Step: ~ 1
Linear part: ~ Views + Likes + Dislikes + Comments + Aggregate.Followers
Kernel part
Part AIC- Views kernel 905.2908
<none> 910.8282
+ Sentiment kernel 913.1769
- Aggregate.Followers kernel 915.5481
- Likes kernel 916.8125
- Comments kernel 921.9627
- Dislikes kernel 970.3804
: AIC = 905.3
Step: ~ 1
Linear part: ~ Likes + Dislikes + Comments + Aggregate.Followers
Kernel part
Part AIC<none> 905.2908
+ Views kernel 910.8282
+ Sentiment kernel 913.9753
- Aggregate.Followers kernel 916.0224
- Comments kernel 917.8758
- Likes kernel 925.0230
- Dislikes kernel 968.2502
The final model includes the variables Likes
, Dislikes
, Comments
and Aggregate.Followers
.
Our second example illustrates how KSPM may be efficient with complex data as time series and show how the choice of tuning parameters impacts the results.
The energy
data is a set of data on energy consumption each hour on
Ouessant island (France) from September the 13th, 2015 to October the
4th, 2015, that were made publicly available by Electricité de France
at https://iles-ponant-edf-sei.opendatasoft.com. The data set also
contains corresponding meteorologic data such as temperature (Celsius
degrees), pressure (Pa) and humidity rate (g/m\(^3\)). These measures are
collected by Meteo France every 3 hours and are publicly available on
www.infoclimat.fr. We obtained hourly values by linear interpolation. In
total the data set contains 504 measurements.
> data("energy")
> head(energy)
power date T P HR hour hour.num1 526.1667 2015-09-13 12.43333 1007.933 81.66667 01h 1
2 495.0000 2015-09-13 12.36667 1007.167 82.33333 02h 2
3 446.1667 2015-09-13 12.30000 1006.400 83.00000 03h 3
4 365.8333 2015-09-13 12.30000 1005.833 82.66667 04h 4
5 341.0000 2015-09-13 12.30000 1005.267 82.33333 05h 5
6 352.3333 2015-09-13 12.30000 1004.700 82.00000 06h 6
These data demonstrate strong periodicity depending on time of day (Figure 4). Of note, if data had been collected for a period longer than one year, a second periodicity would be visible corresponding to seasons.
We will consider the 408 first measurements (17 days) as the training set. The others 4 days will be used as a test set, where we want to predict the energy consumption.
> energy_train_ <- energy[1:408, ]
> energy_test_ <- energy[409:504, ]
We fit a single kernel semi-parametric model to the training data. We
assume that energy depends linearly on temperature (T
), and therefore
this variable is included in the linear part of the model. The other
meteorologic data, as well as hours in 24-hour numeric format, are
included in the kernel part of the model. We used a gaussian kernel and
we left the \(\rho\) parameter free to be estimated by the model.
> energy.fit1 <- kspm(response = "power", linear = ~T, kernel = ~Kernel(~
+ hour.num + P + HR, kernel.function = "gaussian") , data = energy_train_)
> energy.fit1$kernel.info$Ker1$rho
par1 0.7028723
We recomputed the model using other values for the \(\rho\) parameter to
explore sensitivity of the results to this key kernel parameter. Values
were chosen tenfold larger and smaller than the estimated value of
\(0.70\) in energy.fit1
.
> energy.fit2 <- kspm(response = "power", linear = ~T, kernel = ~Kernel(~
+ hour.num + P + HR, kernel.function = "gaussian", rho = 7) , data =
+ energy_train_)
> energy.fit3 <- kspm(response = "power", linear = ~T, kernel = ~Kernel(~
+ hour.num + P + HR, kernel.function = "gaussian", rho = 0.07) , data =
+ energy_train_)
Figure 5 displays the predictions obtained on
both the training and test data sets, as well as the derivatives, as a
function of the hours variable, for the three models energy.fit1
,
energy.fit2
and energy.fit3
that differ only on the value of the
tuning parameter \(\rho\).
The first row of Figure 5 corresponds to the model with the value of \(\rho\) estimated by the model. Predictions fit well for both the training and the test sets. The derivative graph shows that between 6h00 and 12h00 and between 18h00 and 23h00 the derivatives are positive. This result is coherent with the increase of energy consumption during these times of the day (Figure 4-Right). Inversely between 0h00 and 6h00 and between 12h00 and 18h00, the energy consumption decreases as showed by negative values of derivatives. One might have expected that derivative values should be close between 0h00 and 23h00 but the peak consumption observed at 23h00 everyday (Figure 4) is probably associated with a sudden change in derivative values that is difficult to smooth accurately.
This example shows how a gaussian kernel function may handle complex functional data. However the choice of the \(\rho\) parameter may influence the results. A higher \(\rho\) leads to a smoother approximation of \(h(.)\) and worsens the predictions. Indeed, a higher \(\rho\) may perfectly fit the training data set, but this corresponds to a case of model overfitting, which results in biased predictions in the test set. In contrast, a lower \(\rho\) may lead to underfitting of training and test data sets. The choice of the \(\rho\) parameter also impacts the derivatives; indeed a higher \(\rho\) induces more noise among the derivatives.
This example shows how the interaction test may be applied to gene-gene interaction using standard linear kernel and SNPs involved in genes.
We simulated a data of 300 subjects and 9 SNPs, 6 belonging to gene A
and 3 belonging to gene B, using the glSim()
function into the
adegenet R
package (Jombart and Ahmed 2011). SNPs are coded 0, 1 or 2 according to the number of
alleles of each type carried by the subject. A continuous outcome was
simulated as a linear combination of the interaction terms between the 6
SNPs of gene A and the three SNPs of gene B, where the weight
coefficients where randomly and uniformly chosen in the interval
\(\left[-2;0\right]\). We added an intercept of 100 and a normally
distributed error of mean 0 and standard deviation 15. Code details are
available in Supplement S1.
Let y
be the vector of continuous outcomes, geneA
be the
\(300 \times 6\) matrix of SNPs belonging to gene A and geneB
be the
\(300 \times 3\) matrix of SNPs belonging to gene B.
We test the interaction between the genes using the code below. Of note, in this example, the data are specified using vector and design matrices instead of using formulae and data frame as in previous examples.
> gene.fit <- kspm(response = y, kernel = ~ Kernel(geneA, kernel = "linear")
+ * Kernel(geneB, kernel = "linear"))
> summary(gene.fit, kernel.test = "Ker1:Ker2")
:
Callkspm(response = y, kernel = ~Kernel(geneA, kernel = "linear") *
Kernel(geneB, kernel = "linear"))
:
Sample size= 300
n
:
Residuals
Min Q1 Median Q3 Max -34.1111 -8.7300 -1.1659 9.7299 38.6887
Coefficients (linear part):
Pr(>|t|)
Estimate Std. Error t value 88.80183 0.8178659 108.5775 2.531469e-231
(Intercept)
for non-parametric kernel:
Score test -value
lambda tau p:Ker2 199.4587 0.9862838 0.006646781
Ker1
: 14.03 on 280.74 effective degrees of freedom
Residual standard error-squared: 0.2387, Adjusted R-squared: 0.1892 Multiple R
The result suggests that genes A and B impact the continuous outcome y through their interaction.
This new KSPM package provides a flexible implementation of the kernel
semi-parametric model and its extensions in a unified framework, using
nomenclatures similar to other regression R packages. Thanks to the
kernel trick, such a model is useful when interest lies in prediction
and our predict()
command makes this easy. Nevertheless, inference is
also possible through the confidence intervals and tests provided by the
package, such as through the summary()
command. Moreover, we have
provided many options for model diagnostics (residuals, leverage, ...),
model interpretation (derivatives, ...) and model comparisons (AIC,
stepwise, ...).
Model estimation for multiple kernel model is based on an iterative
estimation of regression parameters we develop and prove in Supplement
S2. Penalization and tuning parameter estimation involves optimize()
and DEoptim()
functions. The both integrate C code ensuring a faster
convergence of the algorithms. However, the overall KSPM algorithm
includes inversion of \(n \times n\) matrices, thereby resulting in slower
optimization as \(n\) increases. It would improve KSPM’s performance to
combine matrix inversion code and optimization code in a single
efficient set of C code and we are considering this for a future
implementation. It is worth noting, however, that computation time is
not severely impacted by the number of predictors even if \(n << q\).
In the KRLS package, (Hainmueller and Hazlett 2013) proposed a marginal test for the individual effect of each variable included in the kernel. We decided not to implement this test in our package since the marginal effect may mask interesting but symmetric effects of a variable on the continuous outcome. However, we have implemented a test of hypothesis based on variance components to test the joint effect of a set of variables on the continuous outcome. The distribution of this test statistic follows a mixture of \(\chi^2\) distributions that we approximated using Davies’ method (Davies 1980), available through the CompQuadForm package (Duchesne and de Micheaux 2010; de Micheaux 2017). The distribution can also be approximated by a scaled chi-squared distribution using Satterthwaite’s method where parameters are computed by moment matching. (Schifano et al. 2012) compared the two approximation methods on the type-I error rate in a single kernel semi-parametric model and showed that the rate is inflated with Satterthwaite’s method when the \(\alpha\)-level is low, and therefore we recommend the Davies method. To assess global testing of multiple kernels, we implemented a sum of the L single-kernel test statistics as an overall test statistic. In general, the computation of the \(p\) value of this overall test involves the joint distribution of (\(Q_1\), ... \(Q_L\)). To derive analytical values for the overall \(p\) value, one can use techniques similar to those used in (Sun et al. 2019), which relied on a copula-based model to approximate this joint distribution. This last option is not yet implemented in KSPM.
For all our tests, the model has to be re-estimated under the null hypothesis for the kernel of interest. If the resulting null model still contains one or more kernel part(s), we made the choice to recompute penalization parameter(s), since this choice ensures an optimized model under the null hypothesis. However, we have also decided to leave the kernel tuning parameters fixed when re-estimating under the null. Indeed, a change in tuning parameters induces a change in the choice of kernel functions - and thus in model assumptions - and hence, keeping the tuning parameters fixed ensures comparability of model assumptions under the null and the alternative hypotheses.
In our KSPM package, we have also included an algorithm for selection of variables based on these information criteria. Backward, forward or stepwise approaches can be chosen for single kernel semi-parametric models. Although these concepts may be easily extended to the case of multiple kernel models, such analyses require large computing times. Parallelization of such a process may greatly increase the appeal of this procedure. For now, we recommend investigating variables within a single kernel before starting to fit multiple kernel models.
In summary, the KSPM package is a flexible comprehensible option for studying the effects of groups of variables on a continuous outcome. These tools may be extended to the case of binary or survival outcomes in future work.
CS was supported by the Institute for Data Valorization (IVADO) fellowship. CG was supported by the Ludmer Centre for Neuroinformatics and Mental Health, the Canadian Institutes for Health Research, as well as the CIHR PJT 148620 grant. SJ is a recipient of a Canada Research Chair in neurodevelopmental disorders, and a chair from the Jeanne et Jean Louis Levesque Foundation. This research was enabled in part by support provided by Calcul Quebec (http://www.calculquebec.ca) and Compute Canada (www.computecanada.ca). This work is supported by a grant from the Brain Canada Multi-Investigator initiative and CIHR grant 159734 (SJ, CMTG). This work was also supported by an NIH award U01 MH119690 granted (SJ) and U01 MH119739.
SUPPLEMENT
We simulated a \(300 \times 9\) matrix representing the number of minor
allele (0, 1 or 2) of 300 subjects for 9 SNPs, 6 belonging to gene A and
3 belonging to gene B, using the glSim()
function into the adegenet
R package.
> library(adegenet)
> set.seed(78)
> SNPdata <- as.matrix(glSim(n.ind = 300, n.snp.nonstruc = 9, n.snp.struc = 0, k = 1,
LD = FALSE, ploidy = 2))
We may deduce two matrices, one with data belonging to gene A and one with data belonging to gene B.
> geneA <- SNPdata[,1:6]
> geneB <- SNPdata[,7:9]
Then we computed the matrix of interactions between SNPs from gene A and SNPs from geneB.
> SNPdataAB <- matrix(NA, nrow = 300, ncol = 6*3)
> k <- 1
> for (i in 1:6) {
for (j in 1:3) {
<- SNPdataA[,i] * SNPdataB[,j]
SNPdataAB[,k] <- k+1
k
} }
A continuous outcome was simulated as a linear combination of the interaction terms between the 6 SNPs of gene A and the three SNPs of gene B, where the weight coefficients where randomly and uniformly chosen in the interval \(\left[-2;0\right]\) using the following code.
> beta <- as.matrix(runif(6*3,-2,0), nrow = 300)
> y <- 100 + SNPdataAB %*% beta + rnorm(n, 0, 15)
Of note, we added an intercept of 100 and a normally distributed error of mean 0 and standard deviation 15.
Estimating the kernel semi-parametric model parameters consists in estimating \(\alpha_1, ..., \alpha_L\) and \(\beta\). Then, estimators of \(h_1(.)\), ..., \(h_L(.)\) are deduced from \(\hat{\alpha}_1, ..., \hat{\alpha}_L\). When there are multiple kernels, estimation is iterative and so we proceed as follows. Partial derivatives of \(l(\beta, h)\) according to \(\alpha_1, ..., \alpha_L\) and \(\beta\) lead to the following equations:
\[\label{eq:betaM} \hat{\beta} = \left\lbrace X^\top \left( I - \sum\limits_{\ell = 1}^L K_{\ell} G_{\ell}^{-1} M_{\ell} \right) X \right\rbrace^{-1} X^\top \left(I - \sum\limits_{\ell = 1}^L K_{\ell} G_{\ell}^{-1} M_{\ell}\right) Y \tag{9}\]
\[\label{eq:alphaM} \forall \ell \in \left\lbrace 1,..., L\right\rbrace, \quad \hat{\alpha}_{\ell} = G_{\ell}^{-1} M_{\ell} \left(Y-X\hat{\beta}\right) \tag{10}\] where \(I\) is the \(n \times n\) identity matrix and \(\forall \ell \in \left\lbrace 1, ..., L \right\rbrace\), \(G_{\ell}\) and \(M_{\ell}\) are computed using an iterative process as described below:
Whereas the parameter estimation for single kernel model has been
largely demonstrated, the equations for computing multiple kernel
parameters were not theoretically developed yet. Above, we propose an
iterative way to estimate the parameters for all kernel models. Below,
we propose the proof for equations ((9)) and
((10)).
Proof:
Partial derivatives of penalized likelihood ((6)) lead to
equations below:
\[\label{eq:derivll} \left\lbrace \begin{aligned} X^\top (Y - X\beta - \sum\limits_{\ell = 1}^L K_{\ell} \alpha_{\ell}) & = 0 \\ Y - X\beta - \sum\limits_{m = 1}^L K_{m} \alpha_{m} - \lambda_{\ell} \alpha_{\ell} & = 0 \quad \forall \ell \in \left\lbrace 1,...,L \right\rbrace \end{aligned} \right. \tag{11}\]
Proposition 1:
\(\forall k \in \left\lbrace 1, ..., L-1\right\rbrace, \alpha_{(k)} = \mathcal{G}_{(k)}^{-1} \mathcal{M}_{(k - 1)} \left( Y - X\beta - \sum\limits_{m = k + 1}^L K_{(m)} \alpha_{(m)} \right)\)
Proposition 2: if proposition 1 is true until \(k\), then
\[\sum\limits_{j = 1}^k K_{(m)}\alpha_{(m)} = (I - \mathcal{M}_{(k)})\left( Y - X\beta - \sum\limits_{m = k + 1}^L K_{(m)}\alpha_{(m)} \right)\]
(Proofs of propositions 1 and 2 are given below.)
Now, suppose proposition 1 and proposition 2 are true.
\[\begin{aligned} (\ref{eq:derivll}) \Rightarrow && Y - X\beta - \sum\limits_{m = 1}^L K_{(m)}\alpha_{(m)} - \lambda_{(L)}\alpha_{(L)} &= 0 \\ \Leftrightarrow && Y - X\beta - \sum\limits_{m = 1}^{L-1} K_{(m)}\alpha_{(m)} - K_{(L)}\alpha_{(L)} - \lambda_{(L)}\alpha_{(L)} &= 0 \\ \Leftrightarrow && Y - X\beta - (I - \mathcal{M}_{(L-1)})\left( Y - X\beta - K_{(L)}\alpha_{(L)} \right) - K_{(L)}\alpha_{(L)} - \lambda_{(L)}\alpha_{(L)}& = 0 \\ \Leftrightarrow && \left( \lambda_{(L)}I + \mathcal{M}_{(L-1)} K_{(L)} \right) \alpha_{(L)} - \mathcal{M}_{(L-1)} (Y - X\beta) &= 0 \\ \end{aligned}\]
Then,
\(\alpha_{\ell} = \alpha_{(L)} = \mathcal{G}_{(L)}^{-1} \mathcal{M}_{(L - 1)} \left( Y - X\beta \right)\).
Let \(G_{\ell}\) and \(M_{\ell}\) the \(n \times n\) matrix defined as
\(G_{\ell} = \mathcal{G}_{(L)}\) and \(M_{\ell} = \mathcal{M}_{(L - 1)}\),
thus \(\alpha_{\ell} = G_{\ell}^{-1} M_{\ell} \left( Y - X\beta \right)\).
All this process is done for each
\(\ell \in \left\lbrace 1, ..., L\right\rbrace\).
Now, we can derive the estimate for \(\beta\).
\[\begin{aligned}
(\ref{eq:derivll}) \Rightarrow && X^\top \left( Y - X\beta - \sum\limits_{\ell = 1}^L K_{\ell}\alpha_{\ell} \right) & = 0 \\
\Leftrightarrow && X^\top\left( Y - X\beta - \sum\limits_{\ell = 1}^L K_{\ell}G_{\ell}^ {-1}M_{\ell} (Y - X\beta)\right) & = 0 \\
\Leftrightarrow && X^\top\left( I - \sum\limits_{\ell = 1}^L K_{\ell}G_{\ell}^ {-1}M_{\ell} \right) Y & = X^\top\left( I - \sum\limits_{\ell = 1}^L K_{\ell}G_{\ell}^ {-1}M_{\ell} \right) X\beta \\
\end{aligned}\]
Thus
\(\beta = \left\lbrace X^\top\left( I - \sum\limits_{\ell = 1}^L K_{\ell}G_{\ell}^ {-1}M_{\ell} \right) X \right\rbrace^{-1} X^\top\left( I - \sum\limits_{\ell = 1}^L K_{\ell}G_{\ell}^ {-1}M_{\ell} \right) Y\).
Proof of proposition 1:
\(\bullet\) Proposition 1 is true for \(k = 1\):
\[\begin{aligned} \text{(\ref{eq:derivll})} \Rightarrow && Y - X\beta - \sum\limits_{m = 1}^L K_{(m)} \alpha_{(m)} - \lambda_{(1)} \alpha_{(1)} & = 0 \\ \Leftrightarrow && Y - X\beta - \sum\limits_{m = 2}^L K_{(m)} \alpha_{(m)} - K_{(1)} \alpha_{(1)} - \lambda_{(1)} \alpha_{(1)} & = 0 \\ \Leftrightarrow && \mathcal{M}_{(0)} \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)} \alpha_{(m)} \right) & = \left( \lambda_{(1)} I + \mathcal{M}_{(0)} K_{(1)} \right) \alpha_{(1)} \\ \Leftrightarrow && \mathcal{G}_{(1)}^{-1} \mathcal{M}_{(0)} \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)} \alpha_{(m)} \right) & = \alpha_{(1)} \\ \end{aligned}\]
\(\bullet\) If proposition 1 is true for \(k\), then proposition 1 is true
for \(k + 1\):
According to proposition 2:
\[\begin{aligned} \sum\limits_{m = 1}^k K_{(m)} \alpha_{(m)} &= (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 1}^L K_{(m)}\alpha_{(m)} \right) \\ & = (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) - (I - \mathcal{M}_{(k)}) K_{(k+1)} \alpha_{(k+1)} \end{aligned}\]
\[\begin{aligned} \text{(\ref{eq:derivll})} \Rightarrow && Y - X\beta - \sum\limits_{m = 1}^L K_{(m)} \alpha_{(m)} - \lambda_{(k+1)} \alpha_{(k+1)} & = 0 \\ \Leftrightarrow && Y - X\beta - \sum\limits_{m = 1}^k K_{(m)} \alpha_{(m)} - K_{(k+1)} \alpha_{(k+1)} - \sum\limits_{m = k + 2}^L K_{(m)} \alpha_{(m)} - \lambda_{(k+1)} \alpha_{(k+1)} &= 0 \\ \Leftrightarrow && - \left( (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) - (I - \mathcal{M}_{(k)}) K_{(k+1)} \alpha_{(k+1)} \right) & \\ && + Y - X\beta - K_{(k+1)} \alpha_{(k+1)} - \sum\limits_{m = k + 2}^L K_{(m)} \alpha_{(m)} - \lambda_{(k+1)} \alpha_{(k+1)} & = 0 \\ \Leftrightarrow && \mathcal{M}_{(k)} \left( Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)} \alpha_{(m)} \right) - \mathcal{M}_{(k)} K_{(k+1)} \alpha_{(k+1)} - \lambda_{(k+1)} \alpha_{(k+1)} & = 0 \\ \Leftrightarrow && \mathcal{M}_{(k)} \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)} \alpha_{(m)} \right) - \mathcal{G}_{(k+1)} \alpha_{(k+1)} & = 0 \\ \Leftrightarrow && \mathcal{G}_{(k+1)}^{-1} \mathcal{M}_{(k)} \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)} \alpha_{(m)} \right) & = \alpha_{(k+1)} \\ \end{aligned}\]
Proof of proposition 2:
\(\bullet\) Proposition 2 is true for \(k = 1\):
Suppose
\(\alpha_{(1)} = \mathcal{G}_{(1)}^{-1}\mathcal{M}_{(0)} \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)}\alpha_{(m)} \right)\).
\[\begin{aligned} \sum\limits_{m = 1}^1 K_{(m)} \alpha_{(m)} & = K_{(1)}\alpha_{(1)} \\ & = K_{(1)} \mathcal{G}_{(1)}^{-1}\mathcal{M}_{(0)} \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = \left( I - \mathcal{M}_{(0)} + \mathcal{M}_{(0)} K_{(1)} G_1^{-1} \mathcal{M}_{(0)} \right) \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = \left( I - (I - \mathcal{M}_{(0)} K_{(1)} G_1^{-1}) \mathcal{M}_{(0)}\right) \left( Y - X\beta - \sum\limits_{m = 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = \left(I - \mathcal{M}_{(1)} \right) \left(Y - X\beta - \sum\limits_{m = 2}^L K_{(m)} \alpha_{(m)}\right) \\ \end{aligned}\]
\(\bullet\) If proposition 2 is true for \(k\), then proposition 2 is true
for \(k + 1\):
Suppose proposition 1 is true until \(k+1\), then
\(\forall j \in \left\lbrace 1, ..., k+1 \right\rbrace\),
\(\alpha_{(j)} = \mathcal{G}_{(j)}^{-1}\mathcal{M}_{(j-1)} \left( Y - X\beta - \sum\limits_{m = j+1}^L K_{(m)}\alpha_{(m)} \right)\)
\[\begin{aligned} \sum\limits_{m = 1}^{k+1} K_{(m)} \alpha_{(m)} & = \sum\limits_{m = 1}^{k} K_{(m)} \alpha_{(m)} + K_{(k+1)} \alpha_{(k+1)}\\ & = (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 1}^L K_{(m)}\alpha_{(m)} \right) + K_{(k+1)} \alpha_{(k+1)} \\ & = (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) + \mathcal{M}_{(k)} K_{(k+1)}\alpha_{(k+1)} \\ & = (I - \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) + \mathcal{M}_{(k)} K_{(k+1)} \mathcal{G}_{(k+1)}^{-1} \mathcal{M}_{(k)} \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = (I - \mathcal{M}_{(k)} + \mathcal{M}_k K_{(k+1)} \mathcal{G}_{(k+1)}^{-1} \mathcal{M}_k) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = (I - (I+ \mathcal{M}_{(k)} K_{(k+1)} \mathcal{G}_{(k+1)}^{-1}) \mathcal{M}_{(k)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) \\ & = (I - \mathcal{M}_{(k+1)}) \left(Y - X\beta - \sum\limits_{m = k + 2}^L K_{(m)}\alpha_{(m)} \right) \\ \end{aligned}\]
\(\square\)
Of note, for \(L = 1\), equations ((9)) and ((10))
correspond to those provided in (Liu et al. 2007).
KSPM, coxme, SKAT, KRLS, e1071, SPA3G, np, mgcv, lme4, nlme, DEoptim, adegenet, CompQuadForm
Agriculture, Bayesian, ChemPhys, Cluster, Distributions, Econometrics, Environmetrics, Epidemiology, Finance, MachineLearning, MixedModels, OfficialStatistics, Optimization, Psychometrics, Spatial, SpatioTemporal, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Schramm, et al., "KSPM: A Package For Kernel Semi-Parametric Models", The R Journal, 2021
BibTeX citation
@article{RJ-2021-012, author = {Schramm, Catherine and Jacquemont, Sébastien and Oualkacha, Karim and Labbe, Aurélie and Greenwood, Celia M.T.}, title = {KSPM: A Package For Kernel Semi-Parametric Models}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {12}, issue = {2}, issn = {2073-4859}, pages = {82-106} }