Data with multiple responses is ubiquitous in modern applications. However, few tools are available for regression analysis of multivariate counts. The most popular multinomial-logit model has a very restrictive mean-variance structure, limiting its applicability to many data sets. This article introduces an R package MGLM, short for multivariate response generalized linear models, that expands the current tools for regression analysis of polytomous data. Distribution fitting, random number generation, regression, and sparse regression are treated in a unifying framework. The algorithm, usage, and implementation details are discussed.
Multivariate categorical data arises in many fields, including genomics, image analysis, text mining, and sports statistics. The multinomial-logit model (Agresti 2002 7) has been the most popular tool for analyzing such data. However, it is limiting due to its specific mean-variance structure and the strong assumption that the counts are negatively correlated. Models that address over-dispersion relative to a multinomial distribution and incorporate positive and/or negative correlation structures would offer greater flexibility for analysis of polytomous data.
In this article, we introduce an R package MGLM, short for multivariate response generalized linear models. The MGLM package provides a unified framework for random number generation, distribution fitting, regression, hypothesis testing, and variable selection for multivariate response generalized linear models, particularly four models listed in Table 1. These models considerably broaden the class of generalized linear models (GLM) for analysis of multivariate categorical data.
MGLM overlaps little with existing packages in R and other
softwares. The standard multinomial-logit model is implemented in
several R packages (Venables and Ripley 2002) with
VGAM
(Yee 2010, 2015, 2017) being the most
comprehensive. We include the classical multinomial-logit regression
model in MGLM not only for completeness, but also to complement it
with various penalty methods for variable selection and regularization.
If invoked by the group penalty, MGLM is able to perform variable
selection at the predictor level for easier interpretation. This is
different from the elastic net penalized multinomial-logit model
implemented in glmnet
(Friedman et al. 2010), which does selection at the matrix
entry level. Although MGLM focuses on regression, it also provides
distribution fitting and random number generation for the models listed
in Table 1. VGAM and
dirmult
(Tvedebrink 2010) packages can estimate the parameters of the
Dirichlet-multinomial (DM) distribution using Fisher’s scoring and
Newton’s method respectively. As indicated in the manual
(Yee 2017), the convergence of Fisher’s scoring method may be
slow due to the difficulty in evaluating the expected information
matrix. Further the Newton’s method is unstable as the log-likelihood
function may be non-concave. As explained later, MGLM achieves
both stability and efficiency via a careful algorithmic design. In SAS,
PROC LOGISTIC
can fit multinomial-logit model. In Matlab, the mnrfit
function fits multinomial-logit regression. Alternative link functions
(probit, loglog, complementary loglog) are implemented only for ordinal
responses. Other regression models in Table 1 are not
implemented in either SAS or Matlab.
There are some limitations to the MGLM. First, MGLM only handles nominal responses; ordinal responses are not incorporated in current implementation. MGLM also does not allow covariates to take a different value for each category, which can be useful in applications such as modeling consumer choice among a discrete number of products (Yee 2015 14). Lastly, current implementation of MGLM does not permit parameter constraints.
MGLM provides standard errors for all estimates, reports significance of regression covariates based on the Wald test (default) or the likelihood ratio test (LRT), and outputs the AIC (Akaike information criterion) and BIC (Bayesian information criterion) of the fitted model to aid model choice by users. Model selection via regularization is automated by computing the solution path on a grid of tuning parameter values and then reporting the regularized estimate with minimal BIC. With large data sets in mind, we pay particular attention to computational efficiency. For numerical examples in this article, we report the run times whenever possible. The results are obtained on a laptop with Intel Core i7 CPU @ 2.9GHz and 16G memory using MGLM 0.1.0 under R 3.4.3.
Model | No. regress. param. | Correlation | MGLM code |
---|---|---|---|
Multinomial | Negative | dist=’MN’ |
|
Dirichlet-multinomial | Negative | dist=’DM’ |
|
Negative multinomial | Positive | dist=’NegMN’ |
|
Gen. Dirichlet-multinomial | Negative and positive | dist=’GDM’ |
This section details the models implemented in MGLM. Table
1 summarizes the multivariate models implemented in the
R package. They are multivariate analogs of binomial, beta-binomial, and
negative binomial models. For each model, we specify the probability
mass function of the response vector
The probability mass function of a
The parameter
where
Because the log-sum-exponential mapping
The mean-variance structure of the MN model does not allow
over-dispersion, which is common in real data. DM distribution models
the probability parameter
where
relates the parameter
with
The counts
where
to relate the covariates
which in general is not concave in
In some applications the over-dispersion parameter
The model size is
In the previous three models, the multivariate counts have either
pairwise negative correlation (MN and DM) or pairwise positive
correlation (NegMN). It is possible to relax these restrictions by
choosing a more flexible mixing distribution for the probability vector
The probability mass of a count vector
where
We propose the inverse link functions
to relate the covariates
Again the log-likelihood is not concave in general.
When there are no covariates, multinomial model is a special case of the
DM models, which in turn is a sub-model of the GDM model. That is,
MGLMfit
reports the dist="NegMN"
in the distribution fitting function MGLMfit
.
For regression, there is no nesting structure among the models in Table
1. The regression function MGLMreg
outputs AIC and
BIC of the fitted model to aid users in choosing an appropriate
regression model for their data.
Standard errors for both distribution fitting (MGLMfit
) and regression
estimates (MGLMreg
) are calculated based on the observed information
matrix, as it provides a reasonable approximation to the expected
information matrix and is even preferred as argued by
Efron and Hinkley (1978).
Unlike regression for univariate responses, the MGLM regression
parameter is a matrix
By default, MGLMreg
assesses the significance of each predictor by the
Wald test. Let
be the inverse of the observed information matrix at the maximum
likelihood estimate (MLE)
where MGLMreg
. Users can also easily invoke the LRT by calling
MGLMreg
with option LRT=TRUE
. LRT requires more computation than
Wald test as it needs to perform MLE on
The number of parameters, MGLMsparsereg
function solves the regularized problem
where
In elementwise regularization (penalty=’sweep’
),
where
In groupwise regularization (penalty=’group’
),
The group penalty (Yuan and Lin 2006; Meier et al. 2008) achieves variable selection at the predictor level and leads to a more interpretable model.
Shrinkage and sparsity in terms of the rank of penalty=’nuclear’
)
where
The wrapper function MGLMtune
facilitates the tuning procedure by
performing the regularized estimation over a grid of 30 (default) tuning
parameter values using warm start and reports the optimal tuning
parameter according to BIC. Users can also supply their own grid points.
As the DM, NegMN, and GDM distributions do not belong to the exponential family, the usual iteratively reweighted least squares method for maximum likelihood estimation of GLM does not apply. The main issue lies in the difficulty of calculating the expected information matrix, which involves evaluating numerous tail probabilities of the marginal distribution (Paul et al. 2005; Zhou and Lange 2010; Zhou and Zhang 2012). On the other hand, Newton’s method suffers from instability since the log-likelihood functions are non-concave in general.
For distribution fitting, Zhou and Lange (2010) derive stable algorithms based on the minorization-maximization (MM) principle (Lange et al. 2000). Similar to the classical expectation-maximization algorithm, MM algorithm increases the objective value at each iteration and converges to a stationarity point of objective function under mild regularity conditions.
For regression models, Zhang et al. (2017) propose an iteratively reweighted Poisson regression (IRPR) method for maximum likelihood estimation. Their derivation again hinges upon the MM principle, resulting in the much desirable stability of the IRPR algorithm which is critical as the number of parameters is potentially large.
In practice, MM algorithm may suffer from slow convergence especially in the proximity of the optimum. MGLM organically combines the MM algorithm and the Newton’s method. At each iteration, it computes both MM and Newton updates and chooses the one that results in a higher log-likelihood. Thus, stability is ensured as the log-likelihood always increases. When sufficiently close to the optimum, Newton’s method takes over and quickly converges to the MLE in just a few iterations.
An added advantage of the MM algorithm is that it separates parameters
and embraces parallel computing (Zhou et al. 2010). Each
iteration of the IRPR algorithm involves solving MGLMreg
regression function supports multi-threading on a
multi-core machine.
For sparse regression in the MGLMsparsereg
function, we implemented
the accelerated proximal gradient (Nesterov) method
(Zhang et al. 2017). Each iteration only involves evaluation of
the gradient of the negative log-likelihood, followed by the
elementwise, groupwise, or singular value thresholding according to the
regularization being used, and thus scales well with the problem size.
Three main functions of the MGLM package are MGLMfit
for fitting
multivariate distributions, MGLMreg
for fitting multivariate response
regressions, and MGLMsparsereg
for fitting sparse regressions. The
package adopts S4
object system. In this section, we demonstrate their
basic usage using a simulated RNA-seq data set. The R vignette included
in the package provides more extensive examples.
The rnaseq
data that comes with the package is simulated from the
isoform package
(Sun 2014; Sun et al. 2015) and mimics the real counts generated by
the RNA-seq platforms. The simulation mechanism follows the biological
process and has nothing to do with the models in Table
1.
In this example, totalReads
, treatment
, gender
, and age
of
an individual. 200 observations are simulated. In the generative model,
exon expression levels are affected by intercept, number of total reads
(on log scale), and treatment. Age and gender are unrelated to the exon
counts.
R> library("MGLM")
R> data("rnaseq")
R> data <- rnaseq[, 1:6]
R> head(rnaseq, n = 3)
X1 X2 X3 X4 X5 X6 totalReads treatment gender age
1 295 65 19 114 54 20 28317494 0 0 57
2 213 126 12 96 50 4 20015549 0 0 67
3 322 147 23 245 42 19 35318251 0 1 37
R> dim(rnaseq)
[1] 200 10
We first ignore covariates and demonstrate distribution fitting and
model selection with BIC and LRT. The multi-categorical distribution
fitting is performed by the function MGLMfit
.
The primary inputs of the function are the multi-categorical count data
in the format of data frame or matrix and the distribution intended to
fit. The user can also use the weight
argument to specify the weight
of each observation. Initial values of the iterative algorithm can also
be determined by the user using init
argument. The function also has
epsilon
, maxiters
, and display
to control the convergence
threshold, maximum number of iterations to run, and whether to display
the result from each iteration, respectively.
The outputs are returned in a list of class “MGLMfit”, including
parameter estimates, their standard errors, log-likelihood, AIC, BIC,
number of iterations to converge. The inversed information matrices and
gradients are also returned, but are not printed, in order to keep the
printed output concise. When fitting DM and GDM distribution, we also
perform LRT, testing against the null hypothesis of using multinomial
model.
The following snippets fit the DM
R> system.time (
+ dmFit <- MGLMfit(data, dist = "DM")
+ )
user system elapsed
0.255 0.007 0.263
R> dmFit
estimate SE
alpha_X1 6.128117 0.327888
alpha_X2 2.413647 0.139676
alpha_X3 1.625627 0.099424
alpha_X4 6.822929 0.362568
alpha_X5 2.214236 0.129233
alpha_X6 0.784028 0.051369
Distribution: Dirichlet Multinomial
Log-likelihood: -4968.666
BIC: 9969.121
AIC: 9949.331
LRT test p value: <0.0001
Iterations: 6
and the GDM model
R> system.time(
+ gdmFit <- MGLMfit(data, dist = "GDM")
+ )
user system elapsed
0.380 0.019 0.399
R> gdmFit
estimate SE
alpha_X1 3.741846 0.367088
alpha_X2 2.400909 0.815431
alpha_X3 1.558396 0.233136
alpha_X4 6.988354 1.164764
alpha_X5 20.689398 0.149279
beta_X1 8.026379 0.966502
beta_X2 11.038376 0.725978
beta_X3 8.961428 0.264520
beta_X4 2.702723 2.871718
beta_X5 4.854816 0.648271
Distribution: Generalized Dirichlet Multinomial
Log-likelihood: -4841.231
BIC: 9735.446
AIC: 9702.463
LRT test p value: <0.0001
Iterations: 59
Both dmFit
and gdmFit
give the LRT
R> LRT <- -2 * (logLik(dmFit) - logLik(gdmFit))
R> pchisq(LRT, ncol(data) - 2, lower.tail = FALSE)
[1] 5.817352e-54
Both suggest that GDM provides a significantly better fit than DM.
The NegMN model
R> system.time(
+ negmnFit <- MGLMfit(data, dist = "NegMN")
+ )
user system elapsed
0.006 0.002 0.009
R> print(negmnFit)
estimate SE
p_X1 0.311486 0.001362
p_X2 0.106491 0.000850
p_X3 0.098373 0.000819
p_X4 0.350496 0.001425
p_X5 0.094263 0.000803
p_X6 0.021220 0.000389
phi 12.232569 1.229253
Distribution: Negative Multinomial
Log-likelihood: -20673.71
BIC: 41384.52
AIC: 41361.43
LRT test p value:
Iterations: 3
yields a much larger BIC than those of DM and GDM. LRT does not apply here, since NegMN is not a sub-model of the other three.
The more interesting question is whether the covariates have a
significant relationship to the responses. Regressions are performed
using function MGLMreg
. First, the regression formula is required by
the MGLMreg
function. When specifying the regression formula, the
response matrix is on the left hand side and the covariates on the
right, following the convention in lm
and glm
. The model is
specified using the dist
argument. The input argument data
is
optional. When specified, the formula is built based on the supplied
data frame; otherwise, the function searches through the global
environment for the data elements. Similar to the distribution fitting
function, weights of the observations can be specified by the weight
argument, and initial values can be supplied using init
arguments.
Parallel computing is also implemented in the package. Setting
parallel=TRUE
and giving the number of cores using the argument core
invokes multi-threading. The other arguments used to control the
algorithm convergence and display include epsilon
, maxiters
,
display
, and LRT
.
The output of MGLMreg
is a list of class “MGLMreg”, containing the
estimated regression coefficients, their standard errors, Wald test
statistics and their corresponding
We fit the four regression models in Table 1 with all
A few observations can be made from the following output. BIC indicates the GDM model as the best fit, followed by the DM model. This is consistent with the distribution fitting results. Note that the hypothesis testing results in the four models are different. In the MN model, all covariates are significant; however, this is not true because age and gender have no effects in the generative model. DM also wrongly identifies age as a significant predictor. Only the GDM model correctly identifies true significant predictors.
R> system.time(
+ mnreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
+ treatment + age + gender, data = rnaseq, dist = "MN")
+ )
user system elapsed
0.142 0.006 0.149
R> print(mnreg)
Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
treatment + age + gender, data = rnaseq, dist = "MN")
Coefficients:
X1 X2 X3 X4 X5
[1,] 4.942732 5.009649 -3.792216 4.435434 4.027689
[2,] -0.112845 -0.170222 0.219277 -0.107260 -0.120928
[3,] -0.022655 -0.043099 2.745277 1.405742 0.092246
[4,] -0.006187 -0.009709 -0.005907 -0.010945 -0.009599
[5,] 0.032676 0.100389 0.020663 0.103859 0.009514
Hypothesis test:
wald value Pr(>wald)
(Intercept) 144.88789 1.634268e-29
log(totalReads) 69.92572 1.061922e-13
treatment 18364.13260 0.000000e+00
age 79.91023 8.762650e-16
gender 52.33670 4.601575e-10
Distribution: Multinomial
Log-likelihood: -7506.393
BIC: 15145.24
AIC: 15062.79
Iterations: 6
R> system.time(
+ dmreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
+ treatment + age + gender, data = rnaseq, dist = "DM")
+ )
user system elapsed
0.182 0.004 0.187
R> print(dmreg)
Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
treatment + age + gender, data = rnaseq, dist = "DM")
Coefficients:
X1 X2 X3 X4 X5 X6
[1,] -0.895850 -1.096921 -8.997414 -1.736871 -1.774227 -5.646822
[2,] 0.221988 0.186919 0.536572 0.252679 0.216672 0.347271
[3,] -0.679291 -0.686881 1.835585 0.707954 -0.546469 -0.543134
[4,] 0.010245 0.005227 0.009134 0.004252 0.006090 0.011642
[5,] -0.026177 0.040244 -0.052842 0.023178 -0.058339 -0.039139
Hypothesis test:
wald value Pr(>wald)
(Intercept) 14.579069 0.023796
log(totalReads) 8.502549 0.203547
treatment 1851.437449 0.000000
age 13.131512 0.040994
gender 4.133364 0.658634
Distribution: Dirichlet Multinomial
Log-likelihood: -4386.941
BIC: 8932.831
AIC: 8833.882
Iterations: 9
R> system.time(
+ gdmreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
+ treatment + age + gender, data = rnaseq, dist = "GDM")
+ )
user system elapsed
0.219 0.003 0.224
R> print(gdmreg)
Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
treatment + age + gender, data = rnaseq, dist = "GDM")
Coefficients:
alpha_X4 alpha_X1 alpha_X2 alpha_X3 alpha_X5 beta_X4
[1,] 5.987993 -7.056673 0.456088 -10.120738 2.639396 4.661089
[2,] -0.215099 0.555973 0.039553 0.720358 -0.016121 -0.140896
[3,] -0.047691 -0.329320 0.979359 0.099958 0.063393 0.628878
[4,] 0.006661 -0.004343 0.019361 0.008173 0.012397 0.003224
[5,] 0.233006 0.374838 -0.186420 -0.202417 0.144289 0.212071
beta_X1 beta_X2 beta_X3 beta_X5
[1,] -9.789127 7.095061 -9.530008 -1.687615
[2,] 0.713819 -0.222984 0.743146 0.133985
[3,] 0.746198 -1.591630 -0.923712 -0.042441
[4,] 0.000453 0.015945 0.012541 0.019188
[5,] 0.273256 -0.233121 -0.270428 0.122062
Hypothesis test:
wald value Pr(>wald)
(Intercept) 15.40109 0.118109
log(totalReads) 11.04187 0.354265
treatment 2549.23829 0.000000
age 16.42846 0.088007
gender 10.72122 0.379646
Distribution: Generalized Dirichlet Multinomial
Log-likelihood: -4289.281
BIC: 8843.479
AIC: 8678.563
Iterations: 46
There are two variants of NegMN regression, depending on whether to link
over-dispersion parameter to covariates. The default setting
regBeta=FALSE
instructs MGLMreg
to fit the NegMN regression with all
observations sharing the same over-dispersion parameter value. There are
R> system.time(
+ negmnreg2 <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~
+ log(totalReads) + treatment + age + gender,
+ data = rnaseq, dist = "NegMN", regBeta = FALSE)
+ )
user system elapsed
0.196 0.004 0.201
R> print(negmnreg2)
Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
treatment + age + gender, data = rnaseq, dist = "NegMN",
regBeta = FALSE)
Coefficients:
$alpha
X1 X2 X3 X4
(Intercept) -13.587385 -13.521818 -22.380101 -14.131348
log(totalReads) 0.907716 0.850412 1.242922 0.915258
treatment -0.753113 -0.773507 2.014641 0.675195
age 0.002583 -0.000938 0.002916 -0.002141
gender -0.060696 0.007022 -0.069502 0.012499
X5 X6
(Intercept) -14.507698 -18.526425
log(totalReads) 0.899918 1.020360
treatment -0.638296 -0.730410
age -0.000824 0.008766
gender -0.083681 -0.093397
$phi
phi
31.6062
Hypothesis test:
wald value Pr(>wald)
(Intercept) 385.75540 3.223413e-80
log(totalReads) 368.08485 2.017976e-76
treatment 18377.52958 0.000000e+00
age 79.70906 4.103065e-15
gender 54.84662 4.978098e-10
Distribution: Negative Multinomial
Log-likelihood: -8746.689
BIC: 17657.63
AIC: 17555.38
Iterations: 35
regBeta=TRUE
instructs MGLMreg
to link over-dispersion parameter to
covariates and there are
R> system.time(
+ negmnreg <- MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~
+ log(totalReads) + treatment + age + gender,
+ data = rnaseq, dist = "NegMN", regBeta = TRUE)
+ )
user system elapsed
9.866 0.023 9.898
R> print(negmnreg)
Call: MGLMreg(formula = cbind(X1, X2, X3, X4, X5, X6) ~ log(totalReads) +
treatment + age + gender, data = rnaseq, dist = "NegMN",
regBeta = TRUE)
Coefficients:
X1 X2 X3 X4
(Intercept) -17.648355 -17.582555 -26.462109 -18.204282
log(totalReads) 1.192057 1.134742 1.528500 1.200309
treatment -0.877324 -0.897715 1.890375 0.550944
age -0.013397 -0.016918 -0.013072 -0.018127
gender -0.101456 -0.033730 -0.110253 -0.028252
X5 X6 phi
(Intercept) -18.569609 -22.587821 7.543157
log(totalReads) 1.184315 1.304725 -0.285737
treatment -0.762503 -0.854624 0.125412
age -0.016804 -0.007213 0.015871
gender -0.124433 -0.134173 0.041254
Hypothesis test:
wald value Pr(>wald)
(Intercept) 291.48017 3.987668e-59
log(totalReads) 371.39226 3.232075e-76
treatment 18377.18774 0.000000e+00
age 81.70350 6.187095e-15
gender 54.79955 1.633654e-09
Distribution: Negative Multinomial
Log-likelihood: -8745.162
BIC: 17675.77
AIC: 17560.32
Iterations: 140
The function MGLMsparsereg
performs regularized estimation. Similar to
MGLMreg
, the inputs of the sparse regression function include
formula
, data
, dist
, and the convergence controlling arguments.
The function also requires the tuning parameter lambda
, and the
penalty type argument penalty
. The outputs include the coefficient
estimates, log-likelihood, AIC, BIC, degrees of freedom, and the number
of iterations.
We simulate 100 data points from a 5-variate DM model using 10
covariates. Only three of them (1, 3, and 5) have non-zero effects. For
each 5-variate data point, batch size, or the total number of objects
that are put into 5 categories, is drawn from
R> dist <- "DM"
R> n <- 100
R> p <- 10
R> d <- 5
R> set.seed(118)
R> m <- rbinom(n, 200, 0.8)
R> X <- matrix(rnorm(n * p), n, p)
R> alpha <- matrix(0, p, d)
R> alpha[c(1, 3, 5), ] <- 1
R> Alpha <- exp(X %*% alpha)
R> Y <- rdirmn(size = m, alpha = Alpha)
R> length(m)
[1] 100
R> head(m)
[1] 167 160 162 159 156 157
R> head(Y)
[,1] [,2] [,3] [,4] [,5]
[1,] 24 7 112 15 9
[2,] 34 33 31 38 24
[3,] 0 0 0 0 162
[4,] 7 17 84 29 22
[5,] 0 33 0 123 0
[6,] 0 0 3 154 0
R> head(rowSums(Y))
[1] 167 160 162 159 156 157
We demonstrate the regularized estimation by group, nuclear norm, and element-wise penalization.
With input lambda=Inf
, MGLMsparsereg
returns
R> pen <- "group"
R> ngridpt <- 30
R> spmodelfit <- MGLMsparsereg(formula = Y ~ 0 + X, dist = dist,
+ lambda = Inf, penalty = pen)
R> maxlambda <- maxlambda(spmodelfit)
R> lambdas <- exp(seq(from = log(maxlambda), to = log(maxlambda / nrow(Y)),
+ length.out = ngridpt))
Tuning is performed on 30 grid points. The left panel of Figure 1 displays the BIC trace along the solution path.
R> BICs <- rep(0, ngridpt)
R> AICs <- rep(0, ngridpt)
R> LogLs <- rep(0, ngridpt)
R> Dofs <- rep(0, ngridpt)
R> ptm <- proc.time()
R> for (j in 1:ngridpt) {
+ if (j == 1) {
+ B0 <- matrix(0, p, ncol(coef(spmodelfit)))
+ }
+ else B0 <- B_hat
+ select.fit <- MGLMsparsereg(formula = Y ~ 0 + X, dist = dist,
+ lambda = lambdas[j], penalty = pen, init = B0)
+ B_hat <- coef(select.fit)
+ BICs[j] <- BIC(select.fit)
+ LogLs[j] <- logLik(select.fit)
+ AICs[j] <- AIC(select.fit)
+ Dofs[j] <- dof(select.fit)
+ }
R> proc.time() - ptm
user system elapsed
4.469 0.026 4.500
R> pen <- "group"
R> ngridpt <- 30
R> spmodelfit <- MGLMsparsereg(formula = Y ~ 0 + X, dist = dist,
+ lambda = Inf, penalty = pen)
R> maxlambda <- maxlambda(spmodelfit)
R> lambdas <- exp(seq(from = log(maxlambda), to = log(maxlambda / nrow(Y)),
+ length.out = ngridpt))
The right panel of Figure 1 displays the
regularized estimate
R> chosen.lambda <- lambdas[which.min(BICs)]
R> select <- MGLMsparsereg(formula = Y ~ 0 + X, dist = dist,
+ lambda = chosen.lambda, penalty = pen)
Alternatively, the function MGLMtune
automates the tuning procedure
and reports the regularized estimate at the optimal tuning parameter
value according to BIC.
R> selectTune <- MGLMtune(Y ~ 0 + X, dist = dist, penalty = pen, ngridpt = 30,
+ display = FALSE)
The option ngridpt
sets the number of grid points
|
|
Nuclear norm regularization is invoked by setting penalty="nuclear"
.
R> system.time (
+ select <- MGLMtune(Y ~ 0 + X, dist = "DM", penalty = "nuclear",
+ ngridpt = 30, display = FALSE))
user system elapsed
4.475 0.037 4.528
Figure 2 displays the BIC trace plot and the
regularized estimate at optimal
|
|
R> system.time (
+ select <- MGLMtune(Y ~ 0 + X, dist = "DM", penalty = "sweep", ngridpt = 30,
+ display = FALSE))
user system elapsed
4.448 0.038 4.504
Figure 3 displays the BIC trace plot and the
regularized estimate at optimal
|
|
This article introduces the MGLM package for analysis of multivariate categorical data. Distribution fitting, regression, sparse regression, and random number generation are implemented in a simple and unified framework. It timely responds to the current challenge of multivariate categorical data analysis arising from modern technology such as RNA-seq. The R package is available on CRAN.
There are several possible extensions that would be useful to the package. Some of them include:
Alternative parameterization. Some models in Table
1 admits alternative parameterization. For
instance, the DM distribution can be re-parameterized as
Alternative link function. Current version only implements the log
link function for positive distribution parameter and logit
link
for the probability vector. Inclusion of alternative link functions
such as probit
and cloglog
would expand the flexibility of the
models.
Ordinal categorical responses. Multinomial-logit model can be adapted to ordinal categorical responses (Agresti 2002 7). Parallel developments and implementation for the DM, NegMN, and GDM models are worth considering.
Parameter constraints. Current version does not allow constraints
among regression parameters. MGLMreg
calls glm.fit
functions to
fit weighted Poisson regressions in each iteration; we may call
functions from glmc
package (Chaudhuri et al. 2006) instead to incorporate parameter
constraints.
The xij
argument (Yee 2015 14). Current version
assumes the same set of covariates for all categories. Allowing
covariates to take different values for each category can be useful,
e.g., for discrete choice modeling in econometrics. The
corresponding algorithm and implementation are worth exploring.
The MGLM package for R is continually being developed.
The work is partially supported by NSF grants DMS-1127914, DMS-1310319 and NIH grants HG006139, GM105785 and GM53275.
MGLM, VGAM, glmnet, dirmult, parallel, isoform, glmc
Distributions, Econometrics, Environmetrics, ExtremeValue, MachineLearning, Psychometrics, Survival
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kim, et al., "MGLM: An R Package for Multivariate Categorical Data Analysis", The R Journal, 2018
BibTeX citation
@article{RJ-2018-015, author = {Kim, Juhyun and Zhang, Yiwen and Day, Joshua and Zhou, Hua}, title = {MGLM: An R Package for Multivariate Categorical Data Analysis}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {73-90} }