We provide a detailed overview of the updated version of the R package LMest, which offers functionalities for estimating Markov chain and latent or hidden Markov models for time series and longitudinal data. This overview includes a description of the modeling structure, maximum-likelihood estimation based on the Expectation-Maximization algorithm, and related issues. Practical applications of these models are illustrated using real and simulated data with both categorical and continuous responses. The latter are handled under the assumption of the Gaussian distribution given the latent process. When describing the main functions of the package, we refer to potential applicative contexts across various fields. The LMest package introduces several key novelties compared to previous versions. It now handles missing responses under the missing-at-random assumption and provides imputed values. The implemented functions allow users to display and visualize model results. Additionally, the package includes functions to perform parametric bootstrap for inferential procedures and to simulate data with complex structures in longitudinal contexts.
Markov chain (MC) and latent or hidden Markov (HM) models are gaining increasing attention due to possibility of handling the temporal structure of many observed phenomena . In particular, with only one categorical response, an MC model allows studying the initial distribution of this response variable and its conditional distribution given the previous response. On the other hand, HM models offer a practical model-based dynamic clustering and classification method . This class of models finds application in the analysis of time-series and longitudinal data . The model formulation involves the introduction of a sequence of individual and time specific discrete latent (hidden) variables. This results in a hidden process assumed to follow a Markov chain of first order. The states of this chain correspond to latent clusters or subpopulations of individuals with similar latent characteristics. The approach can handle responses of different types, whether continuous or categorical. In the latter case, the conditional distribution of these variables is freely parameterized on the basis of conditional response probabilities.
In this article we provide a practical introduction to MC and HM models through an overview of the LMest package, focusing in particular on the new features of this package compared to the previous versions, such as the one illustrated in . The package allows analyzing time-series and longitudinal data by the models at issues. In the longitudinal context, recurring measurements over time on possibly multiple characteristics are taken on several sampling units (e.g., individuals), in such a way that it is possible to describe the evolution of these characteristics over time.
The package in its updated version 3.2.5 is available on CRAN at LMest and it is also described with detailed vignettes, including applicative examples. Each function is well documented in the help provided within the package, and several datasets from surveys and other sources are included in the package in both wide and long formats.
The current version of the package has the following features, which were also present in the previous versions:
it is designed to estimate different model formulations, such as MC, HM, and mixed HM models, through the Expectation-Maximization (EM) algorithm ;
it allows the estimation of the effect of covariates under different parameterizations and model specifications;
model selection procedures using the Akaike Information Criterion and the Bayesian Information Criterion are included, relying on different parameter initialization strategies;
functions to produce local and global decoding are implemented to perform dynamic clustering;
standard errors for the parameter estimates are obtained either through exact computation or reliable approximations of the observed information matrix; parametric bootstrap procedures are also available for different model specifications;
in order to increase the computational efficiency, Fortran routines are used to perform numerical operations.
Some important extensions have been introduced in the most recent version of the package. The features of these new functions can be summarized as follows:
in addition to categorical data, the package can handle continuous data, assuming a Gaussian distribution for the response variables conditional on the latent process. It also accommodates missing responses, drop-out, and intermittent missingness under the missing-at-random assumption ;
when dealing with continuous outcomes, covariates may be included under different model specifications, similar to the approach used for categorical outcomes;
simulations from almost all the available model specifications may be carried out through a suitable method;
data in long format can be provided as input, and the model specification follows the common style; formulas for the response variables, as well as for the initial and transition probabilities can be specified using the package Formula ;
parameter estimates can be graphically displayed to enhance the interpretation of the results.
Overall, the current version of the package offers great flexibility in model formulation and estimation.
In the following sections, after covering the main theoretical aspects of the models, we provide an overview of the LMest software package through examples using univariate and multivariate data with and without missing values. The package uses the S3 object-oriented system, defining three main classes of objects. We use both synthetic and real data to illustrate the models, where the synthetic data are generated using functions available within the package. These data are similar to those used in previous applied works published in the authors’ research articles or by other researchers in the field. However, we prefer to use simulated versions because the original data are not always publicly available. Additionally, using such data allows us to incorporate specific features that are useful for illustrating certain functions of the package. When describing the data, we will refer to potential real-world applications to characterize the research questions of interest, thereby better motivating the adopted models. Specifically:
the MC model is used to analyze longitudinal data about labor market careers after graduation;
the HM model for categorical longitudinal responses is adopted to examine the customer’s purchase behavior over time;
the HM model for continuous time-series data is used to discover financial market phases;
the HM model for continuous longitudinal responses is used to investigate the state progression of illness in patients after treatment and the progression of students performance in different types of school.
In the examples we use set.seed(x)
both when we generate
the data and fit the models so that all the output can be replicated.
The code for replicability of the simulated data can be provided upon
request.
In the following we introduce the models, examples, and codes in an expository style to demonstrate the use of specific functions of LMest. More details about MC and HM models are provided in . The remainder of the paper is structured as follows: the next section presents the MC model, while the third section introduces the HM model for categorical longitudinal data. The fourth section presents the HM model for continuous responses, which allows for missing values and individual covariates. Finally, the fifth section provides a summary, mentions some of other similar packages, and discusses additional issues related to future package extensions.
In the following we first illustrate the assumptions of the MC model for categorical responses and then an application based on a longitudinal categorical response and covariates.
In the context of longitudinal data, the MC model is referred to as the transition model since it is of interest to estimate the transition between states of the stochastic process. With only one categorical response variable, let \(\mathbf{Y}_{i} = (Y_{i1},\ldots,Y_{iT})^\prime\) denote the vector of such variables for individual \(i\), with \(i=1,\ldots,n\), where \(Y_{it}\) has \(k\) categories labeled from 1 to \(k\). Let \(\boldsymbol{x}_{it}\) denote the vector of individual covariates available at the \(t\)-th time occasion for individual \(i\), with \(t=1,\ldots,T\).
The main assumption of the model is that, for \(t=3,\ldots,T\), \(Y_{it}\) is conditionally independent of \(Y_{i1}, \ldots, Y_{i,t-2}\) given \(Y_{i,t-1}\) when a first-order dependence structure is formulated. Moreover, in its basic version, the model is characterized by initial and transition probabilities that, for every \(i = 1,\ldots, n\), are denoted by \[ \pi_{u} = p(Y_{i1} = u), \quad u = 1,\ldots,k \] and \[ \pi_{uv} = p(Y_{it} = v | Y_{i,t-1}= u),\quad t=2,\ldots,T,\: u,v=1,\ldots,k, \] respectively.
In presence of individual time-fixed and time-varying covariates, the initial and transition probabilities are denoted as \(\pi_{i,u}= p(Y_{i1} = u |\boldsymbol{x}_{i1})\), \(u=1,\ldots,k\), and \(\pi_{it,uv} = p(Y_{it} = v | Y_{i,t-1}= u, \boldsymbol{x}_{it})\), \(t=2,\ldots,T,\:\ u,v=1,\ldots,k\), respectively. Note that these probabilities are now individual specific, and their dependence on the vector of covariates is formulated on the basis of multinomial logit models . In particular, for the initial probabilities we have \[\begin{equation} \log\frac{\pi_{i,u}}{\pi_{i,1}}= \beta_{0u} + \boldsymbol{x}_{i1}^\prime \boldsymbol{\beta}_{1u},\quad u=2,\ldots,k. \tag{1} \end{equation}\] Note that, as a reference category, the multinomial logit model in (1) has the first state. For the transition probabilities we assume \[\begin{equation} \log\frac{\pi_{it,uv}}{\pi_{it,uu}}= \gamma_{0uv} + \boldsymbol{x}_{it}^\prime \boldsymbol{\gamma}_{1uv},\quad t=2,\ldots,T, \: u,v = 1,\ldots,k,\: u \neq v, \tag{2} \end{equation}\] where the logits have, as reference state, that corresponding to the row of the transition matrix. In the above expressions, \(\boldsymbol{\beta}_u = (\beta_{0u}, \boldsymbol{\beta}_{1u}^\prime)^\prime\) and \(\boldsymbol{\gamma}_{uv} = (\gamma_{0uv},\boldsymbol{\gamma}_{1uv}^\prime)^\prime\) are parameter vectors to be estimated.
Denoting by \(\boldsymbol{\theta}\) the vector of all model parameters, the log-likelihood can be defined as \[ \ell(\boldsymbol{\theta}) = \sum_{i=1}^n f(y_{i1},\ldots,y_{iT}|\boldsymbol{x}_{i1},\ldots \boldsymbol{x}_{iT}), \] where \(f(y_{i1},\ldots,y_{iT}|\boldsymbol{x}_{i1},\ldots \boldsymbol{x}_{iT})\) is the probability of the observed vector of response variables for unit \(i\) and \(y_{it}\) is a realization of \(Y_{it}\). Under the assumptions formulated for the MC model, and in presence of individual covariates, this probability may be expressed as
\[ f(y_{i1},\ldots,y_{iT}|\boldsymbol{x}_{i1},\ldots \boldsymbol{x}_{iT}) = \pi_{i,y_{i1}} \prod_{t=2}^{T}\pi_{it,y_{i,t-1}y_{it}}. \]
For the basic MC model without covariates, explicit formulas are available for the maximum likelihood estimation of the parameters. Under more complex models, for example when individual covariates are included as in Equations (1) and (2), an iterative algorithm, such as the Newton-Raphson or the Fisher-scoring, is required to maximize the above log-likelihood. As is well known, at each step of these algorithms, the parameter vector is updated by adding to its current value the inverse of the observed or expected information matrix, which is multiplied by the score vector, that is, the vector of the first derivatives of \(\ell(\boldsymbol{\theta})\). These steps are performed until a suitable convergence criterion is satisfied. The corresponding asymptotic standard errors can be obtained in the usual way from the diagonal elements of the information matrix; see for more details.
Finally note that, although we illustrate the MC model only for the case of balanced longitudinal data, this model can obviously be applied also to unbalanced data, when we have a specific number of observations \(T_i\) for each individual \(i\). However, for this aim, it is convenient to use the function of LMest for HM models under a specific constraint, as will be clarified in the following.
The present example illustrates an application of the MC model with a
univariate categorical response and time-fixed covariates. The data
provided within the package, named data_employment_sim
, are simulated
supposing they come from a survey context, where interviews are
conducted among a nationally representative sample of graduates about
their employment status after graduation; for analyses of similar data
see . In this scenario, the binary response variable
(emp
) is recorded for three time occasions corresponding to one, two,
and three years after graduation. Individual covariates, which are
assumed to be time-fixed, are the geographical location of the
university where individuals graduated, assuming the country is divided
into two main areas (area
: 1 for South, 2 for North), the grade at
graduation categorized into three levels (grade
: 1 for low, 2 for
medium, and 3 for high), and the indicator for parents’ educational
university qualification (edu
: 1 for university degree, 0 otherwise).
In simulating these data, we set the sample size equal to \(n=585\), and
we consider \(T=3\) time occasions.
According to the simulated context, analyzing such data with the MC model allows us to identify the employment patterns immediately after graduation and evaluate trends over time. Additionally, the research question is whether the probability of employment in the first period after graduation depends on the geographical area or on the final degree grade. Then, we can explore if and how family background influences the probability of employment.
The type of data structures that can be given in input to the estimation
function is in long format, that is, a data frame with one row for each
combination of unit \(i\) and time occasion \(t\), with \(i=1,\ldots, n\) and
\(t = 1,\ldots, T\), so that the total number of rows of this object is
\(n T\). All columns should have names; one column must correspond to the
unit identifier, typically named with id
, and another column must
correspond to the time occasions, typically named with time
.
The data stored in the long format can be described by the usual
summary()
method after they have been prepared by function
lmestData()
, as follows:
In particular, the frequency distribution of the response variable for each wave is reported below:
Data Info:
----------
Observations: 585
Time occasions: 3
Variables: 1
Proportion:
----------
time emp
1:585 0:0.411396
2:585 1:0.588604
3:585
Proportion by year:
----------
Time = 1
time emp
1:585 0:0.4615385
1:0.5384615
Time = 2
time emp
2:585 0:0.4034188
1:0.5965812
Time = 3
time emp
3:585 0:0.3692308
1:0.6307692
To estimate an MC model with covariates we can use the lmestMc()
function. In this example, the interest is in estimating the effect, on
the employment at the first interview, of the area where the students
graduated and of their grade, and in evaluating the impact of the
parent’s educational level only for the periods following the first
interview. Accordingly, covariates area
and grade
are included in
responsesFormula
argument so as to affect the initial probabilities,
while edu
is included in the parameterization of the transition
probabilities. The covariates for initial and transition probabilities
are separated by the symbol “|
”. Note that we use the
Formula package for all formula operators
. Moreover, the argument index
is required, which
is a character vector to specify the name of the unit identifier (first
element) and that of the time occasions (second element). Standard
errors for the parameter estimates are provided by setting
out_se = TRUE
as in the following:
Note that, in the above code, the data object dt
returned by the
lmestData()
function can be directly passed to the estimation
function. The tolerance level to check convergence of the EM algorithm
and the maximum number of iterations of this algorithm are set by
default as tol = 10^-8
and maxit = 1000
.
Using option output = TRUE
, the lmestMc()
function also returns the
subject-specific estimated initial probabilities, collected in object
Piv
, and the estimated subject- and time-specific transition
probabilities, collected in PI
. To summarize these results, it is
convenient to calculate suitable averages of these probabilities.
Provided that the estimation output is stored in object mod1
, for the
initial probabilities we have:
These estimates indicate that the 46% of individuals is unemployed at the beginning of the period of observation, that is, one year after graduation.
The array PI
containing the estimates of the transition probabilities
has dimension \(2 \times 2 \times 585 \times 3\), where 2 is the number of
response categories, 585 is the sample size, and 3 is the number of time
occasions. The averaged transition probabilities, over all individuals
and time occasions, are computed as follows starting from object mod1
where such probabilities are stored:
category
category 0 1
0 0.761 0.239
1 0.088 0.912
From these results we observe that unemployed individuals have a probability close to 0.76 of remaining without a job between two successive time occasions. Employed individuals have a higher persistence in the same employment status. The probability to find a job from one wave to another is around 0.24.
A summary of the output of the lmestMc()
function including parameter
estimates can be printed using the summary()
method:
summary(mod1)
Call:
lmestMc(responsesFormula = emp ~ as.factor(area) + as.factor(grade) |
as.factor(edu), data = dt, index = c("id", "time"), out_se = TRUE,
output = TRUE, seed = 345)
Coefficients:
Be - Parameters affecting the logit for the initial probabilities:
logit
2
(Intercept) -0.4576
as.factor(area)2 0.4382
as.factor(grade)2 0.1536
as.factor(grade)3 1.6676
Standard errors for Be:
logit
2
(Intercept) 0.1681
as.factor(area)2 0.1847
as.factor(grade)2 0.1988
as.factor(grade)3 0.2529
p-values for Be:
logit
2
(Intercept) 0.006
as.factor(area)2 0.018
as.factor(grade)2 0.440
as.factor(grade)3 0.000
Ga - Parameters affecting the logit for the transition probabilities:
logit
1 2
(Intercept) -1.6015 -2.1789
as.factor(edu)1 2.1834 -2.6251
Standard errors for Ga:
logit
1 2
(Intercept) 0.1257 0.1423
as.factor(edu)1 0.3128 1.0141
p-values for Ga:
logit
1 2
(Intercept) 0 0.00
as.factor(edu)1 0 0.01
Note that the summary()
method also returns the standard errors for
the parameter estimates and the corresponding significance level when
option out_se = TRUE
is included in the main estimation function. The
argument Be
returned by the function provides the estimated regression
parameters of the logistic model for the probability of belonging to
category 1 (employed) versus category 0 (unemployed) at the first
interview. For interpretation, a higher degree grade positively affects
the employment probability one year after graduation. Specifically, the
log-odds ratios is equal to 0.154 for those with a medium grade and
1.668 for those with a high grade, with respect to individuals with a
low grade, all the other covariates held fixed.
Moreover,
the log-odds ratio for area
is positive, indicating that, at the
beginning of the study, individuals who graduated from a university
located in the North of the country are more likely to be employed than
those who graduated from a university located in the South, with other
covariates held constant.
Regarding the estimates referred to the transition probabilities
reported in the output Ga
, we may conclude that, apart for the
intercept, a higher level of parents’ education positively affects the
probability of transition from the first to the second category (being
employed after the first interview). In terms of odds ratio, we have
compared to individuals with low educated parents. Moreover, having highly educated parents has a negative effect on the transition from the second to the first category:
We outline the main notation and assumptions of the HM models useful to understand the output of the illustrated estimation functions of the LMest package.
For a sample of \(n\) individuals and \(T_i\) time occasions, let \(\boldsymbol{Y}_{it}\), \(i=1,\ldots,n\), \(t=1,\ldots,T_i\), be the observable vector of response variables with elements \(Y_{ijt}\), \(j=1,\ldots,r\), where \(r\) denotes the number of response variables. Note that the number of time occasions \(T_i\) is specific for each individual \(i\). In this way, we allow for unbalanced panels that may be due to a different number of time occasions for every individual. Let also \(\boldsymbol{U}_i = (U_{i1},\ldots,U_{iT_i})^\prime\) denote the latent process for each individual \(i\) that affects the distribution of the response variables and assumed to follow a first-order Markov chain with a certain number of latent (or hidden) states equal to \(k\). The Markov properties of \(U_{it}\) are the same as those illustrated in Section 2.1. However, here we use the notation \(U_{it}\), which stands for “unobserved”, to emphasize that it is latent/hidden, whereas \(Y_{it}\) in Section 2.1 is observable. The response variables are conditionally independent given this latent process; this assumption is referred to as local independence. It means that the latent process fully explains the underlying phenomenon together with possible covariates that may be time-fixed or time-varying.
The HM model is characterized by two components: the measurement (sub)model, which describes the conditional distribution of the response variables given the latent process, and the structural or latent (sub)model, which describes the distribution of the latent process. When dealing with categorical outcomes, the formulation of the measurement (sub)model without covariates is based on the parameters \[\begin{equation} \phi_{jy|u} = p(Y_{ijt} = y | U_{it} = u),\quad j = 1,\ldots,r, \: t = 1,\ldots,T_i, \: u = 1, \ldots, k,\: y = 0,\ldots, l_j - 1, \tag{3} \end{equation}\] where \(l_j\) corresponds to the number of response categories of the \(j\)-variable in \(\boldsymbol{Y}_{it}\).
Regarding the latent (sub)model, the typical assumption is that the latent Markov chain is of first order and, for the initial and transition probabilities, we can use the same notation previously adopted for the MC model that now is formulated with reference to the latent variables rather than the observable variables. More precisely, we introduce the initial and transition probabilities \[\begin{eqnarray*} \pi_{u}&=&p(U_{i1}=u),\quad u=1,\ldots,k,\\ \pi_{uv}&=&p(U_{it}=v|U_{i,t-1}=u),\quad t=2,\ldots,T_i,\: u,v=1,\ldots,k. \end{eqnarray*}\] Note that, in the basic version of the HM model, the transition probabilities are assumed to be time homogeneous to reduce the number of free parameters. However, this assumption can be relaxed if it proves to be too restrictive.
When available, individual explanatory variables, collected in the
vectors \(\boldsymbol{x}_{it}\), may be included in the measurement model, so as to
account for the unobserved heterogeneity between units, or in the latent
model, so that they affect the initial and transition probabilities of
the Markov chain. In the latter case, these probabilities may be defined
on the basis of the same multinomial logit parameterizations adopted for
the MC model; see in particular Equations (1) and
(2). The parameterization of the transition
probabilities in Equation (2) is referred to as
paramLatent = "multilogit"
in the corresponding argument of the
estimation function lmest()
and is illustrated with an application to
health related data in .
In order to make the model more parsimonious, an alternative
differential logit parameterization can be used for the transition
probabilities, denoted as paramLatent = "difflogit"
. Such a
parameterization relies on logits based on the difference between two
vectors of parameters, for \(t=2,\ldots,T_i\): \[\begin{equation}
\log\frac{\pi_{it,uv}} {\pi_{it,uu}} =
\gamma_{0uv} + \boldsymbol{x}_{it}^\prime (\boldsymbol{\gamma}_{1v}-\boldsymbol{\gamma}_{1u}),\quad u,v=1,\ldots,k, \: u\neq v,
\tag{4}
\end{equation}\] where \(\boldsymbol{\gamma}_{11}= \boldsymbol{0}\) to ensure model
identifiability; see for an example of
application of this parameterization.
The LMest package also allows including individual covariates
in the measurement model through a suitable parameterization of the
conditional distribution of the response variables given the latent
states; see . This formulation can only be used
for univariate data by setting argument modManifest = "LM"
in the
lmest()
function. It is also possible to indicate an alternative model
specification by setting the option modManifest = FM
, which relies on
the assumption that the latent process has a distribution given by a
mixture of AR(1) processes with common variance and specific correlation
coefficients. See for an illustration of this
model; see also .
We illustrate maximum likelihood estimation in the general case in which covariates are available and are included in the distribution of the latent process. In this case, for a sample of \(n\) independent units, the model log-likelihood has the following expression: \[ \ell(\boldsymbol{\theta}) = \sum_{i=1}^n f(\boldsymbol{y}_{i1},\ldots,\boldsymbol{y}_{iT_i}|\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i}), \] where \(\boldsymbol{\theta}\) is the vector of all free model parameters and \(f(\boldsymbol{y}_{i1},\ldots,\boldsymbol{y}_{iT_i}|\boldsymbol{x}_{i1},\ldots,\boldsymbol{ x}_{iT_i})\) corresponds to the manifest distribution, that is, the probability of the responses provided by subject \(i\) given the covariates. This manifest probability has expression \[\begin{equation} f(\boldsymbol{y}_{i1},\ldots,\boldsymbol{y}_{iT_i}|\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i}) = \sum_{u_1=1}^k\cdots\sum_{u_{T_i}=1}^k \pi_{i,u_1} \prod_{t=2}^{T_i}\pi_{it,u_{t-1}u_t} \prod_{t=1}^{T_i}f(\boldsymbol{y}_{it}|u_t), \tag{5} \end{equation}\] where \(f(\boldsymbol{y}_{it}|u)\) refers to the conditional distribution of the response variables for unit \(i\) at time occasion \(t\) given the latent process that, in the case of categorical response variables, is freely parameterized by the conditional response probabilities in Equation (3).
In order to compute \(f(\boldsymbol{y}_{i1},\ldots,\boldsymbol{y}_{iT_i}|\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i})\) while avoiding the sum in Equation (5), which has a computational cost that exponentially increases in \(T_i\), we rely on the Baum and Welch recursion that is described in , among others. The above log-likelihood function can be maximized by the EM algorithm based on the complete-data log-likelihood . The EM algorithm is characterized by a series of iterations consisting of two steps, named E- and M-step, which are repeated until convergence. The E-step computes the conditional expected value of the complete-data log-likelihood given the current value of the parameters and the observed data. The M-step consists in maximizing this expected value so as to update the model parameters. The convergence of the EM algorithm is assessed on the basis of a suitable convergence criterion relying on the relative log-likelihood difference between two consecutive iterations. From this iterative algorithm we obtain an estimate of \(\boldsymbol{\theta}\), denoted by \(\boldsymbol{\hat{\theta}}\).
For HM models, initialization of the estimation algorithm is an
important issue as the model log-likelihood is typically multimodal. The
LMest package allows performing a multi-start initialization
strategy, based on both deterministic and random rules, which is aimed
at suitably exploring the parameter space so as to increase the chance
to reach the global maximum of the model log-likelihood. With some
differences, this is done by functions lmest()
and lmestSearch()
;
the latter will be illustrated in the following.
Once the parameter estimates are computed, standard errors may be
obtained in the usual way on the basis of the observed information
matrix. This matrix is provided by LMest using either the
exact computation method proposed in or the
numerical method of , depending on the complexity of
the model of interest. The package also provides the bootstrap()
method to obtain standard errors by a parametric bootstrap procedure,
that is, by drawing samples from the estimated model and computing the
maximum likelihood estimates for every bootstrap sample. The standard
errors are obtained by computing, in a suitable way, the standard
deviation of the empirical distribution so obtained; see among others,
. An alternative nonparametric bootstrap procedure
can also be used, based on drawing a large number of samples with
replacement from the observed data.
Selection of the number of states is performed according to information criteria such as AIC and BIC . They are defined as follows: \[\begin{eqnarray*} AIC &=& -2 {\ell}(\boldsymbol{\hat{\theta}}) + 2 \;\#par,\\ BIC &=& -2 {\ell}(\boldsymbol{\hat{\theta}}) + \log(n)\;\#par, \end{eqnarray*}\] where \({\ell}(\boldsymbol{\hat{\theta}})\) denotes the maximized log-likelihood of the model of interest, \(n\) is the sample size, and \(\#par\) denotes the number of free parameters to be estimated. Other criteria, such as those based on entropy may be used; see, among others, .
Once the number of states is selected, dynamic clustering is performed
by assigning every unit to a latent state at each time occasion by means
of the estimated posterior probabilities of the \(U_{it}\). These
probabilities are directly provided by the EM algorithm and are defined
as \[\begin{equation}
\hat{a}_{it,u} = p(U_{it}=u | \boldsymbol{y}_{i1},\ldots, \boldsymbol{y}_{iT_i},\mathbf{x}_{i1},\ldots,\boldsymbol{x}_{iT_i}), \: t=1,\ldots,T_i, \: u = 1,\ldots,k, (\#eq:postprob)
\end{equation}\]
\[
\hat{b}_{it,uv} = p(U_{i,t-1}=u, U_{it}=v | \boldsymbol{y}_{i1},\ldots,\boldsymbol{y}_{iT_i},\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i}), \: t=2,\ldots,T_i, \: u,v = 1,\ldots,k. \tag{6}
\]
Prediction of the latent states of every unit \(i\) at each time occasion
\(t\) is known as local decoding and it is obtained as the value of \(u\)
that maximizes the posterior probabilities in Equation
(??). As an alternative to the local decoding, the global
decoding may be performed, which is based on the Viterbi algorithm
to obtain the prediction of the latent
trajectories of a unit across time, that is, the a posteriori most
likely sequence of hidden states. The package provides the
lmestDecoding()
method to perform both local and global decoding based
on the different model specifications, as illustrated in the following
sections. Additional details on the model formulations, backward and
forward recursions, and estimation can be found in
.
As an illustration of the LMest package for estimation of HM
models with multivariate responses and covariates, we consider simulated
data provided in the package, named data_market_sim
. These data refer
to a hypothetical marketing context where a sample of \(n=200\) customers
of four different brands is observed along with the price of each
transaction, defined for \(T=5\) occasions, in Euros per purchase within
the following ranges: [0.1, 10], (10, 30], (30, 60], (30, 100], (100,
500]. For a similar application see, among others,
. Accordingly, we consider \(r=2\) response
variables (items), the first variable with \(l_1=4\) categories, one for
each brand, and the second variable with \(l_2=5\) categories, one for
each range of price. The initial part of the dataset, in long format, is
displayed below:
id time brand price age income
1 1 1 2 2 34 25
2 1 2 0 2 35 25
3 1 3 1 0 36 25
4 1 4 2 2 37 25
5 1 5 3 1 38 25
6 2 1 1 3 35 27
The research questions in this context concern the evolution of customer’s purchase behavior over time by exploring market segments, while also considering the role of socio-demographic characteristics defined by the available individual covariates. In particular, we include age, as time-varying continuous covariate, and income of the customer at the time of the first purchase, as time-fixed continuous covariate.
As already mentioned, the package provides function lmestSearch()
,
which searches for the global maximum of the log-likelihood of different
models and selects the optimal number of hidden states. This function
combines deterministic and random initializations with a rather wide
tolerance level. Moreover, starting from the best solution obtained from
these preliminary runs, a final run is performed, with a smaller
tolerance level, in order to increase the chance of reaching the global
maximum. The argument nrep
can be provided to set the number of random
initializations for each number of hidden states. The range of states to
be considered may be specified in argument k
as a vector of integers.
Generally, model selection is performed by estimating the multivariate
HM model without covariates, which is specified within the model
formula, indicated in the argument responsesFormula
, with ~ NULL
stating that there are no predictors for the two response variables as
follows:
hmm <- lmestSearch(responsesFormula = brand + price ~ NULL,
latentFormula = ~ NULL,
version = "categorical",
index = c("id", "time"),
data = data_market_sim,
k = 1:4, fort = TRUE,
seed = 12345)
By adding the optional fort = TRUE
argument, a faster estimation
procedure is achieved using Fortran routines. The summary()
method
returns the estimation results, displaying the AIC and BIC values for
the sequence of estimated hidden states:
summary(hmm)
Call:
lmestSearch(responsesFormula = brand + price ~ NULL, latentFormula = ~NULL,
data = data_market_sim, index = c("id", "time"), k = 1:4,
version = "categorical", seed = 12345, fort = TRUE)
states lk np AIC BIC
1 -2811.001 7 5636.003 5659.091
2 -2520.131 17 5074.263 5130.334
3 -2445.538 29 4949.075 5044.727
4 -2434.262 43 4954.524 5096.352
In this case, the BIC, used by default for model selection, supports a model with three hidden states. Once the model is selected, we estimate the parameters of the latent model with covariates by fixing the parameter values of the conditional response probabilities obtained under the model chosen above. In this way, the estimated subpopulations are held fixed. These conditional response probabilities are displayed below:
, , item = 1
state
category 1 2 3
0 0.69 0.08 0.10
1 0.10 0.45 0.12
2 0.17 0.40 0.08
3 0.05 0.07 0.70
4 NA NA NA
, , item = 2
state
category 1 2 3
0 0.30 0.04 0.00
1 0.43 0.19 0.05
2 0.10 0.64 0.05
3 0.12 0.11 0.30
4 0.04 0.02 0.60
From these results, we notice that the three states may represent distinct customer segments that can be labeled as follows: low-cost market segment (1st), ordinary segment (2nd), and luxury segment (3rd). These segments are described in more detail in the following.
Function lmest()
estimates the HM model with covariates that can
affect the latent structure in various ways. In the following, we assume
that age and income may influence only the transition probabilities, and
do not affect the composition of the market segments at the beginning of
the survey. This can be set through the latentFormula
argument, which
includes age + income
in the parameterization of the transition
probabilities. For the initial probabilities, we ignore the effects of
covariates and estimate only the intercept of the multinomial logit by
including NULL
in the formula before the symbol “|
”. Note that this
formulation is the same as the one expressed in Equation
(1) but without the vector of covariates, whereas for
the transition probabilities we consider the parametrization defined in
Equation (4) obtained by including the argument
paramLatent = "difflogit"
. Moreover, as an input to the lmest()
function, we also use the optional arguments parInit
, which is a list
of initial model parameters that also includes argument fixPsi = TRUE
to avoid estimation of the conditional response probabilities. In this
way, the EM algorithm is performed to estimate the parameters of the
structural model while these probabilities are kept fixed at the value
displayed above. The same arguments can be used to require the
estimation of an MC model for unbalanced data as a constrained version
of an HM model with a number of states equal to the number of response
categories.
The HM model with three hidden states and fixed conditional response probabilities, chosen as described above, is estimated as follows:
The print()
method provides an overview of the estimation results
including the maximum log-likelihood, number of free parameters, and AIC
and BIC values that can be used for model selection.
print(mod2)
Basic Latent Markov model with covariates in the latent model
Call:
lmest(responsesFormula = brand + price ~ NULL, latentFormula = ~NULL |
age + income, data = data_market_sim, index = c("id", "time"),
k = 3, paramLatent = "difflogit", parInit = list(Psi = Psi,
fixPsi = TRUE), seed = 12345)
Available objects:
[1] "lk" "Be" "Ga" "Psi"
[5] "Piv" "PI" "np" "k"
[9] "aic" "bic" "lkv" "n"
[13] "TT" "paramLatent" "ns" "yv"
[17] "Lk" "Bic" "Aic" "call"
[21] "data"
Convergence info:
LogLik np k AIC BIC n TT
-2444.437 12 3 4912.874 4952.454 200 5
The estimated HM model has a maximum log-likelihood of -2,444.437 with 12 parameters so that AIC and BIC indexes are equal to 4,912.874 and 4,952.454, respectively.
The plot()
method associated with lmest()
provides a variety of
displays. For example, a plot of the estimated conditional response
probabilities can be obtained as:
Figure 1: Plot of the estimated conditional response probabilities: on the left side there is item 1 (brand), and on the right side item 2 (price), respectively. The top, middle, and bottom panels correspond to the estimates for the 1st state (low-cost segment), 2nd state (ordinary segment), and 3rd state (luxury segment), respectively.
as shown in Figure 1. According to the simulated context, from this figure we observe that the 1st state (low-cost segment), which includes the highest frequency of customers at the first time occasion (39%), is related to those who primarily purchase products of the first brand (category 0 of item 1) with low prices (categories 0 and 1 of item 2). Customers in the 2nd state (ordinary segment), corresponding to around 30% of all customers, tend to buy products of the second and third brands with medium prices. On the other hand, customers in the 3rd state (luxury segment) tend to purchase products of brand four (category 3 of item 1) with relatively high prices (categories 3 and 4 of item 2). This state includes around 31% of customers at the beginning of the period of observation.
The averaged estimated transition probabilities may be obtained by the following command:
state
state 1 2 3
1 0.092 0.486 0.422
2 0.028 0.878 0.093
3 0.000 0.106 0.894
A plot showing the corresponding path diagram of the estimated transition matrix can be obtained as
Figure 2: Path diagram of averaged estimated transition probabilities: nodes represent hidden states, and arrows are present when the estimated transition probability between two states is not null. Self-arrows indicate persistence in the same state if estimated. The reported numbers refer to the estimated values of the transition probability matrix.
as illustrated in Figure 2. From the results shown in this figure, we observe that customers have a fairly high probability of persistence in the 2nd and 3rd state from one time period to the next, as the estimated diagonal elements of the transition probability matrix are fairly high (0.88 and 0.89, respectively). This suggests that these two market segments are quite stable across customers over years. Conversely, the highest estimated probabilities outside of the main diagonal refer to the transition from the 1st state (low-cost segment) to the 2nd state (ordinary) and from the 1st to the 3rd (luxury). This shows that customers of the 1st segment have a greater tendency to change their behavior over time.
The summary()
method returns the main estimation results:
summary(mod2)
Call:
lmest(responsesFormula = brand + price ~ NULL, latentFormula = ~NULL |
age + income, data = data_market_sim, index = c("id", "time"),
k = 3, paramLatent = "difflogit", parInit = list(Psi = Psi,
fixPsi = TRUE), seed = 12345)
Coefficients:
Be - Parameters affecting the logit for the initial probabilities:
logit
2 3
(Intercept) -0.2604 -0.196
Ga0 - Intercept affecting the logit for the transition probabilities:
logit
(Intercept) 2 3
1 -3.2126 -3.9261
2 1.4353 -2.8156
3 -5.4099 -1.5630
Ga1 - Regression parameters affecting the logit for the transition probabilities:
logit
2 3
age 0.0836 0.0897
income 0.0563 0.0674
Psi - Conditional response probabilities:
, , item = 1
state
category 1 2 3
0 0.6875 0.0816 0.1031
1 0.1000 0.4512 0.1216
2 0.1674 0.4010 0.0759
3 0.0450 0.0663 0.6993
4 NA NA NA
, , item = 2
state
category 1 2 3
0 0.3016 0.0353 0.0000
1 0.4344 0.1895 0.0505
2 0.1033 0.6407 0.0489
3 0.1172 0.1113 0.2981
4 0.0436 0.0232 0.6025
The output Ga
contains the estimated parameters affecting the
distribution of the transition probabilities based on the difflogit
parameterization. More in detail, Ga0
refers to the intercepts, while
Ga1
refers to the regression coefficients. Each column of Ga1
can be
interpreted as a general measure of attraction of the corresponding
state. From these results, we observe that the estimated effects of the
age and income of the customer are relatively small for each logit. Both
covariates have a positive impact, indicating that as age or income
increases, the probability of moving to more valuable segments also
increases, while holding other parameters constant. Thus, older
customers or those with higher income are more likely to move to
higher-value segments.
In this section, we illustrate the main notation used for HM models for continuous outcomes and discuss how to handle missing responses under the MAR assumption . We also illustrate the inclusion of covariates in both the latent and measurement models. Finally, we introduce three examples of the application of these models using the appropriate functions of the LMest package.
When dealing with continuous outcomes, let \(\boldsymbol{Y}_{it} = (Y_{i1t}, \ldots, Y_{irt})^\prime\) denote the vector of \(r\) continuous response variables measured at time \(t\) for subject \(i\). As usual, under the local independence assumption, the response vectors \(\boldsymbol{Y}_{i1}, \ldots, \boldsymbol{Y}_{iT_i}\) are assumed to be conditionally independent given the latent process \(\boldsymbol{U}_i\). Moreover, for the measurement model, we assume a conditional Gaussian distribution, that is, \[\begin{equation} \boldsymbol{Y}_{it}|U_{it}= u \sim N(\boldsymbol{\mu}_u,{\boldsymbol{\Sigma}}_u),\quad u=1,\ldots,k.\tag{7} \end{equation}\]
The parameters of the measurement (sub)model are the conditional means \(\boldsymbol{\mu}_u\), \(u=1,\ldots,k\), which are state specific, and the variance-covariance matrices \(\boldsymbol{\Sigma}_u\), which may be assumed to be constant across states under the assumption of homoschedasticity, that is, \(\boldsymbol{\Sigma}_1 = \cdots = \boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}\). This assumption, which reduces the number of estimated parameters and may be reasonable in many applied contexts, is adopted in the LMest package.
The parameters of the latent (sub)model are again the initial and transition probabilities of the Markov chain, as specified in the previous sections. As usual, when individual covariates collected in vectors \(\mathbf{x}_{it}\) are available, they may be included in the distribution of the latent variables, so as to affect the initial and transition probabilities. It is also possible to assume that individual covariates affect the distribution of the response variables given the latent process, thereby accounting for unobserved heterogeneity. The latter formulation is based on the assumption that \[ E(Y_{it}| U_{it} = u, \boldsymbol{x}_{it}) = {\alpha}_{u} + \boldsymbol{x}_{it}^\prime \boldsymbol{\beta},\quad u=1,\ldots,k, \] which naturally extends to the multivariate case.
The manifest distribution may be expressed with a formula that closely recalls the one used in Equation (5) for HM models for categorical responses, but with \(f(\boldsymbol{y}_{it}|u)\) that here denotes the density of the multivariate Gaussian distribution based on assumption in Equation (7), possibly dependent on the covariates. Maximum likelihood estimation is carried out as illustrated in Section 3.2 through the EM algorithm and its extended version when individual covariates are included in the model. The LMest package can also handle missing values in the response variables. When the outcomes are continuous, we consider two different types of missing pattern under the MAR assumption: partially missing outcomes at a given time occasion and completely missing outcomes, that is, when individuals do not respond at one or more time occasions (intermittent pattern). Following , in the presence of missing responses, it is convenient to partition each response vector \(\boldsymbol{Y}_{it}\) as \((\boldsymbol{Y}_{it}^o, \boldsymbol{Y}_{it}^m)^\prime\), where \(\boldsymbol{Y}_{it}^o\) is the (sub)-vector of observed variables, and \(\boldsymbol{Y}_{it}^m\) refers to the missing data. The conditional mean vectors and variance-covariance matrix may be decomposed into observed and missing components as follows: \[ \boldsymbol{\mu}_u = \begin{pmatrix} \boldsymbol{\mu}_u^{o} \\ \boldsymbol{\mu}_{u}^m \end{pmatrix}, \quad \boldsymbol{\Sigma}_u = \begin{pmatrix} \boldsymbol{\Sigma}_u^{oo} & \boldsymbol{\Sigma}_u^{om}\\ \boldsymbol{\Sigma}_u^{mo} & \boldsymbol{\Sigma}_u^{mm} \end{pmatrix}, \] where, for instance, \(\boldsymbol{\Sigma}_u^{om}\) is the block of \(\boldsymbol{\Sigma}_u\) collecting covariances between each observed and missing response. In this way, for the observed responses, we have that: \[ \boldsymbol{Y}_{it}^o|U_{it}=u\sim N( \boldsymbol{\mu}_u^o, \boldsymbol{\Sigma}_u^{oo}),\quad u=1,\ldots,k. \tag{8} \] The manifest distribution of the responses is now expressed with reference to the observed data. Model inference is carried out using an extended version of the EM algorithm based on suitable recursions, as illustrated in . In particular, at the E-step the algorithm also computes additional expected values arising from the MAR assumption for the missing observations. Under this model formulation, it is also possible to perform a type of multiple imputation, which allows us to predict the missing responses either conditionally or unconditionally on the predicted states. Within the LMest package, when the missing data are dealt with under the MAR assumption, the unconditional prediction of the missing responses is performed by computing \[ \tilde{\boldsymbol{y}}_{it} = \sum_{u=1}^k \hat{a}_{it,u} E(\boldsymbol{Y}_{it}|\boldsymbol{y}_{it}^o, u), \] where \(\hat{a}_{it,u} = p(U_{it}|\boldsymbol{y}_{i1}^o,\ldots,\boldsymbol{y}_{iT_i}^o, \boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i})\) and \(E(\boldsymbol{Y}_{it}|\boldsymbol{y}_{it}^o, u)\) are posterior expected values computed at the E-step.
Finally, both local and global decoding may be performed as usual. In this context, it is important to note that the prediction of the latent states is also carried out for units with missing responses.
This section illustrates an application of the HM model to multivariate time-series data; for more details, see . Specifically, we use real data from Yahoo Finance, covering S&P 500 Index (SP500TR) and Intel stock prices (INTC). These data are sourced from the package quantmod . The dataset includes the closing prices for each trading day from March to August 2022. For the following application, we then compute the percentage returns based on these closing prices.
The analysis of financial time-series data is typically focused on identifying market phases associated with the volatility of returns. In this context, latent states are referred to as regimes, and it is valuable to detect any abrupt and persistent changes in these regimes.
require(quantmod)
SP500_22 <- getSymbols("^SP500TR",
env = NULL,
from = "2022-03-01",
to = "2022-08-31",
periodicity = "daily")
dim(SP500_22)
[1] 127 6
INCT_22 <- getSymbols("INTC",
env = NULL,
from = "2022-03-01",
to = "2022-08-31",
periodicity = "daily")
sp_sreturns <- dailyReturn(SP500_22)*100
intc_sreturns <- dailyReturn(INCT_22)*100
data_fin <- cbind(1, 1:length(sp_sreturns), sp_sreturns, intc_sreturns)
names(data_fin) <- c("id", "time", "SP", "INCT")
head(round(data_fin, 3))
id time SP INCT
2022-03-01 1 1 -1.304 -1.515
2022-03-02 1 2 1.868 4.378
2022-03-03 1 3 -0.513 -1.923
2022-03-04 1 4 -0.786 0.292
2022-03-07 1 5 -2.951 -0.811
2022-03-08 1 6 -0.721 -0.378
data_fin[which.min(data_fin$SP),]
id time SP INCT
2022-05-18 1 56 -4.016448 -4.617124
data_fin[which.min(data_fin$INCT),]
id time SP INCT
2022-07-29 1 105 1.431582 -8.562069
The two series are displayed in Figure 3 and 4, while descriptive statistics are reported in the following:
Figure 3: Observed percentage returns of Standard and Poor’s 500 Index from March to August 2022 (127 days).
Figure 4: Observed percentage returns of Intel stock from March to August 2022 (127 days).
Index SP INCT
Min. :2022-03-01 Min. :-4.01645 Min. :-8.5621
1st Qu.:2022-04-13 1st Qu.:-0.89097 1st Qu.:-1.8511
Median :2022-05-31 Median :-0.06888 Median :-0.1265
Mean :2022-05-30 Mean :-0.05265 Mean :-0.2769
3rd Qu.:2022-07-16 3rd Qu.: 1.09352 3rd Qu.: 1.1456
Max. :2022-08-30 Max. : 3.05815 Max. : 6.9401
We can easily estimate the HM model for continuous data with three
hidden states, assuming they may represent bear, bull, and intermediate
market regimes. The function of the package that can be used to estimate
this model is lmestCont()
; the basic version of the model, without
covariates, can be specified by setting the argument
responsesFormula = SP + INCT ~ NULL
.
As for the other estimation functions, the argument index
is required
to specify the name of the unit identifier and the indicator of the time occasions. The
model can be fitted with the constraint of time-homogeneity, by setting
the argument modBasic = 1
, which ensures that the transition
probabilities do not depend on \(t\). Argument tol = 10^-6
is used to
set the tolerance level for the convergence of the algorithm; by
default, this is set to tol = 10^-10
. We can also specify maxit
,
which is an integer that sets the maximum number of EM iterations; by
default, this is set to 5,000.
A summary of the estimation results is obtained with the summary()
method:
summary(mod3)
Call:
lmestCont(responsesFormula = SP + INCT ~ NULL, data = data_fin,
index = c("id", "time"), k = 3, modBasic = 1, tol = 10^-6)
Coefficients:
Initial probabilities:
est_piv
[1,] 1
[2,] 0
[3,] 0
Transition probabilities:
state
state 1 2 3
1 0.1265 0.2715 0.6020
2 0.0389 0.9610 0.0001
3 0.5113 0.0001 0.4887
Mu - Conditional response means:
state
item 1 2 3
SP -2.5609 0.2353 0.6531
INCT -2.9240 -0.0996 1.1140
Si - Variance-covariance matrix:
[,1] [,2]
[1,] 1.5253 1.7211
[2,] 1.7211 4.3718
This summary output shows the estimated initial and transition probabilities of the three latent states, the estimated cluster means, and the variance-covariance matrix. The three states represent different market regimes that can be interpreted according to the estimated means as follows: the 1st state identifies negative returns, the 2nd state corresponds to positive returns for S&P 500 (SP) and negative for Intel stock (INCT), and the 3rd state identifies positive returns for both. The main transitions are from the 1st to the 2nd state (0.272), from the 1st to the 3rd (0.602) state, and from the 3rd to the 1st state (0.511).
The contour plot of the estimated overall density, with weights given by
the estimated marginal probabilities of the latent states, is shown in
Figure 5. It is obtained by the plot()
method using
the argument what = "density"
as follows:
plot(mod3, what = "density")
Figure 5: Estimated density surface of the bivariate distribution of Standard and Poor’s 500 Index and Intel stock prices observed from March to August 2022 (127 days)
The density regions of each estimated component (regime), represented as
a contour plot, can be visualized by setting the argument
components = c(1, 2, 3)
. The resulting plot is shown in Figure
6.
Figure 6: Estimated density surfaces for each hidden state (component) represented as contour levels of Standard and Poor’s 500 Index and Intel stock prices observed from March to August 2022 (127 days).
In the context of financial time-series, a relevant issue is path
prediction, that is, finding the most likely sequence of latent states
for a given unit on the basis of the observed values. As already
mentioned, both local and global decoding are implemented in the
lmestDecoding()
method, which allows us to predict the sequence of
latent states for a sample unit on the basis of the output of the
different estimation functions, as follows:
deco3 <- lmestDecoding(mod3)
Object deco3
contains the local (object Ul
) and global (object Ug
)
decoding (obtained through the Viterbi algorithm). For example, for the
local decoding, we have:
deco3$Ul
[1] 1 3 3 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[33] 2 2 2 2 2 1 3 1 3 3 1 3 3 3 1 3 1 3 1 3 3 3 3 1 2 2 2 2 2 2 2 2
[65] 2 2 2 2 2 2 1 1 1 3 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[97] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
These sequences can be depicted as follows:
data_fin$Ul <- deco3$Ul
plot(data_fin$Ul, ylab="State")
Figure 7: Predicted sequence of hidden states from March to August 2022 (127 days) under the HM model with \(k=3\) estimated for Standard and Poor’s 500 Index and Intel stock prices.
From Figure 7 we can observe distinct periods characterized by different market regimes, as indicated by the decoded state sequence. Initially, a neutral market regime is observed, followed by alternating bullish and bearish market regimes, and finally, a neutral market regime predominates during the last period.
Data provided in the package, named data_heart_sim
, are simulated from
a hypothetical observational retrospective study designed to assess the
progression of health states of individuals after treatment. The dataset
includes observations on systolic and diastolic blood pressure
(variables sap
and dap
in mmHg) and heart rate (variable hr
in
bpm), as well as covariates such as fluid administration (variable
fluid
in ml/kg/h), gender (variable gender
: 1 for male, 2 for
female), and age (variable age
in years). Note that fluid
is a
time-varying covariate. The initial part of the dataset is as follow:
id time sap dap hr fluid gender age
1 1 1 117 67 75 2800 1 56
2 1 2 111 69 93 1750 1 56
3 1 3 102 60 108 2320 1 56
4 1 4 123 78 105 2600 1 56
5 1 5 102 57 99 1700 1 56
6 1 6 86 60 96 2210 1 56
We assume that these measurements are recorded daily after a particular
surgery for up to six days, so that \(T=6\), and some patients have
missing records for one or more outcomes. The number of patients is
\(n=125\). Missing responses are denoted with NA
in the data provided to
the estimation function. Note that the package does not support missing
values in the covariates, which should be imputed separately. The data
are simulated to include a proportion of missing values of around 15%.
We estimate an HM model with three hidden states, including covariates
that affect both the initial and transition probabilities, and where
missing data are handled under the MAR assumption. This model is
specified using the full set of responses through the function
lmestCont()
as follows:
Note that it is possible to specify how to deal with missing responses
by setting the logical argument miss.imp
. The MAR assumption is
adopted if this argument is set to FALSE
(the default value).
Otherwise, missing outcomes are imputed using the imp.mix()
function
from the package mix before starting the
estimation.
The following command provides a summary of the estimation results:
summary(mod4)
Call:
lmestCont(responsesFormula = sap + dap + hr ~ NULL, latentFormula = ~fluid +
as.factor(gender) + age, data = data_heart_sim, index = c("id",
"time"), k = 3, miss.imp = FALSE)
Coefficients:
Be - Parameters affecting the logit for the initial probabilities:
logit
2 3
(Intercept) -0.0348 0.0258
fluid 0.0003 0.0001
as.factor(gender)2 -0.4704 0.3428
age -0.0290 -0.0096
Ga - Parameters affecting the logit for the transition probabilities:
, , logit = 1
logit
2 3
(Intercept) -2.1318 -2.3075
fluid -0.0007 0.0001
as.factor(gender)2 2.1343 0.0536
age -0.0047 0.0076
, , logit = 2
logit
2 3
(Intercept) -0.9958 -1.9856
fluid -0.0025 -0.0001
as.factor(gender)2 4.3148 2.4297
age -0.0493 -0.0641
, , logit = 3
logit
2 3
(Intercept) -2.2754 -2.4416
fluid 0.0008 0.0007
as.factor(gender)2 0.3736 -1.9600
age -0.0700 -0.0342
Mu - Conditional response means:
state
1 2 3
sap 101.0068 104.9304 127.5081
dap 61.9488 66.9759 73.6769
hr 74.2144 102.1859 83.0373
Si - Variance-covariance matrix:
[,1] [,2] [,3]
[1,] 197.0381 55.0938 12.7835
[2,] 55.0938 107.8801 21.0515
[3,] 12.7835 21.0515 123.8274
Considering the estimated conditional means under the HM model with
\(k = 3\) latent states, we observe that these states are ordered
increasingly according to the values of the estimated means for both
systolic (sap
) and diastolic (dap
) blood pressure. The estimated
regression parameters related to the initial probabilities are stored in
object Be
. The log-odds ratio for gender is positive for the 3rd state
and negative for the 2nd, indicating that females are more likely to be
in the 3rd and 1st states one day after surgery, holding other covariates constant.
The output Ga
refers to the estimated regression parameters on the
transition probabilities. For example, the values in the 1st column of
Ga[,,3]
measure the influence of each covariate on the transition
from the 3rd to the 1st state. For females, the probability of this
transition is higher than for males and age has a negative effect on this transition.
Parametric bootstrap standard errors for the parameter estimates can be
obtained using the bootstrap()
method by specifying, as input
argument, the object returned by the call of lmestCont()
. The number
of bootstrap samples can be specified through argument B
. The call is
as follows:
boot <- bootstrap(mod4, B = 20)
Results of function bootstrap()
include the average of bootstrap
estimates and the standard errors for the model parameters. For example,
the estimated standard errors for the regression parameters affecting
the logits on initial and transition probabilities can be obtained as
round(boot$seBe, 3)
logit
2 3
(Intercept) 0.014 0.031
fluid 0.000 0.000
as.factor(gender)2 0.206 0.402
age 0.012 0.011
round(boot$seGa, 3)
, , logit = 1
logit
2 3
(Intercept) 0.011 0.036
fluid 0.000 0.001
as.factor(gender)2 0.099 0.520
age 0.015 0.026
, , logit = 2
logit
2 3
(Intercept) 4.240 0.705
fluid 0.006 0.004
as.factor(gender)2 4.477 2.196
age 0.159 0.301
, , logit = 3
logit
2 3
(Intercept) 0.031 0.042
fluid 0.004 0.001
as.factor(gender)2 0.245 0.629
age 0.318 0.069
We examine data collected by the Minnesota Department of Education for all Minnesota schools during the period 2008-2010, as illustrated in . The dataset is available in the GitHub repository at the link provided in the following chunk.
urlfile <- "https://raw.githubusercontent.com/proback/BeyondMLR/master/data/chart_wide_condense.csv"
data <- read.csv(url(urlfile))
n <- nrow(data); TT <- 3
data_school <- data.frame(id = rep(1:n, each = TT), time = rep(1:TT, n),
chart = rep(data$charter, each = TT),
sped = rep(data$schPctsped, each = TT),
math = c(t(data[,c("MathAvgScore.0",
"MathAvgScore.1", "MathAvgScore.2")])))
The final part of the dataset in long format is displayed in the following:
id time chart sped math
1849 617 1 1 0.105 NA
1850 617 2 1 0.105 NA
1851 617 3 1 0.105 651.4
1852 618 1 1 0.455 NA
1853 618 2 1 0.455 NA
1854 618 3 1 0.455 631.2
The data refer to 618 schools, and
the aim is to compare student performance in charter schools versus public
schools by analyzing Minnesota Comprehensive Assessment average math
scores from 6th to 8th graders in each school. The response variable
(math
) is recorded at three time occasions corresponding to years
2008, 2009, and 2010. Type of school (chart
: coded as 1 for a charter
school, and 0 for a public school), and the proportion of special
education students in a school (sped
), based on 2010 figures, are
time-fixed covariates. In this example, we investigate the effects of
the two covariates on the test performance, specifically examining
whether there are differences between charter and public schools.
We observe that the school with id 618 has missing records for the math scores at both the first and second occasions. Additionally, there are 121 missing math scores out of a total of 1,854 records.
table(complete.cases(data_school$math))
FALSE TRUE
121 1733
We estimate an HM model with two latent states, handling missing values
through imputation, and accounting for the effect of covariates on the
measurement model. This is done using the lmestCont()
function with
the following options:
The summary output presents the estimated model parameters, including the following details:
summary(mod5)
Call:
lmestCont(responsesFormula = math ~ chart + sped, data = data_school,
index = c("id", "time"), k = 2, miss.imp = TRUE)
Coefficients:
Initial probabilities:
est_piv
[1,] 0.1652
[2,] 0.8348
Transition probabilities:
, , time = 2
state
state 1 2
1 0.9233 0.0767
2 0.0130 0.9870
, , time = 3
state
state 1 2
1 0.7419 0.2581
2 0.0085 0.9915
Al - Intercepts:
state [,1]
1 644.8250
2 656.8341
Be - Regression parameters:
[,1]
[1,] -2.8784
[2,] -12.8279
Si - Variance-covariance matrix:
[,1]
[1,] 24.0767
The estimated intercepts, collected in object Al
, indicate that the
1st state corresponds to lower performing schools. Additionally, the
analysis reveals a negative effect of attending a non-charter public
school on the average math score, while controlling for the percentage
of special education students. Covariate sped
has also a negative
effect on the math score. Specifically, an increase of 1% in the
proportion of special education students at a school is associated with
a decrease of 12.83 points in the estimated mean math scores.
From the estimated initial probabilities, we infer that, in 2008, around 17% of schools belong to the 1st state. The two estimated transition matrices indicate a significant increase in the average math score from 2009 to 2010, as evidenced by the transition probability from the 1st to the 2nd state, which is 0.26.
The output collected in mod5
also includes an object named Yimp
,
which is an array of dimension \(n \times T \times k\) containing the
imputed data. For instance, the imputed values for unit 618 at the first
and second time occasion are 662.966 and 661.465, respectively.
round(mod5$Yimp[618,,], 3)
[1] 662.966 661.465 631.200
The LMest package is designed for Markov chain (MC) and hidden Markov (HM) models. It includes functions for maximum likelihood estimation of various versions of these models, both with and without covariates, and under different constraints. In particular, the package allows analyzing longitudinal and time-series data through MC models for univariate categorical responses and HM models for both categorical and continuous responses, so covering a wide range of applications.
The primary features of the models at issue, and of HM models in particular, are the great flexibility and the capability of performing dynamic model-based clustering with each cluster corresponding to a support point of the Markov chain. HM models are estimated using the Expectation-Maximization (EM) algorithm. Given that the likelihood is often multimodal, the package provides appropriate criteria to assess the convergence and choose different sets of starting values for the model parameters. Moreover, in order to optimize computational efficiency, the Fortran language is used for many numerical operations. The package also includes functions for simulating data from different versions of the HM models, as well as for displaying and visualizing parameter estimates. Parametric bootstrap can also be applied to provide standard errors for the parameters estimates, offering an alternative to using the information matrix.
In addition to the previous features, the LMest package also
allows estimation of a model for longitudinal categorical data based on
a mixture of latent AR(1) processes to dynamically account for
unobserved heterogeneity. Each mixture component has a specific mean and
correlation coefficient, but these components share a common variance.
Therefore this model is based on a continuous latent process and is
tailored for univariate data, in which the response variable has an
ordinal nature. As another extension, it is possible to estimate mixed
HM models through function lmestMixed()
, which is
formulated to account for additional sources of time-fixed dependency in
the data. The parameters of the latent process can vary across different
latent subpopulations defined by an additional latent variable.
It is worth mentioning that, within the package, it is possible to handle intermittent, entirely or partially, missing values in the responses under the missing-at-random assumption and the EM algorithm for parameter estimation is suitable adapted also for providing an imputation of the missing responses. Weighted maximum likelihood can also be performed, which can be useful in many applied contexts, such as when longitudinal survey data are analyzed. In this case, individual survey weights , which refer to the probability of each unit being sampled in the reference population, are available; see among others, and . By using suitable weights, it is also possible to carry out causal inference with observational longitudinal data , in a potential outcome framework , on the basis of a propensity score method. For applications of this type see and .
As alternative packages to LMest, we mention march , which provides tools for fitting discrete-time Markov chains, semi-Markov chains, and higher-order models. Moreover, we mention the mstate package designed for modeling and analyzing multi-state models, which are generalizations of Markov chains, useful in the context of survival analysis, and the ctmcd package , which provides methods for parameter estimation, simulation, and detailed analysis of continuous-time Markov models.
For the HM model, it is also worth considering the depmixS4 package . LMest and depmixS4 differ from others packages mainly in how covariates are parameterized in the latent and the manifest models. Additionally, depmixS4 can handle a more general class of distributional assumptions when covariates are included in the measurement model. We also highlight the recent proposal of with the eglhmm package, which allows us to estimate extended generalized linear HM models for data conforming to various distributions. Another available package of interest is msm , which is designed for estimating continuous-time Markov and HM models for longitudinal data. It provides both maximum likelihood estimation and Bayesian inference under various models, including multi-state models. Additionally, the HiddenMarkov package offers tools for modeling time series data and supports various distributions for the observations, including Gaussian and Poisson distributions. Lastly, package seqHMM is designed for fitting HM models and mixture HM models specifically for social sequence data and other types of categorical data.
The extensive development and availability of packages for HM models on CRAN highlights the broad applicability across diverse fields of such models. Their flexibility and robustness in handling various types of data, combined with their capability to uncover underlying unobserved processes, demonstrate the popularity of these models and their value among researchers and practitioners. Finally, as possible extension of the LMest package we consider that to handle informative missing data, as in the case of dropout from a longitudinal study. As described in , dropout can be accounted for by considering an absorbing state, which allows modeling survival time without specifying a separate survival model component. This extension will be included in future updates of the package. The package will also be extended to allow for variable selection, as recently proposed in , where a greedy search algorithm is implemented to identify the most important response variables. Other forthcoming extensions include functions to estimate models for data with a multilevel structure, as proposed in , additional parameterizations for the covariates affecting the latent (sub)model, as described in , and methods for dealing with compositional data, as proposed in .
Acknowledgments
The authors are grateful for the financial support from the grant ``Hidden Markov Models for Early Warning Systems” of Ministero dell’Università e della Ricerca (PRIN 2022TZEXKF) funded by the European Union - Next Generation EU, Mission 4, Component 2, CUP J53D23004990006.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2024-036.zip
LMest, Formula, quantmod, mix, march, mstate, ctmcd, depmixS4, eglhmm, msm, HiddenMarkov, seqHMM
Distributions, Epidemiology, Finance, MissingData, Survival, TimeSeries
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Pennoni, et al., "LMest: An R Package for Estimating Generalized Latent Markov Models", The R Journal, 2025
BibTeX citation
@article{RJ-2024-036, author = {Pennoni, Fulvia and Pandolfi, Silvia and Bartolucci, Francesco}, title = {LMest: An R Package for Estimating Generalized Latent Markov Models}, journal = {The R Journal}, year = {2025}, note = {https://doi.org/10.32614/RJ-2024-036}, doi = {10.32614/RJ-2024-036}, volume = {16}, issue = {4}, issn = {2073-4859}, pages = {74-101} }