Social scientists have frequently attempted to assess the relative contribution of age, period, and cohort variables to the overall trend in an outcome. We develop an R package APCI (and Stata command apci
) to implement the age-period-cohort-interaction (APC-I) model for estimating and testing age, period, and cohort patterns in various types of outcomes for pooled cross-sectional data and multi-cohort panel data. Package APCI also provides a set of functions for visualizing the data and modeling results. We demonstrate the usage of package APCI with empirical data from the Current Population Survey. We show that package APCI provides useful visualization and analytical tools for understanding age, period, and cohort trends in various types of outcomes.
Researchers across disciplines have long been interested in distinguishing the relative contribution of three time-related variables — namely, age (i.e., how old a person is at the time of data collection), time periods (e.g., the Great Recession 2007-2009 and the COVID-19 pandemic beginning in December 2019), and cohort membership (e.g., the baby boom cohort born in 1945-1964 and the Millennials born in 1981-1996) — to the overall trends in various outcomes (e.g., labor force participation, attitudes, and cognitive functioning) (Clogg 1982; Alwin and McCammon 2003; Pescosolido et al. 2021). Decomposing the overall trends into age, period, and cohort variations provides insight into the ways in which biological and social factors affect these outcomes (Hobcraft et al. 1982; Heckman and Robb 1985; Fosse and Winship 2019).
To quantify the relative contribution of age, period, and cohort, Luo and Hodges (2020b) have recently developed a model called the age-period-cohort-interaction (APC-I) model. The APC-I is qualitatively different from other age-period-cohort (APC) models in that it characterizes cohort effects as a structure of the age-by-period interaction terms to acknowledge the interdependence of age, period, and cohort effects, whereas prior methods attempt to recover the independent and additive effects of the three variables. The APC-I model has been used to understand the unique contribution of cohort membership in various outcomes including crime involvement, substance use, and cultural taste (Verdery et al. 2020; Lu and Luo 2020; Ma 2020). However, the authors of the APC-I model focused on the conceptual motivation of the method and offered relatively few technical details for implementing the method. Estimating and testing cohort effects in the APC-I model may be be challenging for interested readers.
We developed an R package
APCI (Xu and Luo 2021) and a Stata
command apci
for implementing the APC-I model in empirical research
using pooled cross-sectional data (e.g., the General Social Survey and
the Current Population Survey) and importantly, extend the APC-I method
for analyzing multi-cohort longitudinal or panel data (e.g., data from
the Health and Retirement Study
(HRS) and the National Longitudinal
Study of Youth (NLSY)). The purpose of this
paper is three folded. First, we describe the R functions in the
APCI package and Stata
command to estimate and test age, period, and cohort effects in the
APC-I model. The core function can be used for analyzing pooled
cross-sectional data and multi-cohort longitudinal data. Second, we
introduce a set of visualization tools to help researchers motivate an
APC analysis and interpret age, period, and cohort effects from the
APC-I model. Third, we clarify several important issues about
characterizing cohort effects as a set of age-by-period interaction
terms. We pay particular attention to the implications of coding schemes
and how to interpret the between-cohort average deviations and
within-cohort life-course variations.
This paper is organized as follows. Following a description of traditional APC models and the identification problem, we introduce the APC-I model and the estimation and testing procedures. We explain how and why the age-by-period interaction terms can be used to characterize cohort effects with particular attention to the implications of coding schemes for estimating and testing interactions. Next, we describe the visualization tools and functions in the R package APCI. We then demonstrate how to use the package with the empirical example of men’s and women’s labor force participation from 1990 to 2018 in the United States using data from the Current Population Survey (CPS, Flood et al. 2021).
To formally estimate and infer the independent age, period, and cohort effects, Mason et al. (1973) specified an analysis of variance (ANOVA) model that they labeled the age-period-cohort (APC) accounting model:
\[\label{eq:1} g\left(E\left(Y_{i j}\right)\right)=\mu+\alpha_{i}+\beta_{j}+\gamma_{k} \tag{1}\] for age groups \(i=1, 2,\ldots, A\), periods \(j=1, 2,\ldots, P\), and cohorts \(k=1, 2,\ldots, (A+P-1)\), where \(\sum_{i=1}^{A} \alpha_{i}=\sum_{j=1}^{P}\beta_{j}=\sum_{k=1}^{A+P-1} \gamma_{k}=0\). \(E\left(Y_{i j}\right)\)denotes the expected value of the outcome \(Y\) for the \(i\)th age group in the \(j\)th time period; \(g\) is the “link function”; \(\alpha_i\) denotes the mean difference from the global mean \(\mu\) associated with the \(i\)th age category; \(\beta_j\) denotes the mean difference from \(\mu\) associated with the \(j\)th period; \(\gamma_k\) denotes the mean difference from \(\mu\) associated with membership in the \(k\)th cohort.
Unfortunately, the APC accounting model ((1)) is not identified even when a coding scheme (e.g., dummy coding where one group is set as the reference group or effect coding where the sum of the coefficients for each effect is set to 0) is applied. This is because age, period, and cohort are exactly linearly related (see Fienberg and Mason 1979; Luo et al. 2016 for detailed discussions; Fosse and Winship 2019). As a result, the design matrix of model ((1)) has rank one less than full, so an infinite number of solutions (i.e., estimates) for the parameters fit the data equally well. That is, the data cannot distinguish different estimation results, so an additional constraint — in addition to the usual reference group or sum-to-zero constraint — must be imposed in order to choose one set of estimates. Moreover, interpreting the results is difficult because the standard interpretation of regression coefficients — that is, the conditional effect of one variable after accounting for other covariates — cannot apply due to the lack of variation in the third variable (e.g., cohort) after considering the other two (e.g., age and period).
The theoretical root of the identification problem in traditional APC models is the problematic assumption that age, period, and cohort effects operate independently of each other. It implies that the identification challenge is inherent in any APC model that attempts to separate independent and additive effects of age, period, and cohort and thus cannot be solved by changing the model setup [e.g., using random effects for period and cohort as in (Yang and Land 2006); see Luo and Hodges (2020a) for a critique] or by variable manipulation [e.g., using unequal interval widths for age, period, and cohort groups as in (Robertson and Boyle 1986; Sarma et al. 2012); see Luo et al. (2016) for a detailed discussion]. The identification problem is well recognized, and its consequences have been discussed extensively (Kupper et al. 1983, 1985; Fienberg and Mason 1985; Luo 2013; Grotenhuis et al. 2016; Luo et al. 2016; Fosse and Winship 2019; O’Brien 2020; Luo and Hodges 2020a; Morgan and Lee 2021). In essence, internal information derived from the data cannot help because the problem is circular: researchers do the analysis to learn precisely the kind of information needed to justify any such constraint.
Luo and Hodges (2020b) proposed a new APC model called the age-period-cohort-interaction (APC-I) model. The APC-I model is qualitatively different from all estimators developed under the traditional framework in that it explicitly specifies cohort effects as a structure of the age-by-period interactions. A life-course dynamics hypothesis that concerns about whether and how cohort effects may change as cohorts age thus corresponds to a specific structure of the age-by-period interactions. This specification is motivated by the theoretical account that “The minimal basis for expecting interdependence between inter-cohort differentiation and social change is that change has variant import for persons of unlike age” (Ryder 1965). That is, a basic notion on which cohort analysis rests is that “transformations of the social world modify people of different ages in different ways.” (Ryder 1965)
The APC-I model is fully identified in the sense that it does not require additional constraints other than a regular coding scheme. It is also flexible enough to test various hypotheses about life-course changes within cohorts as cohort members age. We first describe the model specification and estimation and testing techniques. The next section demonstrates the procedure using empirical examples.
The general form of the APC-I model can be written as: \[\label{eq:2} g\left(E\left(Y_{i j}\right)\right)=\mu+\alpha_{i}+\beta_{j}+\alpha \beta_{i j(k)} \tag{2}\] where \(g\), \(Y_{ij}\), \(\mu\), \(\alpha_{i}\) and \(\beta_{j}\) are defined as in model ((1)) and \(\alpha \beta_{i j(k)}\) denotes the interaction of the \(i\)th age group and \(j\)th period group, corresponding to the effect of the \(k\)th cohort. Note that except for the oldest and youngest cohorts, the effect of one cohort includes multiple age-by-period interaction terms \(\alpha \beta_{i j(k)}\) that lie on the same diagonal in a table with ages in rows and periods in columns.
Model ((2)) differs from model ((1)) in the way that cohort effects are modeled. In model ((2)), cohort effects are considered as a specific form of the age-by-period interaction. In statistics, the interaction between two variables describes the differential effects of one variable depending on the level of the other (Scheffé 1999). In APC research, this means that if part of the overall pattern of interest can be attributed to cohort differences, significant age-by-period interactions should be present. When cohort membership is not associated with the outcome — that is, when the effects of historical or social shifts (period effects) are uniform across age groups — then age-by-period interactions should not be present.
Luo and Hodges (2020b) described a procedure for investigating age and period main effects and inter-cohort deviations and intra-cohort dynamics. They recommended beginning with a deviance test about whether the effects of time periods vary among age groups, which is called “a global F test”. A non-significant global F statistic indicates that there are few age-by-period interaction effects and thus little cohort variation. If the model suggests significant age-by-period interaction effects, one may proceed to examine inter- and intra-cohort differences1. Inter-cohort average deviations are quantified based on the arithmetic mean of the age-by-period interaction terms contained in each cohort and a t test can be used to examine the average of that cohort-specific deviation. To investigate intra-cohort dynamics over the life course (e.g., the cumulative (dis)advantage hypothesis in Dannefer 1987; Chauvel et al. 2016; Ferraro and Morton 2018; O’Brien 2020), one may use a t-test of the linear orthogonal polynomial contrast in each cohort’s age-by-period interaction effects. This intra-cohort life-course dynamics test is helpful for investigating whether the average (dis)advantages of the members of that cohort accumulate, remain stable, or diminish as they age.
The APC-I model has three advantages. First, it is identified in that it does not require additional constraints other than the usual coding scheme. That is, it avoids the identification problem of the APC accounting model based on the theoretical account of cohort effects and allows inclusion of other important predictors such as education, sex, and employment status. Second, the interpretation of the coefficient estimates of the APC-I model is meaningful and straightforward. This is because the APC-I model recognizes the dependence of age, period, and cohort so the dilemma that analysts face using traditional APC models does not apply. Third, the APC-I model permits investigating life-course dynamics as a cohort ages, whereas extant methods usually assume that cohort effects do not change.
It is important to note that the APC-I model is never intended to "solve" the identification problem in traditional APC accounting models because it is a false problem to begin with. Given the near monopoly of the accounting model, it may be challenging not to see the APC-I method through the lens of the traditional APC accounting framework. For example, because the APC-I model quantifies cohort effects as a structure of the age-by-period interactions, some readers may take it to mean that the APC-I model cannot estimate "linear cohort main effects". However, the APC-I method, by design, does not intend to estimate any kind of "linear cohort main effects" precisely because the traditional model’s assumption that there is a linear cohort effect that is additive or independent of age and period effects lacks theoretical grounding and is thus arbitrary and questionable. Please see Luo and Hodges (2020b) for a more thorough discussion about the theoretical motivation of the APC-I model.
Because the APC-I mode is relatively new, below we make additional remarks about interaction effects and coding schemes to help readers better understand and use the model.
In some cases, interaction terms may be difficult to interpret besides suggesting that the effect of one variable may depend on the values of the other. However, as explained by Luo and Hodges (2020b), the age-by-period interaction terms correspond to the conceptual definition of cohort effects and thus can be modeled and interpreted in a meaningful way. Specifically, cohort effects are expected when the influence of social events and changes differ by age groups. This conceptualization of cohort effects implies that the age-by-period interactions, which represent the differential effects of time periods depending on age, can be used to measure cohort effects.
Technically, because of the linear dependency among age, period, and cohort, the effects of the third variable can be expressed as the interaction between the other two variables. The APC-I model considers age and period as main effects and cohort their interactions, which may give the impression that it privileges age and period effects and “discriminates” against cohort effects. The theoretical motivation for this choice is that it is often desirable to estimate a general age pattern that individuals follow as they get older. Period main effects are used to represent the kind of impacts of social environment that everyone in the society is exposed to. The decision to quantify cohort effects as a specific form of age-by-period interaction is informed by the demographic literature on how cohort effects are conceptualized in relation to age and period effects. Empirically, as the analysis of women’s labor force participation in section Examples will illustrate, the size of the cohort effects, characterized as the age-by-period interactions, is not necessarily smaller—in fact may be larger—than some of the main effects.
For the unidentified APC accounting model, some estimation methods including the intrinsic estimator yield effect estimates that are dependent on the choice of coding schemes in that estimates under different coding schemes are not equivalent (Grotenhuis et al. 2016 for a more detailed discussion; see Luo et al. 2016). The APC-I model does not have a rank deficiency problem in the sense that it does not require more constraints than a usual ANOVA model with main effects and their interactions. For any identified model including the APC-I model, the estimates are equivalent; that is, the estimated cell means are the same for all coding schemes.
Although this equivalence holds for both main effects and interaction estimates in the APC-I model, it is less obvious for interaction terms because the interpretation of the specific parameter estimates do change with coding schemes. To illustrate, consider an example of applying the APC-I model to health data with three age categories and three periods, shown in 1 below. Under dummy coding—for example, the youngest age group 20-24 and the beginning survey period of 2000 are set to zero or omitted as the referent—the interaction for ages 25-29 and period 2005 in cell Y represents the difference in a health outcome between periods 2000 and 2005 for age 25-29 or equivalently, health difference between ages 25-29 and 20-24 for the period 2005. That is, interactions under dummy coding represents a directional difference from a particular reference group.
2000 | 2005 | 2010 | ||
20-24 | ||||
25-29 | \(\mathrm{Y}\) | |||
30-34 |
By contrast, under effect coding (i.e., sum-to-zero coding), the same interaction term in cell \(Y\) represents the deviation in the health outcome from the age main effect plus period main effect for the group of individuals who were age 25-29 and surveyed in period 2005.
The estimated interaction terms under these two types of coding in cell \(Y\) thus necessarily have different numerical values. However, this difference does not arise from an identification problem but rather from a shift in what these quantities represent. That is, the two interaction terms can be transformed to be equivalent so that the means in \(Y\) after considering age and period main effects are the same under the two coding schemes.
We recommend using effect or sum-to-zero coding for estimating the APC-I model for the following reasons: when characterizing cohort effects as a set of age-by-period interactions, we are less concerned about any direction of the interactions; that is, we are not particularly interested in the difference between two cohorts at a particular age or time period. Rather, we focus on particular structures of these interactions that may represent theoretically interesting patterns during a cohort’s life span. Effect coding is helpful because they all have the same referent group — the next lower level in the hierarchy of main effects and interactions. That is, we choose effect coding for the purpose of easy interpretation. This is also consistent with the recommendation of coding schemes in the presence of interactions (Aiken et al. 1991; Jaccard and Turrisi 2003).
The R package APCI 2 can be installed and loaded using the following R code3:
# install R package APCI
> install.packages("APCI")
# load R package APCI to the current working environment
> library(APCI)
The main routines to implement the APC-I model using package
APCI are apci.plot.raw
,
apci
, apci.plot
(or apci.plot.hexagram
, apci.plot.heatmap
). A
summary of these functions and input arguments used in the routines are
described below.
The R package APCI contains the following functions for estimating the APC-I model and visualizing the data and the model results:
apci
: to estimate the age, period, and cohort effects using the
APC-I model.temp_model
: an internal function that estimates a generalized
linear model.tests
: to conduct the global F test.maineffect
: an internal function to extract age and period main
effects.cohortdeviation
: an internal function to extract between-cohort
average deviations and within-cohort life-course dynamics.ageperiod_group
: to return a cohort index based on how age and
period are grouped.apci.plot.raw
: to visualize the mean values of the outcome across
age and period groups, respectively.apci.plot.hexagram
: to visualize the estimated cohort effects in a
hexagram style.apci.plot.heatmap
: to visualize the estimated cohort effects in a
heatmap style.apci.plot
: to visualize the estimated age, period, and cohorts in
conventional figures.A summary of input arguments required in these functions will be given one by one 4 in the next section. Package APCI also contains three empirical datasets women9017, cpsmen, cpswomen, and one simulated dataset simulation. Dataset women9017 was used and described in Luo and Hodges (2020b). Applications of the APC-I model to the other two empirical datasets are given in section Examples.
Function apci
is the core function in the R package
APCI. It fits an APC-I
model with or without covariates and returns a list of results including
coefficients and standard error estimates for age main effects, period
main effects, inter-cohort average deviations, and intra-cohort
life-course trends, and covariate coefficients if any. Both pooled
cross-sectional data and multi-cohort longitudinal/panel data are
supported. Specifically, function apci
is used as
apci(data, outcome, age, period, cohort, weight, covariate, family,
dev.test=TRUE, print, gee, id, corstr,...)
and takes the following arguments:
data
: a data frame containing an outcome variable, age group
indicators, period group indicators, and covariates to be used in
the model. If a variable is not found in data, there will be an
error message reminding users to check the input data again.
Supported data structures include pooled cross-sectional data and
multi-cohort longitudinal/panel data.outcome
: an object of class character containing the name of the
outcome variable. The outcome variable can be a continuous,
categorical, or count variable.age
: an object of class character indicating the age group index
taking on the number of distinct values in the data (e.g., six age
groups: 20-24, 25-29, 30-34, 35-39, 40-44, and 45-49). The vector
should be a factor (or “category”, or “enumerated type”).period
: an object of class character indicating the time period
index in the data.cohort
: an optional object of class character indicating cohort
membership index in the data. The cohort index can be generated from
the age group index and time period index in the data because of the
exact linear relationship among these three time-related indices.weight
: an optional vector of sample weights to be used in the
model fitting process. If non-NULL
, user-supplied weights will be
used in the first step to estimate the model. Observations with
negative weights will be dropped in modeling.covariates
: an optional vector of characters containing the names
of user-specified covariate(s) to be used in the model. If the
variables are not found in data, there will be an error message
reminding the users to check the data again.dev.test
: logical, specifying if the global F test (step 1) should
be implemented before fitting the APC-I model. If TRUE
, apci
will first run the global F test and report the test results;
otherwise, apci
will skip this step and return NULL
. The default
setting is TRUE
. However, users should be aware that the algorithm
will not automatically stop even if there is no significant
age-by-period interactions based on the global F test.5family
: a character string specifying the link function to be used
in the model. The value can be “binomial”, “multinomial”, or
“gaussian”. See R function glm
for more details about link
functions.print
: logical, specifying if the intermediate results should be
displayed in the console when fitting the model. The default setting
is TRUE
to display the results of each procedure.gee
: logical, indicating if the data is cross-sectional data or
longitudinal/panel data. If TRUE
, the generalized estimating
equation will be used to correct the standard error estimates. The
default is FALSE
, indicating that the data are cross-sectional.id
: a character vector specifying the cluster index in
longitudinal data. It is required when gee
is TRUE
. The length
of the vector should be the same as the number of observations.corstr
: a character string specifying a possible correlation
structure in the error terms when gee
is TRUE
. The following are
allowed: independence
, fixed
, stat_M_dep
, non_stat_M_dep
,
exchangeable
, AR-M
and unstructured
. The default value is
exchangeable
.unequal_interval
: logical, indicating if age and period groups are
of the same interval width. The default is set as TRUE
.age_range, period_range
: numeric vectors indicating the actual age
or period range (e.g., 10 to 59 years old or from 2000 to 2019).age_interval, period_interval, age_group, period_group
: numeric
values or character vectors indicating how age and period are
grouped. age_interval
and period_interval
indicate the width of
age and period intervals, respectively. age_group
and
period_group
are character vectors listing possible age and period
groups. There are two ways to define age and period groups with
unequal intervals: 1) defining age_interval
and period_interval
,
or 2) defining age_group
and period_group
. Users must define age
and period groups using one of the two options when
unequal_interval
is TRUE
....
: further optional arguments to be passed to the model.As mentioned in section Coding scheme and contrast, we use
effect coding to estimate the APC-I model. The age
and period
arguments in function apci
accept categorical variables. Different
from the common approach of dummy coding or simple coding, where an
effect is defined as the difference of each group from the reference
group, function apci
uses effect coding (i.e., the sum-to-zero coding,
deviation coding, or the ANOVA coding) as the default coding. Under this
coding scheme, the effect of the omitted category equals the negative
sum of the effects of all other categories. Computation wise, the effect
coefficient of the omitted category is redundant because of the coding
scheme. However, for the purpose of quantifying cohort effects as
deviations from the main effects of age and period, it is useful to
compute estimates for all age-by-period cross-classifications and their
standard errors. For data with A age groups and P periods,
therefore, function apci
returns \(A\) and \(P\) number of main effect
estimates and \(A*P\) number of interaction estimates along with their
standard error estimates. For the main effects of age or period, the
estimate can be interpreted as the deviation associated with each age or
period group from the global mean. The age-by-period interactions
represent the deviation from the expectation determined by the
corresponding age and period main effects.
Also note that when age and period groups have unequal interval widths
in an age-period classification table, the age-by-period interactions
contained in a cohort no longer lie on the same diagonals. Because
cohort effects are conceptualized and estimated as a structure of the
age-by-period interactions in the APC-I model, it is technically
possible to use the argument unequal_interval
in package
APCI to extract interaction
coefficient estimates that lie on different diagonals. However, unequal
age and period group intervals may complicate the issue of cohort
overlapping noted by Kupper et al. (1985). For this reason, we do
not recommend using unequal interval widths for age and period groups if
possible.
After fitting the APC-I model, function apci
will store the following
components as a list for further usage:
model
: a summary of the fitted generalized linear regression
model.6 It displays the standard regression output including
coefficient and standard errors estimates.dev_global
: the global F test results. It examines if the
interaction terms are significant in a generalized linear regression
model that contains age and period main effects and their
interactions.intercept
: the overall intercept (\(\mu\) in equation (2)).age_effect
: a vector containing the estimated effect for each age
group.period_effect
: a vector containing the estimated effect for each
time period.cohort_average
: a vector containing the inter-cohort average
deviations for comparing differences between cohorts.cohort_slope
: a vector containing intra-cohort life-course trends.As mentioned earlier, the first step of implementing the APC-I model is
to conduct a global F test of the age-by-period interactions. This step
is a routine in function apci
, but the procedure does not stop even if
there is no statistically significant deviation from the age main effect
and period main effect. Therefore, we recommend separately conducting
the global F test. In R package
APCI, the function tests
can be used for this purpose. It can be used as follows:
tests(model, A, P, C, ...)
and takes the following arguments:
model
: a generalized linear regression model generated from the
internal function temp_model
. 7A, P, C
: numbers of age groups, period groups, or cohort groups.
If age and period groups are of different widths, the values of will
be automatically generated by the function.Function tests
will return a standard F test result including the
value of the F test statistic and the associated p-value.
In package APCI, we provide
four functions to facilitate visualizing the data and model results,
namely apci.plot.raw
, apci.heatmap
, apci.plot.hexagram
, and
apci.plot
, in different stages of a research project. They take
similar input arguments. A summary of these arguments is given below.
Function apci.plot.raw
is designed to plot the outcome variable
aggregated by age or period groups. This function may be used in the
stage of data exploration. Functions apci.heatmap
and
apci.plot.hexagram
are designed to plot the age-by-period interactions
from the APC-I model. Both functions generate heatmaps, where one axis
represents age groups, and the other period groups. The cells in a
diagonal represent one cohort. The difference between the two functions
is the layout of the heatmap; one is a rectangular graph and the other a
hexagram. Function apci.plot
can be used to visualize both raw data
and model results. It divides the canvas into four (\(2 \times 2\))
panels. Three of the four panels can be used to visualize the three
effect estimates in the APC-I model and the left panel to add notes. For
data exploration, users can visualize the mean values of the outcome
variable across age, period, and cohort groups on the same canvas. Users
can the same functions to visualize the estimated age, period, and
cohort effects from the modeling results.
The visualization functions in package APCI include:
apci.plot.raw(data, outcome_var, age, period, ...)
apci.plot.heatmap(model, age, period, color_map = NULL, color_scale = NULL,
quantile = NULL, ...)
apci.plot.hexagram(model, age, period, first_age, first_period, interval,
color_scale = NULL, color_map = NULL, quantile = NULL, …)
apci.plot(model, age, period, outcome_var, type = "model", quantile = NULL,
...)
and takes the following arguments:
model
: a list recording the results from function apci
.outcome_var
: an object of class character indicating the name of
the outcome variable used in the model. The outcome variable can be
a continuous, binary, categorical, or count variable.age
: a vector indicating the age group. The vector should be
converted to a factor (or the terms of “category” and “enumerated
type”).period
: a vector indicating the time period. The vector should be
converted to a factor (or “category”, “enumerated type”).color_map
: a vector representing a color palette to be used in the
figure. The default setting is greys if color_map
is NULL
.
Alternatives, for example, can be c(“blue”, “yellow”), “blues”, etc.cohort_scale
: a vector containing two values to indicate the
minimum and maximum values, respectively, of the estimated cohort
effects to be displayed. If NULL
, the function will use the range
from the estimation results.quantile
: a number valued between 0 and 1, representing the
desirable percentiles to be used in visualizing the data or model.
If NULL
, the original scale of the outcome variable will be used.We now illustrate how to use package APCI’s visualization and analytical functions. We describe and analyze two empirical datasets to demonstrate how this package may be used to analyze pooled cross-sectional data. We later briefly describe how to fit an APC-I model for multi-cohort longitudinal data.
Temporal trends in men’s and women’s labor force participation (LFP) in United States have gathered much scholarly attention (Farkas 1977; Treas 1987; Wilkie 1991; see e.g. Connelly 1992; Macunovich 2012; Hollister and Smith 2014). For example, whereas men’s LFP steadily declined in the past decades (Wilkie 1991), women’s LFP continued to rise until the 1990s and the 2000s. Female LFP has since then reached a plateau and even begun to decline. Researchers have debated about the causes of this leveling off or decline. Some studies attributed the observed trends to period-specific factors such as labor demand (Erceg and Levin 2014), the economic shocks of the Great Recession (Boushey 2005; Hoffman 2009), social welfare and disability insurance (Duggan and Imberman 2009), and gender role attitudes (Fortin 2015).
However, the temporal trends in LFP are unlikely due to a pure period process. For example, because individuals may begin to leave the labor force in age 50, a decline in LFP should be expected if the proportion of the population age 50 or older has increased. That is, the recent trends may reflect a change in the age composition of the US population (Aaronson et al. 2014). The cohort succession may also contribute to the trend, a process in which older cohorts with higher or lower LFP rates begin to decease and younger cohorts with lower or higher LFP enter the labor force Lee (2014). At the same time, critical social and demographic changes in education level, fertility, and attitudes about women working outside the home may be more of cohort-specific than period-specific processes because these forces only affect individuals of certain ages (Goldin 2006; Farré and Vella 2013; Fernández 2013; Balleer et al. 2014).
Given that the observed trends in LFP are likely a mix of age, period, and/or cohort patterns, an APC analysis that decomposes the temporal trends into age-, period-, and cohort-related variations is thus helpful for revealing the demographic, social, and economic changes that have underlined the temporal trends in American’s labor force participation. In this following section, we demonstrate how to use the functions in APCI to undertake an APC analysis of men’s and women’s LFP using a cross-sectional dataset.
The Current Population Survey (CPS, Flood et al. 2021) is the primary source of labor force statistics in the United States (Flood et al. 2021). Beginning in the 1960s, the CPS has been collecting data on key demographic, economic, and education topics. We subset the 1990-2019 CPS data by gender, resulting in two datasets, namely cpsmen and cpswomen, to show how to conduct an APC-I analysis of men’s and women’s LFP in the United States using package APCI.
Datasets cpsmen and cpswomen contain a subset of men and women age 20-64 who participated in the 1990 to 2019 CPS. The following code is used to load the data into the working environment:
> data(cpsmen)
> data(cpswomen)
The first five rows of the datasets of cpsmen and cpswomen are:
> head(cpsmen, n = 5)
asecwt year age labforce educc2854.84 3 5 0 1
1576.54 4 4 1 2
2340.55 2 6 1 3
158.44 3 5 0 0
347.09 6 6 1 3
> head(cpswomen, n = 5)
asecwt year age labforce educc2415.67 2 3 1 1
663.89 2 5 1 3
1653.01 6 4 1 2
1613.31 6 4 0 2
177.23 4 3 1 2
where labforce
indicates the respondent’s labor force participation
status (1=in the labor force, 0=not in the labor force). asecwt
is the
person-level weight that the CPS recommends to be used in
individual-level data analyses. year
indicates the survey year when
respondent was interviewed, grouped into 6 period groups (1=1990-94,
2=1995-99, …, 6=2015-19). age
indicates the respondent’s age
categories (1=20-24, 2=25-29, …, 9=60-64). educc
is a three-level
categorical education measure (1=less than high school, 2=high school
graduate, 3=college degree or above).
For data exploratory purpose, function apci.plot.raw
visualizes the
outcome variable in the following way:
> apci.plot.raw(data, outcome_var, age, period)
Figure 1 shows LFP rates by age groups (top panel) and period groups (bottom panel), respectively, for male (left panel) and female CPS respondents (right panel) age 20 to 64 from 1990 to 2019. Figure 1’s top panel suggests similar age patterns in LFP across time periods. The bottom panel shows distinct period trends depending on age groups. For example, the LFP rates among women in the 55-59 and 60-64 age groups seem to have gone up whereas other age groups show a relatively flat trend. Such distinct period patterns in LFP by age groups suggest potential cohort variations in women’s labor force participation. For men’s LFP, however, the visualization results suggest that a simpler model with age and period main effects may suffice for summarizing their LFP patterns.
Function apci
can be used to fit an APC-I model for pooled
cross-sectional data or multi-cohort longitudinal/panel data. In the
simplest form of an APC-I model without covariates for pooled
cross-sectional data, function apci
is called as follows:
> no_cov <- APCI::apci(outcome = "labforce",
+ age = "age",
+ period = "year",
+ weight = "asecwt",
+ data = cpswomen,
+ dev.test = FALSE,
+ family = "binomial")
It is often desirable to add covariates in the model, which can be done
by calling the covariate
argument. For example, suppose one would like
to add education levels (“educc”) as a covariate in the model, function
apci
can be used as:
> with_cov <- APCI::apci(outcome = "labforce",
+ age = "age",
+ period = "year",
+ covariate = c("educc"),
+ weight = "asecwt",
+ data = cpswomen,
+ print=F,
+ dev.test=FALSE,
+ family = "binomial")
Below is a summary of the results from an APC-I model that includes education levels as a covariate:
> summary(with_cov)
Length Class Mode 33 svyglm list
model 0 -none- NULL
dev_global 4 -none- character
intercept 45 -none- character
age_effect 30 -none- character
period_effect 6 data.frame list
cohort_average 6 data.frame list
cohort_slope 5 data.frame list
int_matrix 54 -none- numeric
cohort_index 7 data.frame list data
The returned value is a list of objects. model
contains the results
from a logistic regression model with age and period main effects and
the unstructured interactions. dev_global
displays the global F test
result. A significant F statistic suggests that there may exist cohort
effects. intercept
is the overall intercept (\(\mu\) in Equation
(2)). age_effect
gives estimated age main effect.
period_effect
is the estimated period main effect. cohort_average
gives inter-cohort average deviations from age and period main effects.
cohort_slope
gives intra-cohort life-course linear slopes, which can
be used for testing intra-cohort life-course dynamics. int_matrix
displays a matrix that contains the estimated coefficients for
age-by-period interactions. Note that there are A*P interactions in
int_matrix
because effect coding is used to compute the A+P-1
interaction estimates based on the (A-1)*(P-1) freely varying
interaction parameters. Such interaction estimates are used to generate
heatmaps similar to Figure 2. data
stores the data fed into apci
function. Users may call an object to obtain detailed results. For
example, by calling with_cov$cohort_average
and
with_cov$cohort_slope
, users can obtain estimated inter-cohort average
deviations and intra-cohort life-course slopes.
The output below shows education-adjusted inter-cohort average
deviations in women’s LFP from analyzing the cpswomen data using
function apci
.
> with_cov$cohort_average
c_avg_group c_avg_est c_avg_se c_avg_t c_avg_p c_avg_sig1 1 -0.329 0.193 -1.709 0.088
2 2 -0.155 0.142 -1.091 0.275
3 3 -0.162 0.114 -1.422 0.155
4 4 0.047 0.097 0.481 0.631
5 5 0.096 0.085 1.139 0.255
6 6 0.174 0.076 2.288 0.022 *
7 7 0.034 0.074 0.457 0.648
8 8 -0.036 0.074 -0.493 0.622
9 9 0.003 0.073 0.047 0.963
10 10 -0.072 0.081 -0.894 0.371
11 11 0.030 0.085 0.353 0.724
12 12 -0.029 0.102 -0.288 0.774
13 13 -0.080 0.131 -0.609 0.543
14 14 -0.103 0.170 -0.608 0.543
where c_avg_group
indicates cohort membership (e.g., cohort 1=the 1930
birth cohort, cohort 2=the 1935 birth cohort,...,cohort 14=the 1995
birth cohort), c_avg_est
is inter-cohort average deviation, c_avg_se
is the standard error estimate for the average deviation, c_avg_t
is
the t test statistic for the average deviation, and c_avg_p
and
c_avg_sig
are the p values and alpha levels (*: p < .05, **: p <
.01, and ***: p < .001), respectively.
The results from with_cov$cohort_average
imply that on average, the
LFP rates among cohort 6’s–the 1955 birth cohort – significantly
differ from the expected rates based on age and period main effects.
Specifically, the 1955 cohort shows a .19 (exp(.174)-1, p < .05) higher
participation rate than the expectation based on the age and period main
effects.
The output below shows education-adjusted intra-cohort life-course
dynamics in women’s LFP from analyzing the cpswomen data using
function apci
.
> with_cov$cohort_slope
c_slp_group c_slp_est c_slp_se c_slp_t c_slp_p c_slp_sig1 1 NA NA NA NA <NA>
2 2 0.165 0.195 0.849 0.396
3 3 -0.215 0.187 -1.148 0.251
4 4 0.163 0.189 0.866 0.386
5 5 0.093 0.184 0.508 0.611
6 6 0.007 0.169 0.039 0.969
7 7 0.047 0.172 0.277 0.782
8 8 -0.096 0.181 -0.530 0.596
9 9 -0.187 0.173 -1.076 0.282
10 10 -0.106 0.176 -0.602 0.547
11 11 -0.279 0.159 -1.750 0.080
12 12 0.353 0.160 2.207 0.027 *
13 13 -0.047 0.180 -0.262 0.793
14 14 NA NA NA NA <NA>
where c_slp_group
indicates cohort membership (e.g., cohort 1=the 1930
birth cohort, cohort 2=the 1935 birth cohort,..., cohort 14=the 1995
birth cohort), c_slp_est
is intra-cohort life-course slopes,
c_slp_se
is the standard error estimate for the life-course slope,
c_slp_t
is the t test statistic for the life-course slope, and
c_slp_p
and c_slp_sig
are the p values and alpha levels (*: p <
.05, **: p < .01, and ***: p < .001), respectively. NA
s are
generated for the youngest and oldest cohort because there is only one
age-by-period combination observed for the two cohorts and thus
intra-cohort life-course dynamics cannot be accessed.
For example, for cohort 12 (the 1985 birth cohort), the estimated intra-cohort slope is 0.353 (p < .05), meaning that this cohort’s LFP is lower than expected when they were young but higher than expected in older ages. Interestingly, for cohort 12 (the 1985 birth cohort), their average cohort deviation is not statistically significant. Such an insignificant inter-cohort average deviation and a significantly negative intra-cohort slope indicate a compensation life-course pattern; that is, this cohort’s lower-than-expected LFP in younger ages seems to be compensated by their higher LFP when they were older.
The intra-cohort life-course dynamics are based on the age-by-period interactions as follows:
# the first six rows of the life-course dynamics
> with_cov$int_matrix
iaesti iase iap iasig cohortindex1 0.166 0.169 0.327 9
2 -0.048 0.207 0.818 8
3 0.068 0.164 0.678 7
4 0.095 0.172 0.581 6
5 0.205 0.191 0.283 5
6 0.227 0.193 0.239 4
# [there are 48 rows compressed]
where “iaesti” is the age-by-period interaction estimates, “iase” is the standard error estimate for the interaction term, “iap” and “iasig” are the p value and alpha level (*: p < .05, **: p < .01, and ***: p < .001),respectively, and “cohortindex” indicates cohort membership.
The following code can be used to organize the intra-cohort life-course estimates in a matrix form:
> matrix(with_cov$int_matrix, A, P)[A:1,]
# A is the number of age groups and P is the number of period groups.
#1 period #2 period #3 period #4 period #5 period #6
period #9 -0.329 -0.038 -0.416* 0.334 0.267 0.182
age #8 -0.272 0.043 0.017 0.125 0.001 0.086
age #7 -0.112 -0.392* -0.070 0.257 0.278 0.040
age #6 0.227 -0.046 0.443* -0.333* -0.258 -0.033
age #5 0.205 0.066 0.061 -0.150 -0.074 -0.107
age #4 0.095 0.044 0.139 0.065 -0.109 -0.234
age #3 0.068 0.059 -0.359* -0.147 0.096 0.283
age #2 -0.048 0.256 -0.006 0.067 -0.156 -0.113
age #1 0.166 0.009 0.191 -0.216 -0.046 -0.103 age
Based on the R package ggplot2 (Wickham 2016), heatmaps can be generated to visualize inter- and intra-cohort patterns and motivate a subsequent formal APC analysis. For example, for dataset whitemen, both inter-cohort average deviations and intra-cohort life-course dynamics may be visualized in a heatmap as follows:
> apci.plot.heatmap(model = with_cov, age = "age",period = 'year',
color_map = c('blue','yellow'))
Figure 2 is a heatmap of the estimated age-by-period interactions, with rows defined by age groups and columns by time periods. Each square represents an age-by-period interaction. Yellow squares indicate lower participation rates than the expectation determined by the age and period main effects. Blue squares indicate higher-than-expected rates. Each diagonal that runs from the lower left to the upper right represents one birth cohort. The dotted line in Figure 2 indicates a significant inter-cohort average deviation but an insignificant intra-cohort slope in their LFP for the 1955 cohort. Cohorts that on average significantly deviate from the expected rates based on age and period main effects are indicated by solid, dashed, or dotted lines. Solid lines indicate significantly (p<.05) positive intra-cohort life-course slopes, dashed lines significantly negative intra-cohort slopes, and dotted lines significant average inter-cohort deviations but insignificant intra-cohort slopes. Users have the options to customize the elements of these figures to suit their research or teaching purposes.
Figure 2 indicates that although some cohorts had LFP rates that differ
from the expected rates based on age and period main effects, only the
1955 birth cohort (marked by a dotted line) had, on average,
higher-than-expected LFP rates (inter-cohort average deviation = .174, p
< .05), and this higher LFP seems to persist over their life course
(intra-cohort slope = .007, p = .969). This visualization results are
consistent with the results from with_cov$cohort_average
and
with_cov$cohort_slope
.
Users can also use bar plots to visualize inter-cohort average
deviations. Function apci.bar
can be used as:
> apci.bar(model = with_cov, age = "age",period = "year",
cohort_label = seq(1930,1995,5))
Figure 3 illustrates inter-cohort deviations based on the average of the age-by-period interaction estimates contained in each cohort. The horizontal line at y=0 indicates expectations for all cohort based on the main effects of age and period when they are observed. The bars indicate the estimated average deviation for each cohort from the age and period main effects. Bars above the horizontal line indicate positive average deviations, and bars below the line indicate negative average deviations. The asterisk sign indicates that a cohort’s average deviation is significantly different from the expectation determined by age and period main effects. Figure 3 suggests some cohort variation in women’s LFP using the 1990-2019 CPS data, but only the 1955 birth cohort had significantly higher-than-expected LFP rates.
For longitudinal data (i.e., panel data, repeated measure data) that
include multiple cohorts, users may set the argument gee
to TRUE
to
estimate an APC-I model using the generalized estimating equation (GEE)
technique (Liang and Zeger 1986; Carey et al. 2019). When gee
is TRUE
,
users will also need to specify arguments id
and corstr
accordingly.
> model_gee <- apci(outcome = "y",
+ age = "age",
+ period = "period",
+ cohort = NULL,
+ weight = NULL,
+ covariate = NULL,
+ data=simulation_gee,
+ family ="gaussian",
+ dev.test = FALSE,
+ print = TRUE,
+ gee = TRUE,
+ id = "id",
+ corstr = "exchangeable")
> summary(model_gee)
The list of output results is similar to that for pooled cross-sectional
data, but the standard errors are corrected using the GEE’s sandwich
estimator.
We also designed a Stata command apci
based on the Stata command
rcall
(Haghish 2019) to help implement APC-I models in
Stata. The command is used as:
apci
depvar [indepvars], outcome(depvar) age(age)
period(period) family("gaussian") [if] [in] [weight]
Stata users can use the above command to fit APC-I models and obtain all the results as in R. A Stata ado file for installing this command can be downloaded at https://sites.psu.edu/liyingluo/software/.
In this article, we introduced an R package
APCI and Stata command
apci
for implementing the age-period-cohort-interaction (APC-I) model
developed by Luo and Hodges (2020b). In addition to
pooled cross-sectional data analysis, we extended the package to permit
multi-cohort longitudinal or panel data analysis. This package also
contains a set of visualization tools to help researchers motivate an
APC analysis and interpret the results. We clarify the implications of
coding schemes for estimating and testing main effects and interaction
effects in the APC-I model. We illustrate how to use this package using
the empirical examples of labor force participation using the 1990-2019
data from the Current Population Survey.
Luo and Hodges (2020b) described a local F test for
testing the variation associated with the multiple age-by-period
interactions contained in each cohort. Because the parameterization of
the local F test is more intricate than it appears, the R package
APCI and the Stata command
apci
currently do not support such tests as of version 1.0.5 but may
be available in later versions.
Moreover, it may be of interest to examine the interaction effects of cohort and other explanatory variables such as education and geographic areas. Because cohort effects are conceptualized and operationalized as a two-way interaction term of age and period effects, an interaction term between cohort and another variable is equivalent to a three-way interaction among age, period, and another variable. Future development may consider creating functions to facilitate summarizing and interpreting the more complex three-way interaction terms in the APC-I model.
Phylogenetics, Spatial, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Xu & Luo, "APCI: An R and Stata Package for Visualizing and Analyzing Age-Period-Cohort Data", The R Journal, 2022
BibTeX citation
@article{RJ-2022-026, author = {Xu, Jiahui and Luo, Liying}, title = {APCI: An R and Stata Package for Visualizing and Analyzing Age-Period-Cohort Data}, journal = {The R Journal}, year = {2022}, note = {https://rjournal.github.io/}, volume = {14}, issue = {2}, issn = {2073-4859}, pages = {77-95} }