Changepoint hazard models have several practical applications, including modeling processes such as cancer mortality rates and disease progression. While the inverse cumulative distribution function (CDF) method is commonly used for simulating data, we demonstrate the shortcomings of this approach when simulating data from changepoint hazard distributions with more than a scale parameter. We propose an alternative method of simulating this data that takes advantage of the memoryless property of survival data and introduce the R package cpsurvsim which implements both simulation methods. The functions of cpsurvsim are discussed, demonstrated, and compared.
When modeling timetoevent processes, especially over long periods of time, it is often unreasonable to assume a constant hazard rate. In these cases, changepoint hazard models are applicable. The majority of research surrounding changepoint hazard models focuses on the Cox proportional hazards and piecewise exponential models with one changepoint (Yao 1986; Gijbels and Gurler 2003; Wu et al. 2003; Dupuy 2006; Rojas et al. 2011), likely due to the straightforward extension for including fixed and timevarying covariates (Zhou 2001; Hendry 2014; MontezRath et al. 2017a; Wong et al. 2018). Research on hazard models with multiple changepoints is also expanding as these models have a wide range of applications in fields such as medicine, public health, and economics (Liu et al. 2008; Goodman et al. 2011; He et al. 2013; Han et al. 2014; Qian and Zhang 2014; Cai et al. 2017). In the interest of simulating timetoevent data featuring trends with multiple changepoints, (Walke 2010) presents an algorithm for simulating data from the piecewise exponential distribution with fixed type I censoring using the location of the changepoints, the baseline hazard, and the relative hazard for each time interval in between changepoints. As the research surrounding parametric changepoint hazard models with multiple changepoints continues to grow, likewise does the need to simulate data from these distributions. Simulation is also a powerful and popular tool for assessing the appropriateness of a model for one’s data or conducting a power analysis.
Several R packages available from the Comprehensive R Archive Network (CRAN) provide functions for simulating timetoevent data in general, with a heavy focus on the Cox model. Some of the more popular packages are provided in Table 1, which expands on the METACRAN compilation (Allignol and Latouche 2020). Although considerably smaller in scope, a few R packages provide functions for simulating data with changepoints. CPsurv has functionality for simulating both nonparametric survival data and parametric survival data from the Weibull changepoint distribution but requires existing data as an argument and only allows for one changepoint (Krügel et al. 2017). SimSCRPiecewise simulates data using the piecewise exponential hazard model within the Bayesian framework, however, this method requires at least one covariate as an argument (Chapple 2016).
Package Title  Brief Description 

coxed  Simulates data for the Cox model using the flexiblehazard 
method and allows for the inclusion of timevarying covariates  
(Kropko and Harden 2019)  
CPsurv  Simulates one changepoint for nonparametric survival analysis 
or parametric survival analysis using the Weibull distribution  
(Krügel et al. 2017)  
cpsurvsim  Simulates data with multiple changepoints from the exponential 
and Weibull distributions (Hochheimer 2021)  
discSurv  Simulates survival data from discrete competing risk models 
(Welchowski and Schmid 2019)  
gems  Simulates data from multistate models and allows for 
nonMarkov models that account for previous events  
(Blaser et al. 2015)  
genSurv  Gives users the option to generate data with a binary, 
timedependent covariate  
(MeiraMachado and Faria 2014; Araújo et al. 2015)  
ipred  Provides a function for simulating survival data for tree 
structured survival analysis (Peters and Hothorn 2019)  
MicSim  Performs continuous time miscrosimulations to simulate life 
courses (Zinn 2018)  
PermAlgo  Uses a permutational algorithm to generate timetoevent data 
allowing for the inclusion of several timedependent covariates  
(Sylvestre et al. 2010)  
prodlim  Has functions for simulating right censored nonparametric 
survival data with two covariates and with or without competing  
risks (Gerds 2018)  
simMSM  Uses inversion sampling to simulate data from multistate 
models allowing for nonlinear baseline hazards, timevarying  
covariates, and dependence on past events (Reulen 2015)  
simPH  Simulates data from Cox proportional hazards models 
(Gandrud 2015)  
simsurv  Simulates data from various parametric survival distributions, 
2component mixture distributions, and userdefined hazards  
(Brilleman 2019)  
SimSCRPiecewise  Uses Bayesian estimation to simulate data from the piecewise 
exponential hazard model allowing for the inclusion of covariates  
(Chapple 2016)  
SimulateCER  While not a formal R package, this package extends the methods 
found in PermAlgo and can be downloaded from GitHub  
(MontezRath et al. 2017b)  
survsim  Allows users to simulate timetoevent, competing risks, multiple 
event, and recurrent event data (Moriña and Navarro 2014) 
Our package cpsurvsim allows users to simulate data from a both the exponential and Weibull hazard models with type I right censoring allowing for multiple changepoints (Hochheimer 2021). cpsurvsim provides two methods for simulating data, which are introduced in the following section. The first method draws on (Walke 2010), using the inverse hazard function to simulate data. The second employs the memoryless simulation method, the details of which are also discussed in the next section. We then demonstrate how to simulate data using cpsurvsim and compare the performance of these methods through a simulation study with the motivation of enabling users to determine which method is best for their data.
The piecewise exponential model with multiple changepoints \((\tau_k, k=1,\ldots,K)\) can be expressed as \[\begin{aligned} f(t)=\begin{cases} \theta_1\exp\{\theta_1t\} & 0\leq t<\tau_1\\ \theta_2\exp\{\theta_1\tau_1\theta_2(t\tau_1)\} & \tau_1\leq t <\tau_2\\ \vdots\\ \theta_{K+1}\exp\{\theta_1\tau_1\theta_2(\tau_2\tau_1)\ldots\theta_{K+1}(t\tau_K) \} & t\geq\tau_K \end{cases}\label{exponential pdf} \end{aligned} \tag{1}\] with corresponding hazard function \[\begin{aligned} h(t)=\begin{cases} \theta_1 & 0\leq t<\tau_1\\ \theta_2 & \tau_1\leq t <\tau_2\\ \vdots & \vdots \\ \theta_{K+1} & t\geq\tau_K. \end{cases} \label{exponential hazard} \end{aligned} \tag{2}\] We draw on the work of (Walke 2010) in that we use the inverse hazard function to simulate survival time \(t\). (Walke 2010) uses a baseline hazard and relative hazards to simulate each time interval between changepoints, whereas our simulation is based on the value of the scale parameter (\(\theta_i, i = 1,\ldots, K+1\)) corresponding to each interval as specified by the user. Starting with the relationship between the cumulative density function (CDF) and the cumulative hazard function (\(F(t)=1\exp(H(t))\) where \(H(t)=\int h(t)dt\)) and noting that \(F(t)=U\) where \(U\) is a uniform random variable on (0,1), we derive \(t=H^{1}(\log(1U))\). Seeing as \(x=\log(1U)\sim Exp(1)\), we can simulate random variables from the exponential distribution and plug them into the inverse hazard function to get simulated event time \(t\). With this in mind, the inverse cumulative hazard function for the exponential changepoint hazard model with four changepoints is \[\begin{aligned} H^{1}(x)=\begin{cases} \frac{x}{\theta_1} & 0\leq x<A\\ \frac{xA}{\theta_2}+\tau_1 & A\leq x<A+B\\ \frac{xAB}{\theta_3}+\tau_2 & A+B\leq x <A+B+C\\ \frac{xABC}{\theta_4}+\tau_3 & A+B+C\leq x<A+B+C+D\\ \frac{xABCD}{\theta_5}+\tau_4 & x\geq A+B+C+D \end{cases} \end{aligned}\] where \(A=\theta_1\tau_1\), \(B=\theta_2(\tau_2\tau_1)\), \(C=\theta_3(\tau_3\tau_2)\), and \(D=\theta_4(\tau_4\tau_3)\). In cpsurvsim, this method of simulating timetoevent data is considered the CDF method. An endofstudy time horizon (or maximum measurement time) is specified by the user and all simulated event times with values greater than the end time are censored at that point (type I right censoring).
The Weibull distribution is another popular parametric model for survival data due to its flexibility to fit a variety of hazard shapes while still satisfying the proportional hazards assumption. Note that when \(\gamma=1\), it is identical to the exponential distribution. The Weibull changepoint model has the probability density function \[\begin{aligned} f(t)=\begin{cases} \theta_1 t^{\gamma1}\exp\{\frac{\theta_1}{\gamma}t^\gamma\} & 0\leq t<\tau_1\\ \theta_2 t^{\gamma1}\exp\{\frac{\theta_2}{\gamma}(t^\gamma\tau_1^{\gamma})\frac{\theta_1}{\gamma}\tau_1^{\gamma}\} & \tau_1\leq t<\tau_2\\ \vdots \\ \theta_{K+1} t^{\gamma1}\exp\{\frac{\theta_{K+1}}{\gamma}(t^\gamma\tau_K^{\gamma})\frac{\theta_K}{\gamma}(\tau_K^{\gamma}\tau_{K1}^{\gamma})\ldots\frac{\theta_1}{\gamma}\tau_1^\gamma\} & t\geq\tau_K \end{cases} \label{weibull pdf} \end{aligned} \tag{3}\] with corresponding hazard function \[\begin{aligned} h(t)=\begin{cases} \theta_1 t^{\gamma1} & 0\leq t<\tau_1\\ \theta_2 t^{\gamma1} & \tau_1\leq t<\tau_2 \\ \vdots & \vdots \\ \theta_{K+1} t^{\gamma1} & t\geq\tau_K. \end{cases} \end{aligned}\] As with the exponential model, event times can be simulated using the inverse hazard function (shown here with four changepoints) \[\begin{aligned} H^{1}(x)=\begin{cases} (\frac{\gamma}{\theta_1}x)^{1/\gamma} & 0\leq x<A\\ [\frac{\gamma}{\theta_2}(xA)+\tau_1^{\gamma}]^{1/\gamma} & A\leq x<A+B\\ [\frac{\gamma}{\theta_3}(xAB)+\tau_2^\gamma]^{1/\gamma} & A+B\leq x<A+B+C\\ [\frac{\gamma}{\theta_4}(xABC)+\tau_3^\gamma]^{1/\gamma} & A+B+C\leq x<A+B+C+D\\ [\frac{\gamma}{\theta_5}(xABCD)+\tau_4^\gamma]^{1/\gamma} & x\geq A+B+C+D \end{cases} \end{aligned}\] where \(A=\frac{\theta_1}{\gamma}\tau_1^{\gamma}\), \(B=\frac{\theta_2}{\gamma}(\tau_2^\gamma\tau_1^\gamma)\), \(C=\frac{\theta_3}{\gamma}(\tau_3^\gamma\tau_2^\gamma)\), and \(D=\frac{\theta_4}{\gamma}(\tau_4^\gamma\tau_3^\gamma)\).
(Zhou 2001) touches on the idea of the memoryless property as a means of interpreting the piecewise exponential model, however, we take this one step further by using this property to simulate data from changepoint hazard models. In survival analysis, the memoryless property states that the probability of an individual experiencing an event at time \(t\) is independent of the probability of experiencing an event up to that point. Likewise, the probability of an event occurring after a changepoint is independent of the probability that the event occurs before the changepoint.
Our memoryless simulation method uses this extension of the memoryless property in that data between changepoints are simulated from independent exponential or Weibull hazard distributions with scale parameters \(\theta_i\) corresponding to each time interval. Participants with simulated survival times past the next changepoint are considered surviving at least to that changepoint and then an additional survival time is simulated for them in the next time interval. Total time to event is calculated as the sum of time in each interval between changepoints, with those surviving past the study end time censored at that point. Survival times within each interval are calculated using the inverse hazard of the independent exponential or Weibull function representing that time period. In this way, the inverse hazard and memoryless methods are equivalent when there are no changepoints.
The cpsurvsim package can be installed from CRAN. Functions for simulating data are summarized in Table 2
Function  Hazard model  Simulation method 

exp_cdfsim 
Piecewise constant  Inverse hazard function 
exp_memsim 
Piecewise constant  Memoryless 
weib_cdfsim 
Weibull changepoint  Inverse hazard function 
weib_memsim 
Weibull changepoint  Memoryless 
As an example of the functions exp_cdfsim
and weib_cdfsim
, which
simulate data using the inverse hazard method from the exponential and
Weibull distributions, respectively, consider the following:
library(cpsurvsim)
< exp_cdfsim(n = 50, endtime = 100, theta = c(0.005, 0.01, 0.05),
dta1 + tau = c(33, 66))
head(dta1)
time censor1 100.00000 0
2 85.99736 1
3 78.21772 1
4 71.03138 1
5 100.00000 0
6 82.71520 1
< weib_cdfsim(n = 50, endtime = 100, gamma = 2,
dta2 + theta = c(0.0001, 0.0002, 0.0001), tau = c(33, 66))
head(dta2)
time censor1 11.36844 1
2 100.00000 0
3 81.04904 1
4 100.00000 0
5 71.93590 1
6 56.40275 1
When simulating using the memoryless method, we use the following calls from cpsurvsim:
< exp_memsim(n = 50, endtime = 100, theta = c(0.005, 0.01, 0.05),
dta3 + tau = c(33, 66))
head(dta3)
time censor1 93.64262 1
2 63.47413 1
3 84.54253 1
4 89.01574 1
5 73.92685 1
6 23.67631 1
< weib_memsim(n = 50, endtime = 100, gamma = 2,
dta4 + theta = c(0.0001, 0.0002, 0.0001), tau = c(33, 66))
head(dta4)
time censor1 59.47848 1
2 100.00000 0
3 62.08739 1
4 100.00000 0
5 100.00000 0
6 100.00000 0
As seen in these examples, all four functions return a dataset with the survival times and a censoring indicator.
To compare the performance of the inverse hazard method with the memoryless method under different settings, we conducted a simulation study using cpsurvsim. We simulated data with one, two, three, and four changepoints using both the exponential and Weibull distributions. In our simulation, time \(t\) ranged from 0100 and changepoints occurred at various times within that range. Sample sizes of 50, 100, and 500 were tested and values of \(\theta\) were chosen to demonstrate differences between the simulation methods when the hazard rate changes (e.g., smaller to larger hazard versus larger to smaller hazard). For the Weibull simulations, we set \(\gamma=2\). We conducted 10,000 simulations of each setting. We are primarily interested in comparing the ability of these two simulation methods to simulate data with the correct changepoints \(\tau_i\). Therefore, we compared how often the estimated value (\(\hat{\tau_i}\)) was within a 10% range, in this case \([\tau_i5, \tau_i+5]\) based on our time range. We also evaluated whether the known values of \(\tau_i\) fell within the 95% confidence interval of the average simulated values for both methods as well as discuss bias in the model parameters. This simulation study was conducted in R 3.6.1.
When simulating from the exponential distribution, these two simulation methods had comparable accuracy in terms of the location of the changepoints (see Figure 1). Sample size, however, had a large impact on the accuracy regardless of the simulation method. When estimating one changepoint with a sample size of 50, there were a few simulation templates where less than a third of estimates \(\hat{\tau}\) were within range of the known changepoint (Figure 1a). In general, accuracy improved as the sample size increased. For every simulation scenario using the exponential distribution, the 95% confidence interval for the mean estimate of \(\tau_i\) included the known value.




Simulations for the Weibull distribution, however, revealed important differences in accuracy between the two methods (see Figure 2). Although accuracy of the two methods was similar when simulating one changepoint, there were many cases where accuracy was very low, even with a sample size of 500 (Figure 2a). In almost one quarter (\(4/18\)) of the simulation scenarios, the true value of \(\tau\) was not within the 95% confidence interval of the average estimate for either method. In three of those scenarios, both simulation methods severely underestimated \(\tau\) when the true value was 80. When simulating two changepoints (Figure 2b), accuracy of the first changepoint was often much lower when using the memoryless method, especially with a larger sample size. All except one of the simulation scenarios where the known value of \(\tau_1\) did not fall within the 95% confidence interval of the average estimate for the memoryless method had a sample size of 500. Accuracy was lower for all changepoints in the three changepoint simulations when using the memoryless method and the discrepancies between the two methods were larger for larger sample sizes (Figure 2c). In almost half of the simulation scenarios at least one value of \(\tau_i\) was not within the 95% confidence interval of the mean estimate for the memoryless method and all except one of those scenarios had a sample size of 500. When simulating four changepoints (Figure 2d), for most scenarios there was a large drop in accuracy at the first and third changepoint for the memoryless method compared to the inverse hazard method. At the second and fourth changepoints, however, there was a drop in accuracy for sample sizes of 50 and 100 whereas most simulation scenarios with a sample size of 500 had similar accuracy between the two methods. Every simulation scenario for four changepoints with a sample size of 500 using the memoryless method had at least one changepoint where the 95% confidence interval did not include the known value of \(\tau_i\). The known value of \(\tau_4\) was, however, included in the 95% confidence interval for every scenario with a sample size of 500.

