Research on climate change impacts can require extensive processing of climate model output, especially when using ensemble techniques to incorporate output from multiple climate models and multiple simulations of each model. This processing can be particularly extensive when identifying and characterizing multi-day extreme events like heat waves and frost day spells, as these must be processed from model output with daily time steps. Further, climate model output is in a format and follows standards that may be unfamiliar to most R users. Here, we provide an overview of working with daily climate model output data in R. We then present the futureheatwaves package, which we developed to ease the process of identifying, characterizing, and exploring multi-day extreme events in climate model output. This package can input a directory of climate model output files, identify all extreme events using customizable event definitions, and summarize the output using user-specified functions.
Research on climate change impacts can require extensive processing of climate model output data. This is not only because output files for a single climate model can be large, but also because of the rising popularity of ensemble techniques (Cubasch et al. 2013) in which, to better characterize uncertainty in projections, impacts are assessed for multiple climate models, multiple simulations of each climate model, and multiple climate experiments. Such ensemble techniques help characterize uncertainty in projections of regional climate change over the next century due to three distinct sources: (1) internal climate variability, i.e., climate noise, (2) climate model uncertainty, i.e. the same forcing can produce a different response in different models and (3) scenario uncertainty, i.e., uncertainty in future climate forcings (e.g. Hawkins and Sutton (2009)).
A key source of data for ensemble techniques is the Coupled Model Intercomparison Project, Phase 5 (CMIP5; Taylor et al. (2012)). This project brought together major climate modeling groups around the world to simulate the same future radiative forcing scenarios, but with their own models. This created an ensemble of state-of-the-art climate model projections that allows researchers to study projections and their uncertainties. Most of these modeling groups additionally performed more than one simulation for each scenario and model (i.e. multiple ensemble members), perturbing the initial conditions by a very tiny amount to quantify uncertainties due to internal climate variability.
We begin this article with an overview of CMIP5 climate model output data for R users, focusing on output with a daily time step. We outline where data from CMIP5 can be obtained as well as how to work with the file format (netCDF) from R. We overview some R packages that can be useful when working with this data, as well as aspects of the data (e.g., non-standard calendars) of which users should be aware when working with daily climate model output in R.
After this overview, we present the futureheatwaves package, which we created to aid in identifying and characterizing any type of multi-day extreme event from daily climate model output (Table 1). The impacts of multi-day extreme events must be assessed using output in daily time step, unlike other climate impacts that can be assessed using climate model output at monthly, seasonal, or yearly time steps. Further, extreme events are identified based on conditions that are rare for a certain location (e.g., 98th percentile of local temperature distribution for identifying heat waves) (Cubasch et al. 2013). In this case, the event definition must be determined at each study location from climate model output before events can be identified. Finally, it is often of interest to create summaries of multiple characteristics of these extreme events. For example, one may be interested in determining whether the frequency or characteristics (e.g., length, intensity) of heat waves or warm spells will change under certain climate change scenarios (Cubasch et al. 2013).
The futureheatwaves package handles these challenges and can be used to identify and characterize a variety of multi-day extreme events across different ensemble members of one or more climate models. It also provides some functionality specifically useful in identifying and characterizing heat waves. Quantification of the impacts of heat waves on human health suffers from additional sources of uncertainty beyond those inherent in projections of regional changes in surface temperature. These include: (1) uncertainty in the definition of a heat wave itself and (2) uncertainty in the ability of communities to adapt to changing temperatures (the adaptation scenario). This package therefore allows the user to create and use a custom extreme event definition to identify events in the climate model output, as well as providing options to explore different scenarios of adaptation to heat.
Design goals of the futureheatwaves package | |
---|---|
1 | Make processing of large ensembles of climate simulations more practical for researchers exploring the potential impacts of heat waves and other multi-day extreme events. |
2 | Speed up processing time by incorporating C++ in event identification. |
3 | Keep track of the names of climate models and number of ensemble members processed for each. |
4 | Not only identify, but also characterize, all extreme events within each climate simulation, to allow the exploration of patterns in these characteristics across different simulations and also to allow the use of more complex impacts models, including models that incorporate event characteristics (e.g., event length, event intensity). For example, this package allows the user to apply a health-effects model where risk of mortality is not the same for every heat wave, but rather is modified by heat wave length, intensity, or other measured characteristics. |
5 | Give users extensive power in customizing the process, including allowing custom extreme event definitions. |
6 | Allow users to easily explore the extreme events identified within all climate simulations. |
7 | Create output that is in a “tidy" data format, allowing it to work well with ggplot2 for visualization. |
For climate impact studies, a main source of data is the Coupled Model Intercomparison Project, which is currently in its fifth phase (CMIP5). Over 20 climate modeling groups created one or more climate models which, for this project, were run using standardized scenarios (Taylor et al. 2012). The resulting output is uniform across modeling groups and has a consistent structure, which allows comparison of simulations from different models (Flato et al. 2013). CMIP5 climate model output is archived at a number of different time steps (e.g., daily, monthly, seasonal, yearly) (Taylor and Doutriaux 2010), and some variables are reported at multiple levels in the ocean or atmosphere (e.g., ocean temperature is reported at different ocean depths). Here, we will focus on data with a daily time step for variables reported at a single level (e.g., near-surface air temperature).
Each modeling group ran simulations for CMIP5 under several experiments, with experiments varying in terms of radiative forcing scenarios through the use of different scenarios of time-varying model inputs (greenhouse gas emissions or concentrations, land use changes, etc.) (Taylor et al. 2012; Flato et al. 2013). Experiments include historical experiments (run using radiative forcing consistent with observed and reconstructed data for 1850–2005), pre-industrial control experiments, and experiments of future scenarios of radiative forcing over the 21st century or longer (e.g., RCP4.5, RCP8.5) (Taylor et al. 2012). Some modeling groups created ensembles of output for a specific model and experiment, in which they ran the experiment multiple times with very small changes to the initial conditions.
The CMIP5 climate model output data are distributed across data nodes at different climate modeling centers (Taylor et al. 2012), but can be accessed centrally at the World Climate Research Programme CMIP5 data portal at https://pcmdi.llnl.gov/search/cmip5/. Users must register before downloading data, and some data are restricted to non-commercial use. There is a separate file for each combination of climate model, experiment, modeling realm (e.g., atmosphere, ocean), variable, time step, and ensemble member (Taylor and Doutriaux 2010; Taylor et al. 2012). For finer time scales, the output is further split across multiple files for specific year ranges (e.g., 5 years of output for each file) (Taylor and Doutriaux 2010). Each file’s name includes the output variable, climate model, experiment, and ensemble member for the simulation (Taylor and Doutriaux 2010).
CMIP5 files can be searched and downloaded through a point-and-click web
interface. They can also be downloaded in bulk to computers with Unix or
Mac operating systems using the wget
file downloading utility.
Appropriate wget
scripts can be created either through the World
Climate Research Programme CMIP5 data portal or through the Earth System
Grid Federation’s Search RESTful API. Tips on efficiently searching and
downloading the data, including through the use of wget scripts and the
search API, are available as user tutorials through the website of the
University of Colorado Boulder’s Earth System CoG (e.g.,
https://www.earthsystemcog.org/projects/cog/doc/wget for a tutorial on
downloading files using wget).
CMIP5 files are saved in Network Common Data Format (netCDF), a binary file format that allows storage of data representing a regular array. For climate model output at a single level (e.g., near-surface air temperature), the data is a 3-dimensional array, with dimensions representing time and two coordinates of location (e.g., latitude and longitude). Figure 1 provides a sketch of the structure of netCDF files for single-level climate model output.
Each data point in the netCDF array gives the modeled value of the variable (e.g., surface temperature) for a single time point and location. Global climate models generate output at regularly-spaced time steps, typically at regularly-spaced grid points around the world. The latitude and longitude spacing of grid points vary by climate model, but are typically 1–2 degrees for atmospheric variables in CMIP5 models (Flato et al. 2013). For CMIP5 climate model output, the location units are in degrees east and degrees north for longitude and latitude, respectively. For daily output files, the time unit is in days since a specified origin date-time (e.g., days since 1850-01-01 00:00:00) (Taylor and Doutriaux 2010).
All CMIP5 output files are required to include certain metadata (or “attributes”) (Taylor and Doutriaux 2010), including the experiment, forcing agents input to the model to create the simulation, time step, institution and institutional contact information, climate model, and modeling realm (Taylor and Doutriaux 2010). The metadata also must include units for all of the dimension variables (e.g., longitude, latitude, time). The netCDF format allows one to access metadata and variables describing the dimensions of the data without reading the full file into memory.
To find out more about the CMIP climate model output data, Taylor et al. (2012) and Meehl et al. (2007) are excellent resources.
When working with daily climate model output data, challenges to R users include: (1) the file format, (2) use of non-Gregorian calendars, and (3) large file sizes. This section explains these challenges and offers some strategies for dealing with them.
CMIP5 data are available in the netCDF file format. Free specialty software exists to work with climate model output files in this format, including a collection of command line tools developed by the Max Planck Institute called Climate Model Operators (CDO) (Schulzweida 2017) and an interpreted language developed by the National Center for Atmospheric Research (NCAR) called the NCAR Command Language (NCL) (UCAR/NCAR Computational and Information Systems Laboratory 2017). Although such software can be used to quickly process netCDF climate model output files, they require learning a new language syntax, and so for R users may not be worth the computational speed gain compared to alternative solutions that can be scripted in the R language. While base R import functions do not exist for netCDF files, there are a few R package extensions that allow R users to work with the netCDF file format used for CMIP5 files directly from R. Older packages include ncdf and ncvar, but these do not work with the newer netCDF version 4 released in 2008 and are no longer available through CRAN. More recent packages, including ncdf4 (Pierce 2015) and RNetCDF (Michna and Woods 2013, 2016), work with both version 4 and netCDF’s older version 3. While climate model output data for CMIP5 are required to conform with the earlier version (version 3) (Taylor and Doutriaux 2010), it is safer to write code using functions that can be used with either version, in case future phases of CMIP do not require files to conform with netCDF version 3.
You can do a number of things with netCDF files in R using these
packages. For example, ncdf4’s nc_open
function can be used to open
a connection to a netCDF file; the object returned by the function
includes the file’s attribute data. Once a file connection is open,
variables can be read in using the ncvar_get
function. For example,
the variables defining the dimensions of the sketched netCDF file in
Figure 1 could be read into R with ncvar_get
with the varid
parameter set to “lat”, “lon”, or “time”.
The climate output variable (e.g., near-surface air temperature) can
similarly be read in using ncvar_get
. In this case, the varid
parameter should be set using the appropriate CMIP5 variable name (e.g.,
“tas” for near-surface air temperature); these variable names can be
found in the CMIP requested output tables (Taylor and Doutriaux 2010). If only a
subset of the full file is needed, the dimensional time and location
data can be used to identify the location of the needed data in the
netCDF array and this information can then be used to read a portion of
data into memory (e.g., with the nc.get.var.subset.by.axes
function in
ncdf4.helpers).
Once the user is done reading in data from the file, the connection to
the netCDF should be closed (e.g., with the nc_close
function from
ncdf4).
A second challenge when working with climate model output data in R is
that some climate models output to non-Gregorian calendars. Since the
late 1500s, Western dates have been set using the Gregorian calendar,
which has 365.2425-day years. Some climate models, however, are run
using different calendars, including the Julian calendar (365.25-day
years), a calendar where there are no leap years (365-day years), a
calendar where every year is a leap year (366-day years), and a calendar
of twelve 30-day months (360-day years) (Eaton et al. 2011). With these
non-Gregorian calendars, R’s base functions for converting a vector to a
Date class based on the number of days since an origin date (as.Date
,
as.POSIXct
) do not return the desired values.
Two R packages provide help with non-Gregorian calendars:
PCICt (Bronaugh and Drepper 2013) and
ncdf4.helpers
(Bronaugh 2014). The nc.get.time.series
function in ncdf4.helpers
pulls and uses metadata on the calendar stored in the CMIP5 netCDF
file’s attributes to convert the “time” variable in the file to an
object of the PCICt
class. This class is defined in the PCICt
package and provides date-like functionality for 360- and 365-day
calendars (Bronaugh and Drepper 2013). However, while these functions will help with
handling most CMIP5 files, the CMIP5 standards allows use of other
calendars which may not be successfully handled by these functions, so
it is important to assess whether the time variable range in the PCICt
object correctly matches the expected date ranges for a file when
processing CMIP5 data in R. While most CMIP5 climate models use the same
calendar for all experiments, a few do not; a full table of the
calendars used for each climate model and experiment, pulled from netCDF
metadata, is available at
https://www.earthsystemcog.org/projects/cog/faq_data.
Finally, the size of CMIP5 files can make them difficult to work with in R. CMIP5 climate model output files can be as large as several gigabytes. The size of the files can therefore be large enough that it may make more sense to work with smaller chunks of the data in R, rather than reading all data into memory and working with the data all at once (Todd-Brown and Bond-Lamberty 2016). This problem aggregates when working with multiple climate models and more than one ensemble member for each of those climate models.
In addition to these general packages for working with netCDF files, there are several R packages specifically for working with climate model output data, including RCMIP5 (Todd-Brown and Bond-Lamberty 2016) and wux (Mendlik et al. 2016). However, these packages are more useful for working with data output at time steps of a month or higher and have limited utility with the daily climate model output data required for studies of multi-day extreme events.
The RCMIP5 package includes functions to read in CMIP5 data from
netCDF files, scan a directory of CMIP5 files and determine models with
continuous available data, create objects of a special cmip5data
class
to work with CMIP5 data within R, and parse the file names for all files
in a directory to extract information within the file name. For this
package, most functions only work with monthly or less frequent data
(Todd-Brown and Bond-Lamberty 2016). While the loadCMIP5
function does successfully load daily
data as a cmip5data
object, most of the methods for this object type
do not do anything meaningful for daily data. The package’s
getFileInfo
function, however, will work with CMIP5 files of any time
step; this function identifies all CMIP5 files in a directory and
creates a data frame with information parsed from the file name. The
get.split.filename.cmip5
function in the ncdf4.helper package
similarly can be used to parse information contained in CMIP5 file names
(Bronaugh 2014).
The wux package (Mendlik et al. 2016) includes functions that allow the user to
download CMIP5 output at a monthly time step directly from R with the
CMIP5fromESGF
function. The package then uses the models2wux
function to read climate model output netCDF files and convert it to
“WUX” data frames, which can be used by other functions in the package.
While this function can input climate model output with daily time steps
(the “what.timesteps” element of the modelinput
list input must be set
to “daily”), the function aggregates this data to a monthly or less
frequent (e.g., seasonal) aggregation when creating the WUX data frame.
Therefore, while this package provides very useful functionality for
working with averaged output of daily climate model output data or with
output data at a larger time step, it cannot easily be used to identify
and characterize multi-day extreme events like heat waves.
The functions and packages described in this section can be used with CMIP5 netCDF files to do things in R like map near-surface air temperatures from a single climate model on a specific day (Figure 2) or pull a time series of daily near-surface air temperature simulations at a specific climate model grid point (Figure 3). The futureheatwaves’ “Starting from netCDF files” vignette (https://cran.r-project.org/web/packages/futureheatwaves/vignettes/starting_from_netcdf.html) provides all code required to create these figures, as well as more details and code examples on working with CMIP5 netCDF files in R.
We created the futureheatwaves package to aid in identifying, characterizing, and exploring multi-day extreme events in daily climate model output data. While most of the discrete tasks involved in identifying and characterizing multi-day extreme events are fairly straightforward, the full process can be code-intensive, especially for multi-city studies, studies that test sensitivity to how an event is defined, or studies that incorporate different scenarios of adaptation in the case of events defined using a threshold relative to community climate. Our aim in developing this package was therefore to make the full process of identifying and characterizing these extreme events much more convenient and so facilitate the use of multi-model, multi-ensemble member analyses in climate impact studies conducted by non-climate scientists.
Figure 4 gives an overview of the two primary functions
of the futureheatwaves package. First, the gen_hw_set
function
processes a directory of climate projection files that are stored
locally on the user’s computer (Figure 4, “Climate
projections”), to generate a list of all extreme events in each
projection, as well as over a dozen characteristics of each identified
extreme event (Table 2). This package start
from files rather than R objects to avoid loading data from all climate
model ensembles at once; instead, the function loads, processes, and
saves output for a single climate model ensemble member at a time. The
extreme events are identified and characterized at one or more study
locations (e.g., cities), which the user specifies in an input file. The
extreme events identified for each ensemble member are output as
separate files in a directory specified by the user (Figure
4, “Extreme events datasets”).
Once the user creates these data frames of location-specific extreme
events, the apply_all_models
function can be used to apply custom
functions across all the extreme event data frames. This functionality
allows users to create summaries of extreme events across all climate
models and ensemble members (Figure 4, right). The
function can be used to generate summary statistics (e.g., determine
average heat wave length or total frost days) or to apply more complex
functions (e.g., apply epidemiologic effect estimates across the heat
waves to generate health impact estimates).
When using this package, CMIP5 climate model output data require some pre-processing. Data will need to be saved in a specific format, with files stored in a specific directory structure. Full details of the required file and directory structure are provided in the package’s main vignette (https://cran.r-project.org/web/packages/futureheatwaves/vignettes/futureheatwaves.html), while tips and an R script for conducting this processing starting from CMIP5 netCDF files are given in the “Starting from netCDF files” vignette.
This package can be used for study locations worldwide. The “Starting from netCDF files” package vignette provides an example of using this package to identify and explore future heat waves in several Chinese cities.
Column name | Description of characteristic |
---|---|
mean.var |
Average daily value of the variable across all days in the extreme event, in the units in which the variable is expressed in input files (e.g., average daily mean temperature during the heat wave in degrees Kelvin) |
max.var |
Highest daily value of the variable across all days in the extreme event, in the units in which the variable is expressed in input files |
min.var |
Lowest daily value of the variable across all days in the extreme event, in the units in which the variable is expressed in input files |
length |
Number of days in the event |
start.date |
Date of the first day of the event |
end.date |
Date of the last day of the event |
start.doy |
Day of the year of the first day of the event (1 = Jan. 1, etc.) |
start.month |
Month in which the event started (1 = January) |
days.above.abs.thresh.1 |
Number of days in the event above a specified absolute threshold (default is the number of days in the event above 80oF / 26.7oC, but this and the following three absolute thresholds can be changed with the absolute_thresholds argument in gen_hw_set ) |
days.above.abs.thresh.2 |
Number of days in the event above a specified absolute threshold (default is the number of days in the event above 85oF / 29.4oC) |
days.above.abs.thresh.3 |
Number of days in the event above a specified absolute threshold (default is the number of days in the event above 90oF / 32.3oC) |
days.above.abs.thresh.4 |
Number of days in the event above a specified absolute threshold (default is the number of days in the event above 95oF / 35.0oC) |
days.above.99th |
Number of days in the event above the 99th percentile of the variable for the location, using the period specified with the referenceBoundaries argument in gen_hw_set as a reference for determining these percentiles |
days.above.99.5th |
Number of days in the event above the 99.5th percentile of the variable for the location, using the period specified with the referenceBoundaries argument in gen_hw_set as a reference for determining these percentiles |
first.in.year |
Whether the event was the first to occur in its calendar year in the location |
mean.var.quantile |
The percentile of the average variable value during the event compared to the location’s year-round distribution of the variable, based on the variable distribution for the location during the period specified by the referenceBoundaries argument in gen_hw_set |
max.var.quantile |
The percentile of the maximum variable value during the event compared to the location’s year-round distribution of the variable, based on the variable distribution for the location during the period specified by the referenceBoundaries argument in gen_hw_set |
min.var.quantile |
The percentile of the minimum variable value during the event compared to the location’s year-round distribution of the variable, based on the variable distribution for the location during the period specified by the referenceBoundaries argument in gen_hw_set |
mean.seasonal.var |
The location’s average seasonal value of the variable (by default, season is set to May–September, but this can be changed with the seasonal_months argument in gen_hw_set ), based on the variable values for the location during the years specified by the referenceBoundaries argument in gen_hw_set |
mean.yearround.var |
The location’s average year-round value of the variable, based on the variable values for the location during the years specified by the referenceBoundaries argument in gen_hw_set |
We have included data files in the package to serve as example files so
that users can try this package before applying it to their own
directory of climate projection files. These example data come from two
climate models that are a part of CMIP5: (1) the model of the Beijing
Climate Center, China Meteorological Administration (BCC)
(Xin et al. 2013) and (2) the National Center for Atmospheric
Research’s (NCAR’s) Community Climate System Model, version 4 (CCSM4)
(Gent et al. 2011). We include one ensemble member from BCC (r1i1p1)
and two from CCSM (r1i1p1 and r2i1p1). Once the futureheatwaves
package is installed and loaded, the user can find the location of these
files on his or her computer using R’s system.file
function.
To ensure that the size of this example data is reasonably small, we have only included projection data for grid points from these climate models that are near five U.S. east coast cities: New York, NY; Philadelphia, PA; Newark, NJ; Baltimore, MD; and Providence, RI. Further, to keep the file sizes reasonably small, the historical projections range over the years 1990 to 1999, while the future projections are limited to 2060 to 2079. Users’ applications of this package will likely use directories with many more climate model ensemble members and more locations; however, the operation of the package is the same for this smaller example application, as it would be for a much larger application.
Once climate model output files are set up, as specified in the
“futureheatwaves” package vignette, the package can process them to
identify and characterize heat waves in each ensemble member’s
projection for each location using the gen_hw_set
function. For
example, to process the example climate model output data included with
the package, the user can run:
library(futureheatwaves)
<- system.file("extdata/cmip5",
projection_dir_location package = "futureheatwaves")
<- system.file("extdata/cities.csv",
city_file_location package = "futureheatwaves")
gen_hw_set(out = "example_results",
dataFolder = projection_dir_location ,
dataDirectories = list("historical" = c(1990, 1999),
"rcp85" = c(2060, 2079)),
citycsv = city_file_location,
coordinateFilenames = "latitude_longitude_NorthAmerica_12mo.csv",
tasFilenames = "tas_NorthAmerica_12mo.csv",
timeFilenames = "time_NorthAmerica_12mo.csv")
This code first identifies and saves as objects the path names on the
user’s computer of the example climate projections directory
(projection_dir_location
) and the file of study locations
(city_file_location
). The gen_hw_set
function processes the example
input and creates a new directory, example_results
, with files of
identified and characterized heat waves, in the user’s current working
directory. In this example code, the processing is done using default
values for the event definition, adaptation scenario, etc. How and why
to customize these choices are explained later in the text. Function
arguments (e.g., dataDirectories
, tasFilenames
) are used to specify
the format of the data and the directory structure.
Once the function has completed running, results will be written locally
to the directory specified by the out
argument of gen_hw_set
. This
directory will include files with some basic information about the
climate models and the closest grid points of each climate model to each
location, as well as a directory with files of identified and classified
extreme events for each ensemble member, including all characteristics
in Table 2. See the package’s vignettes for
more details on the content and structure of this output.
Once uses have created a directory of characterized event files for each
ensemble member (“Extreme event data sets”, Figure 4),
they can explore the results using the apply_all_models
function. This
function allows the user to apply custom R functions across all extreme
event data frames created by the gen_hw_sets
call. The user can apply
any R function that follows certain standards in accepting input and
returning output. Full details on these standards are given in the main
package vignette.
As an example, if the user wanted to calculate the average temperature of the heat waves identified for each ensemble member in the output generated by the code above, he or she could write a simple function:
<- function(hw_datafr){
average_mean_temp <- mean(hw_datafr$mean.var)
out return(out)
}
This function can then be applied across all extreme event data sets
output by gen_hw_set
using the apply_all_models
function. For
example, to apply this function to all the example output results that
come with the package, the user could run:
<- system.file("extdata/example_results", package = "futureheatwaves")
out apply_all_models(out = out, FUN = average_mean_temp)
#> model ensemble value
#> 1 bcc1 1 302.3745
#> 2 ccsm 1 302.4458
#> 3 ccsm 2 302.3428
This output gives the results (value
column) of running the custom
function for each ensemble member of each climate model. Note that the
location of the directory with the heat wave data frames must be
specified using the out
argument when calling apply_all_models
.
Typically, this will be the directory path for the directory specified
with the out
argument in gen_hw_set
.
Location-specific results can be generated using the city_specific
argument in apply_all_models
:
apply_all_models(out = out, FUN = average_mean_temp, city_specific = TRUE)
#> model ensemble city value
#> 1 bcc1 1 balt 305.1816
#> 2 bcc1 1 nwk 300.3367
#> 3 bcc1 1 ny 300.3367
#> 4 bcc1 1 phil 305.1816
#> 5 bcc1 1 prov 298.0402
#> 6 ccsm 1 balt 303.1277
#> 7 ccsm 1 nwk 302.4053
#> 8 ccsm 1 ny 302.4053
#> 9 ccsm 1 phil 302.3425
#> 10 ccsm 1 prov 301.8895
#> 11 ccsm 2 balt 302.9373
#> 12 ccsm 2 nwk 302.2748
#> 13 ccsm 2 ny 302.2748
#> 14 ccsm 2 phil 302.2858
#> 15 ccsm 2 prov 301.9520
The same process can be used to create a number of other summaries of
the identified extreme events. For example, it could be used to
determine average length of extreme events or estimate how much earlier
in the year events are expected to start across an ensemble of climate
model simulations. The functionality can also be used for more complex
analysis of extreme event files. For example, it can be used to apply
epidemiological models of heat wave to estimate excess heat-related
mortality under different future scenarios; an example of this
application is provided in the main futureheatwaves vignette. The
output from apply_all_models
is structured as “tidy" data
(Wickham 2014), allowing it to be used easily with the graphing
package ggplot2
(Wickham 2009) and other packages in the tidyverse.
By default, the package identifies extreme events in climate model output data using a specific definition for heat waves that has been used in some epidemiological and climate impact research (e.g., Anderson and Bell (2009)):
A heat wave is two or more days at or above a city-specific threshold temperature, with the threshold determined as the 98th percentile of year-round temperature in the city during some reference period (by default, 1990–1999).
However, this is not the only accepted heat wave definition in the scientific literature. A variety of different heat wave definitions have been used to identify heat waves in a time series of temperature data (Anderson and Bell 2009; Smith et al. 2013; Kent et al. 2014; Chen et al. 2015), and the choice of heat wave definitions can influence both projected heat wave trends (Smith et al. 2013) and estimates of health risks during events (Anderson and Bell 2009; Kent et al. 2014; Chen et al. 2015). Further, other types of extreme events will be defined differently than heat waves (for example, frost day spells may be defined as one or more days with temperature at or below 32oF / 0oC).
Therefore, this package allows the user to extensively customize the definition used to identify extreme events. Users can write a custom R function with either a different heat wave definition (see Smith et al. (2013) and Kent et al. (2014) for listings of some of the definitions used in scientific studies) or with a definition appropriate for a different type of extreme event (e.g., one or more days at or below 32oF / 0oC for frost day spells). For heat wave identification, researchers might want to use a different event definition because, for example, it matches the definition used by local health officials to declare heat wave warnings or, in the case of health impact assessments, to match with a definition used in an epidemiological study. For studies of other extreme events, a heat wave definition likely will not be applicable and so a customized definition is necessary.
Three components of the extreme event definition can be easily
customized in the gen_hw_set
function call, without creating a new R
function to use to identify heat waves. First, many extreme event
definitions are based on conditions that are rare in the study location
(Cubasch et al. 2013), but definitions may vary in how rare conditions must be. For
example, some of the different definitions used to identify heat waves
vary only in the percentile temperature used for a threshold (e.g., one
definition is \(\ge2\) days at or above the 98th percentile temperature
at a location while another is \(\ge2\) days at or above the 99th
percentile temperature; (Smith et al. 2013; Kent et al.; 2014)). Therefore, the
futureheatwaves package allows users to change the percentile of the
variable of interest required for an extreme event using the
probThreshold
option in gen_hw_set
. Other heat wave definitions vary
only in the number of consecutive days that must be over the threshold
for a period to quality as an extreme event (e.g., one definition is
\(\ge2\) days at or above the 98th percentile temperature at a location
while another is \(\ge4\) days at or above the 98th percentile
temperature; Anderson and Bell (2009)). Therefore, the package allows the
user to change the number of days used in the heat wave definition using
the numDays
argument in the gen_hw_set
function. Combined, these two
customization choices allow the user to identify heat waves using many
of the heat wave definitions used in previous climate and health
research– for example, 9 of 16 heat wave definitions outlined in
Kent et al. (2014) could be fit using different combinations of these two
options for specifying threshold percentile and number of days. Third,
some extreme events like cold waves and frost day spells are defined as
a certain number of days below, rather than above, a threshold. While
the default is to identify events by searching for days above a
threshold, this behavior can be changed with the
above_threshold = FALSE
argument in the gen_hw_set function
.
Beyond these simpler options, the customization of the event definition
is even more extensive as one has the option of writing and using a
custom R function to identify extreme events. This functionality allows
the user to use definitions that either require a number of days above
or below an absolute threshold (e.g., maximum temperature of \(\ge\)
95oF for \(\ge1\) day (Tan et al. 2007; Kent et al.; 2014); minimum temperature
\(\le\) 0oC for \(\ge1\) day for frost day spells) or that require a
combination of thresholds to be met (e.g., maximum daily temperature
above a lower threshold every day of the heat wave and above a higher
threshold for a certain number of days within the heat wave;
(Peng et al. 2011; Kent et al.; 2014)). To use a customized event definition,
the user must write and load an R function that implements the
definition. This custom function is passed to the gen_hw_set
function
using the IDheatwavesFunction
argument. To work correctly, this custom
function must allow only specific inputs and generate only specific
outputs; details about the required structure are provided in the main
futureheatwaves package vignette. To increase processing speed when
identifying extreme events, we coded parts of the default event
definition function in C++ and synced it with R using the
Rcpp package (Eddelbuettel and Francois 2011). Users
should consider a similar strategy for custom heat wave definitions,
especially when processing a large number of climate projection files.
Extreme events tend to be identified based on conditions that are rare
for a specific location using location-specific relative thresholds.
These thresholds are often defined for climate impact studies based on a
variable’s distribution at that location in present-day or historical
data. However, for some extreme events, impacts are associated with how
rare the conditions during the event are compared to current norms in
the location (Anderson and Bell 2009), which suggests some capacity for
adaptation to heat and raises the question of whether extreme events
should be defined using a percentile threshold based on present-day
variable distributions or based on distributions in the time period
being projected. Therefore, it can be interesting to explore trends in
extreme events under climate change if extreme events are identified
based on variable distributions during the projection period or another
future period. The futureheatwaves package allows users to specify the
time period to use when determining a location-specific relative
threshold for an event definition using the thresholdBoundaries
argument in the function gen_hw_set
. This feature allows users to
explore how sensitive projections of impacts are to this choice of the
time period to use when determining relative variable measures,
including thresholds used for percentile-based event definitions.
Similarly, some of the event characteristics (e.g.,
mean.temp.quantile
, Table 2) are also
calculated by the package based on relative temperature, providing
measures of how the value of the variable of interest during an extreme
event compares to the typical distribution of that variable at that
location (e.g., the “mean.var.quantile”, “min.var.quantile”, and
“max.var.quantile” characteristics, Table 2)
or how long conditions of a certain rarity persisted during the event
(e.g., the “days.above.99th” and “days.above.99.5th” characteristics,
Table 2). These characteristics are measured
for each of the extreme events identified by the gen_hw_set
function
by taking the absolute value of the variable during the event (e.g.,
average temperature during the heat wave is 90oF, 32.2oC) and
comparing it to the location’s typical variable distribution. This
process generates relative measures of how intense the event is compared
to what is normal in that location (e.g., 90oF, 32.2oC is in the
99th percentile of year-round temperatures in the location).
These relative event characteristics will vary depending on whether you
calculate them based on a location’s present-day variable distribution
or on the location’s variable distribution in the future, since the
distributions of many relevant variables (e.g., temperature,
precipitation) are expected to change in many locations with climate
change. The package therefore allows the user to specify date ranges of
the temperature distributions to be used in calculating these relative
temperature metrics in each location, which can be done using the
referenceBoundaries
option of gen_hw_set
.
Finally, it can be useful to explore the location of the climate model
grid point used to pull climate model output for each study location
with a given climate model. Therefore, the package has a function called
map_grid_leaflet
that plots the locations of grid points used for each
location from each climate model. This function is built using the
htmlWidget leaflet
package (Cheng and Xie 2016). The following code illustrates the use of this
function with the example data to create Figure 5, which
plots the grid points used in the example data from the BCC climate
model in the example data:
<- system.file("extdata/example_results", package = "futureheatwaves")
out map_grid_leaflet(plot_model = "bcc1", out = out)
This interactive map can be panned and zoomed to explore the locations of climate model grid points used to represent each study location. This mapping function works for study locations worldwide.
While this package was created to be used for research on extreme events in climate change projections, it can be used more broadly. For example, there are other episodes like wildfires and air pollution where it may be interesting to identify extended periods of high exposures in projection time series. The futureheatwaves package is not exclusive to CMIP5 model output data, and so could be applied to gridded air pollution model output to explore these exposures.
Research that assesses the potential impacts of climate change is
critical in informing current policy choices, and R is an important tool
for many researchers performing such assessments. While the
futureheatwaves
package described here takes steps to facilitate the
assessment of impacts related to sustained, multi-day events, a number
of challenges remain in working with climate model output data in R, and
future R software development offers the potential to further address
the challenges of working with this data.
One important step in future development of R software to work with climate model output could be the development of R wrappers for some of the existing command line tools available through the Climate Data Operators (CDO) software (Schulzweida 2017). Libraries already exist for Python and Ruby that allow the functionality of CDO tools to be used within these scripting languages (available from the Max Planck Institute at https://code.zmaw.de/projects/cdo/wiki/Cdo%7Brbpy%7D). While one R package (ncdf4.helpers; Bronaugh (2014)) already provides R wrappers for a few CDO operators, such functionality could be extended through future R software to capture more of the full functionality of the CDO toolkit.
Another important path for development could be through approaches that
allow researchers to take advantage of the statistical tools offered by
R while maintaining large climate model output files in a netCDF format.
For example, Goncalves and coauthors recently described a “round table"
approach of connecting as-needed data access from netCDF climate data
files through to functionality available in R and CDO through the
intermediary of a MonetDB database system (Goncalves et al. 2015). In
other topical areas, R programmers are also improving the efficiency of
working with data in large netCDF files through approaches that avoid
loading all data in-memory. For example, the Bioconductor package
ncdfFlow
enables R users to conduct analyses of hundreds of large flow cytometry
output files through the creation and use of an ncdfFlowSet
class that
stores the data in an HDF5 format rather than in-memory
(Finak et al. 2014; Jiang et al. 2017). Such an approach could be promising for
future R software development for working with large climate model
files.
This work was supported by grants from the National Institute of Environmental Health Sciences (R00ES022631), the National Science Foundation (1331399), and the Colorado State University Vice President for Research. Thank you to early testers of the package: Julia Bromberek, Wande Benka-Coker, Josh Ferreri, Ryan Gan, Molly Gutilla, Mike Lyons, Casey Quinn, Rachel Severson, and Meilin Yan.
futureheatwaves, ggplot2, ncdf4, RNetCDF, ncdf4.helpers, PCICt, RCMIP5, wux, Rcpp, leaflet
HighPerformanceComputing, NumericalMathematics, Phylogenetics, Spatial, SpatioTemporal, TeachingStatistics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Anderson, et al., "Working with Daily Climate Model Output Data in R and the futureheatwaves Package", The R Journal, 2017
BibTeX citation
@article{RJ-2017-032, author = {Anderson, G. Brooke and Eason, Colin and Barnes, Elizabeth A.}, title = {Working with Daily Climate Model Output Data in R and the futureheatwaves Package}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {124-137} }