Reproducible Summary Tables with the gtsummary Package

The gtsummary package provides an elegant and flexible way to create publication-ready summary tables in R. A critical part of the work of statisticians, data scientists, and analysts is summarizing data sets and regression models in R and publishing or sharing polished summary tables. The gtsummary package was created to streamline these everyday analysis tasks by allowing users to easily create reproducible summaries of data sets, regression models, survey data, and survival data with a simple interface and very little code. The package follows a tidy framework, making it easy to integrate with standard data workflows, and offers many table customization features through function arguments, helper functions, and custom themes.

Daniel D. Sjoberg (Memorial Sloan Kettering Cancer Center) , Karissa Whiting (Memorial Sloan Kettering Cancer Center) , Michael Curry (Memorial Sloan Kettering Cancer Center) , Jessica A. Lavery (Memorial Sloan Kettering Cancer Center) , Joseph Larmarange (Centre Population & Développement, IRD, Université de Paris, Inserm)
2021-06-22

1 Introduction

Table summaries are a fundamental tool in an analyst’s toolbox that help us understand and communicate patterns in our data. The ability to easily create and export polished and reproducible tables is essential. The gtsummary (Sjoberg et al. 2020) package provides an elegant and flexible framework to create publication-ready analytical and summary tables in R. This package works to close the gap between a reproducible RMarkdown report and the final report. Specifically, gtsummary allows the user to fully customize and format summary tables with code, eliminating the need to modify any tables by hand after the table has been exported. Removing the need to modify tables after the table has been created eliminates an error-prone step in our workflow and increases the reproducibility of our analyses and reports.

Using gtsummary, analysts can easily summarize data frames, present and compare descriptive statistics between groups, summarize regression models, and report statistics inline in RMarkdown reports. After identifying these basic structures of most tables presented in the medical literature (and other fields), we wrote gtsummary to ease the creation of fully-formatted, ready-to-publish tables.

Additionally, gtsummary leverages other analysis and tidying R packages to create a complete analysis and reporting framework. For example, we take advantage of the existing broom (Robinson et al. 2020) tidiers to prepare regression results for tbl_regression() and use gt (Iannone et al. 2020) to print gtsummary tables to various output formats (e.g., HTML, PDF, Word, or RTF). Furthermore, gtsummary functions are designed to work within a "tidy" framework, utilizing the magrittr (Bache and Wickham 2020) pipe operator and tidyselect (Henry and Wickham 2020) functions used throughout the tidyverse (Wickham et al. 2019).

While other R packages are available to present data and regression model summary tables, such as skimr, stargazer, finalfit, and tableone, gtsummary is unique in that it is a one-stop-shop for most types of statistical tables and offers diverse features to customize the content of tables to a high degree. The default gtsummary table is suitable to be published in a scientific journal with little or no additional formatting. For example, gtsummary has specific internal algorithms to identify variable data types, so there is no need for users to specify whether a variable should be displayed with categorical or continuous summaries, which yields summary tables with minimal code.

Along with descriptive summaries, gtsummary summarizes statistical models, survey data, survival data and builds cross-tabulations. After data are summarized in a table, gtsummary allows users to combine tables, either side-by-side (with tbl_merge()) , or on top of each other (with tbl_stack()). The table merging and stacking abilities allows analysts to easily synthesize and compare output from several tables and share information in a compact format. All tables in this manuscript were created using gtsummary v1.4.1.

2 Data Summaries

To showcase gtsummary functions, we will use a simulated clinical trial data set containing baseline characteristics of 200 patients who received Drug A or Drug B, as well as the outcomes of tumor response and death. Each variable in the data frame has been assigned an attribute label with the labelled package (Larmarange 2020), e.g., trial %>% set_variable_labels(age = "Age"), that will be shown in the summary tables. These labels are displayed in the gtsummary tables by default, and had labels not been assigned, the variable name would have been shown.

Table 1: Example data frame, trial
colname label class values
trt Chemotherapy Treatment character Drug A, Drug B
age Age numeric 6, 9, 10, 17, ...
marker Marker Level (ng/mL) numeric 0.003, 0.005, 0.013, 0.015, ...
stage T Stage factor T1, T2, T3, T4
grade Grade factor I, II, III
response Tumor Response integer 0, 1
death Patient Died integer 0, 1
ttdeath Months to Death/Censor numeric 3.53, 5.33, 6.32, 7.27, ...

tbl_summary()

The default output from tbl_summary() is meant to be publication-ready. The tbl_summary() function can take, at minimum, a data frame as the only input, and returns descriptive statistics for each column in the data frame. This is often the first table of clinical manuscripts and describes the characteristics of the study cohort. A simple example is shown below. Notably, by specifying the by= argument, you can stratify the summary table. In the example below, we have split the table by the treatment a patient received.

trial %>%
  select(age, grade, response, trt) %>%
  tbl_summary(by = trt)

graphic without alt text The function is highly customizable, and it is initiated with sensible default settings. Specifically, tbl_summary() detects variable types of input data and calculates descriptive statistics accordingly. For example, variables coded as 0/1, TRUE/FALSE, and Yes/No are presented dichotomously. Additionally, NA values are recognized as missing and listed as unknown, and if a data set is labeled, the label attributes are utilized.

Default settings may be customized using the tbl_summary() function arguments.

Table 2: tbl_summary() function arguments
Argument Description
label= specify the variable labels printed in table
type= specify the variable type (e.g., continuous, categorical, etc.)
statistic= change the summary statistics presented
digits= number of digits the summary statistics will be rounded to
missing= whether to display a row with the number of missing observations
missing_text= text label for the missing number row
sort= change the sorting of categorical levels by frequency
percent= print column, row, or cell percentages
include= list of variables to include in summary table

For continuous variables, tables display one row of statistics per variable by default. This can be customized, and in the example below, the age variable is cast to "continuous2" type, meaning the continuous summary statistics will appear on two or more rows in the table. This allows the number of non-missing observations and the mean to be displayed on separate lines.

In the example below, the "age" variable’s label is updated to "Patient Age". Default summary statistics for both continuous and categorical variables are updated using the statistic= argument. gtsummary uses glue (Hester 2020) syntax to construct the statistics displayed in the table. Function names appearing in curly brackets will be replaced by the evaluated value. The digits= argument is used to increase the number of decimal places to which the statistics are rounded, and the missing row is omitted with missing = "no".

trial %>%
  select(age, grade, response, trt) %>%
  tbl_summary(
    by = trt,
    type = age ~ "continuous2",
    label = age ~ "Patient Age",
    statistic = list(age ~ c("{N_nonmiss}", "{mean} ({sd})"),
                     c(grade, response) ~ "{n} / {N} ({p}%)"),
    digits = c(grade, response) ~ c(0, 0, 1),
    missing = "no"
  )

graphic without alt text A note about notation: Throughout the gtsummary package, you will find function arguments that accept a list of formulas (or a single formula) as the input. In the example above, the label for the age variable was updated using label = age\(\sim\)"Patient Age"—equivalently, label = list(age\(\sim\)"Patient Age"). To select groups of variables, utilize the select helpers from the tidyselect and gtsummary packages. The all_continuous() selector is a convenient way to select all continuous variables. In the example above, it could have been used to change the summary statistics for all continuous variables—all_continuous()\(\sim\)c("{N_nonmiss}", "{mean} ({sd})"). Similarly, users may utilize all_categorical() (from gtsummary) or any of the tidyselect helpers used throughout the tidyverse packages, such as starts_with(), contains(), etc.

In addition to summary statistics, the gtsummary package has several functions to add additional information or statistics to tbl_summary() tables.

Table 3: tbl_summary() functions to add information
Function Description
add_p() add p-values to the output comparing values across groups
add_overall() add a column with overall summary statistics
add_n() add a column with N (or N missing) for each variable
add_difference() add column for difference between two group, confidence interval, and p-value
add_stat_label() add label for the summary statistics shown in each row
add_stat() generic function to add a column with user-defined values
add_q() add a column of q-values to control for multiple comparisons

In the example below, descriptive statistics are shown by the treatment received and overall, as well as a p-value comparing the values between the treatments. Default statistical tests are chosen based on data type, and the statistical test performed can be customized in the add_p() function. p-value formatting can be adjusted using the pvalue_fun= argument, which accepts both a proper function, as well the formula shortcut notation used throughout the tidyverse packages.

trial %>%
  select(age, grade, response, trt) %>%
  tbl_summary(by = trt) %>%
  add_overall() %>%
  add_p(test = all_continuous() ~ "t.test",
        pvalue_fun = ~style_pvalue(., digits = 2))

graphic without alt text

tbl_svysummary()

The tbl_svysummary() function is analogous to tbl_summary(), except a survey (Lumley 2020) object is supplied rather than a data frame. The summary statistics presented take into account the survey weights, as do any p-values presented.

# convert trial data frame to survey object
svy_trial <- survey::svydesign(data = trial, ids = ~ 1, weights = ~ 1)

tbl_svysummary_1 <-
  svy_trial %>%
  tbl_svysummary(by = trt, include = c(trt, age, grade)) %>%
  add_p()

graphic without alt text

tbl_cross()

The tbl_cross() function is a wrapper for tbl_summary() and creates a simple, publication-ready cross tabulation.

trial %>%
  tbl_cross(row = stage, col = trt, percent = "cell") %>%
  add_p(source_note = TRUE)

graphic without alt text

tbl_survfit()

The tbl_survfit() function parses and tabulates survival::survfit() objects presenting survival percentile estimates and survival probabilities at specified times.

library(survival)

list(survfit(Surv(ttdeath, death) ~ trt, trial),
     survfit(Surv(ttdeath, death) ~ grade, trial)) %>%
  tbl_survfit(times = c(12, 24),
              label_header = "**{time} Month**") %>%
  add_p()

graphic without alt text

Customization

The gtsummary package includes functions specifically made to modify and format the summary tables. These functions work with any table constructed with gtsummary. The most common uses are changing the column headers and footnotes or modifying the look of tables through bolding and italicization.

Table 4: Functions to style and modify gtsummary tables
Function Description
modify_header() update column headers
modify_footnote() update column footnote
modify_spanning_header() update spanning headers
modify_caption() update table caption/title
bold_labels() bold variable labels
bold_levels() bold variable levels
italicize_labels() italicize variable labels
italicize_levels() italicize variable levels
bold_p() bold significant p-values

The gtsummary package utilizes the gt package to print the summary tables. The gt package exports approximately one hundred functions to customize and style tables. When you need to add additional details or styling not available within gtsummary, use the as_gt() function to convert the gtsummary object to gt and continue customization.

The example below is a common table reported in clinical trials and observational research where two treatments are compared. The treatment differences were added with the add_difference() function. The table includes customization using both gtsummary and gt functions. The gtsummary functions are utilized to bold the variable labels, update the column headers, and add a spanning header. Additional gt customization was utilized to add table captions and source notes.

trial %>%
  select(marker, response, trt) %>%
  tbl_summary(by = trt,
              missing = "no",
              statistic = marker ~ "{mean} ({sd})") %>%
  add_difference() %>%
  add_n() %>%
  add_stat_label() %>%
  bold_labels() %>%
  modify_header(list(label ~ "**Variable**", all_stat_cols() ~ "**{level}**")) %>%
  modify_spanning_header(all_stat_cols() ~ "**Randomization Assignment**") %>%
  as_gt() %>%
  gt::tab_header(
    title = gt::md("**Table 1. Treatment Differences**"),
    subtitle = gt::md("_Highly Confidential_")
  ) %>%
  gt::tab_source_note("Data updated June 26, 2015")

graphic without alt text

3 Model Summaries

Regression modeling is one of the most common tools of medical research. The gtsummary package has two functions to help analysts prepare tabular summaries of regression models: tbl_regression() and tbl_uvregression().

tbl_regression()

The tbl_regression() function takes a regression model object in R and returns a formatted table of regression model results. Like tbl_summary(), tbl_regression() creates highly customizable analytic tables with sensible defaults. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically identified, and the tables headers are pre-filled with appropriate column headers (i.e., Odds Ratio and Hazard Ratio).

In the example below, the logistic regression model is summarized with tbl_regression(). Note that a reference row for grade has been added, and the variable labels have been carried through into the table. Using exponentiate = TRUE, we exponentiate the regression coefficients, yielding the odds ratios. The helper function add_global_p() was used to replace the p-values for each term with the global p-value for grade.

glm(response ~ age + grade, trial, family = binomial) %>%
  tbl_regression(exponentiate = TRUE) %>%
  add_global_p()

graphic without alt text The tbl_regression() function leverages the huge effort behind the broom, parameters (Lüdecke et al. 2020), and broom.helpers (Larmarange and Sjoberg 2021) packages to perform the initial formatting of the regression object. Because tbl_regression() utilizes these packages, there are many model types that are supported out of the box, such as lm(), glm(), lme4::lmer(), lme4::glmer(), geepack::geeglm(), survival::coxph(), survival::survreg(), survival::clogit(), nnet::multinom(), rstanarm::stan_glm(), models built with the mice package (van Buuren and Groothuis-Oudshoorn 2011), and many more. A custom tidier may be specified as well, which is helpful when you need to present non-standard modifications to your model results such as Wald confidence intervals or results with modified variance-covariance standard errors.

tbl_uvregression()

The tbl_uvregression() function is a wrapper for tbl_regression() that is useful when you need a series of univariate regression models. The user passes a data frame to tbl_uvregression(), indicates what the outcome is, what regression model to run, and the function will return a formatted table of stacked univariate regression models.

trial %>%
  select(response, age, grade) %>%
  tbl_uvregression(
    y = response,
    method = glm,
    method.args = list(family = binomial),
    exponentiate = TRUE,
    pvalue_fun = ~style_pvalue(., digits = 2)
  ) %>%
  add_nevent() %>%
  add_global_p()

graphic without alt text

4 Inline Reporting

Reproducible reports are an important part of good analytic practices. We often need to report the results from a table in the text of an R markdown report. The inline_text() function reports statistics from gtsummary tables inline in an R markdown document.

Imagine you need to report the results for age from the univariate table above. Typically, the odds ratio, confidence interval, and p-value would be hard-coded into a report, which can lead to reproducibility issues if the data is updated and the hard-coded statistics are not amended. A simple call to the inline_text() function will dynamically add the model results to an RMarkdown report.

The odds ratio for age was r inline_text(uvreg, variable = age).

Here is how the line will appear in your report.

The odds ratio for age was 1.02 (95% CI 1.00, 1.04; p=0.091).

The default pattern to display for a regression table is "{estimate} ({conf.level*100}% CI {conf.low}, {conf.high}; {p.value})" (again using glue syntax), and can be modified with the inline_text(pattern=) argument.

5 Merging and Stacking

The gtsummary tables shown above are often ready for publication as they are; however, it is common that more complex tables need to be constructed. This can be achieved by merging or stacking gtsummary tables using the tbl_merge() and tbl_stack() functions. For example, in cancer research we often report models predicting a tumor’s response to treatment and risk of death side-by-side in publications. This type of table is simple to construct using tbl_merge(). First, build a table for each regression model using tbl_regression(), then merge the two tables with tbl_merge(). Any number of gtsummary tables can be merged with this function.

tbl1 <-
  glm(response ~ age + grade, trial, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

tbl2 <-
  coxph(Surv(ttdeath, death) ~ age + grade, trial) %>%
  tbl_regression(exponentiate = TRUE)

tbl_merge_1 <-
  tbl_merge(
    tbls = list(tbl1, tbl2),
    tab_spanner = c("**Tumor Response**", "**Time to Death**")
  )

graphic without alt text Similarly, any number of gtsummary tables may be stacked using the tbl_stack() function.

6 Themes

We love themes. The default styling (e.g., statistics displayed in tbl_summary(), how p-values are rounded, decimal separator, and more) follow the reporting guidelines from European Urology, The Journal of Urology, Urology, and the British Journal of Urology International (Assel et al. 2019). However, you will likely submit to another journal, or your personal preferences differ from the defaults. The gtsummary package is unique from other table building packages with the ability to set fine-grained customization defaults with themes. Themes were created to make these customizations easy to navigate and reuse across documents or projects. With themes, users can control default settings for existing functions (e.g., always present means instead of medians in tbl_summary()), as well as other changes that are not modifiable with function arguments. Several themes are available to follow various journals’ reporting guidelines, reduce cell padding and font size, and language themes to translate gtsummary tables to more than 14 languages.

For example, using the theme for The Journal of the American Medical Association (JAMA), large p-values are rounded to two decimal places, confidence intervals are shown as "lb to ub" instead of "lb, ub", and the confidence interval is displayed in the same column as the model coefficients.

theme_gtsummary_journal("jama")

glm(response ~ age + grade, trial, family = binomial) %>%
  tbl_regression(exponentiate = TRUE)

graphic without alt text The language theme can be used to translate the table to another language and allows users to specify the decimal and big mark symbols. For example, theme_gtsummary_language(language = "es", decimal.mark = ",", big.mark = ".") will translate the output to Spanish and format numeric results as 1.000,00 instead of 1,000.00 (the default formatting).

A custom theme was used to construct the gtsummary tables shown in this manuscript to match the R Journal font and reduce the default cell padding. Themes are an evolving feature, and we welcome additions of new journals’ reporting guidelines or other themes useful to users. A full glossary of customizable theme elements is available in the package’s themes vignette (http://www.danieldsjoberg.com/gtsummary/articles/themes.html).

7 Print Engines

Tables printed with gtsummary can be seamlessly integrated into RMarkdown documents and knitted into various output types using a number of print engines. The package was written to be a companion to the gt package from RStudio and is optimized to leverage the advanced customization features of this print engine, but offers compatibility with a variety of popular printing methods, including knitr::kable() (Xie 2020), flextable (Gohel 2020), huxtable (Hugh-Jones 2020), and kableExtra (Zhu 2020). While gt is used as the default for most outputs, you can easily use your print engine of choice with the conversion helper functions provided in the package (e.g., as_flex_table()). It is possible to get results in HTML, PDF (via LaTeX), RTF, Microsoft Word, PowerPoint, Excel, and others, utilizing the various print engines. The package is designed to interact with these print engines behind the scenes to reduce the burden on users, and you generally only need to be aware of them if you want to add advanced customizations.

8 Summary

The functions in the gtsummary package were designed to reduce the burden of reporting and to work together to easily construct both simple and complex tables. It is our hope that the user-friendly syntax and publication-ready tables will aid analysts in preparing reproducible and high-quality findings.

CRAN packages used

gtsummary, broom, gt, magrittr, tidyselect, tidyverse, skimr, stargazer, finalfit, tableone, glue, survey, parameters, broom.helpers, flextable, huxtable, kableExtra

CRAN Task Views implied by cited packages

CausalInference, Distributions, MixedModels, OfficialStatistics, ReproducibleResearch, Survival

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

M. Assel, D. Sjoberg, A. Elders, X. Wang, D. Huo, A. Botchway, K. Delfino, Y. Fan, Z. Zhao, T. Koyama, et al. Guidelines for reporting of statistics for clinical research in urology. European urology, 75(3): 358, 2019.
S. M. Bache and H. Wickham. Magrittr: A forward-pipe operator for r. 2020. URL https://CRAN.R-project.org/package=magrittr. R package version 2.0.1.
D. Gohel. Flextable: Functions for tabular reporting. 2020. URL https://CRAN.R-project.org/package=flextable. R package version 0.6.1.
L. Henry and H. Wickham. Tidyselect: Select from a set of strings. 2020. URL https://CRAN.R-project.org/package=tidyselect. R package version 1.1.0.
J. Hester. Glue: Interpreted string literals. 2020. URL https://CRAN.R-project.org/package=glue. R package version 1.4.2.
D. Hugh-Jones. Huxtable: Easily create and style tables for LaTeX, HTML and other formats. 2020. URL https://CRAN.R-project.org/package=huxtable. R package version 5.1.1.
R. Iannone, J. Cheng and B. Schloerke. Gt: Easily create presentation-ready display tables. 2020. https://gt.rstudio.com/, https://github.com/rstudio/gt.
J. Larmarange. Labelled: Manipulating labelled data. 2020. URL https://CRAN.R-project.org/package=labelled. R package version 2.7.0.
J. Larmarange and D. D. Sjoberg. Broom.helpers: Helpers for model coefficients tibbles. 2021. URL https://CRAN.R-project.org/package=broom.helpers. R package version 1.3.0.
D. Lüdecke, M. S. Ben-Shachar, I. Patil and D. Makowski. Extracting, computing and exploring the parameters of statistical models using R. Journal of Open Source Software, 5(53): 2445, 2020. DOI 10.21105/joss.02445.
T. Lumley. Survey: Analysis of complex survey samples. 2020. URL https://CRAN.R-project.org/package=survey. R package version 4.0.
D. Robinson, A. Hayes and S. Couch. Broom: Convert statistical objects into tidy tibbles. 2020. URL https://CRAN.R-project.org/package=broom. R package version 0.7.2.
D. D. Sjoberg, M. Curry, M. Hannum, K. Whiting and E. C. Zabor. Gtsummary: Presentation-ready data summary and analytic result tables. 2020. URL https://CRAN.R-project.org/package=gtsummary. R package version 1.3.6, https://github.com/ddsjoberg/gtsummary, http://www.danieldsjoberg.com/gtsummary/.
S. van Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3): 1–67, 2011. URL https://www.jstatsoft.org/v45/i03/.
H. Wickham, M. Averick, J. Bryan, W. Chang, L. D. McGowan, R. François, G. Grolemund, A. Hayes, L. Henry, J. Hester, et al. Welcome to the tidyverse. Journal of Open Source Software, 4(43): 1686, 2019. DOI 10.21105/joss.01686.
Y. Xie. Knitr: A general-purpose package for dynamic report generation in r. 2020. URL https://CRAN.R-project.org/package=knitr. R package version 1.30.
H. Zhu. kableExtra: Construct complex table with ’kable’ and pipe syntax. 2020. URL https://CRAN.R-project.org/package=kableExtra. R package version 1.3.1.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Sjoberg, et al., "Reproducible Summary Tables with the gtsummary Package", The R Journal, 2021

BibTeX citation

@article{RJ-2021-053,
  author = {Sjoberg, Daniel D. and Whiting, Karissa and Curry, Michael and Lavery, Jessica A. and Larmarange, Joseph},
  title = {Reproducible Summary Tables with the gtsummary Package},
  journal = {The R Journal},
  year = {2021},
  note = {https://rjournal.github.io/},
  volume = {13},
  issue = {1},
  issn = {2073-4859},
  pages = {570-580}
}