Making Provenance Work for You

To be useful, scientific results must be reproducible and trustworthy. Data provenance—the history of data and how it was computed—underlies reproducibility of, and trust in, data analyses. Our work focuses on collecting data provenance from R scripts and providing tools that use the provenance to increase the reproducibility of and trust in analyses done in R. Specifically, our “End-to-end provenance tools” (“E2ETools”) use data provenance to: document the computing environment and inputs and outputs of a script’s execution; support script debugging and exploration; and explain differences in behavior across repeated executions of the same script. Use of these tools can help both the original author and later users of a script reproduce and trust its results.

Barbara Lerner (Mount Holyoke College) , Emery Boose (Harvard University) , Orenna Brand (Columbia University) , Aaron M. Ellison (Sound Solutions for Sustainable Science) , Elizabeth Fong (Mount Holyoke College) , Matthew Lau (University of Hawaii West Oahu) , Khanh Ngo (Mount Holyoke College) , Thomas Pasquier (University of British Columbia) , Luis A. Perez (Harvard College) , Margo Seltzer (University of British Columbia) , Rose Sheehan (Mount Holyoke College) , Joseph Wonsil (University of British Columbia)
2023-02-10

1 Introduction

In today’s data-driven world, an increasing number of people are finding themselves needing to analyze data in the course of their work. Often these people have little or no background or formal coursework in programming and may think of it solely as a tedious means to an interesting end. Writing scripts to work with data in this way is often exploratory. The researcher may be writing a script to produce a plot that enables visual understanding of the data. This understanding might then lead to a realization that the data need to be cleaned to remove bad values, and statistical tests need to be performed to determine the strength or trends of relationships. Examining these results may raise more questions and lead to more code. This type of exploratory programming can easily lead to scripts that grow over time to include both useful and irrelevant code that is difficult to understand, debug, and modify.

Creating a script and successfully running it once to analyze a dataset is one thing. Reproducing it later is another thing entirely. We might expect that re-running a script and reproducing a data analysis should be a simple matter of rerunning a program or script on the same data, but it is rarely that simple. Anyone who has tried to retrieve the version of the data and scripts used to produce the results presented in a paper will likely appreciate how difficult this can be. Data and scripts can be modified or lost. But even if care is taken to save the scripts and data, new versions of programming languages, libraries and operating systems may make scripts behave differently or be unable to run at all. In an ideal world, everything would be backwards-compatible, but in reality, what ran last week often doesn’t run next week. It can be difficult to determine what went wrong, especially if programming is an occasional activity. The National Academy of Sciences report on Reproducibility and Replicability in Science (National Academies of Sciences, Engineering, and Medicine 2019) describes at length the challenges associated with computational reproducibility of scientific results.

Motivated by an interest in supporting reproducibility of R scripts, we developed a package called rdtLite to collect data provenance containing a record of a script’s execution and the environment in which it was executed (Lerner et al. 2018). Having done that, we then realized that the wealth of information contained in the data provenance could serve other purposes as well. This led to the development of End-to-End Provenance Tools (“E2ETools”): an evolving set of R packages that use data provenance to help users save workable copies of their data and scripts, debug them, understand how data and results of analyses were derived, discover what has changed when a script stops working, and reproduce prior results.

2 What is data provenance?

Provenance is the history of creation, ownership, chain-of-custody, and location of an object. In its original and still most-frequently used sense, provenance is used to authenticate and trace the legitimate ownership of a work of art; it confers, creates, or adds value to the work itself. But provenance can be constructed, identified, or traced for any object, including data (Becker and Chambers 1988). Data provenance is analogous to provenance of a work of art in that it includes the history of a datum or entire dataset from the point at which it was collected (by a person or sensor), created (by a computational process), or derived (from other data). Data provenance also confers or adds value—as trustworthiness—to data, but data provenance can do more: it can be used to reproduce computational analyses and validate scientific conclusions.

More precisely, data provenance is the history of a data item (“datum”) or a dataset (“data”); it describes how the datum or data came to be in its present state. Our E2ETools focus on language-level provenance: how data are created and manipulated by a programming language such as R during the execution of a script or program. Provenance is also referred to in other computing contexts. For example, data provenance can be used to understand results of queries to a database or to the processes that were used to create or modify a file. In the remainder of this paper, however, when we say “provenance” or “data provenance”, we specifically mean language-level provenance.

We associate three types of information with provenance: environment information, coarse-grained information, and fine-grained information. Environment information includes information about the computing environment in which the script was executed. This includes information such as the operating system version, the R version, and the versions of the R libraries used, as each of these may play a role in understanding the details of how a script behaves. Coarse-grained information includes the source code of the script(s), the data input to the script, the data output by the script, and plots produced by the script. Fine-grained information includes an execution trace. Specifically, for each line of the script that is executed, fine-grained information includes the data used on that line and any data computed by, or object created by, that line. Our E2ETools can use this fine-grained information to help a user understand exactly how any data value or object in the script was computed or derived.

3 A first example

Consider this simple example, mtcars_example.R, that loads in the cars dataset and plots miles per gallon (mpg) as a function of the number of cylinders (cylinders) (1).

# Load the mtcars data set that comes with R
data(mtcars)

# All the cars
allCars.df <- mtcars

# Create separate data frames for each number of cylinders
cars4Cyl.df <- allCars.df[allCars.df$cyl == 4, ]
cars6Cyl.df <- allCars.df[allCars.df$cyl == 6, ]
cars8Cyl.df <- allCars.df[allCars.df$cyl == 8, ]

# Create a table with the average mpg for each # cylinders
cylinders = c(4, 6, 8)
mpg = c(mean(cars4Cyl.df$mpg), mean(cars6Cyl.df$mpg), mean(cars8Cyl.df$mpg))
cyl.vs.mpg.df <- data.frame (cylinders, mpg)

# Plot it
plot(cylinders, mpg)
Figure 1: Source code for mtcars_example.R. This code is used to demonstrate the lineage traces provided by the debug.lineage function as described in the text.

The following commands run the script, collect its provenance, and produce a textual summary of the provenance.

library(rdtLite)
prov.run("mtcars_example.R")
prov.summarize()
PROVENANCE SUMMARY for mtcars_example.R

ENVIRONMENT:
Executed at 2022-07-28T13.52.25EDT
Total execution time was 1.516 seconds
Script last modified at 2022-07-22T10.41.25EDT
Executed with R version 4.2.1 (2022-06-23)
Platform was x86_64, darwin17.0
Operating system was macOS Catalina 10.15.7
User interface was 2022.02.3+492 Prairie Trillium (desktop)
Document converter was 2.2.1 @ /usr/local/bin/pandoc
Provenance was collected with rdtLite1.4
Provenance is stored in /Users/blerner/tmp/prov/prov_mtcars_example
Hash algorithm is md5

LIBRARIES (loaded by script):
None (see notes below)

SCRIPTS:
1[:] /Users/blerner/Documents/Process/DataProvenance/Papers/RJournal/scripts/
    examples/mtcars_example.R

PRE-EXISTING:
None

INPUTS:
1[:] /Library/Frameworks/R.framework/Versions/4.2/Resources/library/datasets/
    data/Rdata.rds

OUTPUTS:
1[-] /Users/blerner/Documents/Process/DataProvenance/Papers/RJournal/scripts/
    dev.off.11.pdf

CONSOLE:
None

ERRORS & WARNINGS:
None

NOTES: Files are listed in the order of execution (script 1 = main script).
The status of each file in its original location is marked as follows:
File unchanged [:], File changed [+], File missing [-], Not checked [ ].
Copies of original files are available on the provenance directory.

Libraries loaded by the user's script at the time of execution are displayed.
Note that some libraries may have been loaded before execution. Use details = 
TRUE to see all loaded libraries along with script, file, and message details.
Figure 2: Provenance summary for mtcars_example.R, showing the environment in which the script was executed, identifying the script, input and output files, and any errors or warnings encountered when the script was executed.

The provenance summary is shown in 2. The environment information (lines 3–18) reports details of the computing environment in which the script was executed, such as the processor and operating system on which it ran and the version of R and R libraries used. The coarse-grained information (lines 20–36) identifies the location in the file system of the script, the input dataset, and the plot produced. The fine-grained information, which is not displayed by prov.summarize() but is accessible via other tools, indicates the input and output data for each line of code executed, linking them together so that one can see how the values computed in one statement are used in later statements. For example, the provenance debugger can use fine-grained information to display everything that is derived from a variable.

library(provDebugR)
prov.debug()
debug.lineage("cars4Cyl.df", forward = TRUE)

The resulting output displays the line numbers and code for everything computed, either directly or indirectly, from cars4Cyl.df .

Var cars4Cyl.df 
    8:    cars4Cyl.df <- allCars.df[allCars.df$cyl == 4, ] 
    14:      mpg = c(mean(cars4Cyl.df$mpg), mean(cars6Cyl.df ...
    15:      cyl.vs.mpg.df <- data.frame (cylinders, mpg) 
    18:      plot(cylinders, mpg) 
    NA:      mtcars_example.R 

Alternatively, a modified version of the same command

debug.lineage("cars4Cyl.df")

shows the lines of code that lead to the value for cars4Cyl.df being computed.

Var cars4Cyl.df 
    2:   data(mtcars) 
    5:   allCars.df <- mtcars 
    8:   cars4Cyl.df <- allCars.df[allCars.df$cyl == 4, ] 

Having seen an introductory example of some things the E2ETools can do, we now turn to a more detailed discussion of each tool.

4 The end-to-end provenance tools

The E2ETools consist of three types of packages:

We describe each of these packages, beginning with provenance collection. All the tools described are available on CRAN.

Collecting provenance with rdtLite

The rdtLite package collects provenance from R scripts as they execute.1 rdtLite captures provenance data from both scripts and interactive console sessions. To capture provenance for a script, the user runs the script using the prov.run function.

library(rdtLite)
prov.run("script.R")

To collect provenance for an interactive session, the user begins the session with the prov.init function and concludes it with prov.quit.

library(rdtLite)
prov.init()
data <- read.csv("mydata.csv")
plot(data$x, data$y)
prov.quit()

rdtLite collects information about each file or URL read by the script, each file written by the script, and each plot created by the script. In addition, it records an execution trace of the top-level R statements. This trace identifies the statement executed. It records any variables set or used by the statement. When a variable is set, it records the type of the value, including its container (such as vector, data frame, etc.), dimensions, and class (e.g., character, numeric). If the container is a vector of length 1, rdtLite records its data value, embedded in the provenance (which is stored in a JSON file). rdtLite can save the values of larger containers in separate snapshot files. The user controls how much data to save using the snapshot.size parameter in prov.init and prov.run. The default is to not save snapshots. rdtLite also records any warning or error messages generated when the statement is executed. To capture similar information about scripts that are included using the source function, calls to source must be replaced with calls to prov.source.

The provenance is stored in a JSON file using a format that extends the PROV-JSON standard (W3C 2014).2 The extended format provides structured information about fine-grained provenance, such as a list of libraries used, a mapping from functions called to the libraries from which they came, script line numbers, and data values and their types. More information about the extended JSON format is provided in the Appendix.

The JSON file is stored in a provenance directory that also contains copies of all input and output files and the R scripts executed. By default, the provenance data is stored in the R session temporary directory, but the user can change this location either at the time that prov.run or prov.init is called or by setting the prov.dir option, for example, in the .Rprofile file.

Upon completion of a script called with prov.run, or after a call to prov.quit, rtdLite creates and populates a directory named either prov_script, where script is the name of the script file, or prov_console for an interactive session. The directory will contain:

The rdtLite default is to overwrite this information if the same script is executed again or if prov.init is used again in a console session. However if the overwrite parameter is set to FALSE, the provenance is stored in a unique, time-stamped directory, allowing provenance from multiple executions to be analyzed and compared.

Using provenance

Having the provenance is extremely valuable, but it is not particularly usable without tools that read the provenance and provide information or enable reproducibility. We next describe four tools that use provenance to help R programmers understand executions of their script. The provSummarizeR package provides a concise textual summary of an execution. The provViz package provides a graphical visualization of the provenance. The provDebugR package uses collected provenance to help programmers debug their code. The provExplainR package compares provenance from two executions to help the programmer understand changes between them. These applications exist in packages separate from rdtLite and would work equally well with provenance collected by other tools that produce the same JSON format.

provSummarizeR

The purpose of provSummarizeR is to produce a concise record of the environment in which a script was executed. This information could be particularly valuable when including a script and its results in a paper, or when sharing a script with a colleague. For an example, please see 2 above. The summary includes the following information:

In our own day-to-day work, we use provSummarizeR to document the processing of real-time meteorological and hydrological data at Harvard Forest. Data and plots of data captured in the past 30 days, including air temperature, precipitation, stream discharge, and water temperature, are updated and posted every 15 minutes.3 Also posted at the same site are provenance summaries for the script execution that creates the plots.

There are three functions provided to generate summaries:

prov.summarize(details = FALSE)
prov.summarize.file(prov.file, details = FALSE)
prov.summarize.run(r.script, details = FALSE)

By passing TRUE for the details parameter, the user can see more detail about some aspects of the provenance. In particular,

The provViz and provDebugR tools described below provide a similar set of three functions: one to use the last provenance collected, one to use a specific JSON file, and one to run a script and use its provenance.

provViz

graphic without alt text
Figure 3: A provenance graph as displayed using provViz. Yellow nodes represent statements in the code, blue nodes represent variables, orange nodes represent files and green nodes mark the start and end of the script.

The provViz package allows visual exploration of script execution as shown in 3. There are two types of nodes: data nodes and procedure nodes. Data nodes represent things such as variables, files, plots, and URLs. Procedure nodes represent executed R statements. An edge from a data node to a procedure node indicates that the statement represented by the procedure node uses the data represented by the data node. For example, the edge from data item, 7-mpg, to procedure node, 9-plot(cylinders,mpg), indicates that mpg was used in the call to the plot function. Conversely, an edge from a procedure node to a data node indicates that the procedure produced the data, for example, by assigning to a variable or writing to a file. An edge between two procedure nodes represents control flow, indicating the order in which the statements were executed.

provViz also allows the user to view the graph and explore it to examine intermediate data values or input and output files and to perform lineage queries. The node colors indicate node type. Data nodes representing variables are purple. Files are tan. Orange nodes represent standard output, while red data nodes represent warnings and errors. Yellow nodes represent R statements. Green nodes come in pairs and represent the start and end of a group of R statements. Clicking on a green node reduces the set of statements between the matching Start and Finish nodes into a single node, which is useful for making large graphs more manageable.

graphic without alt text
Figure 4: Displaying the Lineage of 3-cars4Cyl.df

To see everything that depends on the value of a variable at a particular point in the execution of the script, the user can right-click on the data node and select Show what is computed using this value. This will display a subgraph containing just the data and procedure nodes that are in the lineage of the data node, as shown in 4, which shows the lineage of 3-cars4Cyl.df. Notice that statements that do not use the value of cars4Cyl.df, either directly or indirectly, are not shown.

In addition to examining data values and tracing lineages as in this example, provViz supports the following ways of exploring the provenance:

provViZ itself is a small R program that connects to a Java program called DDG Explorer (Lerner and Boose 2014a), which does the actual work of creating and managing the display.

provDebugR

The provDebugR package provides debugging support by using the provenance to help users understand the state of their script at any point during execution. It provides command-line debugging capabilities, but one could imagine building a GUI on top of these functions to produce a friendly interactive debugging environment. By using provenance, provDebugR provides insight into the entire execution and creates a rich debugging environment that provides execution context not typically available in debuggers.

For example, consider a simple, but buggy script.

w <- 4:6
x <- 1:3
y <- 1:10
z <- w + y
y <- c('a', 'b', 'c')
xyz <- data.frame (x, y, z)

Running this script produces a warning and an error.

Error in data.frame(x, y, z) : 
  arguments imply differing number of rows: 3, 10
In addition: Warning message:
In w + y : longer object length is not a multiple of shorter object length

Of course, with a short script like this, a user could simply step through the script one line at a time and examine the results, but for the purposes of demonstrating the debugger, imagine that this code is buried within a large script. The lines of code might not be consecutive as shown here, and it may even be difficult to determine what lines caused the reported errors.

The debugger provides some functions that are particularly helpful for understanding warning and error messages. For example, if the user needs help understanding where a warning came from, calling debug.warning with no arguments lists all the warnings; when called with a warning number, it displays the lines of code leading up to the warning.

> debug.warning()
Possible results: 
                                                                              
1 In  w + y :  longer object length is not a multiple of shorter object length

Pass the corresponding numeric value to the function for info on that warning
> debug.warning(1)
Warning: In  w + y :  longer object length is not a multiple of shorter object length 
    1:   w <- 4:6 
    3:   y <- 1:10 
    4:   z <- w + y 

By omitting lines that do not contribute to the computations that lead to the warning, the R programmer should be able to find the problem more easily.

Similarly, the user can get information about what led up to an error using debug.error.

> debug.error()
Your Error: Error in data.frame(x, y, z): arguments imply differing number of rows: 3, 10

Code that led to error message:
    1:   w <- 4:6 
    2:   x <- 1:3 
    3:   y <- 1:10 
    4:   z <- w + y 
    5:   y <- c('a', 'b', 'c') 
    6:   xyz <- data.frame (x, y, z) 

The debug.error function has an optional logical parameter, stack.overflow. When set to TRUE, debug.error uses the stackexchange API to search Stack Overflow for posts about similar error messages. It lists the questions asked in the top six posts. The user can select one and a tab will open in the user’s browser displaying the selected post.

5 shows a sample dialog using debug.error. Selecting 1 results in the user’s browser going to the page displayed in 6.5 By scrolling down through answers to this question (not shown here), users will ideally obtain helpful information allowing them to solve their problem quickly.

> debug.error(stack.overflow=TRUE)
Your Error: Error in data.frame(x, y, z): arguments imply differing number 
of rows: 3, 10

Code that led to error message:
1:    w <- 4:6 
2:    x <- 1:3 
3:    y <- 1:10 
4:    z <- w + y 
5:      y <- c('a', 'b', 'c') 
6:      xyz <- data.frame (x, y, z) 

Results from StackOverflow:
[1] "What does the error \"arguments imply differing number of rows: x, y\" 
    mean?"                                                 
[2] "ggplot gives \"arguments imply differing number of rows\" error in 
    geom_point while it isn't true - how to debug?"            
[3] "Checkpoint function error in R- arguments imply differing number of rows: 
    1, 38, 37"                                          
[4] "qdap check_spelling Error in checkForRemoteErrors(val) : one node 
    produced an error: arguments imply differing number of rows"
[5] "Creating and appending to data frame in R (Error: arguments imply 
    differing number of rows: 0, 1)"                            
[6] "Caret and GBM: task 1 failed - \"arguments imply differing number of rows\""                                                  

Choose a numeric value that matches your error the best or q to quit: 
Figure 5: The output of a call to debug.error, showing the titles of posts on Stack Overflow related to the error encountered in the script. The user can select an option to be taken to the corresponding Stack Overflow page.