To be useful, scientific results must be reproducible and trustworthy. Data provenance—the history of data and how it was computed—underlies reproducibility of, and trust in, data analyses. Our work focuses on collecting data provenance from R scripts and providing tools that use the provenance to increase the reproducibility of and trust in analyses done in R. Specifically, our “End-to-end provenance tools” (“E2ETools”) use data provenance to: document the computing environment and inputs and outputs of a script’s execution; support script debugging and exploration; and explain differences in behavior across repeated executions of the same script. Use of these tools can help both the original author and later users of a script reproduce and trust its results.
In today’s data-driven world, an increasing number of people are finding themselves needing to analyze data in the course of their work. Often these people have little or no background or formal coursework in programming and may think of it solely as a tedious means to an interesting end. Writing scripts to work with data in this way is often exploratory. The researcher may be writing a script to produce a plot that enables visual understanding of the data. This understanding might then lead to a realization that the data need to be cleaned to remove bad values, and statistical tests need to be performed to determine the strength or trends of relationships. Examining these results may raise more questions and lead to more code. This type of exploratory programming can easily lead to scripts that grow over time to include both useful and irrelevant code that is difficult to understand, debug, and modify.
Creating a script and successfully running it once to analyze a dataset is one thing. Reproducing it later is another thing entirely. We might expect that re-running a script and reproducing a data analysis should be a simple matter of rerunning a program or script on the same data, but it is rarely that simple. Anyone who has tried to retrieve the version of the data and scripts used to produce the results presented in a paper will likely appreciate how difficult this can be. Data and scripts can be modified or lost. But even if care is taken to save the scripts and data, new versions of programming languages, libraries and operating systems may make scripts behave differently or be unable to run at all. In an ideal world, everything would be backwards-compatible, but in reality, what ran last week often doesn’t run next week. It can be difficult to determine what went wrong, especially if programming is an occasional activity. The National Academy of Sciences report on Reproducibility and Replicability in Science (National Academies of Sciences, Engineering, and Medicine 2019) describes at length the challenges associated with computational reproducibility of scientific results.
Motivated by an interest in supporting reproducibility of R scripts, we developed a package called rdtLite to collect data provenance containing a record of a script’s execution and the environment in which it was executed (Lerner et al. 2018). Having done that, we then realized that the wealth of information contained in the data provenance could serve other purposes as well. This led to the development of End-to-End Provenance Tools (“E2ETools”): an evolving set of R packages that use data provenance to help users save workable copies of their data and scripts, debug them, understand how data and results of analyses were derived, discover what has changed when a script stops working, and reproduce prior results.
Provenance is the history of creation, ownership, chain-of-custody, and location of an object. In its original and still most-frequently used sense, provenance is used to authenticate and trace the legitimate ownership of a work of art; it confers, creates, or adds value to the work itself. But provenance can be constructed, identified, or traced for any object, including data (Becker and Chambers 1988). Data provenance is analogous to provenance of a work of art in that it includes the history of a datum or entire dataset from the point at which it was collected (by a person or sensor), created (by a computational process), or derived (from other data). Data provenance also confers or adds value—as trustworthiness—to data, but data provenance can do more: it can be used to reproduce computational analyses and validate scientific conclusions.
More precisely, data provenance is the history of a data item (“datum”) or a dataset (“data”); it describes how the datum or data came to be in its present state. Our E2ETools focus on language-level provenance: how data are created and manipulated by a programming language such as R during the execution of a script or program. Provenance is also referred to in other computing contexts. For example, data provenance can be used to understand results of queries to a database or to the processes that were used to create or modify a file. In the remainder of this paper, however, when we say “provenance” or “data provenance”, we specifically mean language-level provenance.
We associate three types of information with provenance: environment information, coarse-grained information, and fine-grained information. Environment information includes information about the computing environment in which the script was executed. This includes information such as the operating system version, the R version, and the versions of the R libraries used, as each of these may play a role in understanding the details of how a script behaves. Coarse-grained information includes the source code of the script(s), the data input to the script, the data output by the script, and plots produced by the script. Fine-grained information includes an execution trace. Specifically, for each line of the script that is executed, fine-grained information includes the data used on that line and any data computed by, or object created by, that line. Our E2ETools can use this fine-grained information to help a user understand exactly how any data value or object in the script was computed or derived.
Consider this simple example, mtcars_example.R
, that loads in the
cars
dataset and plots miles per gallon (mpg
) as a function of the
number of cylinders (cylinders
) (1).
# Load the mtcars data set that comes with R
data(mtcars)
# All the cars
<- mtcars
allCars.df
# Create separate data frames for each number of cylinders
<- allCars.df[allCars.df$cyl == 4, ]
cars4Cyl.df <- allCars.df[allCars.df$cyl == 6, ]
cars6Cyl.df <- allCars.df[allCars.df$cyl == 8, ]
cars8Cyl.df
# Create a table with the average mpg for each # cylinders
= c(4, 6, 8)
cylinders = c(mean(cars4Cyl.df$mpg), mean(cars6Cyl.df$mpg), mean(cars8Cyl.df$mpg))
mpg <- data.frame (cylinders, mpg)
cyl.vs.mpg.df
# Plot it
plot(cylinders, mpg)
The following commands run the script, collect its provenance, and produce a textual summary of the provenance.
library(rdtLite)
prov.run("mtcars_example.R")
prov.summarize()
for mtcars_example.R
PROVENANCE SUMMARY
:
ENVIRONMENT2022-07-28T13.52.25EDT
Executed at 1.516 seconds
Total execution time was 2022-07-22T10.41.25EDT
Script last modified at 4.2.1 (2022-06-23)
Executed with R version .0
Platform was x86_64, darwin1710.15.7
Operating system was macOS Catalina 2022.02.3+492 Prairie Trillium (desktop)
User interface was 2.2.1 @ /usr/local/bin/pandoc
Document converter was .4
Provenance was collected with rdtLite1in /Users/blerner/tmp/prov/prov_mtcars_example
Provenance is stored
Hash algorithm is md5
LIBRARIES (loaded by script):
None (see notes below)
:
SCRIPTS1[:] /Users/blerner/Documents/Process/DataProvenance/Papers/RJournal/scripts/
/mtcars_example.R
examples
-EXISTING:
PRE
None
:
INPUTS1[:] /Library/Frameworks/R.framework/Versions/4.2/Resources/library/datasets/
/Rdata.rds
data
:
OUTPUTS1[-] /Users/blerner/Documents/Process/DataProvenance/Papers/RJournal/scripts/
11.pdf
dev.off.
:
CONSOLE
None
& WARNINGS:
ERRORS
None
: Files are listed in the order of execution (script 1 = main script).
NOTESin its original location is marked as follows:
The status of each file :], File changed [+], File missing [-], Not checked [ ].
File unchanged [
Copies of original files are available on the provenance directory.
's script at the time of execution are displayed.
Libraries loaded by the userNote that some libraries may have been loaded before execution. Use details =
TRUE to see all loaded libraries along with script, file, and message details.
The provenance summary is shown in 2.
The environment information (lines 3–18) reports details of the
computing environment in which the script was executed, such as the
processor and operating system on which it ran and the version of R and
R libraries used. The coarse-grained information (lines 20–36)
identifies the location in the file system of the script, the input
dataset, and the plot produced. The fine-grained information, which is
not displayed by prov.summarize()
but is accessible via other tools,
indicates the input and output data for each line of code executed,
linking them together so that one can see how the values computed in one
statement are used in later statements. For example, the provenance
debugger can use fine-grained information to display everything that is
derived from a variable.
library(provDebugR)
prov.debug()
debug.lineage("cars4Cyl.df", forward = TRUE)
The resulting output displays the line numbers and code for everything
computed, either directly or indirectly, from cars4Cyl.df
.
Var cars4Cyl.df 8: cars4Cyl.df <- allCars.df[allCars.df$cyl == 4, ]
14: mpg = c(mean(cars4Cyl.df$mpg), mean(cars6Cyl.df ...
15: cyl.vs.mpg.df <- data.frame (cylinders, mpg)
18: plot(cylinders, mpg)
NA: mtcars_example.R
Alternatively, a modified version of the same command
debug.lineage("cars4Cyl.df")
shows the lines of code that lead to the value for cars4Cyl.df
being
computed.
Var cars4Cyl.df 2: data(mtcars)
5: allCars.df <- mtcars
8: cars4Cyl.df <- allCars.df[allCars.df$cyl == 4, ]
Having seen an introductory example of some things the E2ETools can do, we now turn to a more detailed discussion of each tool.
The E2ETools consist of three types of packages:
We describe each of these packages, beginning with provenance collection. All the tools described are available on CRAN.
The rdtLite package collects provenance from R scripts as they
execute.1 rdtLite captures provenance data from both scripts and
interactive console sessions. To capture provenance for a script, the
user runs the script using the prov.run
function.
library(rdtLite)
prov.run("script.R")
To collect provenance for an interactive session, the user begins the
session with the prov.init
function and concludes it with prov.quit
.
library(rdtLite)
prov.init()
<- read.csv("mydata.csv")
data plot(data$x, data$y)
prov.quit()
rdtLite collects information about each file or URL read by the script,
each file written by the script, and each plot created by the script. In
addition, it records an execution trace of the top-level R statements.
This trace identifies the statement executed. It records any variables
set or used by the statement. When a variable is set, it records the
type of the value, including its container (such as vector, data frame,
etc.), dimensions, and class (e.g., character, numeric). If the
container is a vector of length 1, rdtLite records its data value,
embedded in the provenance (which is stored in a JSON file). rdtLite can
save the values of larger containers in separate snapshot files. The
user controls how much data to save using the snapshot.size
parameter
in prov.init
and prov.run
. The default is to not save snapshots.
rdtLite also records any warning or error messages generated when the
statement is executed. To capture similar information about scripts that
are included using the source
function, calls to source
must be
replaced with calls to prov.source
.
The provenance is stored in a JSON file using a format that extends the PROV-JSON standard (W3C 2014).2 The extended format provides structured information about fine-grained provenance, such as a list of libraries used, a mapping from functions called to the libraries from which they came, script line numbers, and data values and their types. More information about the extended JSON format is provided in the Appendix.
The JSON file is stored in a provenance directory that also contains
copies of all input and output files and the R scripts executed. By
default, the provenance data is stored in the R session temporary
directory, but the user can change this location either at the time that
prov.run
or prov.init
is called or by setting the prov.dir
option,
for example, in the .Rprofile file.
Upon completion of a script called with prov.run
, or after a call to
prov.quit
, rtdLite creates and populates a directory named either
prov_script
, where script
is the name of the script file, or
prov_console
for an interactive session. The directory will contain:
prov.json
- the JSON file containing the fine-grained provenancedata
- a directory containing copies of input and output files,
URLs, plots created, and snapshot files.scripts
- a directory containing a copy of the scripts for which
provenance was collected.The rdtLite default is to overwrite this information if the same script
is executed again or if prov.init
is used again in a console session.
However if the overwrite
parameter is set to FALSE, the provenance is
stored in a unique, time-stamped directory, allowing provenance from
multiple executions to be analyzed and compared.
Having the provenance is extremely valuable, but it is not particularly usable without tools that read the provenance and provide information or enable reproducibility. We next describe four tools that use provenance to help R programmers understand executions of their script. The provSummarizeR package provides a concise textual summary of an execution. The provViz package provides a graphical visualization of the provenance. The provDebugR package uses collected provenance to help programmers debug their code. The provExplainR package compares provenance from two executions to help the programmer understand changes between them. These applications exist in packages separate from rdtLite and would work equally well with provenance collected by other tools that produce the same JSON format.
The purpose of provSummarizeR is to produce a concise record of the environment in which a script was executed. This information could be particularly valuable when including a script and its results in a paper, or when sharing a script with a colleague. For an example, please see 2 above. The summary includes the following information:
source
or
prov.source
functions.In our own day-to-day work, we use provSummarizeR to document the processing of real-time meteorological and hydrological data at Harvard Forest. Data and plots of data captured in the past 30 days, including air temperature, precipitation, stream discharge, and water temperature, are updated and posted every 15 minutes.3 Also posted at the same site are provenance summaries for the script execution that creates the plots.
There are three functions provided to generate summaries:
prov.summarize(details = FALSE)
prov.summarize.file(prov.file, details = FALSE)
prov.summarize.run(r.script, details = FALSE)
prov.summarize
produces a summary for the last provenance
collected in the current R session.prov.summarize.file
takes the name of a JSON file containing
provenance and produces a summary from it.prov.summarize.run
takes the name of a file containing an R
script. It runs the script, collects its provenance, and produces a
summary.4By passing TRUE
for the details
parameter, the user can see more
detail about some aspects of the provenance. In particular,
The provViz and provDebugR tools described below provide a similar set of three functions: one to use the last provenance collected, one to use a specific JSON file, and one to run a script and use its provenance.
The provViz package
allows visual exploration of script execution as shown in
3. There are two types of nodes: data nodes and
procedure nodes. Data nodes represent things such as variables, files,
plots, and URLs. Procedure nodes represent executed R statements. An
edge from a data node to a procedure node indicates that the statement
represented by the procedure node uses the data represented by the data
node. For example, the edge from data item, 7-mpg
, to procedure node,
9-plot(cylinders,mpg)
, indicates that mpg
was used in the call to
the plot
function. Conversely, an edge from a procedure node to a data
node indicates that the procedure produced the data, for example, by
assigning to a variable or writing to a file. An edge between two
procedure nodes represents control flow, indicating the order in which
the statements were executed.
provViz also allows the
user to view the graph and explore it to examine intermediate data
values or input and output files and to perform lineage queries. The
node colors indicate node type. Data nodes representing variables are
purple. Files are tan. Orange nodes represent standard output, while red
data nodes represent warnings and errors. Yellow nodes represent R
statements. Green nodes come in pairs and represent the start and end of
a group of R statements. Clicking on a green node reduces the set of
statements between the matching Start
and Finish
nodes into a single
node, which is useful for making large graphs more manageable.
To see everything that depends on the value of a variable at a
particular point in the execution of the script, the user can
right-click on the data node and select
Show what is computed using this value
. This will display a subgraph
containing just the data and procedure nodes that are in the lineage of
the data node, as shown in 4, which shows the lineage of
3-cars4Cyl.df
. Notice that statements that do not use the value of
cars4Cyl.df
, either directly or indirectly, are not shown.
In addition to examining data values and tracing lineages as in this example, provViz supports the following ways of exploring the provenance:
provViZ itself is a small R program that connects to a Java program called DDG Explorer (Lerner and Boose 2014a), which does the actual work of creating and managing the display.
The provDebugR package provides debugging support by using the provenance to help users understand the state of their script at any point during execution. It provides command-line debugging capabilities, but one could imagine building a GUI on top of these functions to produce a friendly interactive debugging environment. By using provenance, provDebugR provides insight into the entire execution and creates a rich debugging environment that provides execution context not typically available in debuggers.
For example, consider a simple, but buggy script.
<- 4:6
w <- 1:3
x <- 1:10
y <- w + y
z <- c('a', 'b', 'c')
y <- data.frame (x, y, z) xyz
Running this script produces a warning and an error.
in data.frame(x, y, z) :
Error : 3, 10
arguments imply differing number of rows: Warning message:
In addition+ y : longer object length is not a multiple of shorter object length In w
Of course, with a short script like this, a user could simply step through the script one line at a time and examine the results, but for the purposes of demonstrating the debugger, imagine that this code is buried within a large script. The lines of code might not be consecutive as shown here, and it may even be difficult to determine what lines caused the reported errors.
The debugger provides some functions that are particularly helpful for
understanding warning and error messages. For example, if the user needs
help understanding where a warning came from, calling debug.warning
with no arguments lists all the warnings; when called with a warning
number, it displays the lines of code leading up to the warning.
> debug.warning()
:
Possible results
1 In w + y : longer object length is not a multiple of shorter object length
function for info on that warning
Pass the corresponding numeric value to the > debug.warning(1)
: In w + y : longer object length is not a multiple of shorter object length
Warning1: w <- 4:6
3: y <- 1:10
4: z <- w + y
By omitting lines that do not contribute to the computations that lead to the warning, the R programmer should be able to find the problem more easily.
Similarly, the user can get information about what led up to an error
using debug.error
.
> debug.error()
: Error in data.frame(x, y, z): arguments imply differing number of rows: 3, 10
Your Error
:
Code that led to error message1: w <- 4:6
2: x <- 1:3
3: y <- 1:10
4: z <- w + y
5: y <- c('a', 'b', 'c')
6: xyz <- data.frame (x, y, z)
The debug.error
function has an optional logical parameter,
stack.overflow
. When set to TRUE
, debug.error
uses the
stackexchange API to search Stack Overflow for posts about similar error
messages. It lists the questions asked in the top six posts. The user
can select one and a tab will open in the user’s browser displaying the
selected post.
5 shows a sample dialog using debug.error
.
Selecting 1 results in the user’s browser going to the page displayed in
6.5 By scrolling down through answers to this
question (not shown here), users will ideally obtain helpful information
allowing them to solve their problem quickly.
> debug.error(stack.overflow=TRUE)
: Error in data.frame(x, y, z): arguments imply differing number
Your Error: 3, 10
of rows
:
Code that led to error message1: w <- 4:6
2: x <- 1:3
3: y <- 1:10
4: z <- w + y
5: y <- c('a', 'b', 'c')
6: xyz <- data.frame (x, y, z)
:
Results from StackOverflow1] "What does the error \"arguments imply differing number of rows: x, y\"
[ mean?"
2] "ggplot gives \"arguments imply differing number of rows\" error in
[ geom_point while it isn't true - how to debug?"
3] "Checkpoint function error in R- arguments imply differing number of rows:
[ 1, 38, 37"
4] "qdap check_spelling Error in checkForRemoteErrors(val) : one node
[ produced an error: arguments imply differing number of rows"
5] "Creating and appending to data frame in R (Error: arguments imply
[ differing number of rows: 0, 1)"
6] "Caret and GBM: task 1 failed - \"arguments imply differing number of rows\""
[
: Choose a numeric value that matches your error the best or q to quit