PDFEstimator: An R Package for Density Estimation and Analysis

This article presents PDFEstimator, an R package for nonparametric probability density estimation and analysis, as both a practical enhancement and alternative to kernel-based estimators. PDFEstimator creates fast, highly accurate, data-driven probability density estimates for continuous random data through an intuitive interface. Excellent results are obtained for a diverse set of data distributions ranging from 10 to \(10^6\) samples when invoked with default parameter definitions in the absence of user directives. Additionally, the package contains methods for assessing the quality of any estimate, including robust plotting functions for detailed visualization and trouble-shooting. Usage of PDFEstimator is illustrated through a variety of examples, including comparisons to several kernel density methods.

Jenny Farmer (Department of Bioinformatics, University of North Carolina at Charlotte) , Donald Jacobs (Department of Physics and Optical Science, University of North Carolina at Charlotte)
2022-10-11

1 Introduction

The ability to estimate a probability distribution from a single data sample is critical across diverse fields of science and finance (Tang et al. 2016; Cavuoti et al. 2017; Munkhammar et al. 2017; Sidibé et al. 2018). Estimating the parameters of the underlying density function becomes increasingly difficult when there is no prior information about the number of parameters, as the shape and complexity must also be inferred from that data. Although there are many nonparametric methods, kernel density estimation (KDE) is among the most popular.

Variants of KDE differ based on how they implement the selections of the bandwidth and the kernel function, which are nontrivial decisions that can have a significant impact on the quality and performance of the estimate. Most KDE implementations allow the user to manipulate some parameters manually, allowing for an experienced user to fine-tune the default behavior for improved results. Unfortunately, user directives introduce unavoidable subjectivity to the estimate. More advanced implementations include intelligent and adaptive bandwidth selection optimized according to the characteristics of the data (Botev et al. 2010; Chen 2017). However, there is inevitably a trade-off between computational performance and accuracy (Gray and Moore 2003). There are many R packages available that can estimate density nonparametrically, typically based on KDE (Hayfield and Racine 2008; Konietschke et al. 2015; Calonico et al. 2019; Moss and Tveten 2019; Sebastian et al. 2019; Nagler and Vatter 2020; Wand 2021; Wand and Ripley 2021). The core R function density implements a straightforward kernel density method.

Presented in this article is the package PDFEstimator for nonparametric density estimation and analysis (Farmer and Jacobs 2018). The features of PDFEstimator can be separated into two categories. The first is a novel estimation method based on the principle of maximum entropy, available through the estimatePDF function. A primary advantage of estimatePDF is the automated interface, requiring nothing from the user other than a data sample. Range, multi-scale resolution, outliers, and boundaries are determined within the algorithm to achieve optimized data-driven estimates appropriate to the given sample. Although these defaults can be overridden by a sophisticated user, overrides generally do not improve the results. Additionally, multiple acceptable solutions can be returned, which is particularly useful in the case of low sample sizes where there is more statistical uncertainty.

The second category of features included in PDFEstimator is a unique set of assessment utilities for evaluating the accuracy of the solution and highlighting areas of uncertainty within a density estimate. Furthermore, a user-defined threshold can be specified to identify data points that fall outside of an expected confidence level. These diagnostics tools are visualized through a variety of customized plotting options, thus aiding in the evaluation of difficult distributions. Most importantly, all of these features can be applied towards any estimation function, such as density, as they are universal measurements independent of the method used, allowing for an integrated comparison between alternative models.

The remainder of this paper is organized as follows.  2 describes the functions available in this package, including their usage and underlying methods.  3 provides additional and advanced examples for identifying and troubleshooting problems with any density estimation method.  4 compares estimatePDF with three popular kernel-based estimation packages using a diverse set of known distributions. Finally,  5 demonstrates the use of PDFEstimator for a well-known real data set from a rare stamp collection.

2 Available functions and usage

An overview of the functions included in PDFEstimator is listed in Table 1. A critical component of the tools in PDFEstimator is a PDFe object, which encapsulates all information necessary for plotting and assessing the quality of any density estimate. The member variables for the PDFe class definition are listed in Table 2. The methods for calculating the necessary components of the PDFe and how they are employed in each of the functions in Table 1 are described in this section.

Table 1: Overview of functions in the PDFEstimator package.
Function Description
getTarget Returns upper and lower limits of SQR for a given target level
plotBeta Plots a shaded region outlining expected range of SQR values by position
estimatePDF Estimates a density from a data sample. Returns a PDFe estimation object.
convertToPDFe Converts any PDF to a PDFe estimation object for diagnostic purposes.
approximatePoints Returns approximated PDF for an existing PDFe estimation object at a given set of data points.
plot Main plotting function for PDFe estimation objects.
lines Plots the density for a PDFe estimation object as a connecting line segment to an existing plot.
summary Prints a summary of the PDFe object.
print Prints the probability density and cumulative density for each estimation point in the PDFe object.

The PDFe class

The first four member variables listed in Table 2 are commonly understood values for an estimated density, beginning with the random data sample that is the basis for the estimate. The probability density function (PDF) is estimated for the range of points defined in x. Similarly, the cumulative distribution function (CDF) can be calculated representing the cumulative probability of the PDF for each x. PDFEstimator defines additional descriptions for an estimate that measure the quality of its fit to the sample data. Central to the quality of these estimates is a scoring mechanism to rate the overall fit of an estimate to the sample. A single average score is calculated, as well as individual confidence levels for each data point within the sample. These scores are based on order statistics (Wilks 1948).

If the CDF, defined on the range (0 1), is an accurate representation of the data, then \(CDF(x)\) will represent uniform random data. The general problem then becomes assessing if \(r_k=CDF(x_k)\) represents uniform data for a given sample. Although there are many methods to test for uniform data (Farmer et al. 2019), PDFEstimator employs an average quadratic z-score, defined as \[\label{eq:zscore} z^2=\frac{-1}{N} \sum_{k=1}^{N}\frac{\left(r_k-\mu_k\right)^2}{\sigma_k^2}, \tag{1}\] where \(k\) is the sort ordered position in a sample size \(N\), and \(\mu_k\) and \(\sigma_k\) are the mean and standard deviation from single order statistics known to be \(\mu_k=-\frac{k}{N+1}\) and \(\sigma_k=\frac{\mu_k\left(\mu_k-1\right)}{\sqrt{N+2}}\). Perfectly uniform data would yield a score of exactly zero.

To study typical z-scores for uniform random data, extensive numerical experiments were generated with a random number generator on the range (0, 1) for many different sample sizes. The typical distribution of scores according to Equation (1) is represented in the left plot for Figure 1. The peak density of the PDF corresponds to the most likely z-score and occurs at a value of approximately -0.5, with a sharp drop-off in the density as scores approach zero. The threshold value of the PDFe object reports the empirical cumulative probability for the z-score, as a percentage, shown on the right-hand plot of Figure 1. The cumulative probability for the peak z-score is calculated to be a little over 0.7, thus a threshold value near or greater than 70% can be considered a highly probable fit. A threshold of less than 5%, by contrast, is low probability and therefore likely an underfit for the data. In this event, the PDFe member variable failedSolution is set to TRUE. A score with a threshold of 95%, however, is similarly unlikely and can be interpreted as overfitting the data. The software does not rigidly enforce a particular score upon an accepted solution, but rather uses the numerically calculated density shown in Figure 1 as a guide to iteratively move towards increasingly probable solutions.

The threshold provides an average score for the estimate, but to assess the estimate per position and identify the locations of potential errors, a scaled quantile residual (SQR) is defined as \[\label{eq:sqr} SQR_k=\sqrt{N+2} \left(r_k-\mu_k\right). \tag{2}\] The scaling factor of \(\sqrt{N+2}\) creates a sample-size-invariant metric for each position \(k\). The sqr member variable of the PDFe class contains a vector of SQR values according to Equation (2) and their ability to diagnose problems in a density estimate will be demonstrated with the plot function. Note that the PDFe object will also report the size of sqr, which will be the number of samples less the number of any outliers detected.

getTarget and plotBeta

It has been shown that, when plotted against position, \(SQR_k\) for uniform random data falls approximately within an oval shaped region (Farmer et al. 2019). The reason for this can be understood by examining the beta distributions that govern order statistics for sort ordered random uniform data (Wilks 1948). The probability of \(u\) for position \(k\) in uniform random data with \(N\) samples is as follows. \[\label{eq:beta} p_k(u)=\frac{N!}{\left(k-1\right)!\left(N-k\right)!}u^{k-1}\left(1-u\right)^{N-k} \tag{3}\] By integrating Equation (3) for each position \(k\), confidence levels for SQR values of sample size \(N\) are calculated in the getTarget function. Figure 2 demonstrates confidence levels for three different target percentages, plotted as contour lines. The background shading in grey represents typical ranges of SQR values, with darker shading corresponding to higher probability areas. This shading is independent of sample size, and is calculated according to Equations (2) and  (3) in the plotBeta function.

Table 2: Member variables of the PDFe class.
Member Variable Description
sample Sample data used to estimate the density.
x Points where the density is estimated.
pdf Estimated density values for x.
cdf Cumulative density values for x.
threshold Threshold score, expressed as a percentage, measuring the uniformity of the CDF of sample data.
failedSolution If true, indicates that the estimate returned does not meet at least a 5% threshold.
sqr Scaled quantile residual values for sample data points.
sqrSize Size of sqr.
graphic without alt text graphic without alt text
Figure 1: These plots demonstrate the distribution of scores as defined by Equation  and are used to assess the probability of a trial solution. The density was estimated empirically by generating 1000 trials of 10,000 uniform random data samples. The probability density function is shown on the left and the cumulative density function is on the right.
graphic without alt text graphic without alt text
Figure 2: Confidence levels by position for a scaled quantile residual plot based on beta distribution probabilities according to single order statistics for a sample size of 50 (left) and 1000 (right). The colored contour lines define regions within an oval shape that give target levels for observing data points deviating away from perfect uniform spacing as expected for true uniform random data. Note that the skewness in the contour lines are noticeable for small sample sizes compared to large sample sizes.

estimatePDF

estimatePDF provides an R interface for nonparametric density estimation based on a novel method providing an alternative to traditional KDE implementations. Details of this approach, based on the principle of maximum entropy (Wu 1997), were published previously and have been shown to produce more accurate estimates than KDE in most cases (Farmer and Jacobs 2018; Farmer et al. 2019; Puchert et al. 2021; Farmer and Jacobs 2022). For optimal performance and flexibility with other applications, this functionality is performed within a set of C++ classes and is not the focus of this paper, but a brief summary is provided for insight into the estimatePDF interface.

In a traditional maximum entropy method, moments for a set of characteristic functions are introduced, and the coefficients to these functions are optimized to match the predicted moments with the empirical moments. As a particular choice of moments that exist for any probability density, and to form a systematic truncated expansion over a complete set of orthogonal functions, the analytical form for the density function from a data sample is expressed as \[\label{eq:entropy} p(\nu)=\sum_{j=1}^{D}\exp\left(\lambda_jg_j(\nu)\right), \tag{4}\] where \(g_j(\nu)\) are bounded level functions and \(\lambda_j\) are Lagrange coefficients controlling the shape of the density function (Jacobs 2008). For a fixed number of coefficients, \(D\), this method is parametric in form. Although solving for the coefficients analytically is increasingly impractical for high dimensionality, a random search method is employed that provides very efficient numerical optimization. An expansion of orthogonal functions is constructed in the form of Equation (4), without specifying D in advance, where higher mode orthogonal functions are successively added as needed. The algorithm iteratively explores possible density functions by perturbing the Lagrange coefficients that are currently present, while D is slowly increased, testing each possibility according to the scoring function in Equation (1) to converge towards an accurate estimate.

The arguments for estimatePDF are listed in Table 3. If no other parameters are specified, the range of the returned estimate is calculated automatically. By default, left and right boundaries are presumed theoretically infinite and allowed to extend beyond the range of the data sample. The finite numerical bounds are calculated according to the density near the tails, with longer tails receiving more padding than shorter tails. Similarly, extreme outliers are detected and removed from the sample as appropriate according to the outlierCutoff argument. Alternatively, the lowerBound, upperBound, and outlierCutoff parameters can be independently specified to provide the user complete control over the range of the estimate. Setting outlierCutoff to zero turns off outlier detection and includes all the data.

Table 3: Overview of arguments in the estimatePDF function.
Arguments Description
sample A vector containing the data sample to estimate.
pdfLength Specifies the desired length of the estimate returned. By default, this length is calculated based on the length of the sample.
estimationPoints An optional vector containing specific points to estimate.
lowerBound Sets the finite lower bound for the sample, if it exists.
upperBound Sets the finite upper bound for the sample, if it exists.
lagrangeMin Specifies the minimum allowed dimension, \(D\), in Equation (4).
lagrangeMax Specifies the maximum allowed dimension, \(D\), in Equation (4).
debug If TRUE, detailed progress will be printed to the console.
outlierCutoff If greater than 0, specifies the range of included sample data, according to the formula [(Q1 - outlierCutoff x IQR), (Q3 + outlierCutoff x IQR)], where Q1, Q3, and IQR represent the first quartile, third quartile, and inter-quartile range, respectively.
target Sets a target percentage threshold between 0 and 100. The default is 70, the minimum accepted is 5.
smooth If TRUE (default), preference is given towards smooth density estimates.

The number of expansions in Equation (4) begins at lagrangeMin and is capped at lagrangeMax, with default values of 1 and 200, respectively. The maximum of 200 provides a generous realistic upper limit to the complexity of the estimate and the computational time required but can be altered to either increase accuracy or decrease compute time. Another reason to override these limits is to create a semi-parametric estimate by narrowing the range between minimum and maximum. A strictly parametric approach can be achieved by setting the two limits to the same value. For example, setting both langrangeMin and lagrangeMax to 1 forces a uniform fit. Similarly, setting them both to either 2 or 3 respectively yields exponential and Gaussian distributions.

The target argument refers to the cumulative probability of the z-score, as previously discussed. Note that target is a user-defined argument, whereas threshold in the PDFe class is the actual threshold achieved. These values may be different for several reasons. For example, the search for the target threshold may abort prematurely if the lagrangeMax has been exceeded or if progress has been stalled. Additionally, a small penalty is added to the score if the estimate becomes exceptionally noisy in areas of low density where sharp features are not justified. Therefore, a smoother curve may be favored over a higher score. The smoothing penalty is constructed according to a Taylor expansion error estimate as described previously (Jacobs 2008). The original model calculated a second order expansion, but a first order approximation was found to be sufficient and implemented in this version. This behavior can be circumvented when intentionally searching for small peaks by setting the smooth argument to FALSE.

convertToPDFe and approximatePoints

The estimatePDF function performs the density estimation based on the input parameters in Table 3 and returns a PDFe object for plotting and additional analysis. Alternatively, the convertToPDFe function will create a PDFe object for an estimate calculated using any other method for a given data sample. convertToPDFe requires a data sample and the (x, y) values for the estimate, and calculates the score, threshold, and SQR values for each point in the sample.

The approximatePoints function operates on a PDFe object to approximate the PDF for additional data points after the estimate has already been calculated using either estimatePDF or convertToPDFe. This functionality is similar to specifying estimationPoints in the estimatePDF function, but is provided for the convenience of approximating different points without having to recalculate the estimate. The following example will create a PDFe object for a KDE estimate of a random sample from a standard normal distribution, and then return additional density approximations at points -3, 0, and 1.

    sample = rnorm(1000)
    kde = density(sample)
    pdfe = convertToPDFe(sample, kde$x, kde$y)
    approximatePoints(pdfe, c(-3, 0, 1))

plot

The PDFEstimator::plot.PDFe function extends the generic plot function in R supporting all existing graphical parameters, with additional options summarized in Table 4. The first argument listed in the table is the PDFe object returned by estimatePDF or convertToPDFe and is required for all plots. The plotPDF and plotSQR arguments can be independently set to TRUE or FALSE and collectively control the plot type. The plotShading and showOutlierPercent values invoke the plotBeta and getTarget functions from Table 1 and provide optional diagnostics to highlight specified uncertainties within the estimate through the use of the SQR plot. The remaining arguments listed in Table 4 control minor graphical features for customized aesthetics. The following examples will demonstrate the variety of plots that can be created with combinations of these options.

Table 4: Overview of arguments in the plot function.
Arguments Description
x A PDFe estimation object. Returned from estimatePDF.
plotPDF Plots the probability density for x if TRUE.
plotSQR Plots the scaled quantile residual (SQR) for x if TRUE. If plotPDF is also TRUE, the SQR will be scaled to the range of the density.
plotShading Plots gray background shading indicating approximate confidence levels for the SQR, where darker shades indicate higher confidence. Setting this to TRUE only has meaning when plotSQR = TRUE.
shadeResolution Specifies the number of data points plotted in the background shading when plotShading = TRUE. Increasing this resolution will create sharper and more accurate approximations for the confidence levels, but will take more time to plot.
showOutlierPercent Specifies the threshold to define outliers for SQR. Must be a number between 1 and 100.
outlierColor Specifies the color for outliers when showOutlierPercent is defined.
sqrPlotThreshold Magnitude of y-axis for SQR plot.
sqrColor Specifies the SQR color for non-outliers
type Specifies the line type of the density curve if plotPDF = TRUE. If plotPDF = FALSE and plotSQR = TRUE, the SQR plot uses this type. The default is lines type.
lwd Specifies the line width of the density curve if plotPDF = TRUE. If plotPDF = FALSE and plotSQR = true, the SQR plot uses this width. The default is 2.
xlab x-axis label for pdf. If plotPDF = FALSE and plotSQR = TRUE, then the sqr plot uses this label.
ylab y-axis label for pdf. If plotPDF = FALSE and plotSQR = TRUE, then the sqr plot uses this label.
legendcex expansion factor for legend point size with sqr plot type, for plotPDF = FALSE and plotSQR = TRUE.
... Inherits all other plotting arguments from plot function

3 Examples and illustrations

Plotting estimates

Figure 3 contains two separate examples of the Maxwell distribution. The left panel demonstrates the plot function using all the default parameters and simply plots the density estimate. On the right, the plotSQR parameter creates the SQR plot, as described in Equation (2), overlaying the density. A shaded background, scaled according to the data sample, is plotted with the addition of the plotShading parameter set to TRUE, providing a visual approximation of the most probable range of scaled quantile residual values. Finally, the showOutlierPercent parameter flags scaled quantile residual values that are outside of the 99% target interval. By default, these outliers are plotted in red.

graphic without alt text graphic without alt text
Figure 3: (left) Density estimates for the Maxwell distribution with 100,000 samples using default parameters. (right) In addition to the density estimate, the scaled quantile residual mapped to the original variable x is shown along with a shaded region indicating 99% target. The scaled quantile residuals outside of the 99% target interval are highlighted in red.

Figure  4 contains additional examples for visual assessment of estimates. Each of these plots is based on the sawtooth distribution, defined by ten identical isosceles triangles at equal intervals. The sawtooth distribution is designed to be challenging to estimate due to these extremely sharp peaks. The exact distribution is plotted in gray in the top left panel of Figure 4, with the estimatePDF estimate for 100,000 samples shown in black. The estimate captures the high peaks well but falls short of reaching the lowest points of each triangle. The top right plot demonstrates how the showOutlierPercent parameter can identify specific areas of lower confidence in the estimate when the exact distribution is not known. In this example, the estimate is plotted in gray with green highlighting the sample points outside of an 80% target level. Although the estimate closely approximates the distribution, the low peaks are identified as less accurate.

graphic without alt text graphic without alt text
graphic without alt text graphic without alt text
Figure 4: Density estimates for the sawtooth distribution with 100,000 samples. The top left plot shows the exact known distribution in gray and the estimate in black. The top right plot shows the estimate in gray, with green highlights indicating areas outside of the 80% target threshold. The bottom row demonstrates scaled quantile residual plots for the same estimate, for 80% and 99% target levels.

The bottom row of figure 4 shows an alternative diagnostic visualization of the same estimate. These plots do not include the density estimate, but show only the SQR plot. Note that when the SQR is plotted alone, each sample point is spaced equally according to sort ordered position. Additionally, dotted lines represent the confidence interval, if showOutlierPercent is specified, and the calculated percentage of points lying outside this threshold are printed on the bottom right. These examples show targeted thresholds of 80% and 99% with outliers representing 8% and 1% of the points, respectively. If the sum of the threshold and outlier percentages fall far below or above 100%, this indicates a poor estimate. The figures in this section collectively demonstrate a wide range of visual assessment and color choices available with this customized plot function.

Advanced diagnostics

The results returned from estimatePDF provide additional information for further analysis when needed. For example, the first plot in Figure 5 is the SQR result for 10,000 data samples generated from a normal-cubed distribution. In this case, the failedSolution return value is TRUE, indicating that the estimate does not meet the required threshold for the scoring function. Rather than return a NULL solution in the event of an unacceptable score, the best scoring estimate is always returned. However, in addition to the poor average score, the SQR plot also shows large variations outside of a 99% threshold.

The second plot in Figure 5 shows the SQR result for the same sample data estimated with the core R function density. The density estimate is first converted to a PDFe object via the convertToPDFe function. The failedSolution return variable indicates this estimate also does not meet an acceptable fit, but the SQR plot further suggests an extreme underfit of the estimate near the midpoint of the sample. This is a common weakness of KDE estimates for data with sharp peaks and long tails. In fact, the normal-cubed distribution is undefined at zero and diverges as it approaches this singularity from both directions, presenting a challenge for any density estimator. In either method, however, the information available in PDFe alerts the user of problems that may require intervention.

A manual inspection of the sample data, perhaps in the form of a quantile or histogram plot, confirms an approximately symmetric distribution of data with an extreme variation in density near the center. A reasonable course of action in this event is to attempt to fit the low-density tails separately from the high-density peak. Figure 6 demonstrates this approach, modeling the distribution with three separate calls to estimatePDF (top) and density (bottom) for ranges delineated by -0.01 and 0.01 around the peak. The tails are bounded according to the range returned from the respective original estimates for each method.

For estimatePDF, this produces individual estimates mostly within the 99% target range shown in the three SQR plots in Figure 6. This example suggests a general divide and conquer method for addressing extreme distributions that is part of an automated procedure (to be published elsewhere). For density, however, the estimates remain poor in comparison even when fitting the regions separately. This is due, in part, because KDE methods do not incorporate automatic outlier detection. Additionally, KDE tends to perform poorly at boundaries, causing them to be less amenable to piecing together estimates in this way.

graphic without alt text graphic without alt text
Figure 5: Scaled quantile residual plots for estimates of the normal-cubed distribution with 10,000 samples using estimatePDF (left) and density (right).
graphic without alt text