This article presents PDFEstimator, an R package for nonparametric probability density estimation and analysis, as both a practical enhancement and alternative to kernelbased estimators. PDFEstimator creates fast, highly accurate, datadriven probability density estimates for continuous random data through an intuitive interface. Excellent results are obtained for a diverse set of data distributions ranging from 10 to \(10^6\) samples when invoked with default parameter definitions in the absence of user directives. Additionally, the package contains methods for assessing the quality of any estimate, including robust plotting functions for detailed visualization and troubleshooting. Usage of PDFEstimator is illustrated through a variety of examples, including comparisons to several kernel density methods.
The ability to estimate a probability distribution from a single data sample is critical across diverse fields of science and finance (Tang et al. 2016; Cavuoti et al. 2017; Munkhammar et al. 2017; SidibĂ© et al. 2018). Estimating the parameters of the underlying density function becomes increasingly difficult when there is no prior information about the number of parameters, as the shape and complexity must also be inferred from that data. Although there are many nonparametric methods, kernel density estimation (KDE) is among the most popular.
Variants of KDE differ based on how they implement the selections of the
bandwidth and the kernel function, which are nontrivial decisions that
can have a significant impact on the quality and performance of the
estimate. Most KDE implementations allow the user to manipulate some
parameters manually, allowing for an experienced user to finetune the
default behavior for improved results. Unfortunately, user directives
introduce unavoidable subjectivity to the estimate. More advanced
implementations include intelligent and adaptive bandwidth selection
optimized according to the characteristics of the data
(Botev et al. 2010; Chen 2017). However, there is inevitably a tradeoff
between computational performance and accuracy (Gray and Moore 2003). There are
many R packages available that can estimate density nonparametrically,
typically based on KDE
(Hayfield and Racine 2008; Konietschke et al. 2015; Calonico et al. 2019; Moss and Tveten 2019; Sebastian et al. 2019; Nagler and Vatter 2020; Wand 2021; Wand and Ripley 2021).
The core R function density
implements a straightforward kernel
density method.
Presented in this article is the package
PDFEstimator for
nonparametric density estimation and analysis (Farmer and Jacobs 2018). The features of
PDFEstimator can be separated into two categories. The first is a
novel estimation method based on the principle of maximum entropy,
available through the estimatePDF
function. A primary advantage of
estimatePDF
is the automated interface, requiring nothing from the
user other than a data sample. Range, multiscale resolution, outliers,
and boundaries are determined within the algorithm to achieve optimized
datadriven estimates appropriate to the given sample. Although these
defaults can be overridden by a sophisticated user, overrides generally
do not improve the results. Additionally, multiple acceptable solutions
can be returned, which is particularly useful in the case of low sample
sizes where there is more statistical uncertainty.
The second category of features included in PDFEstimator is a unique
set of assessment utilities for evaluating the accuracy of the solution
and highlighting areas of uncertainty within a density estimate.
Furthermore, a userdefined threshold can be specified to identify data
points that fall outside of an expected confidence level. These
diagnostics tools are visualized through a variety of customized
plotting options, thus aiding in the evaluation of difficult
distributions. Most importantly, all of these features can be applied
towards any estimation function, such as density
, as they are
universal measurements independent of the method used, allowing for an
integrated comparison between alternative models.
The remainder of this paper is organized as follows. Â 2
describes the functions available in this package, including their usage
and underlying methods. Â 3 provides additional and
advanced examples for identifying and troubleshooting problems with any
density estimation method. Â 4 compares estimatePDF
with three popular kernelbased estimation packages using a diverse set
of known distributions. Finally, Â 5 demonstrates the use
of PDFEstimator for a wellknown real data set from a rare stamp
collection.
An overview of the functions included in PDFEstimator is listed in
TableÂ 1. A critical component of the tools in
PDFEstimator is a PDFe
object, which encapsulates all information
necessary for plotting and assessing the quality of any density
estimate. The member variables for the PDFe
class definition are
listed in TableÂ 2. The methods for calculating the
necessary components of the PDFe
and how they are employed in each of
the functions in TableÂ 1 are described in this
section.
Function  Description 

getTarget  Returns upper and lower limits of SQR for a given target level 
plotBeta  Plots a shaded region outlining expected range of SQR values by position 
estimatePDF  Estimates a density from a data sample. Returns a PDFe estimation object. 
convertToPDFe  Converts any PDF to a PDFe estimation object for diagnostic purposes. 
approximatePoints  Returns approximated PDF for an existing PDFe estimation object at a given set of data points. 
plot  Main plotting function for PDFe estimation objects. 
lines  Plots the density for a PDFe estimation object as a connecting line segment to an existing plot. 
summary  Prints a summary of the PDFe object. 
Prints the probability density and cumulative density for each estimation point in the PDFe object. 
The first four member variables listed in TableÂ 2 are
commonly understood values for an estimated density, beginning with the
random data sample that is the basis for the estimate. The probability
density function (PDF) is estimated for the range of points defined in
x
. Similarly, the cumulative distribution function (CDF) can be
calculated representing the cumulative probability of the PDF for each
x
. PDFEstimator defines additional descriptions for an estimate that
measure the quality of its fit to the sample data. Central to the
quality of these estimates is a scoring mechanism to rate the overall
fit of an estimate to the sample. A single average score is calculated,
as well as individual confidence levels for each data point within the
sample. These scores are based on order statistics (Wilks 1948).
If the CDF, defined on the range (0 1), is an accurate representation of the data, then \(CDF(x)\) will represent uniform random data. The general problem then becomes assessing if \(r_k=CDF(x_k)\) represents uniform data for a given sample. Although there are many methods to test for uniform data (Farmer et al. 2019), PDFEstimator employs an average quadratic zscore, defined as \[\label{eq:zscore} z^2=\frac{1}{N} \sum_{k=1}^{N}\frac{\left(r_k\mu_k\right)^2}{\sigma_k^2}, \tag{1}\] where \(k\) is the sort ordered position in a sample size \(N\), and \(\mu_k\) and \(\sigma_k\) are the mean and standard deviation from single order statistics known to be \(\mu_k=\frac{k}{N+1}\) and \(\sigma_k=\frac{\mu_k\left(\mu_k1\right)}{\sqrt{N+2}}\). Perfectly uniform data would yield a score of exactly zero.
To study typical zscores for uniform random data, extensive numerical
experiments were generated with a random number generator on the range
(0, 1) for many different sample sizes. The typical distribution of
scores according to EquationÂ (1) is represented in the left
plot for FigureÂ 1. The peak density of the PDF
corresponds to the most likely zscore and occurs at a value of
approximately 0.5, with a sharp dropoff in the density as scores
approach zero. The threshold
value of the PDFe
object reports the
empirical cumulative probability for the zscore, as a percentage, shown
on the righthand plot of FigureÂ 1. The cumulative
probability for the peak zscore is calculated to be a little over 0.7,
thus a threshold value near or greater than 70% can be considered a
highly probable fit. A threshold of less than 5%, by contrast, is low
probability and therefore likely an underfit for the data. In this
event, the PDFe
member variable failedSolution
is set to TRUE
. A
score with a threshold of 95%, however, is similarly unlikely and can be
interpreted as overfitting the data. The software does not rigidly
enforce a particular score upon an accepted solution, but rather uses
the numerically calculated density shown in FigureÂ 1 as a
guide to iteratively move towards increasingly probable solutions.
The threshold provides an average score for the estimate, but to assess
the estimate per position and identify the locations of potential
errors, a scaled quantile residual (SQR) is defined as
\[\label{eq:sqr}
SQR_k=\sqrt{N+2} \left(r_k\mu_k\right). \tag{2}\]
The scaling factor of \(\sqrt{N+2}\) creates a samplesizeinvariant
metric for each position \(k\). The sqr
member variable of the PDFe
class contains a vector of SQR values according to
EquationÂ (2) and their ability to diagnose problems in a
density estimate will be demonstrated with the plot
function. Note
that the PDFe
object will also report the size of sqr
, which will be
the number of samples less the number of any outliers detected.
It has been shown that, when plotted against position, \(SQR_k\) for
uniform random data falls approximately within an oval shaped region
(Farmer et al. 2019). The reason for this can be understood by examining the beta
distributions that govern order statistics for sort ordered random
uniform data (Wilks 1948). The probability of \(u\) for position \(k\) in
uniform random data with \(N\) samples is as follows.
\[\label{eq:beta}
p_k(u)=\frac{N!}{\left(k1\right)!\left(Nk\right)!}u^{k1}\left(1u\right)^{Nk} \tag{3}\]
By integrating EquationÂ (3) for each position \(k\), confidence
levels for SQR values of sample size \(N\) are calculated in the
getTarget
function. FigureÂ 2 demonstrates confidence
levels for three different target percentages, plotted as contour lines.
The background shading in grey represents typical ranges of SQR values,
with darker shading corresponding to higher probability areas. This
shading is independent of sample size, and is calculated according to
EquationsÂ (2) and Â (3) in the plotBeta
function.
Member Variable  Description 

sample  Sample data used to estimate the density. 
x  Points where the density is estimated. 
Estimated density values for x . 

cdf  Cumulative density values for x . 
threshold  Threshold score, expressed as a percentage, measuring the uniformity of the CDF of sample data. 
failedSolution  If true, indicates that the estimate returned does not meet at least a 5% threshold. 
sqr  Scaled quantile residual values for sample data points. 
sqrSize  Size of sqr. 
estimatePDF
provides an R interface for nonparametric density
estimation based on a novel method providing an alternative to
traditional KDE implementations. Details of this approach, based on the
principle of maximum entropy (Wu 1997), were published previously and have
been shown to produce more accurate estimates than KDE in most cases
(Farmer and Jacobs 2018; Farmer et al. 2019; Puchert et al. 2021; Farmer and Jacobs 2022). For optimal performance and
flexibility with other applications, this functionality is performed
within a set of C++ classes and is not the focus of this paper, but a
brief summary is provided for insight into the estimatePDF
interface.
In a traditional maximum entropy method, moments for a set of characteristic functions are introduced, and the coefficients to these functions are optimized to match the predicted moments with the empirical moments. As a particular choice of moments that exist for any probability density, and to form a systematic truncated expansion over a complete set of orthogonal functions, the analytical form for the density function from a data sample is expressed as \[\label{eq:entropy} p(\nu)=\sum_{j=1}^{D}\exp\left(\lambda_jg_j(\nu)\right), \tag{4}\] where \(g_j(\nu)\) are bounded level functions and \(\lambda_j\) are Lagrange coefficients controlling the shape of the density function (Jacobs 2008). For a fixed number of coefficients, \(D\), this method is parametric in form. Although solving for the coefficients analytically is increasingly impractical for high dimensionality, a random search method is employed that provides very efficient numerical optimization. An expansion of orthogonal functions is constructed in the form of EquationÂ (4), without specifying D in advance, where higher mode orthogonal functions are successively added as needed. The algorithm iteratively explores possible density functions by perturbing the Lagrange coefficients that are currently present, while D is slowly increased, testing each possibility according to the scoring function in EquationÂ (1) to converge towards an accurate estimate.
The arguments for estimatePDF
are listed in
TableÂ 3. If no other parameters are specified, the
range of the returned estimate is calculated automatically. By default,
left and right boundaries are presumed theoretically infinite and
allowed to extend beyond the range of the data sample. The finite
numerical bounds are calculated according to the density near the tails,
with longer tails receiving more padding than shorter tails. Similarly,
extreme outliers are detected and removed from the sample as appropriate
according to the outlierCutoff
argument. Alternatively, the
lowerBound
, upperBound
, and outlierCutoff
parameters can be
independently specified to provide the user complete control over the
range of the estimate. Setting outlierCutoff
to zero turns off outlier
detection and includes all the data.
Arguments  Description 

sample  A vector containing the data sample to estimate. 
pdfLength  Specifies the desired length of the estimate returned. By default, this length is calculated based on the length of the sample. 
estimationPoints  An optional vector containing specific points to estimate. 
lowerBound  Sets the finite lower bound for the sample, if it exists. 
upperBound  Sets the finite upper bound for the sample, if it exists. 
lagrangeMin  Specifies the minimum allowed dimension, \(D\), in EquationÂ (4). 
lagrangeMax  Specifies the maximum allowed dimension, \(D\), in EquationÂ (4). 
debug  If TRUE, detailed progress will be printed to the console. 
outlierCutoff  If greater than 0, specifies the range of included sample data, according to the formula [(Q1  outlierCutoff x IQR), (Q3 + outlierCutoff x IQR)], where Q1, Q3, and IQR represent the first quartile, third quartile, and interquartile range, respectively. 
target  Sets a target percentage threshold between 0 and 100. The default is 70, the minimum accepted is 5. 
smooth  If TRUE (default), preference is given towards smooth density estimates. 
The number of expansions in EquationÂ (4) begins at
lagrangeMin
and is capped at lagrangeMax
, with default values of 1
and 200, respectively. The maximum of 200 provides a generous realistic
upper limit to the complexity of the estimate and the computational time
required but can be altered to either increase accuracy or decrease
compute time. Another reason to override these limits is to create a
semiparametric estimate by narrowing the range between minimum and
maximum. A strictly parametric approach can be achieved by setting the
two limits to the same value. For example, setting both langrangeMin
and lagrangeMax
to 1 forces a uniform fit. Similarly, setting them
both to either 2 or 3 respectively yields exponential and Gaussian
distributions.
The target
argument refers to the cumulative probability of the
zscore, as previously discussed. Note that target
is a userdefined
argument, whereas threshold
in the PDFe
class is the actual
threshold achieved. These values may be different for several reasons.
For example, the search for the target threshold may abort prematurely
if the lagrangeMax
has been exceeded or if progress has been stalled.
Additionally, a small penalty is added to the score if the estimate
becomes exceptionally noisy in areas of low density where sharp features
are not justified. Therefore, a smoother curve may be favored over a
higher score. The smoothing penalty is constructed according to a Taylor
expansion error estimate as described previously (Jacobs 2008). The original
model calculated a second order expansion, but a first order
approximation was found to be sufficient and implemented in this
version. This behavior can be circumvented when intentionally searching
for small peaks by setting the smooth
argument to FALSE.
The estimatePDF
function performs the density estimation based on the
input parameters in TableÂ 3 and returns a PDFe
object for plotting and additional analysis. Alternatively, the
convertToPDFe
function will create a PDFe
object for an estimate
calculated using any other method for a given data sample.
convertToPDFe
requires a data sample and the (x, y)
values for the
estimate, and calculates the score, threshold, and SQR values for each
point in the sample.
The approximatePoints
function operates on a PDFe
object to
approximate the PDF for additional data points after the estimate has
already been calculated using either estimatePDF
or convertToPDFe
.
This functionality is similar to specifying estimationPoints
in the
estimatePDF
function, but is provided for the convenience of
approximating different points without having to recalculate the
estimate. The following example will create a PDFe
object for a KDE
estimate of a random sample from a standard normal distribution, and
then return additional density approximations at points 3, 0, and 1.
= rnorm(1000)
sample = density(sample)
kde = convertToPDFe(sample, kde$x, kde$y)
pdfe approximatePoints(pdfe, c(3, 0, 1))
The PDFEstimator::plot.PDFe
function extends the generic plot
function in R supporting all existing graphical parameters, with
additional options summarized in TableÂ 4. The first
argument listed in the table is the PDFe
object returned by
estimatePDF
or convertToPDFe
and is required for all plots. The
plotPDF
and plotSQR
arguments can be independently set to TRUE or
FALSE and collectively control the plot type. The plotShading
and
showOutlierPercent
values invoke the plotBeta
and getTarget
functions from TableÂ 1 and provide optional
diagnostics to highlight specified uncertainties within the estimate
through the use of the SQR plot. The remaining arguments listed in
TableÂ 4 control minor graphical features for
customized aesthetics. The following examples will demonstrate the
variety of plots that can be created with combinations of these options.
Arguments  Description 

x  A PDFe estimation object. Returned from estimatePDF . 
plotPDF  Plots the probability density for x if TRUE. 
plotSQR  Plots the scaled quantile residual (SQR) for x if TRUE. If plotPDF is also TRUE, the SQR will be scaled to the range of the density. 
plotShading  Plots gray background shading indicating approximate confidence levels for the SQR, where darker shades indicate higher confidence. Setting this to TRUE only has meaning when plotSQR = TRUE. 
shadeResolution  Specifies the number of data points plotted in the background shading when plotShading = TRUE. Increasing this resolution will create sharper and more accurate approximations for the confidence levels, but will take more time to plot. 
showOutlierPercent  Specifies the threshold to define outliers for SQR. Must be a number between 1 and 100. 
outlierColor  Specifies the color for outliers when showOutlierPercent is defined. 
sqrPlotThreshold  Magnitude of yaxis for SQR plot. 
sqrColor  Specifies the SQR color for nonoutliers 
type  Specifies the line type of the density curve if plotPDF = TRUE. If plotPDF = FALSE and plotSQR = TRUE, the SQR plot uses this type. The default is lines type. 
lwd  Specifies the line width of the density curve if plotPDF = TRUE. If plotPDF = FALSE and plotSQR = true, the SQR plot uses this width. The default is 2. 
xlab  xaxis label for pdf. If plotPDF = FALSE and plotSQR = TRUE, then the sqr plot uses this label. 
ylab  yaxis label for pdf. If plotPDF = FALSE and plotSQR = TRUE, then the sqr plot uses this label. 
legendcex  expansion factor for legend point size with sqr plot type, for plotPDF = FALSE and plotSQR = TRUE. 
...  Inherits all other plotting arguments from plot function 
FigureÂ 3 contains two separate examples of the Maxwell
distribution. The left panel demonstrates the plot function using all
the default parameters and simply plots
the density estimate. On the
right, the plotSQR
parameter creates the SQR plot, as described in
EquationÂ (2), overlaying the density. A shaded background,
scaled according to the data sample, is plotted with the addition of the
plotShading
parameter set to TRUE, providing a visual approximation of
the most probable range of scaled quantile residual values. Finally, the
showOutlierPercent
parameter flags scaled quantile residual values
that are outside of the 99% target interval. By default, these outliers
are plotted in red.
Figure Â 4 contains additional examples for visual
assessment of estimates. Each of these plots is based on the sawtooth
distribution, defined by ten identical isosceles triangles at equal
intervals. The sawtooth distribution is designed to be challenging to
estimate due to these extremely sharp peaks. The exact distribution is
plotted in gray in the top left panel of FigureÂ 4, with
the estimatePDF
estimate for 100,000 samples shown in black. The
estimate captures the high peaks well but falls short of reaching the
lowest points of each triangle. The top right plot demonstrates how the
showOutlierPercent
parameter can identify specific areas of lower
confidence in the estimate when the exact distribution is not known. In
this example, the estimate is plotted in gray with green highlighting
the sample points outside of an 80% target level. Although the estimate
closely approximates the distribution, the low peaks are identified as
less accurate.
The bottom row of figureÂ 4 shows an alternative
diagnostic visualization of the same estimate. These plots do not
include the density estimate, but show only the SQR plot. Note that when
the SQR is plotted alone, each sample point is spaced equally according
to sort ordered position. Additionally, dotted lines represent the
confidence interval, if showOutlierPercent
is specified, and the
calculated percentage of points lying outside this threshold are printed
on the bottom right. These examples show targeted thresholds of 80% and
99% with outliers representing 8% and 1% of the points, respectively. If
the sum of the threshold and outlier percentages fall far below or above
100%, this indicates a poor estimate. The figures in this section
collectively demonstrate a wide range of visual assessment and color
choices available with this customized plot
function.
The results returned from estimatePDF
provide additional information
for further analysis when needed. For example, the first plot in
FigureÂ 5 is the SQR result for 10,000 data samples
generated from a normalcubed distribution. In this case, the
failedSolution
return value is TRUE, indicating that the estimate does
not meet the required threshold for the scoring function. Rather than
return a NULL solution in the event of an unacceptable score, the best
scoring estimate is always returned. However, in addition to the poor
average score, the SQR plot also shows large variations outside of a 99%
threshold.
The second plot in FigureÂ 5 shows the SQR result
for the same sample data estimated with the core R function density
.
The density
estimate is first converted to a PDFe
object via the
convertToPDFe
function. The failedSolution
return variable indicates
this estimate also does not meet an acceptable fit, but the SQR plot
further suggests an extreme underfit of the estimate near the midpoint
of the sample. This is a common weakness of KDE estimates for data with
sharp peaks and long tails. In fact, the normalcubed distribution is
undefined at zero and diverges as it approaches this singularity from
both directions, presenting a challenge for any density estimator. In
either method, however, the information available in PDFe
alerts the
user of problems that may require intervention.
A manual inspection of the sample data, perhaps in the form of a
quantile or histogram plot, confirms an approximately symmetric
distribution of data with an extreme variation in density near the
center. A reasonable course of action in this event is to attempt to fit
the lowdensity tails separately from the highdensity peak.
FigureÂ 6 demonstrates this approach, modeling the
distribution with three separate calls to estimatePDF
(top) and
density
(bottom) for ranges delineated by 0.01 and 0.01 around the
peak. The tails are bounded according to the range returned from the
respective original estimates for each method.
For estimatePDF
, this produces individual estimates mostly within the
99% target range shown in the three SQR plots in
FigureÂ 6. This example suggests a general divide
and conquer method for addressing extreme distributions that is part of
an automated procedure (to be published elsewhere). For density
,
however, the estimates remain poor in comparison even when fitting the
regions separately. This is due, in part, because KDE methods do not
incorporate automatic outlier detection. Additionally, KDE tends to
perform poorly at boundaries, causing them to be less amenable to
piecing together estimates in this way.
estimatePDF
(left) and density
(right).