distr6 is an object-oriented (OO) probability distributions interface leveraging the extensibility and scalability of R6 and the speed and efficiency of Rcpp. Over 50 probability distributions are currently implemented in the package with ‘core’ methods, including density, distribution, and generating functions, and more ‘exotic’ ones, including hazards and distribution function anti-derivatives. In addition to simple distributions, distr6 supports compositions such as truncation, mixtures, and product distributions. This paper presents the core functionality of the package and demonstrates examples for key use-cases. In addition, this paper provides a critical review of the object-oriented programming paradigms in R and describes some novel implementations for design patterns and core object-oriented features introduced by the package for supporting distr6 components.
Probability distributions are an essential part of data science, underpinning models, simulations, and inference. Hence, they are central to computational data science. With the advent of modern machine learning and AI, it has become increasingly common to adopt a conceptual model where distributions are considered objects in their own right, as opposed to primarily represented through distribution-defining functions (e.g., cdf, pdf) or random samples.
An important distinction to keep in mind is between random variables (that can be sampled from) and probability distributions. distr6 is an interface for probability distributions and supports construction, manipulation, composition, and querying of parameterized simple and composite distributions. distr6 is not an interface for random variables, and therefore, procedures such as sampling and inference are out of scope.
We continue by explaining our conceptual model of probability distributions underpinning the design of distr6 and delineate it from the common conceptualization of random variables. A full mathematical definition of the conceptual model is given in the next section. This section contains an intuitive introduction.
First, we invite the reader to recall some common mathematical objects and recognize that these are related but conceptually distinct:
Critically, we highlight that random variables and distributions are neither identical objects nor concepts. A random variable \(X\) has distribution \(d\), and multiple random variables may be distributed according to \(d\). Further, random variables are sampled from, while the distribution is only a description of probabilities for \(X\). Thus, \(X\) and \(d\) are not identical objects. Figure 1 visually summarizes these differences.
As a possible logical consequence of the above, we adopt the conceptual model that distribution is an abstract object, which:
Abstracting distributions as objects from multiple, non-identical, representations (random variables), introduces major consequences for the conceptual model:
It lends itself naturally to a class-object representation in the computer scientific sense of object-oriented programming. Abstract distributions become classes, concrete distributions are objects, and distribution defining functions are methods of these classes. Random variables are a separate type of object.
It strongly suggests adoption of mathematical conceptualization and notation which cleanly separates distributions from random variables and distribution defining functions - in contrast to common convention, where random variables or random sampling takes conceptual primacy above all.
It allows clean formulation of algorithmic manipulations involving distributions, especially higher-order constructs (truncation, huberization, etc.), as well as clean mathematical definitions.
|
|
In distr6, distributions are first-class objects subject to an
object-oriented class-object representation. For example, a discrete
uniform distribution (fig. 1b) is a
‘class’ with traits such as type (Naturals) and variate form
(univariate). With a given parameterization, this becomes an ‘object’
with properties including symmetry and support. An alternative
definition to the conceptual model of distributions is now provided.
On the mathematical level, we again consider distributions as objects in
their own right, not being identical with a cdf, pdf, or measure, but
instead ‘having’ these as properties.
For a set \(\mathcal{Y}\) (endowed with suitable topology), we define Distr\((\mathcal{Y})\) as a set containing formal objects \(d\) which are in bijection to (but not identical with) probability measures over \(\mathcal{Y}\). Elements of Distr\((\mathcal{Y})\) are called distributions over \(\mathcal{Y}\). We further define formal symbols which, in case of existence, denote ‘aspects’ that such elements have, in the following way: the symbol \(d.F\), for example, denotes the cdf of \(d\), which is to be read as the ‘\(F\)’ of \(d\), with \(F\), in this case, to be read as a modifier to a standard symbol \(d\), rather than a fixed, bound, or variable symbol. In this way, we can define:
\(d.F\) for the cdf of \(d\). This typically exists if \(\mathcal{Y}\subseteq \mathbb{R}^n\) for some \(n\), in which case \(d.F\) is a function of type \(d.F: \mathbb{R}^n \rightarrow [0,1]\).
\(d.f\) for the pdf of \(d\). This exists if \(\mathcal{Y}\subseteq \mathbb{R}^n\), and the distribution \(d\) is absolutely continuous over \(\mathcal{Y}\). In this case, \(d.f\) is a function of type \(d.f: \mathbb{R}^n \rightarrow [0,\infty)\).
\(d.P\) for the probability measure that is in bijection with \(d\). This is a function \(d.P: \mathcal{F} \rightarrow [0,1]\), where \(\mathcal{F}\) is the set of measurable sub-sets of \(\mathcal{Y}\).
We would like to point out that the above is indeed a full formal mathematical definition of our notion of distribution. While distributions, defined this way, are not identical with any of the conventional mathematical objects that define them (cdf, pdf, measures), they are conceptually, formally, and notationally well-defined. Similarly, the aspects (\(d.F\), \(d.f\), etc.) are also well-defined since they refer to one of the conventional mathematical objects which are well-specified in the dependence of the distribution (in case of existence).
This notation provides a more natural and clearer separation of distribution and random variables and allows us to talk about and denote concepts such as ‘the cdf of any random variable following the distribution \(d\)’ with ease (\(d.F\)), unlike classical notation that would see one define \(X\sim d\) and then write \(F_X\). Our notation more clearly follows the software implementation of distributions.
For example, in distr6, the code counterpart to defining a distribution \(d\) which is Gaussian with mean \(1\) and variance \(2\) is
> d <- Normal$new(1, 2)
The pdf and cdf of this Gaussian distribution evaluated at \(2\) are obtained in code as
> d$pdf(2)
> d$cdf(2)
which evaluates to ‘numerics’ that represent the real numbers \(d.f(2)\)
and \(d.F(2)\).
The consideration of distributions as objects, and their conceptual
distinction from random variables as objects, notably differs from
conceptualization in R stats, which implements both distribution and
random variable methods by the ‘dpqr
’ functions. Whilst this may allow
a very fast generation of probabilities and values, there is no support
for querying and inspection of distributions as objects. By instead
treating the dpqr
functions as methods that belong to a distribution
object, distr6 encapsulates all the information in R stats as well
as distribution properties, traits, and other important mathematical
methods. The object orientation principle that defines the architecture
of distr6 is further discussed throughout this manuscript.
Treating distributions as objects is not unique to this package. Possibly the first instance of the object-oriented conceptualization is the distr (Ruckdeschel et al. 2006) family of packages. distr6 was designed alongside the authors of distr in order to port some of their functionality from S4 to R6.
distr6 is the first such package to use the ‘class’ object-oriented
paradigm R6 (Chang 2018), with
other distribution related packages using S3 or S4. The choice of R6
over S3 and S4 is discussed in detail in section
5.1. This choice allows distr6 to fully
leverage the conceptual model and make use of core R6 functionality. As
well as introducing fundamental object-oriented programming (OOP)
principles such as abstract classes and tried and tested design patterns
(Gamma et al. 1996) including decorators, wrappers, and compositors (see
section 5.3).
Besides an overview of distr6’s novel approach to probability
distributions in R, this paper also presents a formal comparison of the
different OOP paradigms while detailing the use of design patterns
relevant to the package.
The strength of the object-oriented approach, both on the algorithmic
and mathematical side, lies in its ability to efficiently express
higher-order constructs and operations: actions between distributions,
resulting in new distributions. One such example is mixture
distributions (also known as spliced distributions). In the distr6
software interface, a MixtureDistribution
is a higher-order
distribution depending on two or more other distributions. For example,
take a uniform mixture of two distributions distr1
and distr2
:
> my_mixt <- MixtureDistribution$new(list(distr1, distr2))
Internally, the dependency of the constructs on the components is
remembered so that my_mixt
is not only evaluable for cdf
(and other
methods), but also carries a symbolic representation of its construction
and definition history in terms of distr1
and distr2
.
On the mathematical side, the object-oriented formalism allows clean
definitions of otherwise more obscure concepts. For example, the mixture
distribution is now defined as follows:
For distributions \(d_1,\dots,d_m\) over \(\mathbb{R}^n\) and weights \(w_1,\dots, w_m\), we define the mixture of \(d_1,\dots, d_m\) with weights \(w_1,\dots, w_m\) to be the unique distribution \(\tilde{d}\) such that \(\tilde{d}.F(x) = \sum_{i=1}^m w_i\cdot d_i.F(x)\) for any \(x\in \mathbb{R}^n\). Note the added clarity by defining the mixture on the distribution \(d_i\), i.e., a first-order concept in terms of distributions.
This section provides a review of other related software that implement probability distributions. This is focused on, but not limited to, software in R.
The core R programming language consists of packages for basic coding
and maths as well as the stats package for statistical functions.
stats contains 17 common probability distributions and four
lesser-known distributions. Each distribution consists of (at most) four
functions: dX, pX, qX, rX
where X
represents the distribution name.
These correspond to the probability density/mass, cumulative
distribution, quantile (inverse cumulative distribution), and simulation
functions, respectively. Each is implemented as a separate function,
written in C, with both inputs and outputs as numerics. The strength of
these functions lies in their speed and efficiency. There is no quicker
way to find, say, the pdf of a Normal distribution than to run the
dnorm
function from stats. However, this is the limit of the package
in terms of probability distributions. As there is no designated
distribution object, there is no way to query results from the
distributions outside of the ‘dpqr
’ functions.
Several R packages implement dpqr
functions for extra probability
distributions. Of particular note are the
extraDistr
(Wolodzko 2019) and
actuar (Dutang et al. 2008)
packages that add over 60 distributions between them. Both of these
packages are limited to dpqr
functions and therefore have the same
limits as R stats.
The distr package was the first package in R to implement an object-oriented interface for distributions, using the S4 object-oriented paradigm. distr tackles the two fundamental problems of stats by introducing distributions as objects that can be stored and queried. These objects include important statistical results, for example, the expectation, variance, and moment generating functions of a distribution. The distr family of packages includes a total of five packages for object-oriented distributions in R. distr has two weaknesses that were caused by using the S4 paradigm. Firstly, the package relies on inheritance, which means that large class trees exist for every object, and extensibility is therefore non-trivial. The second weakness is that S4 objects are not referred to by ‘pointers’ but instead copies. This means that a simple mixture of two distributions is just under 0.5Mb in size (relatively quite large).
The
distributions3
package (Hayes and Moller-Trane 2019) defines distributions as objects using the
S3 paradigm. However, whilst distributions3 treats probability
distributions as S3 objects, it does not add any properties, traits, or
methods and instead uses the objects solely for dpqr
dispatch. In
comparison to distr, the distributions3 package provides fewer
features for inspection or composition. More recently,
distributional
(O’Hara-Wild and Hayes 2020) builds on the distributions3 framework (common
authors exist between the two) to focus on the vectorization of
probability distributions coded as S3 objects. Similarly to distr6,
the primary use-case of this package is for predictive modeling of
distributions as objects.
The mistr package (Sablica and Hornik 2020) is another recent distributions package, which is also influenced by distr. The sole focus of mistr is to add a comprehensive and flexible framework for composite models and mixed distributions. Similarly, to the previous two packages, mistr implements distributions as S3 objects.
Despite not being a package written in R, the Julia Distributions.jl (Lin et al. 2019) package provided inspiration for distr6. Distributions.jl implements distributions as objects with statistical properties including expectation, variance, moment generating and characteristic functions, and many more. This package uses multiple inheritance for ‘valueSupport’ (discrete/continuous) and ‘variateForm’ (univariate/multivariate/matrixvariate). Every distribution inherits from both of these, e.g., a distribution can be ‘discrete-univariate’, ‘continuous-multivariate’, ‘continuous-matrixvariate’, etc. The package provides a unified and user-friendly interface, which was a helpful starting point for distr6.
distr6 was designed and built around the following principles.
Unified interface The package is designed such that all distributions, no matter how complex, have an identical user-facing interface. This helps make the package easy to navigate and the documentation simple to read. Moreover, it minimizes any confusion resulting from using multiple distributions. A clear inheritance structure also allows wrappers and decorators to have the same methods as distributions, which means even complex composite distributions should be intuitive to use. Whether a user constructs a simple Uniform distribution or a mixture of 100 Normal distributions, the same methods and fields are seen in both objects.
Separation of core/exotic and numerical/analytic Via abstraction and encapsulation, core statistical results (common methods such as mean and variance) are separated from ‘exotic’ ones (less common methods such as anti-derivatives and p-norms). Similarly, implemented distributions only contain analytic results; users can impute numerical results using decorators. This separation has several benefits, including: 1) for predictive modeling with/of distributions, numerical results can take longer to compute than analytical results, and the difference between precision of analytical and numerical results can be substantial in the context of automated modeling, separation allows these differences to be highlighted and controlled; 2) separating numerical results allows an expanded interface for users to fine-tune and set their own parameters for how numerical results are computed; 3) a less-technical user can guarantee the precision of results as they are unlikely to use numerical decorators; 4) a user has access to the most important distribution methods immediately after construction but is not overwhelmed by many ‘exotic’ methods that they may never use. Use of decorators and wrappers allows the user to manually expand the interface at any time. For example, a user can choose between an undecorated Binomial distribution, with common methods such as mean and variance, or they can decorate the distribution to additionally gain access to survival and hazard functions.
Inheritance without over-inheritance The class structure stems
from a series of a few abstract classes with concrete child classes,
which allows for a sensible, but not over-complicated, inheritance
structure. For example, all implemented distributions inherit from a
single parent class, so common methods can be unified and only coded
once; note there is no separation of distributions into ‘continuous’
and ‘discrete’ classes. By allowing the extension of classes by
decorators and wrappers, and not solely inheritance, the interface
is highly scalable and extensible. By ‘scalability’, we refer to the
interface’s ability to grow to a large scale without additional
overheads. The decorator and wrapper patterns on top of the R6
paradigm allow an (theoretically) unlimited number of distributions,
wrappers, and methods without computational difficulty. By
‘extensibility’, we refer to the ability to extend the interface.
Again this is made possible by clean abstraction of distributions,
wrappers, core methods, and extra methods in decorators. All
decorators and wrappers in distr6 stem from abstract classes,
which in turn inherit from the Distribution
super-class. In doing
so, any method of expanding an object’s interface in distr6 (i.e.,
via decorators, wrappers, or inheritance) will automatically lead to
an interface that inherits from the top-level class, maintaining the
principle of a unified interface (D1).
Inspection and manipulation of multiple parameterizations The design process identified that use of distributions in R stats is inflexible in that in the majority of cases, only one parameterization of each distribution is allowed. This can lead to isolating users who may be very familiar with one parameterization but completely unaware of another. For example, the use of the precision parameter in the Normal distribution is typically more common in Bayesian statistics, whereas using the variance or standard deviation parameters is more common in frequentist statistics. distr6 allows the user to choose from multiple parameterizations for all distributions (where more than one parameterization is possible/known). Furthermore, querying and updating of any parameter in the distribution is allowed, even if it was not specified in construction (section 4). This allows for a flexible parameter interface that can be fully queried and modified at any time.
Flexible interfacing for technical and non-technical users
Throughout the design process, it was required that distr6 be
accessible to all R users. This was a challenge as R6 is a very
different paradigm from S3 and S4. To reduce the learning curve, the
interface is designed to be as user-friendly and flexible as
possible. This includes: 1) a ‘sensible default principle’ such that
all distributions have justified default values; 2) an ‘inspection
principle’ with functions to list all distributions, wrappers, and
decorators. As discussed in (D2), abstraction and encapsulation
allow technical users to expand any distribution’s interface to be
as arbitrarily complex as they like, whilst maintaining a minimal
representation by default. Where possible defaults are ‘standard’
distributions, i.e. with location \(0\) and scale \(1\), otherwise
sensible defaults are identified as realistic scenarios, for example
Binomial(n = 10, p = 0.5)
.
Flexible OO paradigms Following from (D5), we identified that R6 is still relatively new in R with only \(314\) out of \(16, 050\) packages depending on it (as of July 2020). Therefore this was acknowledged and taken into account when building the package. R6 is also the first paradigm in R with the dollar-sign notation (though S4 uses ‘@’ notation) and with a proper construction method. Whilst new users are advised to learn the basics of R6, S3 compatibility is available for all common methods via R62S3 (Sonabend 2019). Users can therefore decide on calling a method via dollar-sign notation or dispatch. The example below demonstrates ‘piping’ and S3. As the core package is built on R6, the thin-wrappers provided by R62S3 do not compromise the above design principles.
> library(magrittr)
> N <- Normal$new(mean = 2)
> N %>%
+ setParameterValue(mean = 1) %>%
+ getParameterValue("mean")
1] 1
[> pdf(N, 1:4)
1] 0.398942280 0.241970725 0.053990967 0.004431848 [
distr6 1.4.3 implements 56 probability distributions, including 11
probability kernels. Individual distributions are modeled via classes
that inherit from a common interface, implemented in the abstract
Distribution
parent class. The Distribution
class specifies the
abstract distribution interface for parameter access, properties,
traits, and methods, such as a distribution’s pdf or cdf. The most
important interface points are described in
Section 4.1
Concrete distributions, kernels, and wrappers are the grandchildren of
Distribution
, and children of one of the mid-layer abstract classes:
SDistribution
, which models abstract, generic distributions.
Concrete distributions, such as Normal
, which models the normal
distribution, inherit from SDistribution
.Kernel
, which models probability kernels, such as Triangular
and
Epanechnikov
. Probability kernels are absolutely continuous
distributions over the Reals, with assumed mean 0 and variance 1.DistributionWrapper
, which is an abstract parent for higher-order
operations on distributions, including compositions, that is,
operations that create distributions from other distributions, such
as truncation or mixture.DistributionDecorator
, whose purpose is supplementing methods to
distributions in the form of a decorator design pattern. This
includes methods such as integrated cdf or squared integrals of
distribution defining functions.Figure 3 visualizes the key class structure of distr6,
including the concrete Distribution
parent class, from which all other
classes in the package inherit from (with the exception of the
ParameterSet
). These abstract classes allow simple extensibility for
concrete sub-classes.
The base, or top-level, class in distr6 is the Distribution
class.
Its primary function is to act as a parent class for the implemented
probability distributions and higher-order compositions. It is also
utilized for the creation of custom distributions. By design, any
distribution already implemented in distr6 will have the same
interface as a user-specified custom distribution, ensuring (D1) is
upheld. The most important methods for a distribution are shown in Table
1 alongside their meaning and definitions (mathematical
if possible). The two use-cases for the Distribution
class are
discussed separately.
Method | Description/Definition |
---|---|
pdf/cdf/quantile/rand |
dpqr functions. |
mean |
\(d.\mu = \mathbb{E}[X]\) |
variance |
\(d.\sigma^2 = \mathbb{E}[(X - d.\mu)^2]\) |
traits |
List including value support (discrete/continuous/mixed); variate form (uni-/multi-/matrixvariate); type (mathematical domain). |
properties |
List including skewness (\(\mathbb{E}[((X - d.\mu)/d.\sigma)^3]\)) and symmetry (boolean). |
get/setParameterValue |
Getters and setters for parameter values. |
parameters |
Returns the internal parameterization set. |
print/summary |
Representation functions, summary includes distribution properties and traits. |
It is anticipated that the majority of distr6 users will be using the
package for the implemented distributions and kernels. With this in
mind, the Distribution
class defines all variables and methods common
to all child classes. The most important of these are the common
analytical expressions and the dpqr
public methods. Every concrete
implemented distribution/kernel has identical public dpqr
methods that
internally call private dpqr
methods. This accounts for
inconsistencies occurring from packages returning functions in different
formats and handling errors differently, a problem most prominent in
multivariate distributions. Another example is the handling of
non-integer values for discrete distributions. In some packages, this
returns \(0\), or the value is rounded down, or an error is returned. The
dpqr
functions for all distributions have unified validation checks
and return types (numeric
or data.table
). In line with base R and
other distribution packages, distr6 implements a single pdf
function
to cover both probability mass and probability density functions.
> Normal$new()$pdf(1:2)
1] 0.24197072 0.05399097
[> Binomial$new()$cdf(1:2, lower.tail = FALSE, log.p = TRUE, simplify = FALSE)
Binom1: -0.01080030
2: -0.05623972
A key design principle in the package is the separation of analytical
and numerical results (D2), which is ensured by only including
analytical results in implemented distributions. Missing methods in a
distribution, therefore, signify that no closed-form expression for the
method is available. However, all can be numerically estimated with the
CoreStatistics
decorator (see section
4.2). Ideally, all distributions will
include analytical methods for the following: probability density/mass
function (pdf
), cumulative distribution function (cdf
), inverse
cumulative distribution function/quantile function (quantile
),
simulation function (rand
), mean, variance, skewness, (excess)
kurtosis, and entropy of the distribution
(mean, variance, skewness, kurtosis, entropy
), as well as the moment
generating function (mgf
), characteristic function (cf
), and
probability generating function (pgf
). Speed is currently a limitation
in distr6, but the use of
Rcpp (Eddelbuettel and Francois 2011) in all
dpqr
functions helps mitigate against this.
The fourth design principle of distr6 ensures that multiple parameterizations of a given distribution can be both provided and inspected at all times. For example, the Normal distribution can be parameterized in terms of variance, standard deviation, or precision. Any of which can be called in construction with other parameters updated accordingly. If conflicting parameterizations are provided, then an error is returned. By example,
# set precision, others updated automatically
> Normal$new(prec = 4)
Norm(mean = 0, var = 0.25, sd = 0.5, prec = 4)
# try and set both precision and variance, results in error
> Normal$new(var = 1, prec = 2)
in FUN(X[[i]], ...) :
Error Conflicting parametrisations detected. Only one of {var, sd, prec} should be given.
The same principle is used for parameter setting. The methods
getParameterValue
and
setParameterValue
are utilized for getting and setting parameter
values, respectively. The former takes a single argument, the parameter
name, and the second a named list of arguments corresponding to the
parameter name and the value to set. The example below demonstrates this
for a Gamma distribution. Here, the distribution is constructed, the
shape parameter is queried, both shape and rate parameters are updated,
and the latter queried, finally, the scale parameter is set, which
auto-updates the rate parameter.
> G <- Gamma$new(shape = 1, rate = 1)
> G$getParameterValue("shape")
1] 1
[> G$setParameterValue(shape = 1, rate = 2)
> G$getParameterValue("rate")
1] 2
[> G$setParameterValue(scale = 2)
> G$getParameterValue("rate")
1] 0.5 [
Distribution and parameter domains and types are represented by
mathematical sets implemented in
set6 (Sonabend and Kiraly 2020). This
allows for a clear representation of infinite sets and, most
importantly, for internal containedness checks. For example, all public
dpqr
methods first call the $contains
method in their respective
type
and return an error if any points are outside the distribution’s
domain. As set6 uses Rcpp for this method, these come at minimal
cost to speed.
> B <- Binomial$new()
> B$pdf(-1)
in B$pdf(-1) :
Error in {-1} lie in the distribution domain (N0). Not all points
These domains and types are returned along with other important
properties and traits in a call to properties
and traits
,
respectively. This is demonstrated below for the Arcsine distribution.
> A <- Arcsine$new()
> A$properties
$support
0,1]
[
$symmetry
1] "symmetric"
[
> A$traits
$valueSupport
1] "continuous"
[
$variateForm
1] "univariate"
[
$type
R
Users of distr6 can create temporary custom distributions using the
constructor of the Distribution
class directly. Permanent extensions,
e.g., as part of an R package, should create a new concrete distribution
as a child of the SDistribution
class.
The Distribution
constructor is given by
$new(name = NULL, short_name = NULL, type = NULL, support = NULL,
Distribution+ symmetric = FALSE, pdf = NULL, cdf = NULL, quantile = NULL, rand = NULL,
+ parameters = NULL, decorators = NULL, valueSupport = NULL, variateForm = NULL,
+ description = NULL)
The name
and short_name
arguments are identification for the custom
distribution used for printing. type
is a trait corresponding to
scientific type (e.g., Reals, Integers,...), and support
is the
property of the distribution support. Distribution parameters are passed
as a ParameterSet
object. This defines each parameter in the
distribution, including the parameter default value and support. The
pdf/cdf/quantile/rand
arguments define the corresponding methods and
are passed to the private .pdf/.cdf/.quantile/.rand
methods. As above,
the public methods are already defined and ensure consistency in each
function. At a minimum, users have to supply the distribution name
,
type
, and either pdf
or cdf
. All other information can be
numerically estimated with decorators (see section
4.2).
> d <- Distribution$new(name = "Custom Distribution", type = Integers$new(),
+ support = Set$new(1:10),
+ pdf = function(x) rep(1/10, length(x)))
> d$pdf(1:3)
1] 0.1 0.1 0.1 [
Decorators add functionality to classes in object-oriented programming.
These are not natively implemented in R6, and this novel implementation
is therefore discussed further in section
5.3. Decorators in distr6 are only
‘allowed’ if they have at least three methods and cover a clear use
case. This prevents too many decorators from bloating the interface.
However, by their nature, they are lightweight classes that will only
increase the methods in a distribution if explicitly requested by a
user. Decorators can be applied to a distribution in one of three ways.
In construction:
> N <- Normal$new(decorators = c("CoreStatistics", "ExoticStatistics"))
Using the decorate()
function:
> N <- Normal$new()
> decorate(N, c("CoreStatistics", "ExoticStatistics"))
Using the $decorate
method inherited from the DistributionDecorator
super-class:
> N <- Normal$new()
> ExoticStatistics$new()$decorate(N)
The first option is the quickest if decorators are required immediately.
The second is the most efficient once a distribution is already
constructed. The third is the closest method to true OOP but does not
allow adding multiple decorators simultaneously.
Three decorators are currently implemented in distr6. These are
briefly described.
This decorator imputes numerical functions for common statistical
results that could be considered core to a distribution, e.g., the mean
or variance. The decorator additionally adds generalized expectation
(genExp
) and moments (kthmoment
) functions, which allow numerical
results for functions of the form \(\mathbb{E}[f(X)]\) and for
crude/raw/central \(K\) moments. The example below demonstrates how the
decorate
function exposes methods from the CoreStatistics
decorator
to the Normal distribution object.
> n <- Normal$new(mean = 2, var = 4)
> n$kthmoment(3, type = "raw")
: attempt to apply non-function
Error> decorate(n, CoreStatistics)
> n$kthmoment(3, type = "raw")
1] 32 [
This decorator adds more ‘exotic’ methods to distributions, i.e., those that are unlikely to be called by the majority of users. For example, this includes methods for the p-norm of survival and cdf functions, as well as anti-derivatives for these functions. Where possible, analytic results are exploited. For example, this decorator can implement the survival function in one of two ways: either as i) \(1\) minus the distribution cdf if an analytic expression for the cdf is available, or ii) via numerical integration of the distribution.
This decorator imputes numerical expressions for the dpqr
methods.
This is the most useful for custom distributions in which only the pdf
or cdf
is provided. Numerical imputation is implemented via Rcpp.
Composite distributions - that is, distributions created from other distributions - are common in advanced usage. Examples for composites are truncation, mixture, or transformation of domain. In distr6, a number of such composites are supported. Implementation-wise, this uses the wrapper OOP pattern, which is not native to R6 but part of our extensions to R6, discussed in section 5.3.
As discussed above, wrapped distributions inherit from Distribution
thus have an identical interface to any child of SDistribution
, with
the following minor differences:
wrappedModels
method provides a unified interface to access
any component distribution.ParameterSetCollection
object instead of a ParameterSet
, thus
allowing efficient representation of composite and nested parameter
sets.The composition can be iterated and nested any number of times, consider the following example where a mixture distribution is created from two distributions that are in turn composites - a truncated Student T, and a Huberized exponential - note the parameter inspection and automatic prefixing of distribution ‘short names’ to the parameters for identification.
> M <- MixtureDistribution$new(list(
+ truncate(StudentT$new(), lower = -1, upper = 1),
+ huberize(Exponential$new(), upper = 4)
+ ))
> M$parameters()
id value support description1: mix_T_df 1 R+ Degrees of Freedom
2: mix_trunc_lower -1 R U {-Inf, +Inf} Lower limit of truncation
3: mix_trunc_upper 1 R U {-Inf +Inf} Upper limit of truncation
4: mix_Exp_rate 1 R+ Arrival Rate
5: mix_Exp_scale 1 R+ Scale
6: mix_hub_lower 0 R U {-Inf, +Inf} Lower limit of huberization
7: mix_hub_upper 4 R U {-Inf, +Inf} Upper limit of huberization
8: mix_weights uniform {uniform} U [0,1] Mixture weights
We summarize some important implemented compositors (tables 2 and 3) to illustrate the way composition is handled and implemented.
Class | Parameters | Type of \(d\) | Components of \(d\) |
---|---|---|---|
TruncatedDistribution |
\(a,b\in \mathbb{R}\) | \(\mathbb{R}\) | \(d'\), type \(\mathbb{R}\) |
HuberizedDistribution |
\(a,b\in \mathbb{R}\) | \(\mathbb{R},\) mixed | \(d'\), type \(\mathbb{R}\) |
MixtureDistribution |
\(w_i \in \mathbb{R}, \sum^n_{i=1} w_i = 1\) | \(\mathbb{R}^n\) | \(d'_i\), type \(\mathbb{R}^n\) |
ProductDistribution |
- | \(\mathbb{R}^N, N = \sum_{i=1}^n n_i\) | \(d'_i\) type \(\mathbb{R}^{n_i}\) |
Class | \(d.F(x)\) | \(d.f(x)\) |
---|---|---|
TruncatedDistribution |
\(\frac{d'.F(x) - d'.F(a)}{d'.F(b) - d'.F(a)}\) | \(\frac{d'.f(x)}{d'.P([a,b])}\) |
HuberizedDistribution |
\(d'.F(x) + {\mathbf{\mathbb{1}}}[x = b]\cdot d'.P(b)\) | (no pdf, since mixed) |
MixtureDistribution |
\(\sum_{i=1}^N w_i\cdot d'_i.F(x)\) | \(\sum_{i=1}^N w_i\cdot d'_i.f(x)\) (if exists) |
ProductDistribution |
\(\prod_{i=1}^N d'_i.F(x)\) | \(\prod_{i=1}^N d'_i.f(x)\) (if exists) |
Example code to obtain a truncated or Huberized distribution is below. Here, we construct a truncated normal with truncation parameters -1 and 1, and a Huberized Binomial with bounding parameters 2 and 5.
> TN <- truncate(Normal$new(), lower = -1, upper = 1)
> TN$cdf(-2:2)
1] 0.0 0.0 0.5 1.0 1.0
[> class(TN)
1] "TruncatedDistribution" "DistributionWrapper" "Distribution" "R6"
[
> HB <- huberize(Binomial$new(), lower = 2, upper = 5)
> HB$cdf(1:6)
1] 0.0000000 0.0546875 0.1718750 0.3769531 1.0000000 1.0000000
[> HB$median()
1] 5 [
A special feature of distr6 is that it allows vectorization of
distributions, i.e., vectorized representation of multiple distributions
in an array-like structure. This is primarily done for computational
efficiency with the general best R practice of vectorization.
Vectorization of distr6 distributions is implemented via the
VectorDistribution
, which is logically treated as a compositor.
Mathematically, a VectorDistribution
is simply a vector of component
distributions \(d_1,\dots, d_N\) that allows vectorized evaluation. Two
kinds of vectorized evaluation are supported paired and product-wise
vectorization, which we illustrate in the case of cdfs.
cdf
method.In practical terms, paired evaluation is the evaluation of \(N\) distributions at \(N\) points (which may be unique or different). So, by example, for three distributions \(d_1,d_2,d_3\), paired evaluation of their cdfs at \((x_1, x_2, x_3) = (4,5,6)\) respectively results in \((d_1.F(x_1), d_2.F(x_2), d_3.F(x_3)) = (d_1.F(4), d_2.F(5), d_3.F(6))\). In distr6:
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:3))
> V$cdf(4, 5, 6)
Norm1 Norm2 Norm31: 0.9986501 0.9986501 0.9986501
In contrast, product-wise evaluation evaluates \(N\) distributions at the same \(M\) points. Product-wise evaluation of the cdfs of \(d_1,d_2,d_3\) at \((x_1, x_2, x_3) = (4,5,6)\) results in \[\begin{pmatrix} d_1.F(x_1) & d_1.F(x_2) & d_1.F(x_3) \\ d_2.F(x_1) & d_2.F(x_2) & d_2.F(x_3) \\ d_3.F(x_1) & d_3.F(x_2) & d_3.F(x_3) \end{pmatrix} = \begin{pmatrix} d_1.F(4) & d_1.F(5) & d_1.F(6) \\ d_2.F(4) & d_2.F(5) & d_2.F(6) \\ d_3.F(4) & d_3.F(5) & d_3.F(6) \end{pmatrix}\]
In distr6:
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:3))
> V$cdf(4:6)
Norm1 Norm2 Norm31: 0.9986501 0.9772499 0.8413447
2: 0.9999683 0.9986501 0.9772499
3: 0.9999997 0.9999683 0.9986501
The VectorDistribution
wrapper allows for efficient vectorization
across both the distributions and points to evaluate, which we
believe is a feature unique to distr6 among distribution frameworks in
R. By combing product and paired modes, users can evaluate any
distribution in the vector at any point. In the following example,
Normal(1, 1) is evaluated at (1,2), and Normal(2, 1) is evaluated at
(3,4):
> V <- VectorDistribution$new(distribution = "Normal", params = data.frame(mean = 1:2))
> V$pdf(1:2, 3:4)
Norm1 Norm21: 0.3989423 0.24197072
2: 0.2419707 0.05399097
Further, common composites such as ProductDistribution
and
MixtureDistribution
inherit from VectorDistribution
, allowing for
efficient vector dispatch of pdf and cdf methods. Inheriting from
VectorDistribution
results in identical constructors and methods.
Thus, a minor caveat is that users could evaluate a product or mixture
at different points for each distribution, which is not a usual use case
in practice.
Two different choices of constructors are provided. The first
‘distlist
’ constructor passes distribution objects into the
constructor, whereas the second passes a reference to the distribution
class along with the parameterizations. Therefore, the first allows
different types of distributions but is vastly slow as the various
methods have to be calculated individually, whereas the second only
allows a single class of distribution at a time but is much quicker in
evaluation. In the example below, the mixture uses the second
constructor, and the product uses the first.
> M <- MixtureDistribution$new(distribution = "Degenerate",
+ params = data.frame(mean = 1:10))
> M$cdf(1:5)
1] 0.1 0.2 0.3 0.4 0.5
[> class(M)
1] "MixtureDistribution" "VectorDistribution" "DistributionWrapper" "Distribution"
[5] "R6"
[
> P <- ProductDistribution$new(list(Normal$new(), Exponential$new(), Gamma$new()))
> P$cdf(1:5)
1] 0.3361815 0.7306360 0.9016858 0.9636737 0.9865692 [
This paper has so far discussed the API and functionality in distr6. This section discusses object-oriented programming (OOP). Firstly, a brief introduction to OOP and OOP in R and then the package’s contributions to the field.
R has four major paradigms for object-oriented programming: S3, S4, reference classes (R5), and most recently, R6. S3 and S4 are known as functional object-oriented programming (FOOP) paradigms, whereas R5 and R6 move towards class object-oriented programming (COOP) paradigms (R6) (Chambers 2014). One of the main differences (from a user perspective) is that methods in COOP are associated with a class, whereas in FOOP, methods are associated with generic functions. In the first case, methods are called by first specifying the object, and in the second, a dispatch registry is utilized to find the correct method to associate with a given object.
S3 introduces objects as named structures, which in other languages are often referred to as ‘typed lists’. These can hold objects of any type and can include meta-information about the object itself. S3 is the dominant paradigm in R for its flexibility, speed, and efficiency. As such, it is embedded deep in the infrastructure of R, and single dispatch is behind a vast majority of the base functionality, which is a key part of making R easily readable. S3 is a FOOP paradigm in which functions are part of a dispatch system and consist of a generic function that is external to any object and a specific method registered to a ‘class’. However, the term ‘class’ is slightly misleading as no formal class structure exists (and by consequence, no formal construction or inheritance) and as such, S3 is not a formal OOP language1.
S4 formalizes S3 by introducing: class-object separation, a clear notion of construction, and multiple inheritances (Chambers 2014). S4 has more syntax for the user to learn and a few more steps in class and method definitions. As a result, S4 syntax is not overly user-friendly, and S3 is used vastly more than S4 (Chambers 2014).
There is a big jump from S3 and S4 to R6 as they transition from
functional- to class-object-oriented programming. This means new
notation, semantics, syntax, and conventions. The key changes are: 1)
introducing methods and fields that are associated with classes not,
functions; 2) mutable objects with copy-on-modify semantics; and 3) new
dollar-sign notation. In the first case, this means that when a class is
defined, all the methods are defined as existing within the class, and
these can be accessed at any time after construction. Methods are
further split into public and private, as well as active bindings,
which incorporate the abstraction part of OOP. The mutability of objects
and change to copy-on-modify means that to create an independent copy of
an object, the new method clone(deep = TRUE)
has to be used, which
would be familiar to users who know more classical OOP but very
different from most R users. Finally, methods are accessed via the
dollar-sign, and not by calling a function on an object.
Below, the three paradigms are contrasted semantically with a toy example to create a ‘duck’ class with a method ‘quack’.
S3
> quack <- function(x) UseMethod("quack", x)
> duck <- function(name) return(structure(list(name = name), class = "duck"))
> quack.duck <- function(x) cat(x$name, "QUACK!")
> quack(duck("Arthur"))
! Arthur QUACK
S4
> setClass("duck", slots = c(name = "character"))
> setGeneric("quack", function(x) {
+ standardGeneric("quack")
+ })
> setGeneric("duck", function(name) {
+ standardGeneric("duck")
+ })
> setMethod("duck", signature(name = "character"),
+ definition = function(name){
+ new("duck", name = name)
+ })
> setMethod("quack",
+ definition = function(x) {
+ cat(x@name, "QUACK!")
+ })
> quack(duck("Ford"))
! Ford QUACK
R6
> duck <- R6::R6Class("duck", public = list(
+ initialize = function(name) private$.name = name,
+ quack = function() cat(private$.name, "QUACK!")),
+ private = list(.name = character(0)))
> duck$new("Zaphod")$quack()
! Zaphod QUACK
The example clearly highlights the extra code introduced by S4 and the difference between the S3 dispatch and R6 method system.
There is no doubt that R6 is the furthest paradigm from conventional R usage, and as such, there is a steep learning curve for the majority of R users. However, R6 will be most natural for users coming to R from more traditional OOP languages. In contrast, S3 is a natural FOOP paradigm that will be familiar to all R users (even if they are not aware that S3 is being used). S4 is an unfortunate midpoint between the two, which whilst being very useful, is not particularly user-friendly in terms of programming classes and objects.
distr was developed soon after S4 was released and is arguably one of the best case studies for how well S4 performs. Whilst S4 formalizes S3 to allow for a fully OO interface to be developed, its dependence on inheritance forces design decisions that quickly become problematic. This is seen in the large inheritance trees in distr in which one implemented distribution can be nested five child classes deep. This is compounded by the fact that S4 does not use pointer objects but instead nests objects internally. Therefore, distr has problems with composite distributions in that they quickly become very large in size. For example, a mixture of two distributions can easily be around 0.5Mb, which is relatively large. In contrast, R6 introduces pointers, which means that a wrapped object simply points to its wrapped component and does not copy it needlessly.
Whilst a fully object-oriented interface can be developed in S3 and S4,
they do not have the flexibility of R6, which means that in the long
run, extensibility and scalability can be problematic. R6 forces R users
to learn a paradigm that they may not be familiar with, but packages
like R62S3 allow users to become acquainted with R6 on a slightly
shallower learning curve. Speed differences for the three paradigms are
formally compared on the example above using
microbenchmark
(Mersmann 2019); the results are in table 4. The R6 example
is compared both including the construction of the class,
duck$new("Zaphod")$quack()
, and without construction, d$quack()
,
where d
is the object constructed before comparison. A significant
‘bottleneck’ is noted when construction is included in the comparison.
However, despite this, S4 is still significantly the slowest.
Paradigm | mean \((\mu s)\) | cld |
---|---|---|
S3 | 73.44 | a |
S4 | 276.17 | c |
R6 | 187.70 | b |
R6* | 38.32 | a |
In the simplest definition, ‘design patterns’ are abstract solutions to common coding problems. They are probably most widely known due to the book ‘Design Patterns Elements of Reusable Object-Oriented Software’ (Design Patterns) (Gamma et al. 1996). distr6 primarily makes use of the following design patterns
The strategy pattern is common in modeling toolboxes in which multiple
algorithms can be used to solve a problem. This pattern defines an
abstract class for a given problem and concrete classes that each
implement different strategies, or algorithms, to solve the problem. For
example, in the context of mathematical integration (a common problem in
R), one could use Simpson’s rule, Kronrod’s, or many others. These can
be specified by an integrate
abstract class with concrete sub-classes
simpson
and kronrod
(figure 4).
The composite pattern defines a collection of classes with an identical interface when treated independently or when composed into a single class with constituent parts. To the user, this means that only one interface needs to be learned in order to interact with composite or individual classes. A well-built composite pattern allows users to construct complex classes with several layers of composition and yet still be able to make use of a single interface. By inheriting from a parent class, each class and composite share a common interface. Composition is a powerful design principle that allows both modification of existing classes and reduction of multiple classes (Király et al. 2021).
Decorators add additional responsibilities to an object without making
any other changes to the interface. An object that has been decorated
will be identical to its un-decorated counter-part except with
additional methods. This provides a useful alternative to inheritance.
Whereas inheritance can lead to large tree structures in which each
sub-class inherits from the previous and contains all previous methods,
decorators allow the user to pick and choose with responsibilities to
add. Figure 5 demonstrates how this is useful in a
shopping cart example. The top of the figure demonstrates using
inheritance in which each sub-class adds methods to the Cart
parent
class. By the Tax
child class, there are a total of five methods in
the interface. At the bottom of the figure, the decorator pattern
demonstrates how the functionality for adding items and tax is separated
and can be added separately.
In order to implement distr6, several contributions were made to the R6 paradigm to extend its abilities and to implement the design patterns discussed above.
R6 did not have a concept of abstract classes, which meant that patterns
such as adapters, composites, and decorators, could not be directly
implemented without problems. This is produced in distr6 with the
abstract
function, which is placed in the first line of all abstract
classes. In the example below, obj
expects the self
argument from R6
classes, and class
is the name of the class, getR6Class
is a custom
function for returning the name of the class of the given object.
<- function(obj, class) {
abstract if (getR6Class(obj) == class) {
stop(sprintf("%s is an abstract class that can't be initialized.", class))
} }
For example, in decorators, the following line is placed at the top of
the initialize
function:
abstract(self, "DistributionDecorator")
The typical implementation of decorators is to have an abstract
decorator class with concrete decorators inheriting from this, each with
their own added responsibilities. In distr6, this is made possible by
defining the DistributionDecorator
abstract class (see above) with a
public decorate
method. Concrete decorators are simply R6 classes
where the public methods are the ones to ‘copy’ to the decorated object.
> DistributionDecorator
<DistributionDecorator> object generator
:
Public: NULL
packages: function ()
initialize: function (distribution, ...)
decorate: function (deep = FALSE)
clone
> CoreStatistics
<CoreStatistics> object generator
: <DistributionDecorator>
Inherits from:
Public: function (t)
mgf: function (t)
cf: function (z) pgf
When the $decorate
method from a constructed decorator object is
called, the methods are simply copied from the decorator environment to
the object environment. The decorator()
function simplifies this for
the user.
The composite pattern is made use of in what distr6 calls ‘wrappers’.
Again, this is implemented via an abstract class (DistributionWrapper
)
with concrete sub-classes.
> DistributionWrapper
<DistributionWrapper> object generator
: <Distribution>
Inherits from:
Public: function (distlist = NULL, name, short_name, description, support,
initialize: function (model = NULL)
wrappedModels: function (..., lst = NULL, error = "warn")
setParameterValue:
Private: list
.wrappedModels
> TruncatedDistribution
<TruncatedDistribution> object generator
: <DistributionWrapper>
Inherits from:
Public: function (distribution, lower = NULL, upper = NULL)
initialize: function (..., lst = NULL, error = "warn")
setParameterValue:
Private: function (x, log = FALSE)
.pdf: function (x, lower.tail = TRUE, log.p = FALSE)
.cdf: function (p, lower.tail = TRUE, log.p = FALSE)
.quantile: function (n) .rand
Wrappers in distr6 alter objects by modifying either their public or
private methods. Therefore, an ‘unwrapped’ distribution looks identical
to a ‘wrapped’ one, despite inheriting from different classes. This is
possible via two key implementation strategies: 1) on the construction
of a wrapper, parameters are prefixed with a unique ID, meaning that all
parameters can be accessed at any time; 2) the wrappedModels
public
field allows access to the original wrapped distributions. These two
factors allow any new method to be called either by reference to
wrappedModels
or by using $getParameterValue
with the newly prefixed
parameter ID. This is demonstrated in the .pdf
private method of the
TruncatedDistribution
wrapper (slightly abridged).
= function(x, log = FALSE) {
.pdf <- self$wrappedModels()[[1]]
dist <- self$getParameterValue("trunc_lower")
lower <- self$getParameterValue("trunc_upper")
upper
<- numeric(length(x))
pdf > lower & x <= upper] <- dist$pdf(x[x > lower & x <= upper]) /
pdf[x $cdf(upper) - dist$cdf(lower))
(dist
return(pdf)
}
As the public pdf
is the same for all distributions, and this is
inherited by wrappers, only the private .pdf
method needs to be
altered.
This final section looks at concrete short examples for four key use cases.
The primary use case for the majority of users will be in constructing
distributions in order to query their results and visualize their
shape.
Below, a distribution (Binomial) is constructed and queried for its
distribution-specific traits and parameterization-specific properties.
> b <- Binomial$new(prob = 0.1, size = 5)
> b$setParameterValue(size = 6)
> b$getParameterValue("size")
> b$parameters()
> b$properties
> b$traits
Specific methods from the distribution are queried as well.
> b$mean()
> b$entropy()
> b$skewness()
> b$kurtosis()
> b$cdf(1:5)
The distribution is visualized by plotting it’s density, distribution, inverse distribution, hazard, cumulative hazard, and survival function; the output is in figure 6.
> plot(b, fun = "all")
distr6 can also serve as a toolbox for analysis of empirical data by
making use of the three ‘empirical’ distributions: Empirical
,
EmpricalMV
, and WeightedDiscrete
.
First, an empirical distribution is constructed with samples from a
standard exponential distribution.
> E <- Empirical$new(samples = rexp(10000))
The summary
function is used to quickly obtain key information about
the empirical distribution.
> summary(E)
Empirical Probability Distribution.
Quick Statistics: 0.105954
Mean: 1.140673
Variance: 0.05808027
Skewness: -0.473978
Ex. Kurtosis
: (-2.50, -2.19,...,2.27, 2.66) Scientific Type: R
Support
: discrete; univariate
Traits: asymmetric; platykurtic; positive skew Properties
The distribution is compared to a (standard) Normal distribution and then (standard) Exponential distribution; output in figure 7.
> qqplot(E, Normal$new(), xlab = "Empirical", ylab = "Normal")
> qqplot(E, Exponential$new(), xlab = "Empirical", ylab = "Exponential")
The CDF of a bivariate empirical distribution is visualized; output in figure 8.
> plot(EmpiricalMV$new(data.frame(rnorm(100, mean = 3), rnorm(100))), fun = "cdf")
Whilst empirical distributions are useful when data samples have been
generated, custom distributions can be used to build an entirely new
probability distribution. Though, here, we use a simple discrete uniform
distribution. This example highlights the power of decorators to
estimate distribution results without manual computation of every
possible method. The output demonstrates the precision and accuracy of
these results.
Below, a custom distribution is created, and by including the
decorators
argument, all further methods are imputed numerically. The
distribution is summarized for properties, traits, and common results
(this is possible with the ‘CoreStatistics’ decorator). The summary is
identical to the analytic DiscreteUniform
distribution.
> U <- Distribution$new(
+ name = "Discrete Uniform",
+ type = set6::Integers$new(), support = set6::Set$new(1:10),
+ pdf = function(x) ifelse(x < 1 | x > 10, 0, rep(1/10,length(x))),
+ decorators = c("CoreStatistics", "ExoticStatistics", "FunctionImputation"))
> summary(U)
Discrete Uniform
Quick Statistics: 5.5
Mean: 8.25
Variance: 0
Skewness: -1.224242
Ex. Kurtosis
: {1, 2,...,9, 10} Scientific Type: Z
Support
: discrete; univariate
Traits: asymmetric; platykurtic; no skew
Properties
: CoreStatistics, ExoticStatistics, FunctionImputation Decorated with
The CDF and simulation function are called (numerically imputed with the
FunctionImputation
decorator), the hazard function from the
ExoticStatistics
decorator, and the kthmoment
function from the
CoreStatistics
decorator.
> U$cdf(1:10)
1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[> U$rand(10)
1] 8 10 5 8 5 10 6 7 1 4
[> U$hazard(2)
1] 0.125
[> U$kthmoment(2)
1] 8.25 [
Composite distributions are an essential part of any distribution
software. The following example demonstrates two types of composites:
composition via distribution transformation (truncation) and composition
via mixtures and vectors.
First, a Binomial distribution is constructed and truncated between \(1\)
and \(5\), the CDF of the new distribution is queried.
> TB <- truncate(
$new(size = 20, prob = 0.5),
Binomiallower = 1,
upper = 5
)> round(TB$cdf(0:6), 4)
1] 0.0000 0.0000 0.0088 0.0613 0.2848 1.0000 1.0000 [
Next, a vector distribution is constructed of two Normal distributions, with respective means \(1\) and \(2\) and unit standard deviation. The parameters are queried (some columns suppressed).
> V <- VectorDistribution$new(distribution = "Normal",
+ params = data.frame(mean = 1:2))
> V$parameters()
id value support1: Norm1_mean 1 R
2: Norm1_var 1 R+
3: Norm1_sd 1 R+
4: Norm1_prec 1 R+
5: Norm2_mean 2 R
6: Norm2_var 1 R+
7: Norm2_sd 1 R+
8: Norm2_prec 1 R+
Vectorization is possible across distributions, samples, and both. In
the example below, the first call to $pdf
evaluates both distributions
at (1, 2), the second call evaluates the first at (1) and the second at
(2), and the third call evaluates the first at (1, 2) and the second at
(3, 4).
> V$pdf(1:2)
Norm1 Norm21: 0.3989423 0.2419707
2: 0.2419707 0.3989423
> V$pdf(1, 2)
Norm1 Norm21: 0.3989423 0.3989423
> V$pdf(1:2, 3:4)
Norm1 Norm21: 0.3989423 0.24197072
2: 0.2419707 0.05399097
Finally, a mixture distribution with uniform weights is constructed from a \(Normal(2, 1)\) distribution and an \(Exponential(1)\).
> MD <- MixtureDistribution$new(
+ list(Normal$new(mean = 2, sd = 1),
+ Exponential$new(rate = 1)
+ )
+ )
> MD$pdf(1:5)
1] 0.304925083 0.267138782 0.145878896 0.036153303 0.005584898
[> MD$cdf(1:5)
1] 0.3953879 0.6823324 0.8957788 0.9794671 0.9959561
[> MD$rand(5)
1] 3.6664473 0.1055126 0.6092939 0.8880799 3.4517465 [
Whilst distr6 fulfils its primary purpose as an R6 interface for probability distributions with basic features, it is not consider ‘feature-complete’, as it currently lacks many of the important features included in distr and other related software. distr6 is in constant development and has an active GitHub with open issues and projects. Some concrete short-term goals include:
FunctionImputation
decorator to work on higher-order
distributions as well as to improve speed and accuracy.distr6 introduces a robust and scalable object-oriented interface for
probability distributions to R and aims to be the first-stop for class
object-oriented probability distributions in R. By making use of R6,
every implemented distribution is clearly defined with properties,
traits, and analytic results. Whilst R stats is limited to very basic
dpqr
functions for representing evaluated distributions, distr6
ensures that probability distributions are treated as complex
mathematical objects.
Future updates of the package will include adding further numerical approximation strategies in the decorators to allow users to choose different methods (instead of being forced to use one). Additionally, the extensions to R6 could be abstracted into an independent package in order to better benefit the R community.
distr6 is released under the MIT license on GitHub and CRAN. Extended documentation, tutorials, and examples are available at https://alan-turing-institute.github.io/distr6/. Code quality is monitored and maintained by an extensive suite of unit tests on multiple continuous integration systems.
We would like to thank and acknowledge Prof. Dr. Peter Ruckdeschel and Prof. Dr. Matthias Kohl for their work on the distr package and for extensive discussions, planning, and design decisions that were utilised in the development of distr6. RS receives a PhD stipend from EPSRC (EP/R513143/1).
distr6, distr, R6, extraDistr, actuar, distributions3, distributional, mistr, R62S3, Rcpp, set6, microbenchmark
ActuarialScience, Databases, Distributions, Finance, HighPerformanceComputing, NumericalMathematics, Robust
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sonabend & Király, "distr6: R6 Object-Oriented Probability Distributions Interface in R", The R Journal, 2021
BibTeX citation
@article{RJ-2021-055, author = {Sonabend, Raphael and Király, Franz J.}, title = {distr6: R6 Object-Oriented Probability Distributions Interface in R}, journal = {The R Journal}, year = {2021}, note = {https://rjournal.github.io/}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {470-492} }