The purpose of this paper is to introduce the R package InfoTrad for estimating the probability of informed trading (PIN) initially proposed by Easley et al. (1996). PIN is a popular information asymmetry measure that proxies the proportion of informed traders in the market. This study provides a short survey on alternative estimation techniques for the PIN. There are many problems documented in the existing literature in estimating PIN. InfoTrad package aims to address two problems. First, the sequential trading structure proposed by Easley et al. (1996) and later extended by Easley et al. (2002) is prone to sample selection bias for stocks with large trading volumes, due to floating point exception. This problem is solved by different factorizations provided by Easley et al. (2010) (EHO factorization) and Lin and Ke (2011) (LK factorization). Second, the estimates are prone to bias due to boundary solutions. A grid-search algorithm (YZ algorithm) is proposed by Yan and Zhang (2012) to overcome the bias introduced due to boundary estimates. In recent years, clustering algorithms have become popular due to their flexibility in quickly handling large data sets. Gan et al. (2015) propose an algorithm (GAN algorithm) to estimate PIN using hierarchical agglomerative clustering which is later extended by Ersan and Alici (2016) (EA algorithm). The package InfoTrad offers LK and EHO factorizations given an input matrix and initial parameter vector. In addition, these factorizations can be used to estimate PIN through YZ algorithm, GAN algorithm and EA algorithm.
The main aim of this paper is to present the InfoTrad package that estimates the probability of informed trading (PIN) initially proposed by Easley et al. (1996). PIN is one of the primary measures of proxy information asymmetry in the market. The structural model is driven from maximum likelihood estimation (MLE). Wide range of studies use PIN to answer questions in different fields of finance1.
Although it is a heavily used measure in the finance literature, the
development of applications that calculate PIN are quite slow. An
initial attempt for R community is made by (Zagaglia 2012).
FinAsym package of
Zagaglia (2012) and the
PIN package of
Zagaglia (2013) provide the trade classification algorithm of
Lee and Ready (1991) which is an important tool for studies that use the TAQ
database. Both packages also provide PIN estimates through
pin_likelihood()
functions. However, those estimates are prone to bias
due to misspecification and other limitations. InfoTrad package
aims to overcome such limitations and provide users with a wide range of
options when estimating PIN.
Due to the popularity of the measure, problems in estimating PIN recently gained attention in the finance literature. Easley et al. (2010) indicate that for stocks with a large trading volume, it is not possible to estimate PIN due to floating-point-exception (FPE). Two different numerical factorizations are provided by Easley et al. (2010) and Lin and Ke (2011) to overcome the bias created due to FPE.
In addition, boundary solutions in estimating PIN are also shown to create bias in empirical studies. Yan and Zhang (2012) show that, independent of the type of factorization, the likelihood function can stuck at local optimum and provide biased PIN estimates. They propose an algorithm (YZ algorithm) that spans the parameter space by using 125 different initial values for the MLE problem and obtain the PIN estimate that gives the highest likelihood value with non-boundary solutions. Although YZ algorithm provides estimates with higher likelihood and guarantees obtain non-boundary solutions, the iterative structure makes this algorithm time-consuming especially for studies that use large datasets.
Considering the fact that recent studies that estimate PIN use large datasets, the effectiveness of the YZ algorithm is questioned. In recent years, clustering algorithms have become popular due to their efficiency in processing large sets of data. Gan et al. (2015) propose an algorithm that use hierarchical agglomerative clustering to estimate PIN. Ersan and Alici (2016) later extends this framework.
FPE and boundary solutions are not the only problems of PIN model. Duarte and Young (2009) indicate that the structural model of Easley et al. (1996) enforces a negative contemporaneous covariance between intraday buy and sell orders, which is contrary to the empirical evidence for symmetric order shocks. In addition,they show that the PIN model fails to capture the volatility of buy and sell orders,through simulations. Moreover, Duarte and Young (2009) adjust PIN to take into account the liquidity impact and show that liquidity is more prominent on stock returns compared to information asymmetry. Finally, it is important to note that PIN does not consider any strategic behaviour of investors such as order splitting. Order splitting can be more evident when a stock is jointly trading on multiple venues (Menkveld 2008). Even for a stock that is traded on a single market, an informed investor may want to split her order in order avoid revealing her private information too quickly (Foucault et al. 2013). PIN model, by construction, fails to attach multiple small orders to a single informed investor.
This paper introduces and discusses the R (R Core Team 2016)
InfoTrad package for estimating PIN. InfoTrad provides users
with the necessary methods to solely adress the problems of FPE and
boundary solutions. The package contains the likelihood factorizations
of EHO and LK as separate functions (EHO()
and LK()
, respectively)
which provide likelihood specifications to avoid FPE. In addition,
through YZ()
, GAN()
and EA()
functions, PIN estimates can be
obtained using the grid-search algorithm of Yan and Zhang (2012) and
clustering algorithms of Gan et al. (2015) and Ersan and Alici (2016). For all of
the algorithms, likelihood specification can be set to EHO or LK.
The paper is organized as follows; Section 2 provides a brief description of PIN. Specifically, section 2.1 discusses the problem of FPE and the alternative factorizations EHO and LK. Section 2.2 reviews the problem of boundary solutions and the YZ algorithm. Section 2.3 describes the clustering algorithms of Gan et al. (2015) and Ersan and Alici (2016). Section 3 introduces the package InfoTrad along with examples. Section 4 evaluates the performance of each method through simulations. Section 5 provides concluding remarks.
The structural model of Easley et al. (1996) and Easley et al. (2002) consists of three types of agents; informed traders, uninformed traders and market makers. On a trading day \(t\), one risky asset is continuously traded. Market maker sets the price for a given stock by observing the buy orders \((B_t)\) and sell orders \((S_t)\). For that stock, an information event is assumed to follow a Bernoulli distribution with success probability \(\alpha\). This event reveals either a high or a low signal for the stock value. The event is assumed to provide a low signal with probability \(\delta\). When informed traders observe a high (low) signal, they are assumed to place buy (sell) orders at a rate of \(\mu\). Uninformed traders are assumed to place orders, independent of the information event and the signal. They arrive to market to place a buy (sell) order at a rate of \(\epsilon_b\) (\(\epsilon_s\)). Orders of both informed and uninformed investors are assumed to follow independent Poisson processes.
The joint probability distribution with respect to the parameter vector \(\Theta \equiv \{ \alpha, \delta, \mu, \epsilon_b, \epsilon_s \}\) and the number of buys and sells \((B_t,S_t)\), is specified by \[\label{eq:1} \begin{split} f(B_t,S_t|\Theta) \equiv \alpha\delta exp(-\epsilon_b) \frac{\epsilon_b^{B_t}}{{B_t}!} exp[-(\epsilon_s+\mu)] \frac{(\epsilon_s+\mu)^{S_t}}{{S_t}!} \hspace{0.9cm} \\ + \alpha (1-\delta) exp[-(\epsilon_b+\mu)] \frac{(\epsilon_b +\mu)^{B_t}}{{B_t}!} exp(-\epsilon_s) \frac{\epsilon_s^{S_t}}{{S_t}!} \hspace{0.1cm}\\ + (1-\alpha) exp(-\epsilon_b) \frac{\epsilon_b^{B_t}}{{B_t}!} exp(-\epsilon_s) \frac{\epsilon_s^{S_t}}{{S_t}!} \hspace{1.6cm} \end{split} \tag{1}\]
The estimates of arrival rates (\(\hat{\mu}, \hat{\epsilon_s}\) and \(\hat{\epsilon_b}\)), along with estimates of the probabilities (\(\hat{\alpha}\) and \(\hat{\delta}\)) can be obtained by maximizing the joint log-likelihood function given the order input matrix \((B_t,S_t)\) over \(T\) trading days. The non-linear objective function of this problem can be written as; \[\label{eq:2} L(\Theta|T)\equiv \sum_{t=1}^{T}L(\Theta|(B_t,S_t))= \sum_{t=1}^Tlog[f(B_t,S_t|\Theta)] \tag{2}\] The maximization problem is subject to the boundary constraints \(\alpha,\delta \in [0,1]\) and \(\mu,\epsilon_b,\epsilon_s \in [0,\infty)\)2. The PIN estimate is then given by; \[\label{eq:3} \widehat{PIN}=\frac{\hat{\alpha}\hat{\mu}}{\hat{\alpha}\hat{\mu}+\hat{\epsilon_b}+\hat{\epsilon_s}} \tag{3}\]
PIN estimates are prone to selection bias, especially for stocks for which the number of buy and sell orders are large3. Lin and Ke (2011) show that the increase in the number of buy and sell orders for a given stock, significantly shrinks the feasible solution set for the maximization of the log likelihood function in equation (2). To maximize the non-linear function (1), the optimization software introduces initial values for the parameters in \(\Theta\). The numerical optimization method is applied after those initial parameters are introduced. Therefore, for large enough \(B_t\) and \(S_t\) whose factorials cannot be calculated by mainstream computers (i.e. FPE), the optimal value for equation (2) becomes undefined. The FPE problem is therefore, more pronounced in active stocks.
To avoid the bias created due to FPE, one factorization of the equation (2) is provided by Easley et al. (2010) as \(L_{EHO}(\Theta|T)\equiv \sum_{t=1}^T L_{EHO}(\Theta|B_t,S_t)\) where \[\label{eq:4} \begin{split} L_{EHO}(\Theta|B_t,S_t) = log[\alpha \delta exp(-\mu)x_b^{B_t-M_t}x_s^{-M_t}+\alpha(1-\delta)exp(-\mu)x_b^{-M_t}x_s^{S_t-M_t}+(1-\alpha)x_b^{B_t-M_t}x_s^{S_t-M_t}] \\ + B_t log(\epsilon_b + \mu)+S_t log (\epsilon_s + \mu)-(\epsilon_b+\epsilon_s)+ M_t[log(x_b)+log(x_s)]-log(S_t!B_t!),\hspace{1.1cm} \end{split} \tag{4}\] where \(M_t=min(B_t,S_t)+max(B_t,S_t)/2\), \(x_b=\epsilon_b/(\mu+\epsilon_b)\) and \(x_s=\epsilon_s/(\mu+\epsilon_s)\).
Lin and Ke (2011) introduce another algebraically equivalent factorization
of the equation (2),
\(L_{LK}(\Theta|T)\equiv \sum_{t=1}^T L_{LK}(\Theta|B_t,S_t)\) where
\[\label{eq:5}
\begin{split}
L_{LK}(\Theta|B_t,S_t) = log[\alpha \delta exp(e_{1 t}-e_{max t}) +\alpha (1-\delta)exp(e_{2t}-e_{max t})+(1-\alpha)exp(e_{3 t}-e_{max t})] \\
+ B_tlog(\epsilon_b+\mu) + S_tlog(\epsilon_s+\mu)-(\epsilon_b+\epsilon_s)+e_{max t} -log(S_t!B_t!), \hspace{2.3cm}
\end{split} \tag{5}\]
where \(e_{1 t}=-\mu-B_tlog(1+\mu/\epsilon_b)\),
\(e_{2 t}=-\mu-S_tlog(1+\mu/\epsilon_s)\),
\(e_{3 t}=-B_tlog(1+\mu/\epsilon_b)-S_tlog(1+\mu/\epsilon_s)\) and
\(e_{max t} = max(e_{1t},e_{2t},e_{3t})\). The last term \(log(S_t!B_t!)\)
is constant with respect to the parameter vector \(\Theta\), and is,
therefore, dropped in the MLE for both factorizations.
Another source of bias in estimating PIN arises from boundary solutions. Yan and Zhang (2012) indicate that in calculating PIN, parameter estimates \(\hat{\alpha}\) and \(\hat{\delta}\) usually fall onto the boundaries of the parameter space, that is, they are equal to zero or one. PIN estimate presented in equation ((3)) is directly related to the estimate of \(\hat{\alpha}\). Letting \(\hat{\alpha}\) equal to zero will make sure that PIN is zero as well. This can create a sample selection bias in portfolio formation, especially for quarterly estimations4. Yan and Zhang (2012) show that;
\[E(B)=\alpha(1-\delta)\mu+\epsilon_b\]
\[E(S)=\alpha\delta\mu+\epsilon_s\]
Then, they propose the following algorithm to overcome the bias created due to boundary solutions. Let \((\alpha^0,\delta^0,\epsilon_b^0,\epsilon_s^0,\mu^0)\) be the initial parameter function to be placed in the non-linear program presented in equation ((4)). In addition, let \(\bar{B}\) and \(\bar{S}\) be the average number of buy and sell orders.
\[\alpha^0=\alpha_i, \hspace{0.5cm} \delta^0=\delta_j, \hspace{0.5cm} \epsilon_b^0=\gamma_k\bar{B}, \hspace{0.5cm} \mu^0=\frac{\bar{B}-\epsilon_b^0}{\alpha^0(1-\delta^0)} \hspace{0.2cm} \text{and} \hspace{0.2cm} \epsilon_s^0=\bar{S}-\alpha^0\delta^0\mu^0\] where \(\alpha_i,\delta_j, \gamma_k \in \{0.1,0.3,0.5,0.7,0.9\}\). This will yield 125 different PIN estimates along with their likelihood values. In line with Yan and Zhang (2012), we drop any initial parameter vector having negative values for \(\epsilon_s^0\). In addition, following Ersan and Alici (2016), we also drop any initial parameter vector with \(\mu^0>max(B_t,S_t)\). Yan and Zhang (2012) then select the estimate with non-boundary parameters yielding highest likelihood value. This method, by construction, spans the parameter space and tries to avoid local optima and provides non-boundary estimates for \(\alpha\).
In recent years, clustering algorithms are increasingly becoming popular
in estimating the probability of informed trading due to efficiency
concerns. Gan et al. (2015) and Ersan and Alici (2016) use clustering algorithms
to estimate PIN. Gan et al. (2015) introduce a method that clusters the data
into three groups (good news, bad news, no news) based on the mean
absolute difference in order imbalance. Let \(X_t=B_t-S_t\) be the order
imbalance on day \(t\) computed as the difference between buy orders and
sell orders. The clustering is then based on the distance function
defined as \(D(I,J)=|X_i-X_j|, \quad 1\leq i,j \leq T\) where \(i \neq j\).
They use hierarchical agglomerative clustering (HAC) to group the data
elements based on the distance matrix. Specifically, they use hclust()
function of Müllner (2013) in R5. The algorithm sequentially clusters,
in a bottom-up fashion, each observation into groups based on \(X_t\) and
stops when it reaches three clusters. The theoretical framework of
Easley et al. (1996) indicates that a stock has high (low) \(X_t\) on good
(bad) days. Therefore, the cluster which has the highest (lowest) mean
\(X_t\) is labelled as good (bad) news. The remaining cluster is then
labelled as no news. Once each observation is grouped into their
respective clusters (good news, bad news, no news), \(c \in \{G,B,N\}\),
the parameter estimates for
\(\Theta \equiv \{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\) are
calculated simply by counting. Let \(\omega_c\) be the proportion of
cluster \(c\) occupying the total number of days \(T\), such that
\(\sum_{c=1}^3\omega_c=1\). Similarly, let \(\bar{B_c}\) and \(\bar{S_c}\) be
the average number of buys and sells on cluster \(c\), respectively.
Then, the probability of an information event is given by \(\hat{\alpha}=\omega_B+\omega_G\). Moreover, the estimate for the probability of information event releasing bad news is given by \(\hat{\delta}=\omega_B/\hat{\alpha}\). The estimate for the arrival rate of buy orders of uninformed traders represented by \(\hat{\epsilon_b}=\frac{\omega_B}{\omega_B+\omega_N}\bar{B_B}+\frac{\omega_N}{\omega_B+\omega_N}\bar{B_N}\). Similarly, the estimate for the arrival rate of sell orders of uninformed traders represented by \(\hat{\epsilon_s}=\frac{\omega_G}{\omega_G+\omega_N}\bar{S_G}+\frac{\omega_N}{\omega_G+\omega_N}\bar{B_N}\). Finally, the arrival rate for the informed investors is calculated as \(\hat{\mu}=\frac{\omega_G}{\omega_B+\omega_G}(\bar{B_G}-\hat{\epsilon_b})+\frac{\omega_B}{\omega_B+\omega_G}(\bar{S_B}-\hat{\epsilon_s})\) where \((\bar{B_G}-\hat{\epsilon_b})\) corresponds to the buy rate of informed investors \(\hat{\mu_b}\) and \((\bar{S_B}-\hat{\epsilon_s})\) corresponds to the sell rate of informed investors \(\hat{\mu_s}\)6.
Through simulations, Gan et al. (2015) show that estimates calculated as above are proper candidates for the initial parameter values to be used in MLE process. Ersan and Alici (2016) argue that the estimates for the informed arrival rate, \(\mu\), contains a downward bias with GAN algorithm7. This is what we observe in this study as well. In addition, they state that GAN algorithm provides inaccurate estimates for \(\delta\). In order to overcome these issues, instead of using \(X_t\), Ersan and Alici (2016) use absolute daily order imbalance, \(|X_t|\), to cluster the data. They initially cluster, \(|X_t|\) into two, again by using hclust(). The cluster with the lower mean daily absolute order imbalance is labelled as "no event" cluster and the remaining as "event" cluster. Then, the formation of "good" and "bad" event day clusters are obtained through separating the days in the "event" cluster into two with respect to the sign of the daily order imbalances. The parameter estimates are then computed with the same procedure presented above8.
The R package InfoTrad provides five different functions
EHO()
,LK()
,YZ()
,GAN()
and EA()
. The first two functions
provide likelihood specifications whereas the last three functions can
be used to obtain parameter estimates for \(\Theta\) to calculate PIN in
equation (3). All five functions require a data frame that
contains \(B_t\) in the first column, and \(S_t\) in the second column. We
create \(B_t\) and \(S_t\) for ten hypothetical trading days9. EHO()
and LK()
read \((B_t,S_t)\) and return the related functional form of
the negative log likelihood. These objects can be used in any
optimization procedure such as optim()
to obtain the parameter
estimates
\(\hat{\Theta}\equiv\{\hat{\alpha},\hat{\delta},\hat{\mu},\hat{\epsilon_b},\hat{\epsilon_s}\}\),
the likelihood value and other specifications, in one iteration with a
pre-specified initial value vector, \(\Theta_0\), for parameters. We
define EHO()
and LK()
as simple likelihood specifications rather
than functions that execute the MLE procedure. This is due to the fact
that MLE estimators vary depending on the optimization procedure. Users
who wish to develop alternative estimation techniques, based on the
proposed likelihood factorization, can use EHO()
and LK()
. This is
the underlying reason why those functions do not have built-in
optimization procedures. By specifying EHO()
and LK()
as simple
likelihood functions, we give developers the flexibility to select the
most suitable optimization procedure for their application.
For researchers who want to calculate an estimate of PIN, YZ()
,
GAN()
and EA()
functions have built-in optimization procedures.
Those functions read a likelihood specification value along with data.
Likelihood specification can be set either to “LK" or to”EHO" with
“LK" being the default. All estimation functions use neldermead()
function of nloptr
package to conduct MLE with the specified factorization. GAN and EA
functions also use hclust()
function of Müllner (2013) to conduct
clustering. The output of these three functions is an object that
provides
\(\{\hat{\alpha},\hat{\delta}, \hat{\mu},\hat{\epsilon_b},\hat{\epsilon_s},f(\hat{\Theta}),\widehat{PIN}\}\),
where \(f(\hat{\Theta})\) represents the optimal likelihood value given
the parameter estimates \(\hat{\Theta}\).
An example is provided below for EHO()
with a sample data and initial
parameter values. Notice that the first column of sample data is for
\(B_t\) and second column is for \(S_t\). Similarly, the initial parameter
values are constructed as; \(\Theta_0\) =
\(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\). We use optim()
with
‘Nelder-Mead’ method to execute MLE, however developer is flexible to
use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
<-c(350,250,500,552,163,345,847,923,123,349)
Buy<-c(382,500,463,550,200,323,456,342,578,455)
Sell=cbind(Buy,Sell)
data
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
= c(0.5,0.5,300,400,500)
par0
# Call EHO function
= EHO(data)
EHO_out = optim(par0, EHO_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
model
## Parameter Estimates
$par[1] # Estimate for alpha
model# [1] 0.9111102
$par[2] # Estimate for delta
model#[1] 0.0001231429
$par[3] # Estimate for mu
model# [1] 417.1497
$par[4] # Estimate for eb
model# [1] 336.075
$par[5] # Estimate for es
model# [1] 466.2539
## Estimate for PIN
$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
(model# [1] 0.3214394
####
In this example, \(B_t\) and \(S_t\) vectors are selected so that the likelihood function cannot be represented as in equation (1). We set the initial parameters to be \(\Theta_0\)=(0.5,0.5,300,400,500). For the given \(B_t\), \(S_t\) and \(\Theta_0\) vectors, PIN measure is calculated as 0.32 with EHO factorization.
An example is provided below for LK()
function with a sample data and
initial parameter values. Notice that the first column of sample data is
for \(B_t\) and second column is for \(S_t\). Similarly, the initial
parameter values are constructed as; \(\Theta_0\) =
\(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\). We use optim()
with
‘Nelder-Mead’ method to execute MLE, however developer is flexible to
use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
<-c(350,250,500,552,163,345,847,923,123,349)
Buy<-c(382,500,463,550,200,323,456,342,578,455)
Sell=cbind(Buy,Sell)
data
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
= c(0.5,0.5,300,400,500)
par0
# Call LK function
= LK(data)
LK_out = optim(par0, LK_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
model
## The structure of the model output ##
model
#$par
#[1] 0.480277 0.830850 315.259805 296.862318 400.490830
#$value
#[1] -44343.21
#$counts
#function gradient
# 502 NA
#$convergence
#[1] 1
#$message
#NULL
## Parameter Estimates
$par[1] # Estimate for alpha
model# [1] 0.480277
$par[2] # Estimate for delta
model# [1] 0.830850
$par[3] # Estimate for mu
model# [1] 315.259805
$par[4] # Estimate for eb
model# [1] 296.862318
$par[5] # Estimate for es
model# [1] 400.4908
## Estimate for PIN
$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
(model# [1] 0.178391
####
For the given \(B_t\), \(S_t\) and \(\Theta_0\) vectors, PIN measure is calculated as 0.18 with LK factorization.
An example is provided below for YZ()
function with a sample data.
Notice that the first column of sample data is for \(B_t\) and second
column is for \(S_t\). In addition, the first example is with default
likelihood specification LK and the second one is with EHO. Notice that
YZ()
function do not require any initial parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
<-c(350,250,500,552,163,345,847,923,123,349)
Buy<-c(382,500,463,550,200,323,456,342,578,455)
Sell<-cbind(Buy,Sell)
data
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Yan and Zhang (2012).
# Default factorization is set to be "LK"
=YZ(data)
resultprint(result)
# Alpha: 0.3999999
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2004457
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Yan and Zhang (2012).
=YZ(data,likelihood="EHO")
resultprint(result)
# Alpha: 0.9000001
# Delta: 0.9000001
# Mu: 489.1111
# Epsilon_b: 396.1803
# Epsilon_s: 28.72002
# Likelihood Value: Inf
# PIN: 0.3321033
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.20 with YZ algorithm along with LK factorization. Moreover, PIN measure is calculated as 0.33 with YZ algorithm along with EHO factorization.
An example is provided below for GAN()
function with a sample data.
Notice that the first column of sample data is for \(B_t\) and second
column is for \(S_t\). In addition, the first example is with default
likelihood specification LK and the second one is with EHO. Notice that
GAN()
function do not require any initial parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
<-c(350,250,500,552,163,345,847,923,123,349)
Buy<-c(382,500,463,550,200,323,456,342,578,455)
Sell<-cbind(Buy,Sell)
data
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Gan et. al. (2015).
# Default factorization is set to be "LK"
=GAN(data)
resultprint(result)
# Alpha: 0.3999998
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2044464
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Gan et. al. (2015)
=GAN(data, likelihood="EHO")
resultprint(result)
# Alpha: 0.3230001
# Delta: 0.4780001
# Mu: 481.3526
# Epsilon_b: 356.6359
# Epsilon_s: 313.136
# Likelihood Value: Inf
# PIN: 0.1884001
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.20 with GAN algorithm along with LK factorization. Moreover, PIN measure is calculated as 0.19 with GAN algorithm along with EHO factorization.
An example is provided below for EA()
function with a sample data.
Notice that the first column of sample data is for \(B_t\) and second
column is for \(S_t\). In addition, the first example is with default
likelihood specification LK and the second one is with EHO. Notice that
EA()
function do not require any initial parameter vector \(\Theta_0\).
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
=c(350,250,500,552,163,345,847,923,123,349)
Buy=c(382,500,463,550,200,323,456,342,578,455)
Sell=cbind(Buy,Sell)
data
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the modified clustering algorithm of Ersan and Alici (2016).
# Default factorization is set to be "LK"
=EA(data)
resultprint(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the modified clustering algorithm of Ersan and Alici (2016).
=EA(data,likelihood="EHO")
resultprint(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
For the given \(B_t\) and \(S_t\) vectors, PIN measure is calculated as 0.08 with EA algorithm along with LK factorization. Moreover, PIN measure is calculated, again, as 0.08 with EA algorithm along with EHO factorization.
In this section, we investigate the performance of the estimates obtained for \(\Theta\) and PIN using the existing methods. We evaluate the methods based on their accuracy proxied by mean absolute errors (MAE)10. We first examine how the estimates vary in different trade intensity levels. To this end, we follow the methodology in Gan et al. (2015). Let \(I\) be the the set of trade intensity levels ranging from 50 to 5000 at step size of 50, that is, I=\(\{50,100,150,\dots,5000\}\). We first set our parameters as \(\Theta= \{\alpha=0.5,\delta=0.5,\mu=0.2i,\epsilon_b=0.4i,\epsilon_s=0.4i \}\), where \(i \in I\). For each trade intensity level, we generate \(N\)=50 random samples of \(\tilde{\alpha}\) and \(\tilde{\delta}\) that are binomially distributed with parameters \(\alpha\) and \(\delta\) respectively. \(\tilde{\alpha}\) and \(\tilde{\delta}\) proxy the content of the information event. For each pair of \(\tilde{\alpha}\), \(\tilde{\delta}\) values, we generate buy and sell values \((B_t,S_t)\) for hypothetical \(T\)=60 days in the following manner;
if \(\tilde{\alpha}\) = 0, then there is no information event, therefore, generate \(B_t \sim Pois(\epsilon_b)\) and \(S_t \sim Pois(\epsilon_s)\).
if \(\tilde{\alpha}\) = 1, and \(\tilde{\delta}\) =1, then there is bad news, therefore generate \(B_t \sim Pois(\epsilon_b)\) and \(S_t \sim Pois(\epsilon_s+\mu)\)
if \(\tilde{\alpha}\) = 1, and \(\tilde{\delta}\) =0, then there is good news, therefore generate \(B_t \sim Pois(\epsilon_b+\mu)\) and \(S_t \sim Pois(\epsilon_s)\)
We then form the joint likelihood function represented by equation
(4) in EHO form or by equation (5) in LK form and obtain
the estimates using YZ()
, GAN()
or EA()
methods.
The results are presented in Table 1 which indicates that
YZ()
method with LK()
factorization provides the PIN estimates with
lowest MAE. Although the clustering algorithms, especially GAN()
method, provide powerful estimates of
\(\hat{\alpha},\hat{\delta},\hat{\epsilon_b},\hat{\epsilon_s}\), they fail
to estimate the arrival rate of informed investors
\(\hat{\mu}\),accurately. This is in line with Ersan and Alici (2016). On the
contrary, YZ()
method with EHO()
factorization provides the best
estimates for \(\hat{\mu}\), but fails to provide good estimates for other
parameters.
Method | Factorization | \(\widehat{PIN}\) | \(\hat{\alpha}\) | \(\hat{\delta}\) | \(\hat{\mu}\) | \(\hat{\epsilon_b}\) | \(\hat{\epsilon_s}\) | |
---|---|---|---|---|---|---|---|---|
YZ | LK | 0.075 | 0.199 | 0.059 | 415.2 | 104.3 | 109.0 | |
YZ | EHO | 0.134 | 0.428 | 0.310 | 154.6 | 288.3 | 247.4 | |
GAN | EHO | 0.101 | 0.087 | 0.083 | 479.4 | 124.1 | 117.3 | |
GAN | LK | 0.101 | 0.087 | 0.083 | 479.5 | 123.8 | 118.1 | |
EA | LK | 0.102 | 0.268 | 0.274 | 484.6 | 128.7 | 119.3 | |
EA | EHO | 0.102 | 0.270 | 0.275 | 483.1 | 128.5 | 107.8 |
A more general way of examining the accuracy of PIN estimates is proposed in several studies (e.g, Lin and Ke (2011), Gan et al. (2015), Ersan and Alici (2016)). In this setting, we fix the trade intensity, I=2500. The total trade intensity represents the overall presence of informed and uninformed traders, that is, I=(\(\mu\), \(\epsilon_b\), \(\epsilon_s\)). We then generate three probability terms \(p_1,p_2,p_3\) with \(N\)=5000 random observations that are distributed uniformly between 0 and 1. \(p_1\) represents the fraction of informed investors in total trade intensity, that is, \(\mu\)=\(p_1*I\). The rest of the trade intensity is distributed equally to buy and sell orders of uninformed investors, that is, \(e_b=e_s=(1-p_1)*I/2\). \(p_2\) represents the true parameter for the probability of news arrival, \(\alpha\), and \(p_3\) is the true parameter for the content of the news, \(\delta\). We generate observations for \(\tilde{\alpha}\) and \(\tilde{\delta}\), as described earlier. For each pair of \(\tilde{\alpha}\) and \(\tilde{\delta}\), we generate buy and sell values \((B_t,S_t)\) for hypothetical \(T\)=60 days, again, in the manner presented above; form the likelihood and obtain the parameter estimates.
The results are presented in Table 2. Similar to first
simulation, GAN()
captures the true nature of \(\hat{\alpha}\) and
\(\hat{\delta}\) better than any other method with both factorizations.
YZ()
method with EHO()
factorization performs best when estimating
the arrival of informed traders, \(\hat{\mu}\). The importance of
estimating \(\hat{\mu}\) becomes quite evident in Table 2.
Although other methods outperform YZ()
method with EHO()
factorization in estimating \(\alpha,\epsilon_b\) and \(\epsilon_s\), it
provides the best estimate for PIN due to it’s performance on estimating
\(\hat{\mu}\).
Method | Factorization | \(\widehat{PIN}\) | \(\hat{\alpha}\) | \(\hat{\delta}\) | \(\hat{\mu}\) | \(\hat{\epsilon_b}\) | \(\hat{\epsilon_s}\) | |
---|---|---|---|---|---|---|---|---|
YZ | LK | 0.323 | 0.428 | 0.432 | 1,212.0 | 303.4 | 325.0 | |
YZ | EHO | 0.237 | 0.437 | 0.357 | 942.9 | 386.0 | 470.2 | |
GAN | LK | 0.348 | 0.380 | 0.410 | 1,218.7 | 314.5 | 323.3 | |
GAN | EHO | 0.347 | 0.357 | 0.397 | 1,216.2 | 328.5 | 339.5 | |
EA | LK | 0.348 | 0.437 | 0.421 | 1,224.0 | 325.1 | 336.3 | |
EA | EHO | 0.347 | 0.428 | 0.413 | 1,222.0 | 331.3 | 345.9 |
This paper provides a short survey on five most widely used estimation
techniques for the probability of informed trading (PIN) measure. In
this paper, we introduce the R package InfoTrad, covering
estimation procedures for PIN using EHO, LK factorizations along with
YZ, GAN and EA algorithms (EHO()
,LK()
, YZ()
, GAN()
EA()
). The
functions EHO()
and LK()
read a (Tx2) matrix where the rows of the
first column contains total number of buy orders on a given trading day
t, \(B_t\), and the rows of the second column contains the total number of
sell orders on a given trading day t, \(S_t\), where t \(\in\)
\(\{1,2,\dots,T\}\). In addition, they also require an initial parameter
vector in the form of, \(\Theta_0\) =
\(\{\alpha,\delta,\mu,\epsilon_b,\epsilon_s\}\). Both functions produce
the respective log-likelihood functions.
The functions YZ()
, GAN()
and EA()
read \((B_t,S_t)\) as an input
along with a likelihood specification that is set to LK
by default.
These functions do not require initial parameter matrix to obtain the
parameter estimates when calculating PIN. All three functions use
neldermead()
method of nloptr as built-in optimization procedure
for MLE. YZ()
GAN()
and EA()
produce an object that gives the
parameter estimates \(\hat{\Theta}\) along with likelihood value and
\(\widehat{PIN}\).
This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK), Grant Number: 116K335.
InfoTrad, FinAsym, PIN, nloptr
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Çelik & Tiniç, "InfoTrad: An R package for estimating the probability of informed trading", The R Journal, 2018
BibTeX citation
@article{RJ-2018-013, author = {Çelik, Duygu and Tiniç, Murat}, title = {InfoTrad: An R package for estimating the probability of informed trading}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {31-42} }