The purpose of this paper is to introduce the R package InfoTrad for estimating the probability of informed trading (PIN) initially proposed by Easley et al. (1996). PIN is a popular information asymmetry measure that proxies the proportion of informed traders in the market. This study provides a short survey on alternative estimation techniques for the PIN. There are many problems documented in the existing literature in estimating PIN. InfoTrad package aims to address two problems. First, the sequential trading structure proposed by Easley et al. (1996) and later extended by Easley et al. (2002) is prone to sample selection bias for stocks with large trading volumes, due to floating point exception. This problem is solved by different factorizations provided by Easley et al. (2010) (EHO factorization) and Lin and Ke (2011) (LK factorization). Second, the estimates are prone to bias due to boundary solutions. A grid-search algorithm (YZ algorithm) is proposed by Yan and Zhang (2012) to overcome the bias introduced due to boundary estimates. In recent years, clustering algorithms have become popular due to their flexibility in quickly handling large data sets. Gan et al. (2015) propose an algorithm (GAN algorithm) to estimate PIN using hierarchical agglomerative clustering which is later extended by Ersan and Alici (2016) (EA algorithm). The package InfoTrad offers LK and EHO factorizations given an input matrix and initial parameter vector. In addition, these factorizations can be used to estimate PIN through YZ algorithm, GAN algorithm and EA algorithm.
The main aim of this paper is to present the
InfoTrad package
that estimates the probability of informed trading (PIN) initially
proposed by Easley et al. (1996). PIN is one of the primary measures of proxy
information asymmetry in the market. The structural model is driven from
maximum likelihood estimation (MLE). Wide range of studies use PIN to
answer questions in different fields of finance
Although it is a heavily used measure in the finance literature, the
development of applications that calculate PIN are quite slow. An
initial attempt for R community is made by (Zagaglia 2012).
FinAsym package of
Zagaglia (2012) and the
PIN package of
Zagaglia (2013) provide the trade classification algorithm of
Lee and Ready (1991) which is an important tool for studies that use the TAQ
database. Both packages also provide PIN estimates through
pin_likelihood()
functions. However, those estimates are prone to bias
due to misspecification and other limitations. InfoTrad package
aims to overcome such limitations and provide users with a wide range of
options when estimating PIN.
Due to the popularity of the measure, problems in estimating PIN recently gained attention in the finance literature. Easley et al. (2010) indicate that for stocks with a large trading volume, it is not possible to estimate PIN due to floating-point-exception (FPE). Two different numerical factorizations are provided by Easley et al. (2010) and Lin and Ke (2011) to overcome the bias created due to FPE.
In addition, boundary solutions in estimating PIN are also shown to create bias in empirical studies. Yan and Zhang (2012) show that, independent of the type of factorization, the likelihood function can stuck at local optimum and provide biased PIN estimates. They propose an algorithm (YZ algorithm) that spans the parameter space by using 125 different initial values for the MLE problem and obtain the PIN estimate that gives the highest likelihood value with non-boundary solutions. Although YZ algorithm provides estimates with higher likelihood and guarantees obtain non-boundary solutions, the iterative structure makes this algorithm time-consuming especially for studies that use large datasets.
Considering the fact that recent studies that estimate PIN use large datasets, the effectiveness of the YZ algorithm is questioned. In recent years, clustering algorithms have become popular due to their efficiency in processing large sets of data. Gan et al. (2015) propose an algorithm that use hierarchical agglomerative clustering to estimate PIN. Ersan and Alici (2016) later extends this framework.
FPE and boundary solutions are not the only problems of PIN model. Duarte and Young (2009) indicate that the structural model of Easley et al. (1996) enforces a negative contemporaneous covariance between intraday buy and sell orders, which is contrary to the empirical evidence for symmetric order shocks. In addition,they show that the PIN model fails to capture the volatility of buy and sell orders,through simulations. Moreover, Duarte and Young (2009) adjust PIN to take into account the liquidity impact and show that liquidity is more prominent on stock returns compared to information asymmetry. Finally, it is important to note that PIN does not consider any strategic behaviour of investors such as order splitting. Order splitting can be more evident when a stock is jointly trading on multiple venues (Menkveld 2008). Even for a stock that is traded on a single market, an informed investor may want to split her order in order avoid revealing her private information too quickly (Foucault et al. 2013). PIN model, by construction, fails to attach multiple small orders to a single informed investor.
This paper introduces and discusses the R (R Core Team 2016)
InfoTrad package for estimating PIN. InfoTrad provides users
with the necessary methods to solely adress the problems of FPE and
boundary solutions. The package contains the likelihood factorizations
of EHO and LK as separate functions (EHO()
and LK()
, respectively)
which provide likelihood specifications to avoid FPE. In addition,
through YZ()
, GAN()
and EA()
functions, PIN estimates can be
obtained using the grid-search algorithm of Yan and Zhang (2012) and
clustering algorithms of Gan et al. (2015) and Ersan and Alici (2016). For all of
the algorithms, likelihood specification can be set to EHO or LK.
The paper is organized as follows; Section 2 provides a brief description of PIN. Specifically, section 2.1 discusses the problem of FPE and the alternative factorizations EHO and LK. Section 2.2 reviews the problem of boundary solutions and the YZ algorithm. Section 2.3 describes the clustering algorithms of Gan et al. (2015) and Ersan and Alici (2016). Section 3 introduces the package InfoTrad along with examples. Section 4 evaluates the performance of each method through simulations. Section 5 provides concluding remarks.
The structural model of Easley et al. (1996) and Easley et al. (2002) consists of
three types of agents; informed traders, uninformed traders and market
makers. On a trading day
The joint probability distribution with respect to the parameter vector
The estimates of arrival rates (
PIN estimates are prone to selection bias, especially for stocks for
which the number of buy and sell orders are largepin_likelihood()
function of FinAsym package fails to provide results with the
sample initial parameter vector.
To avoid the bias created due to FPE, one factorization of the equation
(2) is provided by Easley et al. (2010) as
Lin and Ke (2011) introduce another algebraically equivalent factorization
of the equation (2),
Another source of bias in estimating PIN arises from boundary solutions.
Yan and Zhang (2012) indicate that in calculating PIN, parameter estimates
Then, they propose the following algorithm to overcome the bias created
due to boundary solutions. Let
In recent years, clustering algorithms are increasingly becoming popular
in estimating the probability of informed trading due to efficiency
concerns. Gan et al. (2015) and Ersan and Alici (2016) use clustering algorithms
to estimate PIN. Gan et al. (2015) introduce a method that clusters the data
into three groups (good news, bad news, no news) based on the mean
absolute difference in order imbalance. Let hclust()
function of Müllner (2013) in Rhclust()
function is used at its default setting in line with
Gan et al. (2015).
Then, the probability of an information event is given by
Through simulations, Gan et al. (2015) show that estimates calculated as
above are proper candidates for the initial parameter values to be used
in MLE process. Ersan and Alici (2016) argue that the estimates for the
informed arrival rate,
The R package InfoTrad provides five different functions
EHO()
,LK()
,YZ()
,GAN()
and EA()
. The first two functions
provide likelihood specifications whereas the last three functions can
be used to obtain parameter estimates for EHO()
and LK()
read optim()
to obtain the parameter
estimates
EHO()
and LK()
as simple likelihood specifications rather
than functions that execute the MLE procedure. This is due to the fact
that MLE estimators vary depending on the optimization procedure. Users
who wish to develop alternative estimation techniques, based on the
proposed likelihood factorization, can use EHO()
and LK()
. This is
the underlying reason why those functions do not have built-in
optimization procedures. By specifying EHO()
and LK()
as simple
likelihood functions, we give developers the flexibility to select the
most suitable optimization procedure for their application.
For researchers who want to calculate an estimate of PIN, YZ()
,
GAN()
and EA()
functions have built-in optimization procedures.
Those functions read a likelihood specification value along with data.
Likelihood specification can be set either to “LK" or to”EHO" with
“LK" being the default. All estimation functions use neldermead()
function of nloptr
package to conduct MLE with the specified factorization. GAN and EA
functions also use hclust()
function of Müllner (2013) to conduct
clustering. The output of these three functions is an object that
provides
An example is provided below for EHO()
with a sample data and initial
parameter values. Notice that the first column of sample data is for
optim()
with
‘Nelder-Mead’ method to execute MLE, however developer is flexible to
use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
par0 = c(0.5,0.5,300,400,500)
# Call EHO function
EHO_out = EHO(data)
model = optim(par0, EHO_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
## Parameter Estimates
model$par[1] # Estimate for alpha
# [1] 0.9111102
model$par[2] # Estimate for delta
#[1] 0.0001231429
model$par[3] # Estimate for mu
# [1] 417.1497
model$par[4] # Estimate for eb
# [1] 336.075
model$par[5] # Estimate for es
# [1] 466.2539
## Estimate for PIN
(model$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
# [1] 0.3214394
####
In this example,
An example is provided below for LK()
function with a sample data and
initial parameter values. Notice that the first column of sample data is
for optim()
with
‘Nelder-Mead’ method to execute MLE, however developer is flexible to
use other methods as well.
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Initial parameter values
# par0 = (alpha, delta, mu, epsilon_b, epsilon_s)
par0 = c(0.5,0.5,300,400,500)
# Call LK function
LK_out = LK(data)
model = optim(par0, LK_out, gr = NULL, method = c("Nelder-Mead"), hessian = FALSE)
## The structure of the model output ##
model
#$par
#[1] 0.480277 0.830850 315.259805 296.862318 400.490830
#$value
#[1] -44343.21
#$counts
#function gradient
# 502 NA
#$convergence
#[1] 1
#$message
#NULL
## Parameter Estimates
model$par[1] # Estimate for alpha
# [1] 0.480277
model$par[2] # Estimate for delta
# [1] 0.830850
model$par[3] # Estimate for mu
# [1] 315.259805
model$par[4] # Estimate for eb
# [1] 296.862318
model$par[5] # Estimate for es
# [1] 400.4908
## Estimate for PIN
(model$par[1]*model$par[3])/((model$par[1]*model$par[3])+model$par[4]+model$par[5])
# [1] 0.178391
####
For the given
An example is provided below for YZ()
function with a sample data.
Notice that the first column of sample data is for YZ()
function do not require any initial parameter vector
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data<-cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Yan and Zhang (2012).
# Default factorization is set to be "LK"
result=YZ(data)
print(result)
# Alpha: 0.3999999
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2004457
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Yan and Zhang (2012).
result=YZ(data,likelihood="EHO")
print(result)
# Alpha: 0.9000001
# Delta: 0.9000001
# Mu: 489.1111
# Epsilon_b: 396.1803
# Epsilon_s: 28.72002
# Likelihood Value: Inf
# PIN: 0.3321033
For the given
An example is provided below for GAN()
function with a sample data.
Notice that the first column of sample data is for GAN()
function do not require any initial parameter vector
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy<-c(350,250,500,552,163,345,847,923,123,349)
Sell<-c(382,500,463,550,200,323,456,342,578,455)
data<-cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the algorithm of Gan et. al. (2015).
# Default factorization is set to be "LK"
result=GAN(data)
print(result)
# Alpha: 0.3999998
# Delta: 0
# Mu: 442.1667
# Epsilon_b: 263.3333
# Epsilon_s: 424.9
# Likelihood Value: 44371.84
# PIN: 0.2044464
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the algorithm of Gan et. al. (2015)
result=GAN(data, likelihood="EHO")
print(result)
# Alpha: 0.3230001
# Delta: 0.4780001
# Mu: 481.3526
# Epsilon_b: 356.6359
# Epsilon_s: 313.136
# Likelihood Value: Inf
# PIN: 0.1884001
For the given
An example is provided below for EA()
function with a sample data.
Notice that the first column of sample data is for EA()
function do not require any initial parameter vector
library(InfoTrad)
# Sample Data
# Buy Sell
#1 350 382
#2 250 500
#3 500 463
#4 552 550
#5 163 200
#6 345 323
#7 847 456
#8 923 342
#9 123 578
#10 349 455
Buy=c(350,250,500,552,163,345,847,923,123,349)
Sell=c(382,500,463,550,200,323,456,342,578,455)
data=cbind(Buy,Sell)
# Parameter estimates using the LK factorization of Lin and Ke (2011)
# with the modified clustering algorithm of Ersan and Alici (2016).
# Default factorization is set to be "LK"
result=EA(data)
print(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
# Parameter estimates using the EHO factorization of Easley et. al. (2010)
# with the modified clustering algorithm of Ersan and Alici (2016).
result=EA(data,likelihood="EHO")
print(result)
# Alpha: 0.9511418
# Delta: 0.2694005
# Mu: 76.7224
# Epsilon_b: 493.7045
# Epsilon_s: 377.4877
# Likelihood Value: 43973.71
# PIN: 0.07728924
For the given
In this section, we investigate the performance of the estimates
obtained for
if
if
if
We then form the joint likelihood function represented by equation
(4) in EHO form or by equation (5) in LK form and obtain
the estimates using YZ()
, GAN()
or EA()
methods.
The results are presented in Table 1 which indicates that
YZ()
method with LK()
factorization provides the PIN estimates with
lowest MAE. Although the clustering algorithms, especially GAN()
method, provide powerful estimates of
YZ()
method with EHO()
factorization provides the best
estimates for
Method | Factorization | |||||||
---|---|---|---|---|---|---|---|---|
YZ | LK | 0.075 | 0.199 | 0.059 | 415.2 | 104.3 | 109.0 | |
YZ | EHO | 0.134 | 0.428 | 0.310 | 154.6 | 288.3 | 247.4 | |
GAN | EHO | 0.101 | 0.087 | 0.083 | 479.4 | 124.1 | 117.3 | |
GAN | LK | 0.101 | 0.087 | 0.083 | 479.5 | 123.8 | 118.1 | |
EA | LK | 0.102 | 0.268 | 0.274 | 484.6 | 128.7 | 119.3 | |
EA | EHO | 0.102 | 0.270 | 0.275 | 483.1 | 128.5 | 107.8 |
A more general way of examining the accuracy of PIN estimates is
proposed in several studies (e.g, Lin and Ke (2011), Gan et al. (2015),
Ersan and Alici (2016)). In this setting, we fix the trade intensity,
I=2500. The total trade intensity represents the overall presence of
informed and uninformed traders, that is, I=(
The results are presented in Table 2. Similar to first
simulation, GAN()
captures the true nature of YZ()
method with EHO()
factorization performs best when estimating
the arrival of informed traders, YZ()
method with EHO()
factorization in estimating
Method | Factorization | |||||||
---|---|---|---|---|---|---|---|---|
YZ | LK | 0.323 | 0.428 | 0.432 | 1,212.0 | 303.4 | 325.0 | |
YZ | EHO | 0.237 | 0.437 | 0.357 | 942.9 | 386.0 | 470.2 | |
GAN | LK | 0.348 | 0.380 | 0.410 | 1,218.7 | 314.5 | 323.3 | |
GAN | EHO | 0.347 | 0.357 | 0.397 | 1,216.2 | 328.5 | 339.5 | |
EA | LK | 0.348 | 0.437 | 0.421 | 1,224.0 | 325.1 | 336.3 | |
EA | EHO | 0.347 | 0.428 | 0.413 | 1,222.0 | 331.3 | 345.9 |
This paper provides a short survey on five most widely used estimation
techniques for the probability of informed trading (PIN) measure. In
this paper, we introduce the R package InfoTrad, covering
estimation procedures for PIN using EHO, LK factorizations along with
YZ, GAN and EA algorithms (EHO()
,LK()
, YZ()
, GAN()
EA()
). The
functions EHO()
and LK()
read a (Tx2) matrix where the rows of the
first column contains total number of buy orders on a given trading day
t,
The functions YZ()
, GAN()
and EA()
read LK
by default.
These functions do not require initial parameter matrix to obtain the
parameter estimates when calculating PIN. All three functions use
neldermead()
method of nloptr as built-in optimization procedure
for MLE. YZ()
GAN()
and EA()
produce an object that gives the
parameter estimates
This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK), Grant Number: 116K335.
InfoTrad, FinAsym, PIN, nloptr
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
pin_likelihood()
function of FinAsym package fails to provide results with the
sample initial parameter vector.[↩]hclust()
function is used at its default setting in line with
Gan et al. (2015).[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Çelik & Tiniç, "InfoTrad: An R package for estimating the probability of informed trading", The R Journal, 2018
BibTeX citation
@article{RJ-2018-013, author = {Çelik, Duygu and Tiniç, Murat}, title = {InfoTrad: An R package for estimating the probability of informed trading}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {31-42} }