PINstimation: An R Package for Estimating Probability of Informed Trading Models

The purpose of this paper is to introduce the R package PINstimation. The package is designed for fast and accurate estimation of the probability of informed trading models through the implementation of well-established estimation methods. The models covered are the original PIN model (Easley and O’Hara 1992; Easley et al. 1996), the multilayer PIN model (Ersan 2016), the adjusted PIN model (Duarte and Young 2009), and the volume-synchronized PIN (Easley, De Prado, and O’Hara 2011; Easley, López De Prado, and O’Hara 2012). These core functionalities of the package are supplemented with utilities for data simulation, aggregation and classification tools. In addition to a detailed overview of the package functions, we provide a brief theoretical review of the main methods implemented in the package. Further, we provide examples of use of the package on trade-level data for 58 Swedish stocks, and report straightforward, comparative and intriguing findings on informed trading. These examples aim to highlight the capabilities of the package in tackling relevant research questions and illustrate the wide usage possibilities of PINstimation for both academics and practitioners.


Introduction
Informed trading indicates the presence of information asymmetry in a given market, and is usually attributed to trading with better-quality information and/or more sophisticated tools for analyzing available information (see Ahn et al., 2008).Given its impact on prices and liquidity, researchers have dedicated considerable effort to the measurement of informed trading, and to the characterization of its relevant aspects (Berkman et al., 2014;Chang et al., 2014;Bongaerts et al., 2014;Hsieh and He, 2014;Yin and Zhao, 2015;Guo and Qiu, 2016).The growth in informed trading measures has been made possible thanks to the availability of rich datasets, and as a response to the continuously evolving nature of trading in financial markets.
Despite the plethora of alternative and more recent measures, "fundamental" measures developed by the pioneering works are still widely used in academic research.Some prominent measures include: relative trade informativeness measure (Hasbrouck, 1991), percentage-price-impact measure (Huang and Stoll, 1996), adverse selection component (Huang and Stoll, 1997) and the adverse information parameter (Madhavan et al., 1997).Above the rest, the probability of informed trading (PIN; Easley and O'Hara, 1992;Easley et al., 1996) has probably been the most widely used measure of informed trading in the literature.Easley and O'Hara, beginning with their foundational work in 1987 and continuing through subsequent studies in the 1990s and 2000s, developed, tested and refined the PIN measure to quantify informed trading in financial markets.A major factor behind the persistent wide use (prominence) of the PIN model lies in the branch of studies addressing the limitations of the model, remedying to the challenges of its estimation; and extending the original model.Due to the rapid evolution of trading in financial markets, the estimation of the original PIN model has become vulnerable to errors; and the model -and its assumptions -as it was first suggested has faced difficulties in matching the real world.Over the years, many extensions and improvements to the PIN model have been developed, addressing various shortcomings of the original model and estimation challenges.However, because of their more complex theoretical underpinnings and implementation details, most of these models have not been adopted by the wider academic and practitioner audience.To address these issues, the PINstimation package seeks to provide easy and convenient access to these extensions of the PIN model.To this end, the package is designed in a compact structure allowing users to directly obtain informed-trading estimates solely by the use of an intraday trading data.The package includes easy-to-use functions, that accurately implement preexisting, and novel remedial solutions to estimation challenges as suggested in the literature.Besides, it provides a rich toolbox for simulating datasets, something that can help researchers conduct robust, and reliable comparative analyses.By the introduction of the package, we hope to contribute to further use of the PIN models in academic research, to improve the validity, and quality of scientific findings within the field, and eventually to heighten the interest of practitioners in these models.
To our knowledge, there are two packages available for the estimation of PIN models: pinbasic (Recktenwald, 2018(Recktenwald, , 2019) ) and InfoTrad (Celik andTiniç, 2017, 2018).Both packages have limited scope as they solely focus on the original PIN model (Easley et al., 1996).In addition to scope differences, The R Journal Vol.15/2, June 2023 ISSN 2073-4859 other motivations for users to shift to PINstimation are that (1) the package pinbasic has recently been placed in the archive by CRAN (2) the package InfoTrad in its current version (V.1.2) is not error-free.1 PINstimation contains functions to estimate probability of informed trading (PIN) as introduced by Easley and O'Hara (1992), and Easley et al. (1996).The estimation procedures implemented in these functions help to avoid floating point errors, boundary solutions, and convergence to local maxima.Besides, the package provides a comprehensive treatment of two important extensions of the PIN model.The multilayer probability of informed trading model (MPIN model;Ersan, 2016), in contrast to the original PIN model, allows for multiple information types, and assumes that information events cluster in layers with uniform informed trading intensity.Relaxing the assumption of a unique information type allows for a more realistic, and accurate treatment of informed trading.However, it poses, at least, two additional challenges: (1) the larger parameter space of the MPIN model makes it more likely that the maximum likelihood estimate may lie on the parameter boundary, and (2) An accurate determination of the number of information layers is crucial to produce reliable estimations of the probability of informed trading.PINstimation tackles these two issues by including a function to generate strategic2 initial parameter sets, and three functions for estimating the number of information layers in datasets.The second extension is the adjusted probability of informed trading model (AdjPIN model; Duarte and Young, 2009).This model challenges the assumption that trading is only performed by uninformed liquidity traders and informed traders, and accounts for the possibility of liquidity shocks to both the buy and sell side.PINstimation provides functions to estimate the AdjPIN measure and the PSOS (probability of a symmetric order flow shock), as well as three functions to generate initial sets of parameters for maximum-likelihood estimation.In addition to the standard maximumlikelihood method, the package provides a novel implementation of the estimation of PIN models via the expectation-conditional maximization algorithm.The speed, and accuracy of this algorithm has been recently documented in Ghachem and Ersan (2022).As for informed trading in high-frequency settings, PINstimation enables users to estimate the volume-synchronized probability of informed trading (VPIN; Easley et al., 2011Easley et al., , 2012)).This measure is an adaptation of the PIN measure to the high-frequency trading, and aims to capture the order flow toxicity in a trading data.Finally, the package offers two supporting utilities: (1) a rich simulation toolbox to simulate data according to the assumptions of the different PIN models and, thereby, test the accuracy of estimation algorithms, and (2) a fast implementation of the prominent trade classification algorithms that allow users to generate daily sequences of buyer-initiated, and seller-initiated trades from raw trading level data.Such sequences are to be used later as inputs for the estimations of PIN, MPIN, and AdjPIN models.
The remainder of this paper is organized as follows.Next section provides a brief introduction to the theoretical background of PIN models.Third section presents a detailed description of the package and illustrates its applications through several examples.Fourth section reports and discusses the results of two empirical investigations conducted using the package.The last section concludes with a summary of the package capabilities and a discussion of its potential extensions.

Theoretical background
2.1 PIN model Easley and O'Hara (1992) developed a model where the change in the order imbalance is associated to the presence of informed trading.The information can be either positive, leading to excess trading on the buy side, or negative, leading to excess trading on the sell side.On days with no information event, there are only uninformed traders in the market.On the days with a good-information (badinformation) event, informed buyers (sellers) join uninformed buyers and sellers to trade on the information.Statistically, Easley et al. (1996) model total trades by a finite Poisson mixture model, where the numbers of buyer-initiated and seller-initiated trades; follow each a Poisson distribution.
The likelihood of observing B buyer-initiated trades (or buys) and S seller-initiated trades (or sells) on a trading day is stated as: where Θ = (α, δ, µ, ε b , ε s ) is the set of parameters to be estimated: α is the probability of occurrence of information events, δ is the conditional probability that the information event is a bad event, µ is the informed trading intensity, and ε b and ε s are uninformed trading intensities on the buy and sell sides, respectively.For a time period of N days, the joint likelihood of observing a set of daily buys and sells, M = (B i , S i ) N i=1 is presented as: Typically, the estimation of the five parameters is performed via maximum likelihood estimation (MLE).Once the parameter set Θ is estimated, the probability of informed trading (PIN) is calculated as: The PIN model relies on several assumptions.First, trading days are assumed to be independent of each other, an assumption that leads to the joint likelihood in Eq.( 2).Tests on the validity of independence assumption provide supportive evidence and sample results are reported in Easley et al. (1997).Second, information events are assumed to occur outside trading hours.Third, at most one information event can occur in any given trading day.Finally, information events are assumed to be of a single type, i.e., leading to the same magnitude of informed trading µ, whenever they occur.

MPIN model
The MPIN model (Ersan, 2016) is a generalization of PIN model that allows for multiple information event types (information layers).When the number of layers J equals to 1, then the model is simplified to the original PIN model.The model relaxes several assumptions of the PIN model.First, information events can be of different types, i.e., generate different magnitudes of informed trading.Second, more than one information event can occur at any given day.Third, the model allows for the occurrence of information events within trading hours.The model's ability to handle multiple information types enables these two last features.It can aggregate the effects of multiple events or identify instances of partially disseminated informed trading on any given day by introducing an additional layer.
The parameter set of an MPIN model with J layers Θ m = α 1 , . . ., α J , δ 1 , . . ., δ J , µ 1 , . . ., µ J , ε b , ε s has length 3J + 2, where (α j ) j=1...J is the probability of occurrence for an information event in layer j, (δ j ) j=1...J is the (conditional) probability the event in layer j is a bad-information event, (µ j ) j=1...J is the informed trading intensity in layer j, ε b and ε s are the uninformed trading intensities.Similar to the PIN model, the multilayer probability of informed trading (MPIN) is the ratio of expected informed trading intensity to the expected total trading intensity as: The estimation of the MPIN model using the standard maximum-likelihood estimation requires a prior estimation of the number of information layers in the data.An algorithm for detecting the number of layers in a dataset has already been suggested by Ersan (2016).Ersan and Ghachem (2022a) improved this algorithm by refining the correction for the order imbalance.Duarte and Young (2009) suggest an alternative, extended informed trading model, to address two main concerns.First, for many stocks, there is a well-documented positive correlation between the numbers of buyer-and seller-initiated trades (Duarte and Young, 2009).This fact cannot be modelled by the original PIN model.Second, it is difficult to capture the large variance of buys and sells by the use of PIN model, if investors are restricted to be of two types: informed and liquidity traders.Accordingly, the authors introduce an extended model, in which a symmetric order-flow shock to both buy and sell sides is introduced.On any given day, in addition to information events, a positive liquidity shock, symmetric in buys and sells, can occur.In addition to the adjusted PIN measure (AdjPIN) capturing

AdjPIN model
The R Journal Vol.15/2, June 2023 ISSN 2073-4859 the probability of informed trading, the model introduces the probability of symmetric order flow shock (PSOS) that measures the probability of a trade to occur due to a symmetric liquidity shock.
The parameter set of the original AdjPIN model Θ a = (α, δ, θ, θ ′ , µ b , µ s , ε b , ε s , ∆ b , ∆ s ) has 10 elements: α is the probability of occurrence of an information event; δ is the probability that the information event is a bad event; µ b ( µ s ) is the informed trading intensity on the buy (sell) side; ε b (ε s ) is the uninformed trading intensity on the buy (sell) side.θ (θ ′ ) is the probability of a symmetric order flow shock occurrence in the absence (presence) of an information event.∆ b (∆ s ) is the additional arrival rate of buys (sells) caused by symmetric liquidity shocks.Once the parameter set Θ a is estimated, typically through MLE, AdjPIN and PSOS are calculated as follows:

Computation issues for PIN, MPIN, and AdjPIN estimations
PIN estimation is prone to two main sources of numerical errors.First, large numbers of trades (buys and sells) in the power terms (Eq 2) can lead to floating point exception problem.3While this was not a problem in 1990's, most stocks in developed markets today are traded tens of thousands of times a day, rendering the likelihood function in Eq (2) numerically intractable.Consequently, several logarithmic transformations (factorizations) of the likelihood function have been suggested to address this problem.Easley et al. (2008) were the first authors to suggest a factorization of the likelihood function, however their transformation is shown to generate non-negligible biases (Lin and Ke, 2011;Yan and Zhang, 2012).Lin and Ke (2011) provide another factorization leading to more accurate estimates.Finally, Ersan (2016) suggests a similar, yet simpler factorization that leads to the same results as with Lin and Ke ( 2011), yet in shorter estimation times.More importantly, the Ersan (2016) factorization is easily generalized for the MPIN model.In line with previous efforts, Ersan and Ghachem (2022b) have suggested a factorization of the likelihood function of the AdjPIN model.
Second issue related to the estimation of PIN, MPIN, and AdjPIN models is that the estimation procedure may not reach the global maximum of the (factorized) likelihood function.Several papers document that the ML estimation of the PIN model frequently yields boundary solutions, not the global maxima (Yan and Zhang, 2012;Gan et al., 2015;Ersan and Alıcı, 2016).As a remedial solution, Gan et al. (2015) suggests the use of a single strategic parameter set generated by their hierarchical clustering algorithm.In contrast, Yan and Zhang (2012) recommend that the MLE procedure is started up to 125 (5 × 5 × 5) times using the initial parameter sets from their grid search algorithm and that the highest likelihood estimates are picked.Similarly, Ersan and Alıcı (2016) settle for multiple MLE runs, but recommend five sets of parameters determined by their clustering algorithm, and show them to be sufficient to reach the global maxima.When compared to PIN model, achieving global maxima in MPIN model is harder given the larger dimension of the parameter set (3J + 2 parameters).The generalization of Yan and Zhang (2012) grid search algorithm would require up to 5 9 runs of MLE.In contrast, the clustering algorithm of Ersan and Alıcı (2016) is easily generalized, and in its basic setting, produces ( J+5 J ) initial parameter sets.As for the AdjPIN model, generating initial parameter sets turned out to be challenging, given its large parameter set, and that preexisting generation algorithms do not allow a straightforward adaptation to the model.Therefore, a large number of studies relied on a limited number of randomly generated initial parameter sets (see e.g.Duarte and Young, 2009).Recently, Cheng and Lai (2021) suggested an extension of the grid-search algorithm of Yan and Zhang (2012), while Ersan and Ghachem (2022b) suggested a novel method loosely based on the algorithm of Ersan and Alıcı (2016).

The expectation-maximization algorithm
The estimation of PIN models has typically been performed through a direct maximization of the corresponding factorization of the likelihood function.The use of alternative estimation methods such as the Gibbs sampler has also been recently suggested (Griffin et al., 2021).More recently, Ghachem and Ersan (2022) have suggested the use of a variant of the expectation-maximization (EM) algorithm to estimate PIN models.In statistics, the EM algorithm is an iterative method for finding maximum likelihood estimates of parameters in finite-mixture models, where the model may depend on unobserved latent variables (Ng et al., 2012).In finite mixture models, each data observation is associated with an unobserved cluster label, i.e. a reference to the cluster it belongs to.In this respect, PIN models can be considered as a Poisson mixture model with a finite number of clusters (Lin and Lee, 2015;Ghachem and Ersan, 2022).Ghachem and Ersan (2022) considered a variant of the EM algorithm, the Expectation-Conditional Maximization algorithm (ECM algorithm), for the estimation of the PIN models, and provided a detailed implementation and an empirical assessment of it.They show that the ECM algorithm yields faster and more accurate estimates than alternative methods.

VPIN measure
Volume-synchronized probability of informed trading (VPIN) metric is introduced by Easley et al. (2011), andEasley et al. (2012).VPIN aims at detecting order flow toxicity in high-frequency financial markets.As Easley et al. (2012) define, "order flow is toxic when it adversely selects market makers, who may be unaware they are providing liquidity at a loss".It is shown that order flow becomes toxic prior to intraday shocks, such as the 2010 Flash Crash (Easley et al., 2011).VPIN metric proceeds with the volume of trades that arrive to the market, rather than number of trades.In a high-frequency framework, VPIN uses volume clock rather than time clock, forming equal sized volume buckets intraday.A new trade classification algorithm -bulk volume classification -is suggested by the authors.Accordingly, trades are aggregated in short time intervals (e.g., 1 minute) and standardized price changes are used in distributing trades into buys and sells.As shown in Easley et al. (2008), informed trading probability from the PIN model can be proxied by the ratio of expected trade imbalance to the expected total volume of trades.In line with that, VPIN is calculated as follows: where V is the predetermined volume bucket size and equals to in that bucket.OI is the order imbalance.In Easley et al. (2012), volume bucket size is determined by dividing the average daily volume by 50.Each volume is filled by aggregating the short time bars.In addition to the time bar (t) and volume bucket size (V), third parameter in VPIN calculation is the sample length (n) that determines how many volume buckets to be included.Thus, VPIN at any time is calculated based on the last n volume buckets.It is updated with each new volume bucket in a rolling window process.

The PINstimation package
The R package PINstimation provides utilities for the estimation of PIN models, partitioned into six categories: • The standard PIN model (Easley and O'Hara, 1992;Easley et al., 1996), including various tools that remedy to floating-point exception, provide efficient algorithms for initial parameter sets and treat boundary solutions (Lin and Ke, 2011;Yan and Zhang, 2012;Gan et al., 2015;Ersan and Alıcı, 2016;Ke et al., 2019); • Multilayer probability of informed trading or MPIN (Ersan, 2016) and tools for respective computational issues; • Adjusted probability of informed trading or AdjPIN (Duarte and Young, 2009) and tools for respective computational issues; • Volume-synchronized probability of informed trading or VPIN (Easley et al., 2011(Easley et al., , 2012)); • Simulation utilities that generate datasets for testing and benchmarking the different PIN model estimation methods; • Trade classification via commonly used algorithms and daily aggregation of buyer-initiated and seller-initiated trades.

Standard PIN model functions
The different factorizations of the likelihood function can be specified using the family of functions of the form fact_pin_*, where the suffix (*) can be one of ("eho","lk","e"), corresponding to the factorization of Easley et al. (2010), Lin andKe (2011), andErsan (2016) respectively.The different algorithms for the generation of initial parameter sets are implemented in the family of functions of the form initials_pin_*, where the suffix (*) can be one of ("yz","gwj","ea"), corresponding to the algorithm of Yan and Zhang (2012), Gan et al. (2015), and Ersan and Alıcı (2016) respectively.
The R Journal Vol.15/2, June 2023 ISSN 2073-4859 The family of functions of the form pin_* allows the estimation of the PIN model using the aforementioned algorithms for the generation of initial parameter sets, where the suffix (*) can be one of ("yz","gwj","ea").The function pin() estimates the PIN model using custom initial parameter sets.
These functions take the two arguments: data, and factorization.The data argument is a data frame that contains daily data of buyer-initiated trades or buys in the first column, and seller-initiated trades or sells in the second column.The argument data is usually a dataset with around 60 (250) rows as representative of a quarterly (yearly) data while any custom length can be determined by the user.The factorization argument referring to the likelihood function factorization used for the maximum likelihood maximization.It can be one of ("none","EHO","LK","E").

Estimation output
The output of the estimation functions pin(), pin_yz(), pin_gwj() and pin_ea() is an S4 object of class estimate.pin.The slots of this object are presented in Table S1.

Examples
We estimate the PIN model using a preloaded dataset called dailytrades.

MPIN model functions
The factorization of the likelihood function of the MPIN model can be evaluated using the function fact_mpin().The initial sets of parameters can be obtained using a generalization of the clustering algorithm developed by Ersan (2016) via the function initials_mpin().The number of layers in The R Journal Vol.15/2, June 2023 ISSN 2073-4859 datasets can be detected using the family of functions of the form detectlayers_*, where the suffix (*) can be one of ("e","eg","ecm"), corresponding to the layer detection algorithm of Ersan (2016), Ersan and Ghachem (2022a), and Ghachem and Ersan (2022) respectively.
The function mpin_ml() estimates this probability using the standard maximum likelihood estimation method, the factorization of Ersan (2016), and the initial parameter sets in Ersan and Alıcı (2016).The function mpin_ml() takes as an argument layers that specifies the number of information layers assumed to be present in the data.If the user omits this argument, the number of layers is detected using the algorithm referred to in the argument detectlayers.This number is then used to generate the initial parameter sets, before proceeding to compute the maximum likelihood estimates of the MPIN model.The function mpin_ecm() estimates the MPIN model via the ECM algorithm.The function mpin_ecm() takes as an argument the number of information layers layers assumed to be present in the data.If this number is provided by the user, the function finds the optimal estimates for each of the initial parameter sets, and then selects the parameter estimates that give the highest likelihood.If the argument layers is omitted, then the function performs the aforementioned estimation for each number of layers in the integer set from 1 to 8, and then select the optimal model having the lowest model selection criterion.The default criterion is the Bayesian Information Criterion or BIC.The function selectModel() allows to change the selection criterion.

Estimation output
The outputs of the functions mpin_ml() and mpin_ecm() are two S4 objects of class estimate.mpin,and estimate.mpin.ecmrespectively.The latter object inherits all slots of the former, with a few additional slots: Three slots for information criteria (@AIC, @BIC, and @AWE), one slot for the hyperparameters (@hyperparams), one slot stating whether the information criterion is used (@optimal), and one slot for the active information criterion (@criterion).Common slots of both objects are presented in Table S2.Additional slots of estimate.mpin.ecmobjects are described in Table S3.

AdjPIN model functions
The factorization of the likelihood function of the AdjPIN model can be specified using the function fact_adjpin().Three functions are provided to generate initial parameter sets for the estimation of the AdjPIN model.First, initials_adjpin() implements the algorithm suggested in Ersan and Ghachem (2022b).Second, initials_adjpin_rnd() randomly generates initial parameter sets as follows: The buy rate parameters {ε b , µ b , ∆ b } are randomly generated from the interval (minB,maxB), where minB (maxB) is the smallest (largest) value of buys in the dataset, under the condition that ε b + µ b + ∆ b < maxB.Analogously, the sell rate parameters {ε s , µ s , ∆ s }are randomly generated from the interval (minS,maxS), where minS (maxS) is the smallest(largest) value of sells in the dataset, under the condition that ε s + µ s + ∆ s < maxS.Third, initials_adjpin_cl() generates initial parameter sets using an extension of the algorithm derived in Cheng and Lai (2021).In their paper, the authors assume that the probability of liquidity shock is the same in no-information, and information days, i.e., θ = θ ′ , and use a procedure similar to that of Yan and Zhang (2012)  The estimation of the AdjPIN model is performed using the function adjpin().The argument method specifies the estimation method used: "ML" for the standard maximum-likelihood estimation, and "ECM" for the ECM algorithm.The standard maximum-likelihood method writes a factorization of the likelihood function and find its maxima using Nelder-Mead method.The expectation-conditional maximization (ECM) algorithm is suggested and detailed in Ghachem and Ersan (2022).The function allows for the estimation of the AdjPIN model (Duarte and Young, 2009), as well as related restricted models.Restricted models are models where pairs of parameters are assumed to be equal.The choice of a restricted model can be specified via the argument restricted.For instance, calling the function adjpin() with the argument restricted = list(mu = TRUE) correspond the estimation of the restricted AdjPIN model where µ b = µ s .

Estimation output
The output of the estimation function adjpin() is an S4 object of class estimate.adjpin.The slots of the estimate.adjpinobject are presented in Table S4.

Examples
We estimate unrestricted, and restricted AdjPIN models using a preloaded dataset called dailytrades.

Volume-synchronized probability of informed trading -VPIN
The Volume-Synchronized Probability of Informed Trading or VPIN is developed by Easley et al. (2011) and Easley et al. (2012), and refers to the adaptation of the original PIN model to the high frequency environment.

The function vpin()
The package provides the function vpin() that computes VPIN using a dataset of high-frequency transactions containing three variables timestamp, price, volume.The three essential arguments of the function are: (1) timebarsize, the size of timebars in seconds with a default value of 60, (2) buckets, the number of buckets per volume of bucket size (VBS) with a default value of 50, (3) samplength, the sample length or window of buckets to calculate VPIN, with a default value of 50.Following Easley et al. (2011Easley et al. ( , 2012)), the default value for the argument timebarsize is 1 minute (60 seconds).Recall that the unit of the argument timebarsize is in seconds, enabling the user to use shorter time bar sizes as well.

Estimation output
The output of the estimation function vpin() is an S4 object of class estimate.vpin.The slots of the estimate.vpinobject are presented in Table S5.

Examples
We use a dataset called hfdata included in the package, which is a simulated dataset containing sample timestamp, price, volume, bid and ask for 100.000high-frequency transactions.The function automatically selects the first 3 columns of the provided data, thus ignores the last two columns (bid and ask).When the function vpin() is run without arguments, it uses the default parameters: a time bar size of 60 seconds, 50 buckets per daily average volume, and a sample length of 50 buckets.

Data simulation functions
We provide utilities to generate simulated data for the PIN, MPIN and AdjPIN models, via the functions generatedata_mpin() and generatedata_adjpin().
The function generatedata_mpin() generates datasets according to the assumptions of the generalized PIN model of Easley and O'Hara (1992), and Easley et al. (1996) as derived by Ersan (2016).The main arguments of the function are as follows: series, which represents the number of datasets to be generated; days, specifying the number of days in each dataset; layers, denoting the number of information layers to be generated in the data; parameters, defining the parameters Θ = (α, δ, µ, ε b , ε s ) used in data generation; ranges, a list containing the ranges for some or all parameters; and maxlayers, representing the maximum number of layers in the generated datasets.If the user omits the argument parameters, the function checks the ranges of simulation parameters as present in the argument ranges.If the user provides a range for a given parameter, it is used in simulating the parameter value.Otherwise, a default range is used.The function generatedata_mpin() has three additional arguments that control the relationship between the theoretical values of the simulation parameters: eps_ratio, mu_ratio, and confidence.For more information about these arguments, and default parameter ranges, we refer the reader to the package documentation.
The function generatedata_adjpin() generates datasets according to the assumptions of the Adjusted PIN model (Duarte and Young, 2009).The arguments of the function are as follows: series, representing the number of datasets; days, specifying the number of days in each dataset; parameters, defining the parameters Θ = (α, δ, θ, θ ′ , ε b , ε s , µ b , µ s , ∆ b , ∆ s ) used in data generation; restricted, a list of binary variables specifying whether two analogous model parameters are assumed equal; and ranges, an alternative to parameters, determining the range for each parameter.The argument restricted can be specified as a vector with four elements: (theta,eps,mu,d).Each of the four elements, when set to TRUE, corresponds to a given restriction on the AdjPIN model.For instance, theta = TRUE corresponds to the AdjPIN model where θ = θ ′ .If the user omits the argument restricted, then no restrictions are applied, and the simulated data is generated to fit the unrestricted model.If the user omits the argument parameters, the function checks the ranges of the different simulation parameters contained in the argument ranges.If the user provides a range for a given parameter, it is considered in simulating the value of that parameter.Otherwise, a default range is used.For more information about the function, and the default parameter ranges, we refer the reader to the package documentation.

Simulation output
The output of the data generation functions generatedata_ * (), where the suffix (*) can be one of ("mpin","adjpin"), depends on the value of the argument series.If series=1, the output is of class dataset; otherwise the output is of class dataseries.The slot @datasets of the latter object contains the simulated data in the form of a list of dataset objects.The slots of the objects dataset, and dataseries are presented in Table S6, and Table S7 respectively.

Examples
We generate several data series using the functions generatedata_mpin() and generatedata_adjpin() by using different values for the arguments.Note that your results might differ from ours as the data is randomly generated.# (2) Using the argument 'ranges' sdata <-generatedata_mpin(layers = 1, ranges = list(alpha=0.3,delta=0.7,eps.b=1500, eps.s=1800, mu=8000)) # [3] Generate a series of 500 datasets with 2 layers where each layer has a minimum # share of 0.1, eps.b is equal to 5000; and mu is between 5000 and 25000.

Trade aggregation function
The PIN model and its extensions use daily numbers of buyer-initiated and seller-initiated trades.Thus, the estimation of the probability of informed trading requires two initial tasks.First is the determination of trade initiator in each trade (trade classification), and second is the aggregation of buys and sells on daily basis. 4The function aggregate_trades() performs both tasks.Among the trade classification algorithms, PINstimation implements four algorithms, which are "Tick", "Quote", 4 In case the data already attaches buy, and sell labels to the individual trades, there is no need to use the algorithms.Besides, when the detailed order book reflecting the arrival times of each electronic message is accessible, high-precision Odders-White (2000) chronological method is preferred.For other kinds of data, trade classification algorithms remain in use, despite non-negligible errors (see e.g., Lee and Ready, 1991;Piwowar and Wei, 2006;Aktas and Kryzanowski, 2014).
The trade classification algorithms are implemented in a single function aggregate_trades() that takes four main arguments: (1) data, a dataframe with four variables in the following order (timestamp, price, bid, ask), (2) algorithm, specifying the algorithm used to determine the trade initiator, accepting one of four possible values: ("Tick","Quote","LR","EMO"), (3) timelag, representing the time lag in milliseconds used to calculate the lagged mid-quote for the methods "Quote", "EMO", and "LR", with a default value of 0 milliseconds, and (4) fullreport, determining whether the day variable is returned.The default value is FALSE.The default value for the time lag to be used in the algorithms is set to 0 -chosen mainly for speed considerations.There are studies also suggesting the better performance of 5-seconds time-lag (Lee and Ready, 1991) and 1-second time-lag (Piwowar and Wei, 2006;Aktas and Kryzanowski, 2014).Given today's high-speed financial markets, a much shorter time-lag of, for example, 100 milliseconds can also be considered.

Tick
A trade is classified as a buy (sell) if the price of the trade to be classified is above (below) the closest different price of a previous trade

Quote
Classifies a trade as a buy (sell) if the trade price of the trade to be classified is above (below) the mid-point of the bid and ask spread.Trades executed at the mid -spread are not classified.

LR
Classifies a trade as a buy (sell) if its price is above (below) the mid-spread (quote algorithm), and uses the tick algorithm if the trade price is at the mid-spread.

EMO
Classifies trades at the bid (ask) as sells (buys) and uses the tick algorithm to classify trades within the then prevailing bid-ask spread.

Estimation output
The output of the function aggregate_trades() is a dataframe of two (or three) variables.If the argument fullreport is omitted, or set to FALSE, the output is a dataframe composed of two variables (b,s).Otherwise, the dataframe consists of 3 variables (day,b,s).

Examples
We use the preloaded dataset hfdata to create a raw high-frequency dataset to aggregate.

More on the PINstimation package
Optimization algorithms: The maximum-likelihood estimation relies on the maximization of the factorized likelihood function over a feasible parameter space.For all instances of MLE throughout the package, this constrained maximization is performed using the Nelder-Mead Simplex algorithm (Nelder and Mead, 1965), as implemented in the function neldermead() of the package nloptr (Johnson, 2022).In contrast, the expectation-conditional maximization (ECM) algorithm does not require multi-dimensional non-linear optimization.Thanks to the use of conditional maximization in the maximization step, the search for the optimal parameters in the maximization step of the complete-data log-likelihood is reduced to the search for the roots of polynomials using the algorithm of Jenkins The R Journal Vol.15/2, June 2023 ISSN 2073-4859 andTraub (1972), which can be implemented, for example, via the function polyroot().In the documentation of the function, it is stated that "numerical stability may be an issue for all but low-degree polynomials."Luckily, the highest degree of maximands (polynomials) for the AdjPIN (MPIN) model estimation via the ECM algorithm is 4 (J + 1), where J -the number of information layers in the MPIN model -often takes a low value, usually less than 5 (Ersan, 2016).
Parallel processing: The search for global maxima of the log-likelihood function, either through standard MLE, or via ECM algorithm, is performed through running the method for several initial parameter sets to obtain, for each dataset, an optimal estimate, then out of these estimates, the one with the highest log-likelihood is selected.Since the search for local optimum for any given initial set is independent of the search for other initial sets, then parallel processing can be used to speed up the execution.Similarly, the trade aggregation -as implemented in the function aggregate_trades()takes an argument timelag, and if this argument is positive, it assigns for each high-frequency trade a lagged mid-quote computed using bid and ask registered a timelag earlier.The computation of lagged midquote can be independently performed for all trades, and therefore can be parallelized.Consequently, the package supports parallel processing for these two main tasks, in particular when these tasks take considerably long time.This concerns namely: (1) estimation of the MPIN model when the number of initial parameter sets is large, (2) data aggregation of high-frequency data when a time-lag is used.The parallel processing is enabled using the argument is_parallel available for the functions mpin_ml(), mpin_ecm(), and aggregate_trades().The default value for this argument is TRUE for the data aggregation, and FALSE for the MPIN model estimation.The parallel processing depends on two additional options: (1) the number of cores used by the functions, (2) the threshold of initial parameter sets needed to activate parallel processing for MPIN estimations.By default, the number of CPU cores used in the parallel processing is 2. The option is stored in, and accessed through the R option pinstimation.parallel.cores.As for the MPIN estimation, parallel processing will not be activated unless the number of initial sets exceeds a threshold, by default 100 sets.The option is stored in, and accessed through the R option pinstimation.parallel.threshold.Information on how to change these options are available on the package website or the package vignette "parallel processing".The parallel processing feature in the package relies on the future framework available through the R package future (Bengtsson, 2021).The actual mapping of functions via futures is performed through the function future_map() of the package furrr (Vaughan and Dancho, 2022).

Empirical time complexity
We have performed an empirical investigation into the time complexity of the algorithms associated with the PIN, MPIN, and ADJPIN models, but chose not to report the results.This decision is motivated by theoretical considerations, as these algorithms are designed to be used with small datasets, typically consisting of 60 to 250 observations5 .For such small datasets, the algorithms typically execute quite efficiently on a fairly average computer.In contrast, the package contains two functions that can be used with larger datasets, namely the data aggregation function aggregate_trades() and the function vpin().To inspect the empirical time complexity of these functions, we obtain a real dataset containing two millions high-frequency trades (sampledata), run the functions on subsets of increasing size and inspect at what rate the execution time grows with the size.
For the function aggregate_trades(), we perform the procedure for both the sequential and parallel processing.For each value of size in the set (100000,200000,. . . ,2000000), we run the following lines of code (1) aggregate_trades(sampledata[1:size,],algorithm = "LR",timelag = 1000) , ( 2) aggregate_trades(sampledata[1:size,],algorithm = "LR",timelag = 1000,is_parallel=FALSE), and (3) vpin(sampledata[1:size,]). For each run, we record the pair consisting of the dataset size, and the execution time.Figure S1 displays the behavior of execution time as a function of the number of high-frequency trades in the dataset for the functions aggregate_trades() and vpin() respectively.Figure S1(a) shows clearly that the function aggregate_trades() displays a linear time complexity, both for sequential, and parallel processing.Similarly, Figure S1(b) shows that the function vpin() does also have a linear time complexity.
Convergence of the ECM algorithm: In theory, the ECM algorithm may fail to converge, and if it does, it may do so slowly (large number of iterations), or converge to a local optimum.To avoid long running times due to non-convergence, Ghachem and Ersan (2022) set an upper bound of 100 iterations per initial set, and report that between 93% and 99% of initial sets lead to convergence in fewer than 100 iterations.To avoid local optima, they use limited number of strategic initial sets, and show that the average bias of AdjPIN(PSOS) is as low as of 0.07% (0.101%); while it is roughly 0.01% for MPIN.Raising the bound on iterations and/or the number of initial sets may further enhance convergence and reduce estimation bias, while keeping running times reasonably low thanks to the fast ECM estimation.Users may adjust these parameters using the arguments hyperparams and xtraclusters of mpin_ecm(), or hyperparams and num_init of adjpin(...,method = "ECM").
Sample datasets: The functions included in the package accept datasets in two different formats.Therefore, and for the sake of compactness, we have only included two sample datasets.This is justified by the fact that package enables users to easily generate simulated datasets that fit their preferences and needs (e.g.number of days, any feasible combination of model parameters) using the functions generatedata_mpin(), and generatedata_adjpin().More information on these functions, and their arguments can be found in the package documentation.
Clustering algorithm: A large number of algorithms implemented in the package, namely those for layer detection (Ersan, 2016;Ersan and Ghachem, 2022a), or for generating initial parameter sets (Gan et al., 2015;Ersan and Alıcı, 2016;Ersan, 2016;Ersan and Ghachem, 2022b), rely on the hierarchical agglomerative clustering (HAC) in one or more of its steps.The function used in the implementation of HAC throughout the package is hclust().
Custom initial parameter sets: The package provides several functions for generating initial parameter sets for the different PIN models, to be fed in the different estimation functions.These latter functions also allow for the use of custom initial parameter sets.This enables researchers to develop, and experiment with eventually more efficient algorithms for generating initial parameter sets.Therefore, an argument initialsets is included in the estimation functions of the PIN models (pin(), mpin_ml(), mpin_ecm(), and adjpin()) that allows researchers/users to use the estimation method while providing their own initial parameter sets.

Applications
In this section, we showcase the different capabilities of the package by describing in sufficient detail two usage examples analyzing real-world datasets.The purpose of these examples is to show that the package can be used to answer typical research questions, and also to serve as a complementary check -our empirical results corroborate well-established findings in the literature, mainly that small stocks have higher informed trading than large stocks, and VPIN values vary around firm-specific announcements.
In the first example, we use different measures of informed trading (implemented in the package) to conduct descriptive and comparative analyses of informed trading activity in large and small stocks.More specifically, we collect and compare the probability of informed trading obtained by estimating the three major models using a sequence of daily buyer-initiated and seller-initiated trades.These models are PIN (Easley and O'Hara, 1992;Easley et al., 1996), MPIN (Ersan, 2016), and AdjPIN (Duarte and Young, 2009).The research strategy consists of three steps.First, we aggregate the high-frequency transaction datasets into datasets of daily trades using the function aggregate_trades() using Lee and Ready (1991) algorithm (algorithm="LR") with zero-second time lag (timelag=0).Second, we estimate each of the three models with various methods and specifications suggested in the literature.Finally, we compare the estimates of informed trading in large and small cap stocks, and test the well-established hypothesis that small stocks experience larger probability of informed trading (see e.g.Easley et al., 2002;Aslan et al., 2011).
In the second example, we conduct an intraday analysis of informed trading, using the same dataset, but different variations of the volume-synchronized probability of informed trading or VPIN (Easley et al., 2011(Easley et al., , 2012)).First, we provide summary statistics for the different VPIN estimates.Next, we provide modified versions of the two tables in Easley et al. (2011) showing the distribution of VPIN and absolute post-returns conditional on each other.Additionally, we investigate the distributions of positive and negative returns separately.Finally, we examine whether order-flow toxicity changes around firm-specific announcements for the examined stocks and during the study period.6

Data
Our main dataset is a stock-level intraday dataset, consisting of all trading transactions for 58 Swedish stocks listed in NASDAQ Stockholm, which took place within the last quarter of 2020 (59 days).The data is a collection of reconstructed order books, based on the NASDAQ OMX Historical ITCH files, and obtained from the website of Swedish House of Finance, National Research Data Center.Reconstructed order books contain extensive information about the different order book entries, such as the instrument symbol, date and timestamp in nanoseconds, first and second-best prices and associated volume at both bid and ask sides, transaction price and volume.The main motivation behind the selection of 58 stocks in the sample is to conduct comparative analyses of informed trading between large and small stocks.The 29 large cap stocks are selected in a straightforward manner from among the 30 large-cap stocks listed in OMX Stockholm 30 Index (OMXS30).Of these 30 stocks, one stock (ATCO A) is excluded because of data unavailability.As for the 30 small-cap stocks, we consider the stocks listed in NASDAQ OMX Small Cap Sweden GI (NOMXSCSEGI), which are not listed in neither the mid-cap, nor the large-cap indices (OMXSMCGI, OMXSLCGI).At the time of the study, 219 stocks are listed in the Small Cap index, among which, 39 are not listed in neither of the aforementioned indices.We select the first 30 stocks of these 39 stocks, chronologically.Of these 30 stocks, one small stock (EGTX) is excluded as it only has six days with any trading records.In sum, we have 29 large and 29 small stocks with 5, 410, 411 associated transactions.
Our second dataset consists of firm-level announcements pertaining to the selected 58 stocks and occurring within the 59 trading days of the first dataset.The announcements' data were manually collected from company news, available on the website of NASDAQ NORDIC and amount to a total of 546 announcements.We apply several filtering steps on the collected raw data before obtaining the final sample of announcements.For instance, we exclude 353 announcements occurring outside of the trading hours, or within the first and last 10 minutes of the trading day.To avoid ambiguity from combined effects of multiple announcements, we exclude all announcements for any stock-day pair having more than one announcement.The final sample consists of 96 announcements, out of which 41 concern large stocks and 55 concern small stocks.

Example 1 -PIN estimation
We estimate the standard PIN model (Easley et al., 1996), the MPIN model (Ersan, 2016), and the AdjPIN model, (Duarte and Young, 2009) using a dataset of high frequency trades on a sample of 58 stocks (29 large and 29 small stocks) during the last quarter of 2020.We perform a comparative study of the estimates of different specifications for each of these models, and provide evidence for the existence of significant differences in informed trading between small and large stocks.Technically, we estimate the original PIN model using 8 different specifications, MPIN model using 5 specifications, and ADJPIN model and its restricted versions using 7 specifications.We, however, report a selection of these specifications.Unreported specifications are variations of the reported models with different factorizations, initial sets, and/or restrictions on parameters.Table 2 defines the ten specifications we report and provides the corresponding code to implement each of them.

Table 2: Definition, and implementation code for a selection of model specifications
The factorization, and initial sets for MPIN and AdjPIN models are presented in Ersan (2016), and Ersan and Ghachem (2022b) respectively.Estimations using the ECM algorithm are detailed in Ghachem and Ersan (2022).
Table 3 presents the mean estimates of the probability of informed trading (PIN) as well as five parameters for the 58 examined stocks.In summary, Table 3 suggest that variation of estimates from different specifications of the same model is of limited scope, while the variation of estimates across models might be quite significant.The MPIN model yields the highest PIN estimates, mainly due to higher probability of information events.Interestingly, the PIN and ADJPIN models produce very similar PIN estimates, even though all their model parameters differ significantly from each other.These results are in line with the assumptions of the different models.4 reports the mean estimates on the probability of informed trading for large and small stocks separately, their difference, and its statistical significance.For all specifications, the mean PIN estimate is significantly larger for small stocks in comparison to large stocks.For instance, the PIN model mean estimate is around 8.7% for large stocks, while it is almost 18% for small stocks.Similarly, MPIN mean model estimates are larger than those of the PIN model, both for large and small stocks, and can reach up to 30% for small stocks.This finding is in line with previous findings in the market microstructure literature that document larger probabilities of informed trading for small stocks (Easley et al., 2002;Aslan et al., 2011;Chen and Zhao, 2012).In the bottom row of Table 4, mean number of layers detected using the different specifications of the MPIN model are reported.The average number of layers for large stocks is consistently higher than that for small stocks.For instance, for the MPIN.ML_EG, mean number of layers detected in the 2020 last-quarter datasets of large stocks is 4.172.This number is significantly higher than its counterpart for small stocks (around 2.9) for the same period, suggesting that large stocks are more likely to witness different types of information events.Table 4: Mean PIN estimates and number of layers for large and small stocks * * * , * * , and * represent significance from a one-sided t-test at 1%, 5% and 10% levels, respectively.PIN values and their differences are in percentages.Next, we focus on one selected implementation for each of three models (PIN_EA, MPIN_ML_EG, ADJPIN_GE). Figure 1 shows stock-level PIN and alpha estimates for each of the three selected specifications.Left (right) hand side of each panel reports estimates for 29 large (small) stocks.Figure 1a displays the PIN estimates for each of the PIN, MPIN and ADJPIN models.While PIN and ADJPIN models produce relatively close PIN estimates, MPIN model estimates are consistently higher.In particular, the difference between MPIN, and PIN estimates is positive, and can reach up to 25%.In contrast, the difference between the estimates from ADJPIN and PIN models does not have a stable sign, and tends to fluctuate around 0. Figure 1a also shows relatively higher estimates in the right side of the panel (small stocks), as well as high stock-based variations, e.g., for the MPIN model PIN estimates range from 10% to around 40% for the examined stocks.Figure 1b replicates Figure 1a for alpha parameter estimates (information event occurrence probability).It shows that PIN model consistently has lower alpha estimates than MPIN and ADJPIN models, but with higher variability, ranging from 2% to 42%.Significant differences in alpha estimates are observed among the three models and across stocks.Therefore, a careful analysis of each model's assumptions is necessary to draw any conclusions.
In each parameter set 'a-b-c', a represents the length of time bars in minutes, b stands for the number of buckets per a day with average trading volume, c is the number of previous buckets used in the calculation of VPIN at any bucket.In line with Easley et al. (2011Easley et al. ( , 2012)), we select the parameter set 1-50-50 as our main setting.
Table 5 presents the summary statistics for VPIN estimates for the three settings, and this for both the whole sample, and for the large and small stocks separately.Mean (median) VPIN with 1-50-50 is 27.6% (25.3%) for the whole sample.Number of VPIN observations is 166, 875, almost equally composed of observations on small and large cap stocks.Mean VPIN is slightly larger for the small stocks (28.1% and 27.2%, respectively).
Under the basic setting, the difference between mean VPIN measures of small and large stocks, while in line with our expectations, it is not as large as previous studies suggest.For instance, Abad and Yagüe (2012) report mean VPIN values of 25% and 53% for the Spanish large and small stocks, respectively.We too obtain positive difference between the mean VPIN values for small and large stocks for all parameter sets.The VPIN value for small stocks is substantially larger than for large stocks (almost twofold) for settings, for which an average trading day contains a single bucket, and five buckets are used in calculating the VPIN (parameter sets 1-1-5 and 5-1-5).The excess informed trading of small stocks is not restricted to average values.For instance, under our basic setting, first and third quartiles of VPIN for the whole sample are around 20% and 34%.This range as well as the standard deviation for small stocks are relatively larger than those of large stocks. 7The first parameter set 1-50-50 is the main setting used in several studies (see e.g.Easley et al., 2011Easley et al., , 2012;;Abad and Yagüe, 2012).The parameter sets 1-1-5, and 5-1-5 are two of the several sets previously used for comparative purposes (see e.g.Abad and Yagüe, 2012).
The R Journal Vol.15/2, June 2023 ISSN 2073-4859 We turn now to investigate whether the correlation observed between the VPIN distribution and the absolute post returns distribution for the S&P 500 E-mini index, as reported by Easley et al. (2011), can be generalized to (1) individual stocks, (2) another (non-US) market, i.e.NASDAQ Stockholm, (3) more recent data, (4) positive and negative post-returns.To do this, we replicate the two tables (Exhibit 7 and 8) as they appear in Easley et al. (2011) for individual stocks, for absolute post-returns initially, before differentiating between positive and negative post-returns.Table 6 reports, in Panel A, the distribution of the absolute post-returns conditional on VPIN.Each of the 3 rows represents the distribution in percentage for the 0 − 5 th , 45 th − 50 th , 95 th − 100 th quantiles of the VPIN values.Respective quantile values are given in the first column (e.g., 0.062 is the 5 th quantile of VPINs in our data).
The results in Table 6 (Panel A) are significantly similar to the results in Easley et al. (2011), both qualitatively, and even quantitatively.For instance, the share of large absolute post-returns is highest in the highest VPIN quantile, and substantially higher than the same share in other quantiles.The share of large absolute post-returns (exceeding 2%) associated with the highest VPIN quantile is 2.16%, while it is below 0.44% for the 45 th to 50 th VPIN quantiles.The highest levels of VPIN (in the highest quantile) have 4.5 times higher likelihood to be followed by large absolute post-returns than intermediate levels of VPIN (in the median quantile) (2.16% and 0.44%).This ratio is strikingly similar to the one found in the referenced paper (0.22% and 0.05%).However, the likelihood of large absolute post-returns is higher in our study (2.16% vs 0.22%), which is likely due to our use of individual stocks rather than an index.For each of the absolute return intervals larger than 0.5%, the share of VPIN values in the highest quantile is at least twice as large as the ones in lower quantiles.The share of VPINs within the highest quantile (last row of Table 6 -Panel B) is noteworthy: Absolute returns larger than 1% are highly likely to be preceded by a high VPIN value.In our unreported results, for over 40% of intraday periods with absolute returns larger than 2%, the (preceding) VPIN is at its highest quantile.We now turn to investigate whether the distribution patterns for absolute returns hold true when returns are split into positive and negative and analyzed separately.Table S8 summarizes the results across four panels showing only the lowest, median, and highest 5 th VPIN quantiles.The distribution patterns of preceding VPINs remain consistent for positive and negative returns, except for the return interval (−0.5%, 0.5%).When VPIN values are within the highest quantile and post return is positive (negative), the likelihood of return in the next volume bucket exceeding 2% (−2%) is as high as 3.87% (4.31%).These probabilities are more than seven times that of the median quantile.Note that we excluded zero-return observations before analyzing positive and negative returns separately.This might explain why the findings in Table S8 are more pronounced than those in Table 6.
Finally, we investigate VPIN around firm-specific announcements.Using 96 firm-specific announcements taking place within the last quarter of 2020, and pertaining to the selected stocks, we investigate whether VPIN values change prior to, and following the announcements, and whether the behavior of VPIN around announcements is similar for the large and smalls stocks.Figure 2 plot the mean VPIN for the (−100, +100) volume buckets where 0 refers to the announcement bucket, i.e., the bucket, during which the announcement took place.It represents VPIN values around announcements for the whole sample, and for both large, and small stocks separately.The main finding of our analysis on the whole sample is that, mean VPIN starts to increase shortly prior to announcements, and continues to increase post-announcement, reaches a maximum, before starting to decrease to pre-announcement levels.
As shown in Figure 2a, mean VPIN starts to increase at bucket (−13) from a level 25.7%, monotonically increases for around 50 buckets, reaching a level of 30.81%, before reverting gradually to around its pre-announcement levels.Mean VPIN of small stocks, in Figure 2b, starts rising at bucket (−13) from a level of 25.6%, and keeps increasing until bucket (+29) reaching a level of 32.4% before starting to gradually decrease.It, then, reaches its lowest post-announcement level at bucket (+81), before starting to rise again.As for large stocks in Figure 2b, mean VPIN starts rising at bucket (−7) from a level of 25.2%, and keep increasing until reaching a level of 30.3% at bucket (+50), before gradually decreasing afterwards.Interestingly, VPIN starts to react relatively earlier for small stocks than for large stocks.Nevertheless, the presence of early warning property of VPIN is evident for both small and large Swedish stocks.This corroborates with the findings of Easley et al. (2011Easley et al. ( , 2012)), where they suggest VPIN as a metric providing an early warning signal for intraday events, such as crashes.Bjursell et al. (2017) document an increase in VPIN prior to news events, and price jumps in the crude oil market.Similarly, Bugeja et al. (2015) examining takeover announcements in the Australian markets, find out that VPIN significantly increases for target firms in the four days prior to the takeover announcements.Our findings suggest the potential of VPIN as an early warning signal might well extend to regular firm-specific events.These VPIN patterns could be further investigated, in light of recent findings on price discovery around announcements in today's financial markets with large HFT prevalence (Beschwitz et al., 2020;Ersan et al., 2021).

Conclusion
PINstimation is an attempt to centralize, and implement in a rigorous manner, the main estimation methods suggested in the literature.In addition to efficiency, we aim that PINstimation be (1) all- The R Journal Vol.15/2, June 2023 ISSN 2073-4859 encompassing, i.e. it includes the main model treating the probability of informed trading and its most relevant extensions, (2) complete, i.e it includes not only the tools required to estimate PIN models, but also algorithms to generate initial parameter sets, tools to simulate datasets, and algorithms to aggregate high-frequency trades into daily trading data, and (3) up-to-date, as the current version of PINstimation package is highly up to date including several methods suggested in 2020-2022.
Future work on the package aims at continuous extension of the package with the most up-to-date estimation methods available.For instance, we have recently added function pin_bayes() which implements a Bayesian approach for the estimation of the original PIN model as suggested by Griffin et al. (2021).Even though the PINstimation package aims to be all-encompassing, it remains primarily dedicated to the estimation of probability of informed trading (PIN) models.Thus, other informed trading measures suggested in the literature are, and shall remain, beyond the scope of the package.By the introduction of the package, we hope to contribute to widen the user base of PIN models both in academic circles, and among practitioners; as well as improve the validity, and the comparability of scientific findings within the field.

( a )Figure 2 :
Figure 2: Average VPIN around announcements for small, large, and all stocks Display the optimal parameter estimates and the PIN value Display probability estimates, trading intensity estimates, adjpin, and psos.Estimate a restricted AdjPIN model where the liquidity shock rates are assumed equal on # the buy and sell side, i.e., d.b = d.s.

Table 3 :
Mean estimates of PIN and five parameters in PIN, MPIN, and ADJPIN models Probability terms, PIN, α, and δ are in percentage.The average running time (Time) is in seconds.

Table 5 :
Descriptive statistics for three settings of VPIN -for large, small, and all stocks.N refers to the number of observations; min and max refer to the minimum, and maximum values respectively.SD corresponds to the standard deviation, while Qx is the x th quantile.

Table 6 :
Conditional distributions of VPIN and absolute post-returnsPanel A provides the distribution of absolute post returns (leading VPIN bucket return) conditional on VPIN values, while Panel B provides the distribution of VPIN values conditional on the absolute post returns.For brevity, only the 5 th , 50 th and 100 th quantiles are reported in each panel.Numbers are given in percentages.