Disparities in economic welfare, inequality and poverty across and within countries are of great interest to sociologists, economists, researchers, social organizations and political scientists. Information about these topics is commonly based on surveys. We present a package called rtip that implements techniques based on stochastic dominance to make unambiguous comparisons, in terms of welfare, poverty and inequality, among income distributions. Besides providing point estimates and confidence intervals for the most commonly used indicators of these characteristics, the package rtip estimates the usual Lorenz curve, the generalized Lorenz curve, the TIP (Three I’s of Poverty) curve and allows to test statistically whether one curve is dominated by another.
Surveys of income provide policy makers, researchers, social organizations and general public with a rich source of data to address and understand the topics of welfare, income inequality and poverty. In one approach, income differences among states, populations or groups are summarised by univariate indices. The most popular aggregate inequality and poverty indices (such as the Gini index and the at-risk-of-poverty rate) summarize inequality or poverty by a univariate index. Such an index is easy to compare over time, across regions or countries but may be insufficient for more detailed research and refined decision making. Increasing availability of data sets from a variety of statistical sources (including national and international statistical offices) allows scientists and policy-makers to carry out detailed comparisons by developing and applying more sophisticated tools and graphical representations from real data sets.
This paper presents an R package, called rtip, that implements methods and techniques based on stochastic dominance to make unambiguous comparisons, in terms of welfare, poverty and inequality, of income distributions. Stochastic dominance requires unanimity in rankings for large classes of indices (rather than simpler rankings based on a single index). We regard one income distribution as dominating another only if the same ranking is obtained for them with an entire family of indices. Stochastic dominance rules for income distributions can be easily implemented by seeking a dominance relation between three graphical representations: the Lorenz curve, the generalized Lorenz curve and the TIP (Three I’s of Poverty) curve (also called the cumulative poverty gap (CPG) curve or the poverty profile curve, see Barrett et al. (2016) for other names). Two income distributions can be unambiguously ranked in terms of inequality if their Lorenz curves do not intersect (see Atkinson 1970). Similarly, two income distributions can be unambiguously ranked with respect to a wide class of reasonable social welfare functions (see Shorrocks 1983 for details) if their respective generalized Lorenz curves do not intersect. The literature also contains results (see Jenkins and Lambert 1998a,b; Sordo and Ramos 2011) connecting unambiguous poverty rankings by general classes of poverty measures with non-intersections of TIP curves. The package rtip estimates these curves directly from data sets and implements some tests to assess any statistical significance.
We can find some functions, in different software languages, addressing
the analysis of income distributions via generalized Lorenz curves and
their associated indices of inequality. For example, the Stata
commands
Some strengths of using the package rtip presented in this paper compared with others are the following:
rtip provides functions to load microdata having EU-SILC format (see Datasets section).
rtip evaluates income distributions on several equivalence scales and with adjustment for differences in household composition. The choice of the equivalence scale is important and can considerably affect the results (see, for example, (Buhmann et al. 1988; Jenkins and Cowell 1994; De Vos and Zaidi 1997), who showed that the choice of the equivalence scale is not innocuous). Although the package rtip employs by default the OECD-modified scale used by Eurostat, it also allows to use the parametric scale in Buhmann et al. (1988), which covers almost all equivalence scales used in practice.
Besides providing point estimates and confidence intervals for the most commonly used indicators for inequality, welfare and poverty, the package rtip implements estimation of the usual Lorenz curve, the generalized Lorenz curve and the TIP curve and provides a way, based on the distribution-free test for generalized Lorenz and deprivation dominance suggested by Xu (1997) and Xu and Osberg (1998), respectively, to test statistically whether a curve is dominated by another.
The remainder of this paper is organized as follows. First, we explain the processes of loading data and setting up the surveys. Then, we briefly describe the indices and curves included in the package, explaining how to estimate them with rtip. Description of the statistical procedures implemented to test dominance of these curves is also presented and illustrated with examples. Finally we provide conclusions as well as a section to explain to users how to load their own datasets.
The rtip package offers the possibility to load data extracted
from EU-SILC using the loadEUSILC
function. Users can modify
loadEUSILC
in order to load their own datasets. For instance,
loadLCS
function is a modification of loadEUSILC
to load data from
the Spanish Living Conditions Survey (see the Appendix). Note that
rtip is not restricted to datasets with EU-SILC format and users
can load their own datasets.
The rtip package contains two datasets: eusilc2
and LCS2014
.
library(rtip)
data(eusilc2)
data(LCS2014)
The eusilc2
dataset is a modification of the synthetic eusilc
dataset in the package laeken (see Alfons and Templ 2013). The aim of this
modificationeusilc
dataset can be found at
https://github.com/AngelBerihuete/rtip/blob/master/data-raw/eusilc_eusilc2.R.eusilc
dataset is related to Austrian EU-SILC data from 2006, and was
generated to comply with EU-SILC confidentiality rules. The LCS2014
dataset
Both datasets, eusilc2
and LCS2014
, have seven variables which are
briefly described in Table 1 (for further
information see Eurostat 2007). The records in these datasets are
households, not individuals. Thus the 14,827 individual-level records in
the eusilc
dataset are condensed to 6,000 household-level records in
the eusilc2
dataset.
Name of the variable | Meaning |
---|---|
DB010 | Year of the survey |
DB020 | Country |
DB040 | Region |
DB090 | Household cross-sectional weight |
HX040 | Household size |
HX050 | Equivalised household size |
HX090 | Equivalised disposable income |
The income variable most commonly studied for the inequality and poverty
assessment is the disposable household income. Most official statistics
(and rtip) use the equivalised disposable household income, which
is the total disposable income of the household adjusted by taking into
account its composition (number of adults and children). The adjustment
is made by dividing the disposable household income by the equivalised
household size, which is defined as the number of household members
converted into equivalent adults by using a specific equivalence scale.
The equivalisation factor employed by Eurostat is the OECD-modified
scale, which gives a weight of 1.0 to the first person aged 14 or more,
a weight of 0.5 to other persons aged 14 or more and a weight of 0.3 to
persons aged 0-13. rtip uses this equivalisation factor by
default, but can also use the parametric scale of Buhmann et al. (1988). In this
case, the equivalised household size is given by
The information is held in two files: basic household register (H-file)
and household data (D-file). These files are loaded by functions
loadEUSILC
or loadLCS
from EU-SILC or INE surveys, respectively.
Next, the data is set up by using the function setupDataset
which has
one mandatory argument (dataset
) and five optional arguments: the
country to be analysed (country
), the region of the country
(region
), the equivalence scale (s
), a deflator (deflator
) and the
purchasing power parity rate (pppr
). All optional arguments have
NULL
default value, except for country = ’ES’
.
The country
and the region
are expressed using the nomenclature of
territorial units for statisticsregion = NULL
) but it is possible to select one
or more regions using a character string. Income is expressed in current
monetary units. If deflator
is not NULL
the value assigned will be
used as a deflator. Finally, by setting up the ratio of the purchasing
power parity conversion factor to market exchange rate (pppr
) we can
compare the income across countries.
dataset <- setupDataset(eusilc2, country = "AT", region = NULL,
s = NULL, deflator = NULL, pppr = NULL)
Next, we form a data frame with a new variable ipuc
. The ipuc
variable is the income per unit of consumptiondeflator = NULL
, pppr = NULL
and
s = NULL
, ipuc
is set to HX090
(see Table
1).
In this section we briefly describe some indicators and curves that are
widely used in the study of poverty, inequality and welfare. The
eusilc2
dataset contains all the data necessary for estimating them.
Let
Poverty curves and indicators are based on the poverty threshold that
distinguishes between poor and non-poor households. The median income,
using the function arpt()
. The default setting of pz
. Optional variable names can be set up for
the income per unit of consumption (ipuc
), the household
cross-sectional weight (hhcsw
) and the household size (hhsize
).
Default values correspond to EU-SILC names. For the sake of clarity we
use the default variable names in the following examples. A sample call
to arpt
would look like:
arpt(dataset, ipuc = "ipuc", hhcsw = "DB090", hhsize = "HX040", pz = 0.6)
The indicator called at-risk-of-poverty rate, defined as the
proportion of persons with an equivalised disposable income below the
at-risk-of-poverty threshold, is estimated using the function arpr()
(see Table 2). A sample call to arpr
would look
like:
arpr(dataset, arpt.value = arpt(dataset, pz = 0.6))
Using the function arpr()
and changing the argument pz
to 40%, 50%
and 70% in arpt()
, we obtain the dispersion around the
at-risk-of-poverty threshold (percentage of persons with an equivalised
disposable income below
arpr(dataset, arpt.value = arpt(dataset, pz = 0.4))
Table 2 contains a brief description of some other
well-known poverty indicators such as the Foster, Greer and Thorbecke
(FGTs1()
. Its normalized version, also called the poverty gap
ratio, is obtained by setting norm = TRUE
. In this case, the index
provides the average of the ratio of the poverty gaps
s1(dataset, arpt.value = arpt(dataset), norm = TRUE)
The rtip package also provides the function s2()
to calculate
the Sen-Shorrocks-Thon (SST) poverty index (Shorrocks 1995; Zheng 1997).
It is estimated as twice the area below the TIP curve (see TIP curve in
next section).
For income inequality, we calculate the most commonly used indices of
inequality, the Gini index, and the quintile share ratio (see
Table 2). The mean income per person, per
household and per unit of consumption are estimated by the respective
functions mip()
, mih()
and miuc()
.
Description |
Formula |
|
---|---|---|
At-risk-of-poverty rate, | ||
arpr() |
Proportion of persons with an equivalised disposable income below the at-risk-of-poverty threshold |
|
Relative median at-risk | ||
of-poverty gap,
rmpg() |
Difference between the at-risk-of-poverty threshold and the median equivalised disposable income of people below the at-risk-of-poverty threshold, expressed as a percentage of the threshold |
|
Foster, Greer and | ||
Thorbecke, s1() |
Average of the absolute poverty gaps (difference between the at-risk-of-poverty threshold and the equivalised disposable income, with the non-poor being given a difference of zero) |
|
Gini, gini() |
Relationship of cumulative proportions of the population arranged according to the level of equivalised disposable income, to the cumulative proportions of the equivalised total disposable income they receive | |
Quintile share ratio, | ||
qsr() |
The ratio of total income received by the 20 percent of the population with the highest income to that received by the 20 percent of the population with the lowest income |
|
Confidence intervals can be obtained for all the indicators in
rtip package by a bootstrap method. We use the
boot package
(Canty and Ripley 2016) to generate bootstrap replicates. The user can change the
number of replicates with the parameter rep
which, by default, is set
to 500. If verbose = TRUE
we obtain a plot showing the histogram of
the indicator estimations (replicates) and the Standard Normal
quantile-quantile plot of bootstrap estimates. For instance, in case of
the arpt
function and 98% confidence interval, a typical call would
look like:
arpt(dataset, pz = 0.6, ci = 0.98, rep = 500, verbose = TRUE)
The Lorenz curve (Gastwirth 1971) is defined as
and the generalized Lorenz (GL) curve is given by
where the sample counterpart of
and lc()
is implemented to estimate both Lorenz and GL curve
ordinates. This function calculates the Lorenz curve for the number of
abscissae samplesize
. Following the
examples in Beach and Davidson (1983), Beach and Kaliski (1986) and Xu (1997), we set samplesize = 10
by default. If samplesize = complete
, ordinates are computed in each
value along the whole distribution. By setting generalized = TRUE
, the
GL curve is calculated. The left-hand panel of Figure
1 shows the Lorenz curve for income distribution of
the Burgenland region. It is produced with:
Burgenland <- setupDataset(eusilc2, country = "AT", region = "Burgenland")
lorenz_curve <- lc(Burgenland, samplesize = 10, generalized = FALSE, plot = FALSE)
p1 <- ggplot(lorenz_curve, aes(x.lg, y.lg)) + geom_line() +
geom_segment(aes(x = 0, y = 0, xend = 1, yend = 1),
linetype = "dotted", color = "grey") +
scale_x_continuous(expression(p)) +
scale_y_continuous(expression(L(p))) + theme_bw()
print(p1)
The variable lorenz_curve
contains the abscissae plot=TRUE
in lc
function.
For an individual (or household-level) measure of deprivation
Let
For its estimation, let the observations of poverty gaps
where the sample counterpart of
and tip()
estimates both the unnormalized and normalized TIP curve ordinates.
Normalization is established by setting norm = TRUE
and the estimated
poverty threshold, arpt
. The
number of ordinates computed is given by the the argument samplesize
,
and following the example in Xu and Osberg (1998), we set samplesize = 50
by
default. If samplesize = complete
, tip ordinates are computed in each
value along the whole distribution. The right-hand panel of Figure
1 shows the normalized TIP curve for income
distribution of Burgenland region. It is produced with:
tip_curve <- tip(Burgenland, arpt(Burgenland), samplesize = 50, norm = TRUE)
p2 <- ggplot(tip_curve, aes(x.tip, y.tip)) + geom_line() +
scale_x_continuous(expression(p)) +
scale_y_continuous(expression(TIP(p, z))) +
theme_bw()
print(p2)
As in the previous example, a less elaborate plot is obtained by setting
plot=TRUE
in the call to the tip
function.
Given two income distributions
Since the initial papers by (Beach and Davidson 1983) and (Bishop et al. 1989) focussing on statistical tests for Lorenz dominance and generalized Lorenz dominance, respectively, many studies have been conducted to implement these ranking criteria empirically (see Chapter 17 in (Duclos and Araar 2006) for a review). Some tests in the literature are based on a two-stage testing strategy including multiple pairwise sub-tests. Xu (1997) offers an alternative to this approach by providing a joint and simpler procedure to test for generalized Lorenz dominance directly, which is adapted to testing for deprivation dominance in Xu and Osberg (1998). We have implemented Xu (1997) and Xu and Osberg (1998) procedures in rtip.
To make statistical inference about GL dominance from sample GL curve estimates we have implemented the asymptotically distribution-free statistical inference procedure in Xu (1997). This article provides one test based on Theorem 1 in Beach and Davidson (1983) which derives the asymptotic joint variance-covariance matrix of GL curve ordinates. EU-SILC surveys involve sampling weights. We implemented an extension of the methodology in Beach and Davidson (1983) to samples which involve weighted observations (see Beach and Kaliski 1986).
Given two income distributions,
where
with OmegaGL()
computes the empirical unrestricted
vector of GL curve ordinates, testGL()
. For both functions the number of ordinates, samplesize
. Following the example in Xu (1997) the default
value is samplesize=10
. The upper-and lower-bounds of critical values
(at the Tvalue
) falls into an inconclusive region (between the lower-
and upper-bounds) the simulated p-value is estimated following
Wolak (1989), otherwise the NA
. For instance, to
test the null hypothesis that the income distribution of Burgenland
dominates the income distribution of Carinthia in the generalized Lorenz
sense we use the following procedure:
Burgenland <- setupDataset(eusilc2, country = "AT", region = "Burgenland")
Carinthia <- setupDataset(eusilc2, country = "AT", region = "Carinthia")
testGL(Burgenland, Carinthia, generalized = TRUE, samplesize = 10, alpha = 0.05)
The output produced is:
$Tvalue
[,1]
[1,] 7723.701
$p.value
[1] NA
$decision
[1] "Reject null hypothesis"
In this example, the null hypothesis of dominance is rejected with a
significance level of 5
To test the null hypothesis that the income distribution of Carinthia
dominates the income distribution of Burgenland in the Lorenz sense we
use the function testGL()
as follows:
testGL(Carinthia, Burgenland, generalized = FALSE)
This results in:
$Tvalue
[,1]
[1,] 4.049037e-12
$p.value
[1] NA
$decision
[1] "Do not reject null hypothesis"
In this case we do not have evidence to reject the null hypothesis with
a significance level of 5
To make statistical inference about TIP dominance we have implemented the asymptotically distribution-free statistical procedure in Xu and Osberg (1998). As in the case of GL dominance, we have followed the methodology suggested by Beach and Davidson (1983) and Beach and Kaliski (1986).
Let’s consider two poverty thresholds
where
The matrix
with OmegaTIP()
computes the empirical unrestricted vector of TIP curve
ordinates, testTIP()
. Following the practical example
in Xu and Osberg (1998) the default value is samplesize=50
. The rules of
rejection and non-rejection based on the value of the statistic Tvalue
) are in Xu and Osberg (1998). By setting the argument norm
equal to TRUE
, the function testTIP()
uses normalized TIP curves.
For example, to test the null hypothesis that the normalized TIP curve
of Carinthia dominates the normalized TIP curve of Burgenland we use the
following procedure:
testTIP(Carinthia, Burgenland, norm = TRUE, samplesize = 50, alpha = 0.05)
This yields:
$Tvalue
[,1]
[1,] 11939.99
$p.value
[1] NA
$decision
[1] "Reject null hypothesis"
The null hypothesis of dominance is rejected with a significance level
of 5
The package rtip presented in this paper compares income
distributions in terms of welfare, poverty and inequality. Besides
providing point estimates and confidence intervals for some commonly
used indicators, rtip implements the methodology and techniques
based on the stochastic dominance to compare income distributions. In
particular, we can estimate with rtip the usual Lorenz curve, the
generalized Lorenz curve, the TIP curve of income distributions and test
statistically whether one curve is dominated by another. Although
potential users of rtip may have their own data, the package
allows to load microdata from EU-SILC survey and Spanish Living
Conditions survey and offers different equivalence scales to adjust
measures of income for differences in household composition. A
development version of the rtip package can be found at
GitHub
The authors thank two anonymous referees and the Editor of the journal for their detailed comments which have led to significant improvements in both the content and presentation of the paper. Miguel A. Sordo and Carmen D. Ramos acknowledge the support received from Ministerio de Economía y Competitividad (Spain) under grant MTM2014-57559-P.
Users can create their own load function using the loadLCS
function as
a template (see the example below). The load function must produce a
data frame with the following variable names: DB010, DB020, DB040,
DB090, HX040, HX050, HX090 (see Table 1 to set
properly the variables). In the case of the Spanish Living Conditions
Survey, for instance, the Spanish Statistical Office delivers four files
with many variables. We only require some of them to work with
rtip:
lcs_d_file.csv
:
DB010: a numeric vector containing the year of the survey.
DB020: a factor with one level which is the country considered.
DB030: a numeric vector containing the household ID.
DB040: a factor with as many levels as there are regions in the country.
DB090: a numeric vector containing information about household cross-sectional weight.
lcs_hfile.csv
:
HB010: a numeric vector containing the year of the survey.
HB030: a numeric vector containing the household ID.
HX040: an integer vector containing information about the household size.
HX240: a numeric vector containing information about the equivalised household size. The scale employed is the modified OECD scale.
vhRentaa: a numeric vector containing the total disposable household income.
A function to load these files and variables is the following:
loadLCS <- function(lcs_d_file, lcs_h_file){
dataset1 <- read.table(lcs_d_file, header=TRUE, sep= ',')
dataset2 <- read.table(lcs_h_file, header=TRUE, sep= ',')
check1 <- identical(dataset1$DB010, dataset2$HB010)
check2 <- identical(dataset1$DB030, dataset2$HB030)
if (!check1) {
stop('Different years!')
} else if (!check2) {
stop('You do not have the same identification for homes')
} else {
subdataset1 <- subset(dataset1, select = c("DB010", "DB020","DB030",
"DB040", "DB090"))
subdataset2 <- subset(dataset2, select = c("HB010", "HB030", "HX040",
"HX240", "vhRentaa"))
subdataset2$HX050 <- subdataset2$HX240
subdataset2$HX090 <- subdataset2$vhRentaa/subdataset2$HX240
dataset <- cbind(subdataset1, subdataset2)
dataset <- subset(dataset, select = c("DB010", "DB020","DB040",
"DB090", "HX040", "HX050",
"HX090"))
return(dataset)
}
}
The code required to reproduce the examples in this paper can be downloaded from: https://gist.github.com/AngelBerihuete/7e88d55845044ce04a9e61edcd5954f2.
Econometrics, OfficialStatistics, Optimization, Survival, TimeSeries
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
eusilc
dataset can be found at
https://github.com/AngelBerihuete/rtip/blob/master/data-raw/eusilc_eusilc2.R.[↩]Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Berihuete, et al., "Welfare, Inequality and Poverty Analysis with rtip: An Approach Based on Stochastic Dominance", The R Journal, 2018
BibTeX citation
@article{RJ-2018-029, author = {Berihuete, Angel and Ramos, Carmen Dolores and Sordo, Miguel Angel}, title = {Welfare, Inequality and Poverty Analysis with rtip: An Approach Based on Stochastic Dominance}, journal = {The R Journal}, year = {2018}, note = {https://rjournal.github.io/}, volume = {10}, issue = {1}, issn = {2073-4859}, pages = {328-341} }