In retail location analysis, marketing research and spatial planning, the market areas of stores and/or locations are a frequent subject. Market area analyses consist of empirical observations and modeling via theoretical and/or econometric models such as the Huff Model or the Multiplicative Competitive Interaction Model. The authors’ package MCI implements the steps of market area analysis into R with a focus on fitting the models and data preparation and processing.
A market area (also called trading area, service area or catchment area) is a part of the earth’s surface where the actual or potential customers of a supply location come from. This supply location can be any kind of location which provides goods and/or services and generates a geographically segmented market. In retailing, a supply location can be a single store, a retail agglomeration (planned or unplanned shopping centre) or even an entire city (Löffler 1998; Berman and Evans 2013).
The market area of a store/location can be regarded as the spatial equivalent to its sum of all customers and/or sales. The total customers/sales of a supply location can be determined by summing the customer flows/expenditures from each geographic region in its market area (Rodrigue et al. 2006; Huff and McCallum 2008). For practical reasons, a market area can also be divided by zones of market penetration and/or distance/travel time (Berman and Evans 2013).
Market areas of retail locations result from the consumer spatial shopping behaviour, more precisely, their store choice. Thus, a market area is influenced by many factors such as the transport costs (e.g. distance, travel time) between customers and locations and, of course, the characteristics of the competitors (e.g. perceived "attraction", pricing, image or the opportunity for multi-purpose and comparison shopping). Mostly, the market areas of competing supply locations overlap, which means that they are in spatial competition (Rodrigue et al. 2006; Wieland 2015b).
Traditionally, market area analysis includes the delineation and segmentation of market areas and can be divided into inductive-empirical and deductive-theoretical approaches. The first type consists of constructing market areas based on empirical observations such as point-of-sale (POS) surveys (customer spotting), while in the latter approach this work is done by using mathematical market area models (Löffler 1998). Modern market area analyses are mostly a combination of both, especially when using econometric market area models which are fitted by empirical data (Wieland 2015b).
Market area models can be used in retail location analysis to find new locations, to evaluate the existing outlets or to assess the impact of changes in the competitive landscape (Huff and McCallum 2008; Berman and Evans 2013). Market area analyses are also subject of governmental spatial planning in Germany (Wolf 2012). The econometric models can also be utilized to identify variables influencing consumer decisions and market areas and to check hypotheses about these relations, which means to find out what affects store choice (Wieland 2015b). The usage of market area models can also be transferred to other service locations, such as health services (Jia et al. 2015).
This paper presents two market area models, the Huff Model and the Multiplicative Competitive Interaction (MCI) Model, and their implementation into R by the authors’ package MCI (Wieland 2016). Yet, only the basic Huff Model formula is integrated in R by the packages SpatialPosition (Giraud and Commenges 2016), which can be used especially for graphical visualization, and huff-tools (Pavlis et al. 2014), which combines the basic formulation with GIS functions. In contrast, the emphases of the MCI package lie on 1) fitting the MCI Model and the Huff Model via OLS (ordinary least squares) regression and nonlinear techniques, respectively, and 2) the steps of processing and transforming empirical data to be usable in the models (especially working with interaction matrices).
Almost all market area models have in common that they are based on an attraction/utility function including transport costs and further characteristics of the supply locations which are subjects of a trade-off by the consumers. The dependent variable of these models is the store choice and/or the choice probability and/or the market shares of the stores/locations (Wieland 2015b).
The first retail market area models (Reilly 1929; Converse 1949) were deterministic, which means that each customer origin is assigned completely to one supply location while overlapping market areas are not envisaged. Furthermore only two supply locations can be processed and the explaining variables of the models are not founded theoretically. These gaps were filled by Huff (1962, 1963, 1964) by introducing his probabilistic market area model. An important content enhancement and, at the same time, an econometric transformation of the Huff Model was introduced in the form of the Multiplicative Competitive Interaction Model by Nakanishi and Cooper (1974, 1982).
The basis of the Huff Model is the following multiplicative utility
function with two explanatory variables representing two determinants of
store choice (Huff and Batsell 1975):
The attraction is translated as the size of the location due to the
increasing probability for a "successful" shopping trip on condition
of consumer uncertainty (the greater the locations’ offer, the more
likely is to get the desired goods). The size is operationalized by the
sales area of the locations. But, as the consumers’ decision costs
normally rise with an increasing number of offered goods, the marginal
utility of the locations’ offer decreases which is reflected by a
degressive weighting of size (
The parameter
Derived from the behavioral scientific Luce choice axiom (Luce 1959),
the consumer decision in the Huff Model is regarded as probabilistic.
The probability to choose the alternative
The expected customer/expenditure flows from
The complete market area of location
While working with the Huff Model, the first step is to delineate the
study area itself and to record the focused origins and locations. The
explanatory variables must be observed empirically by mapping the
regarded stores/locations (
All data is stored into a special kind of linear table that is called
interaction matrix and is the basis for all further calculation steps
(see Table 1). The interaction matrix contains
one row for each (potential) spatial interaction between the origins,
Many approaches were developed to estimate the weighting coefficients by
nonlinear iterative techniques. Huff (1962) showed a way to estimate
In many cases, no empirical market shares are available but only
observed total sales of stores or locations,
Consider a location system with
Thus, the expected sales using non-adjusted attractions in the Huff
Model,
Since the algorithm by Güssefeldt (2002) addresses the local level,
there is no fit measure for the global level (which means the model fit
for the complete location system). The global model fit can be
evaluated in different ways, where three fit measures were already used
in the context of optimizing the Huff Model: the mean absolute
percentage error (
Since the real sales are known and on condition that the location
attraction and the sales are related to each other, the attraction
On condition that an interval is known, this function can be
parametrized by calculating the slope using the difference quotient and
the intercept, thereafter (Güssefeldt 2002). Unlike in the mentioned
approach, this function must pass the origin because in the Huff Model
with its multiplicative utility function (Formula 1), an attraction
equal to zero (
Thus, formula 9 has no intercept (
The relations mentioned above can be brought together in an optimization
algorithm for the Huff Model with respect to a location system with
Set a tolerance value,
Calculate the market areas for the location system and the total
sales of the
Calculate the absolute percentage error between the expected and the
observed total sales of location
Calculate the slope of the attraction function by formula 10. Calculate the adjusted attraction via formula 11.
Save the adjusted attraction of location
Repeat steps 2 to 5 for all locations
After the last location
Repeat steps 2 to 7 for the complete location system until the local
optima and/or the global optimum is sufficiently approximated. The
former can be evaluated by the tolerance value (step 3) for every
The Multiplicative Competitive Interaction Model (in short: MCI Model)
is based on the Huff Model but also belongs to the model family of
market share models which were developed in marketing science. Thus, it
can be regarded as a crossover of these two model families
(Cliquet 2013). The fundamental theorem behind market share models is
the following simple relationship between the competitors’
characteristics and their market shares (Cooper and Nakanishi 2010):
Derived from the Huff Model, the MCI Model is explicitly formulated to
regard a market which is segmented into
The market can be subdivided in any kind of submarkets (e.g. customer groups, time periods, geographic areas). When the market is segmented geographically, it is a matter of a Spatial MCI Model, especially when integrating transport costs as an explanatory variable (Huff and McCallum 2008; Cliquet 2013). The MCI Model can also be regarded as a generalization of the Huff Model, while the Huff Model can be considered as a special case of multiplicative competitive interaction model.
The models mentioned above are deterministic (no random variation) and nonlinear models which cannot be estimated directly by common econometric techniques but by iterative algorithms (see the former section) which do not allow statements about the statistical significance of the explanatory variables and other inference statistics. The main breakthrough of the MCI Model is the transformation of the nonlinear structure into a linear stochastic model which can be estimated via OLS (Ordinary Least Squares) regression (Huff and McCallum 2008).
This requires a re-arrangement of the model to be linear in parameters
which is achieved by a multi-step transformation of the variables using
geometric means and logarithms for standardization and linearization
called the log-centering transformation (Nakanishi and Cooper 1974):
Once the variables are transformed, the model function is linear in its parameters and can be processed as a multiple linear regression model to be estimated by OLS regression (Rawlings et al. 1998). Thus, the model allows to test hypotheses about the influence of store/location characteristics (such as sales area, pricing) and transport costs (such as travel time, distance) on the local market shares in the submarkets (customer origins) by interpreting the regression coefficients and their inference statistics (Nakanishi and Cooper 1974, 1982; Huff and McCallum 2008).
After an estimation of the parameters, they can be included as exponents
in the original nonlinear model (Formula 13) to be utilized for market
share/market area predictions (Nakanishi and Cooper 1982; Huff and McCallum 2008). It is also
possible to integrate dummy variables reflecting qualitative information
(such as brands or store chains) or an intercept (if necessary). Since
the former causes problems in the multiplicative attraction/utility
function (multiplication by zero) and the latter is contrary to the
logical consistency requirement, a different retransformation of the
model called the inverse log-centering transformation is required
(Nakanishi and Cooper 1982):
In the first step, a MCI Model analysis requires the formulation of
hypotheses and/or research questions addressing the influence of the
Since the delineation of the study area and the identification of the relevant competing locations have an enormous impact on the results, these definitions should be made corresponding to the LIFO (little in from outside) and LOFI (little out from inside) principles (Huff and McCallum 2008). This means that the majority of shopping interactions should take place within the study area.
The supplier characteristics can be obtained by mapping the relevant
stores/locations and additional research. The market shares,
To calculate the local market shares of shopping trips and/or
expenditures, the individual data must be aggregated on the submarket
level:
Mostly, the observed variables cannot be processed directly: the
log-centering transformation requires every variable to be ratio-scaled,
non-negative and greater than zero. Thus, also market shares equal to
zero,
Interval scaled variables (e.g. consumer judgements in rating scales)
which may contain negative values can be transformed by the zeta-squared
transformation (Cooper and Nakanishi 1983):
Nominal scaled variables (e.g. store chains, brands) cannot be processed directly in the MCI Model but can be transformed into dummy variables which are ignored in the log-centering transformation.
All in all, an econometric market share/market area analysis using the MCI Model on condition of existing research questions and hypotheses contains the following 8 steps:
Define the study area and divide it into
Obtain the relevant variables: shopping trips and/or expenditures on
the individual/household level (
If necessary, correct or transform the variables to match the
requirements of the log-centering transformation: if some
Calculate the market shares,
Store the submarkets, suppliers, local market shares and the
Apply the log-centering transformation (formula 14) to the previously created interaction matrix.
Apply the OLS regression to the log-centering transformed
interaction matrix with
Interpret the model results. For market share predictions, insert the estimated parameters in the nonlinear model using formula 13 or formulae 15 and 16.
As outlined in the former section, market area analysis for retail and service locations requires a mixture of descriptive and inference statistics, iterative optimization and, of course, a lot of processing and preparation of empirical data which may be even more complex (or, at least, more time-consuming) than the models themselves. Except the Huff Model basis formulation which is implemented in R by the packages SpatialPosition and huff-tools none of the mentioned models were integrated in R yet. The former mentioned package is focused on graphical visualization and other spatial interaction models while the latter combines the basic Huff formula with GIS-related functions. The Multiplicative Competitive Interaction Model and the related procedures of data handling are just as little integrated in R as the OLS and nonlinear fitting procedures for the Huff Model and the MCI Model, respectively. The motivation of the presented package MCI is to fill this gap.
The Huff Model and the Multiplicative Competitive Interaction (MCI) Model are implemented in the MCI package which is focused on:
Fitting the mentioned models by empirically observed data using the
MCI linearization (log-centering transformation) combined with OLS
regression (functions mci.fit()
, mci.transmat()
and
mci.transvar()
) and the Huff Model optimization algorithm
described above (functions huff.attrac()
and huff.fit()
).
The steps of data preparation and processing to make empirical data
usable in these models, especially the creation and processing of
interaction matrices used in MCI analyses which is subject of the
function ijmatrix.create()
.
The MCI package also provides tools that can be used for descriptive
analyses of empirical or estimated market areas, such as zoning (see the
function shares.segm()
) and, of course, the basic Huff and MCI
formulations (see the functions mci.shares()
and huff.shares()
,
respectively). The correction of variables to match the MCI requirements
can be done by the functions var.correct()
and var.asdummy()
.
The input of the most functions is required to be a "data.frame"
where
the first function argument is the dataset name, followed by a set of
variable names (columns) each one in double quotation marks and further
function arguments. Also, the output of nearly every function is a
"data.frame"
, except the mci.fit()
function which returns an object
of type "lm"
and model.fit()
which returns a "list"
. Three
functions (huff.decay()
, model.fit()
and shares.total()
) also
provide an optional graphical output.
The package does not import any other packages (except some already implemented functions from stats and graphics, of course) and does not contain any non-R scripts.
The data used in the following examples is distributed over several datasets to demonstrate the several data sources and the components of a market area analysis, respectively. In fact, there is no need to split the working data like this, but, of course, it is recommended since the voluminous data may lead to confusion.
The first example deals with a more descriptive analysis of empirical
market areas obtained by a POS survey (customer spotting technique). The
package-included example dataset shopping1
is a survey conducted at
two supply locations in the east of Karlsruhe, Germany, in May 2016.
The dataset contains 434 surveyed individuals at both locations,
including 410 cases from the main survey and 24 cases from the pretest.
Amongst other things, the respondents were asked about their place of
residence, their shopping preferences in general (last shopping trip
with respect to different goods) and their on-site shopping behaviour
(duration of stay, expenditures). The customers’ origin is stored in the
column resid_code
. The variable POS
indicates the location: "POS1"
is a town centre and thus, an evolved retail agglomeration, while
"POS2"
is an out-of-town planned shopping centre.
It should be noted that a POS survey mostly does not fulfill the requirements of statistical representativity: besides that shopping behaviour differs by weekdays, time periods and, of course, weather, independent from the size and heterogeneity of the sample, the statistical population is unknown.
library(MCI)
data(shopping1)
# The survey dataset
data(shopping2)
# Dataset with distances and travel times
The first step is to filter the dataset since only the interviews are
needed which were conducted at both supply locations simultaneously. We
subset the relevant data and store it into a new working data frame
(shopping1_adj
) which we use to create an interaction matrix of type
"data.frame"
called ijmatrix
using the function ijmatrix.create()
:
shopping1_adj <- shopping1[(shopping1$weekday != 3) & (shopping1$holiday != 1)
& (shopping1$survey != "pretest"),]
# Removing every case from tuesday, holidays and the ones belonging to the pretest
ijmatrix <- ijmatrix.create(shopping1_adj, "resid_code", "POS", "POS_expen")
# Creates an interaction matrix based on the observed frequencies (automatically)
# and the POS expenditures (Variable "POS_expen" separately stated)
ijmatrix
interaction resid_code POS freq_ij_abs freq_i_total p_ij_obs
1 resid1-POS1 resid1 POS1 91 113 0.8053097
2 resid1-POS2 resid1 POS2 22 113 0.1946903
3 resid10-POS1 resid10 POS1 3 7 0.4285714
...
freq_ij_abs_POS_expen freq_i_total_POS_expen p_ij_obs_POS_expen
1 2318.25 3245.25 0.71435174
2 927.00 3245.25 0.28564826
3 30.00 328.00 0.09146341
...
The resulting data frame (rows 1-3 are displayed) contains the
interaction (from each origin to each destination) in the first column
(interaction
) and each origin and destination in the columns
resid_code
and POS
, respectively, both adopted from the column names
in the input dataset. The absolute number of respondents and the total
value of expenditures from the origins, freq_ij_abs
and freq_ij_abs_POS_expen
, respectively.
The freq_i_total*
columns contain the total number/sum of observed
customers/expenditures from each origin, while the p_ij_obs*
are the
local market shares of customers and expenditures, POS_expen
).
The next step is the zoning of the empirical market areas by driving
time using the function shares.segm()
and the travel time stored in
the shopping2
dataset. The routes were calculated in R using the
package ggmap
(Kahle and Wickham 2013). We want to know how much of the customers and
expenditures come from origins with a maximal travel time of 10, 20, 30
and more than 30 minutes:
ijmatrix_dist <- merge(ijmatrix, shopping2, by.x = "interaction", by.y = "route",
all.x = TRUE)
# Adding the distances and travel times
visit <- shares.segm(ijmatrix_dist, "resid_code", "POS", "d_time", "freq_ij_abs",
0, 10, 20, 30)
# Segmentation by travel time using the number of customers/visitors
# Parameters: interaction matrix (data frame), columns with origins and destinations,
# variable to divide in classes, absolute frequencies/expenditures, class segments
expen <- shares.segm(ijmatrix_dist, "resid_code", "POS", "d_time",
"freq_ij_abs_POS_expen", 0, 10, 20, 30)
# Segmentation by travel time using the POS expenditures
visit
d_time_class POS1_abs POS1_rel POS2_abs POS2_rel
1 0-10 108 72.483221 58 40.277778
2 10-20 24 16.107383 62 43.055556
3 20-30 10 6.711409 12 8.333333
4 Other 7 4.697987 12 8.333333
expen
d_time_class POS1_abs POS1_rel POS2_abs POS2_rel
1 0-10 2858.25 76.418689 2900.0 34.354491
2 10-20 541.00 14.464274 4052.4 48.006255
3 20-30 283.00 7.566339 478.0 5.662568
4 Other 58.00 1.550698 1011.0 11.976686
The "data.frame"
output of the used function contains the segment
classes, named based on the input variable name (d_time_class
),
e.g. d_time_class="0-10"
represents the zone up to 10 minutes of
travel time. The further columns contain the sums and the percentage
shares for each location, respectively, both named based on the original
values (e.g. POS1_abs
, POS1_rel
).
We see that, at "POS1"
(town centre) POS2
(out-of-town planned shopping
centre), only
Since the survey was only conducted at these two supply locations but not at other possible competitive locations, it would not make any sense to use these empirical market areas in an econometric market area model. But, of course, it is possible to analyze the distance decay on the level of single locations.
Before fitting and plotting distance decay functions, some corrections of the data have to be made and, since we don’t have local market shares, the dependent variable reflecting the intensity of interaction (surveyed visitors per 1,000 inhabitants) has to be calculated:
ijmatrix_dist$freq_ij_abs_cor <- var.correct(ijmatrix_dist$freq_ij_abs,
corr.mode = "inc", incby = 0.1)
# Correcting the absolute values (frequencies) by increasing by 0.1
data(shopping3)
ijmatrix_alldata <- merge(ijmatrix_dist, shopping3)
# Adding the information about the origins (places of residence) stored in shopping3
ijmatrix_alldata$visitper1000 <- (ijmatrix_alldata$freq_ij_abs_cor /
ijmatrix_alldata$resid_pop2015) * 1000
# Calculating the dependent variable
# visitper1000: surveyed customers per 1,000 inhabitants of the origin
ijmatrix_alldata <- ijmatrix_alldata[(!is.na(ijmatrix_alldata$visitper1000))
& (!is.na(ijmatrix_alldata$d_time)),]
# Removing NAs (data for some outlier origins and routes not available)
POS1 <- ijmatrix_alldata[ijmatrix_alldata$POS == "POS1",]
# Dataset for POS1 (town centre)
POS2 <- ijmatrix_alldata[ijmatrix_alldata$POS == "POS2",]
# Dataset for POS2 (out-of-town shopping centre)
A fit of distance decay functions can be done with the package function
huff.decay()
which compares four types of possible distance decay
functions (linear, power, exponential, logistic). For both locations,
"POS1"
and "POS2"
, we test the influence of distance in km (d_km
)
and travel time in minutes (d_time
) on the dependent variable
(visitper1000
):
huff.decay(POS1, "d_km", "visitper1000")
Model type Intercept p Intercept Slope p Slope R-Squared Adj. R-squared
1 Linear 0.7354 2e-04 -0.0455 0.0038 0.216 0.1936
2 Power 1.9121 0.4434 -1.5267 2e-04 0.333 0.3139
3 Exponential 0.2788 0.0222 -0.1353 0.0044 0.2092 0.1866
4 Logistic 1.6999 0.0164 0.1823 0.0026 0.2311 0.2092
huff.decay(POS1, "d_time", "visitper1000")
Model type Intercept p Intercept Slope p Slope R-Squared Adj. R-squared
1 Linear 1.2112 0 -0.0585 1e-04 0.3516 0.3331
2 Power 34.6019 0.0289 -2.2968 3e-04 0.3211 0.3017
3 Exponential 0.691 0.6322 -0.1432 0.0027 0.23 0.208
4 Logistic 0.2659 0.7795 0.2058 6e-04 0.2893 0.2689
huff.decay(POS2, "d_km", "visitper1000")
Model type Intercept p Intercept Slope p Slope R-Squared Adj. R-squared
1 Linear 0.6734 0 -0.0316 7e-04 0.2812 0.2606
2 Power 5.7372 0.0348 -1.5978 0 0.3863 0.3688
3 Exponential 0.9161 0.8318 -0.1605 0 0.4086 0.3917
4 Logistic -0.4868 0.4212 0.2043 1e-04 0.3444 0.3256
huff.decay(POS2, "d_time", "visitper1000")
Model type Intercept p Intercept Slope p Slope R-Squared Adj. R-squared
1 Linear 0.9353 0 -0.0411 1e-04 0.3706 0.3526
2 Power 213.9932 7e-04 -2.7363 0 0.4213 0.4048
3 Exponential 2.3946 0.14 -0.184 0 0.4191 0.4025
4 Logistic -1.8572 0.031 0.2441 0 0.3835 0.3659
The huff.decay()
function returns a "data.frame"
containing a
summary of the regression results and a plot of the four estimated
functions and the observed values (see Figure 1). The
output shows the model estimators (Intercept
and Slope
), their
p-values (p Intercept
, p Slope
) and the goodness of fit measures
(R-Squared
, Adj. R-Squared
) for every model type.
Note that the nonlinear models are not fitted via nonlinear regression
but by linearization and retransformation, thereafter, since the usually
used distance decay functions mentioned above are intrinsically linear
(Rawlings et al. 1998). Internally, the linearized models are fitted via
lm()
. It is important to say that the usage of logarithmic
transformations has, apart from many advantages of linearization, also
serious drawbacks: with respect to gravity models, Silva and Tenreyro (2006) point
out that OLS fitting of log-transformed data produces significant biases
depending on heteroscedasticity. The same issue is also addressed by
Manning and Mullahy (2001).
Huff (1962) assumed that a power function fits best to describe the
distance decay. Since other types of distance decay functions are
possible, Kanhäußer (2007) uses the explained variance as a criterion
for choosing the best function type. Note that, strictly speaking, the
As expected, the distance/travel time influences the interaction
intensity significantly and, as Huff (1962) states, mostly exponentially.
In the second case, the regression regarding POS1 with the travel time,
surprisingly, the best fit can be achieved by a linear function. But,
all in all, the model fits can be regarded as rather poor, since the
highest
Since, in the Huff Model, the store choices are explained by sales area and travel time, we want to test these hypotheses empirically with respect to grocery shopping trips. First, we have to calculate the dependent variable (market shares) and to link it to the explanatory variables in an interaction matrix, then we need to apply the log-centering transformation and, finally, we have to fit the MCI Model.
As in the former example, we use the shopping1
dataset which also
contains questions about the general shopping behaviour. The respondents
were asked about the destination of their last grocery shopping trip
(gro_purchase_code
) and the related expenditures
(gro_purchase_expen
).
We have to clean our working dataset from respondents living outside our study area, since we want to analyze the shopping patterns of the grocery shoppers in the eastern districts of Karlsruhe:
data(shopping1)
# Survey dataset
data(shopping3)
# Dataset containing information about the city districts
data(shopping4)
# Dataset containing the grocery stores
shopping1_KAeast <- shopping1[shopping1$resid_code %in%
shopping3$resid_code[shopping3$KA_east == 1],]
# Extracting only inhabitants of the eastern districts of Karlsruhe
From the adjusted dataset, we create an interaction matrix using the
function ijmatrix.create()
with default values (no further adjusting
except ignoring NA
values). The calculation includes the shopping trip
frequency that is counted automatically and the expenditures that must
be specified separately by stating the variable containing the
individual trip expenditures, gro_purchase_expen
. The interaction
matrix is stored in a new "data.frame"
, ijmatrix
.
ijmatrix <- ijmatrix.create(shopping1_KAeast, "resid_code", "gro_purchase_code",
"gro_purchase_expen")
ijmatrix
interaction resid_code gro_purchase_code freq_ij_abs freq_i_total p_ij_obs
1 resid1-ALDI1 resid1 ALDI1 10 186 0.053763441
2 resid1-ALDI11 resid1 ALDI11 0 186 0.000000000
3 resid1-ALDI2 resid1 ALDI2 0 186 0.000000000
...
freq_ij_abs_gro_purchase_expen freq_i_total_gro_purchase_expen
1 420.0 5270.0
2 0.0 5270.0
3 0.0 5270.0
...
p_ij_obs_gro_purchase_expen
1 0.0796963947
2 0.0000000000
3 0.0000000000
...
The interaction matrix (rows 1-3 are displayed) shows the interaction
(origin-destination) in the first column interaction
and the origins
and destinations in the columns resid_code
and gro_purchase_code
,
respectively, adopted from the column names in the input dataset
shopping1_KAeast
. The absolute number of respondents and the total
value of expenditures from the origins, freq_ij_abs
and freq_ij_abs_POS_expen
, respectively.
The names are set automatically based on the original variable names
(e.g. POS_expen
). The freq_i_total*
columns contain the total
number/sum of observed customers/expenditures from each origin, while
the p_ij_obs*
are the local market shares of customers and
expenditures,
Next, we calculate the total customers and expenditures and the
corresponding over-all market shares, respectively, using the function
shares.total()
:
shares.total(ijmatrix, "resid_code", "gro_purchase_code", "p_ij_obs",
"freq_i_total")
# Total values for the shopping trips
suppliers_single sum_E_j share_j
1 ALDI1 11 0.039426523
2 ALDI11 1 0.003584229
3 ALDI2 3 0.010752688
4 ALDI4 1 0.003584229
...
shares.total(ijmatrix, "resid_code", "gro_purchase_code",
"p_ij_obs_gro_purchase_expen", "freq_i_total_gro_purchase_expen")
# Total values for the shopping expenditures
suppliers_single sum_E_j share_j
1 ALDI1 470.0 0.0492378608
2 ALDI11 60.0 0.0062856844
3 ALDI2 87.0 0.0091142423
4 ALDI4 80.0 0.0083809125
...
The resulting tables (rows 1-4 are displayed) contain 42 suppliers: in
the first row, the store "ALDI1"
has a share of about "ALDI11"
).
Thus, this interaction matrix cannot be processed in the MCI Model since it would not pass the log-centering transformation. Consequently, the data has to be “cleaned” distinctly by creating a corrected and simplified interaction matrix. Only stores obtained more than twice are incorporated and the absolute values are increased by a small constant of 0.1 before calculating the shares.
ijmatrix_adj <- ijmatrix.create(shopping1_KAeast, "resid_code",
"gro_purchase_code", "gro_purchase_expen",
remSing = TRUE, remSing.val = 1,
remSingSupp.val = 2, correctVar = TRUE,
correctVar.val = 0.1)
# Removing singular instances/outliers (remSing = TRUE) incorporating
# only suppliers which are at least obtained three times (remSingSupp.val = 2)
# Correcting the values (correctVar = TRUE)
# by adding 0.1 to the absolute values (correctVar.val = 0.1)
There are still some observations that have to be excluded because, in
the survey, any kind of grocery shopping trip was inquired, including
non-relevant stores and shopping channels (such as bakeries, health food
shops and even the local weekly market). There has been some incomplete
answers as well (gro_purchase_code="X_INCOMPLETE_STORE"
.) Thus, the
interaction matrix has to be adjusted again:
ijmatrix_adj <- ijmatrix_adj[(ijmatrix_adj$gro_purchase_code != "REFORMHAUSBOESER") &
(ijmatrix_adj$gro_purchase_code != "WMARKT_DURLACH") &
(ijmatrix_adj$gro_purchase_code != "X_INCOMPLETE_STORE"),]
# Remove non-regarded observations
Now, the interaction matrix (rows 1-3 are displayed) consists of 11
customer origins and 11 suppliers (121 rows). Structured as mentioned
above, there are no zero values anymore and the absolute values
(freq_ij_abs
and freq_ij_abs_POS_expen
) are increased by 0.1:
ijmatrix_adj
interaction resid_code gro_purchase_code freq_ij_abs freq_i_total
1 resid1-ALDI1 resid1 ALDI1 10.1 172.4
2 resid1-ALDI2 resid1 ALDI2 0.1 172.4
3 resid1-CAP1 resid1 CAP1 18.1 172.4
...
p_ij_obs freq_ij_abs_gro_purchase_expen freq_i_total_gro_purchase_expen
1 0.0585846868 420.1 4745.4
2 0.0005800464 0.1 4745.4
3 0.1049883991 224.1 4745.4
...
p_ij_obs_gro_purchase_expen
1 8.852784e-02
2 2.107304e-05
3 4.722468e-02
...
In the next step, we have to add the travel times stored in the dataset
shopping2
(the route calculation was made in R using the ggmap
package) and the grocery store characteristics (sales area in sqm and
store chain, collected in June 2016 subsequent to the survey) from the
shopping4
dataset:
ijmatrix_dist <- merge (ijmatrix_adj, shopping2, by.x="interaction", by.y="route")
# Include the distances and travel times (shopping2)
ijmatrix_alldata <- merge (ijmatrix_dist, shopping4, by.x = "gro_purchase_code",
by.y = "location_code")
# Adding the store information (shopping4)
The next step is to apply the necessary log-centering transformation to
the interaction matrix. The function mci.transmat()
processes this
transformation with a given number of MCI variables. The functions
output is a "data.frame"
containing the transformed variables (ready
for OLS regression):
ijmatrix_transf <- mci.transmat(ijmatrix_alldata, "resid_code", "gro_purchase_code",
"p_ij_obs", "d_time", "salesarea_qm")
ijmatrix_transf
resid_code gro_purchase_code p_ij_obs_t d_time_t salesarea_qm_t
1 resid1 ALDI1 0.5586409 0.060847455 -0.193400118
14 resid1 ALDI2 -1.4456805 -0.008788473 -0.193400118
23 resid1 CAP1 0.8119981 -0.091762709 -0.545582636
...
109 resid29 REWE1 -0.1893441 -0.060611802 -0.001514591
119 resid29 TREFF1 -0.1893441 -0.027694956 -0.169919022
6 resid3 ALDI1 -0.2768831 0.058442499 -0.193400118
...
In the transformed interaction matrix (2 x 3 rows are displayed), the
column names of the origins (resid_code
) and destinations
(gro_purchase_code
) are adopted from the columns in the input dataset.
The metric MCI variables are marked with a "_t"
to indicate that they
were transformed (e.g. d_time_t
is the log-centering transformation of
d_time
which contains the travel time,
To combine transformation and fitting in one step, we use the function
mci.fit()
whose parameters are equal to those in mci.transmat()
except that the column containing the market shares must be stated as
the first variable. The default is a no-intercept model (to be set by
the logical argument origin
with default TRUE
). The function returns
an object of type "lm"
which can be accessed via summary()
:
mci_trips <- mci.fit(ijmatrix_alldata, "resid_code", "gro_purchase_code", "p_ij_obs",
"d_time", "salesarea_qm")
# shares: "p_ij_obs", explanatory variables: "d_time", "salesarea_qm"
summary(mci_trips)
Call:
lm(formula = mci_formula, data = mciworkfile)
Residuals:
Min 1Q Median 3Q Max
-1.27457 -0.28725 -0.02391 0.32163 1.29351
Coefficients:
Estimate Std. Error t value Pr(>|t|)
d_time_t -1.2443 0.2319 -5.367 4.02e-07 ***
salesarea_qm_t 0.9413 0.1158 8.132 4.59e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4458 on 119 degrees of freedom
Multiple R-squared: 0.4603, Adjusted R-squared: 0.4512
F-statistic: 50.74 on 2 and 119 DF, p-value: < 2.2e-16
The output can be interpreted like any other linear model fitted by
lm()
. The results show a strong significant influence of the
explanatory variables (travel time and sales area) on the observed local
market shares (both
For the interpretation of the explained variance it is noteworthy that
the common formulation of summary()
returns a modified
But the market shares with respect to customer flows are not to be
confused with market shares of expenditures: especially in grocery
shopping, there are different kinds of shopping trips, such as less
frequent major trips with high efforts and expenditures and more
frequent and fast fill-in trips with low expenditures (Reutterer and Teller 2009).
Thus, the MCI analysis is repeated with the shares of expenditures
(p_ij_obs_gro_purchase_expen
) as dependent variable:
mci_expen <- mci.fit(ijmatrix_alldata, "resid_code", "gro_purchase_code",
"p_ij_obs_gro_purchase_expen", "d_time", "salesarea_qm")
summary(mci_expen)
Call:
lm(formula = mci_formula, data = mciworkfile)
Residuals:
Min 1Q Median 3Q Max
-2.07495 -0.61794 -0.07452 0.70263 2.60085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
d_time_t -2.3788 0.4517 -5.267 6.26e-07 ***
salesarea_qm_t 2.0409 0.2255 9.051 3.30e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8683 on 119 degrees of freedom
Multiple R-squared: 0.4954, Adjusted R-squared: 0.487
F-statistic: 58.43 on 2 and 119 DF, p-value: < 2.2e-16
As in the first model, both predictors are highly significant, but, in contrast, the impact of size (sales area) is also superlinear. The model fit is a little better than in the first model, what may also be explained by a smaller bias based on the variable correction above: due to their dimensions, the increase of the absolute values by 0.1 has a smaller impact on the expenditures than on the shopping trips.
The utility/attraction function (or, more precisely, its deterministic
component) with respect to the shopping trips (first model) can be
written as:
In consequence, the utility function is included in the Huff/MCI probability function. Thus, the local market shares of shopping trips can be estimated by:
Analogously, the utility/attraction function and the market
share/probability function of the expenditures (second model) are:
A market share prediction can be done using the functions mci.shares()
and huff.shares()
. These functions can be used similarly but differ in
the formulation of the utility/attraction function: according to the
Huff Model, huff.shares()
allows only two explanatory variables
(attraction/size and transport costs) but three types of weighting
function for each variable (power, exponential, logistic). The function
mci.shares()
is able to process any number of variables but only using
the power function from the MCI Model either retransformed or
transformed (inverse log-centering transformation).
We predict market shares using the function mci.shares()
and the
estimations from the second model (expenditures). The variables and
their weightings are function arguments which have to be stated one
after another (variable1, weighting1, variable2, weighting2
, ...).
The estimated interaction matrix containing the predicted shares is
stored in a new dataset, expen
:
expen <- mci.shares(ijmatrix_alldata, "resid_code", "gro_purchase_code",
"salesarea_qm", 2.0409, "d_time", -2.3788)
# MCI market share prediction with two variables
# salesarea_qm (weighting power function with exponent equal to 2.0409)
# d_time (weighting power function with exponent equal to -2.3788)
expen
gro_purchase_code interaction resid_code
1 ALDI1 resid1-ALDI1 resid1
14 ALDI2 resid1-ALDI2 resid1
23 CAP1 resid1-CAP1 resid1
...
d_time salesarea_qm storetype_dc store_chain
1 5.4 900 1 Aldi
14 4.6 900 1 Aldi
23 3.8 400 0 Edeka
...
U_ij sum_U_ij p_ij
1 1.936847e+04 9682538 2.000350e-03
14 2.836255e+04 9682538 2.929247e-03
23 8.537978e+03 9682538 8.817913e-04
...
The interaction matrix (rows 1-3 are displayed) contains the submarkets
(resid_code
), the suppliers (gro_purchase_code
), the interaction
between them (interaction
), the explanatory variables (d_time
,
salesarea_qm
) and the steps of calculation, named according to the
Huff Model (U_ij
, sum_U_ij
) and resulting in the local market shares
stored in the last column (p_ij
). The sum of p_ij
is equal to the
sum of submarkets, since, on the submarket level, the local market
shares sum up to one.
Before using the MCI Model for further market share predictions, the validity of the model should be improved since store choices are, of course, not only influenced by size and transport costs. Especially in grocery retailing, there is a great heterogeneity between the store formats (supermarket, discounter etc.) and the store chains, such as concerning the image or price level of a chain. These more qualitative differences should be reflected in a grocery store market area model. According to Wieland (2015b), the model above is extended with dummy variables reflecting the grocery store chain.
The chains are already stored in the interaction matrix (column
store_chain
, adopted from shopping4
). As they are nominal scaled
variables in character
format, they have to be converted to dummy
variables using the function var.asdummy()
which returns a new
"data.frame"
containing corresponding dummy variables (0
, 1
) named
automatically based on the original characteristics and marked with
"_DUMMY"
. Since they are in the same order, they can be directly
attached:
chain <- var.asdummy(ijmatrix_alldata$store_chain)
# Converting the character vector (column store_chain) to dummy variables
# and storing in a new data frame
chain
Aldi_DUMMY Edeka_DUMMY Lidl_DUMMY Netto_DUMMY Real_DUMMY Rewe_DUMMY Treff 3000_DUMMY
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
...
66 0 1 0 0 0 0 0
67 0 0 1 0 0 0 0
...
ijmatrix_alldata <- cbind(ijmatrix_alldata, chain)
# Add dummy dataset to interaction matrix
In the next step, we repeat the fitting of the second MCI Model (expenditures) including the new dummy variables. Since one dummy is explained by the content of all the others, the last dummy is not used:
mci_expen2 <- mci.fit(ijmatrix_alldata, "resid_code", "gro_purchase_code",
"p_ij_obs_gro_purchase_expen", "d_time", "salesarea_qm",
"Aldi_DUMMY", "Edeka_DUMMY", "Lidl_DUMMY", "Netto_DUMMY",
"Real_DUMMY", "Rewe_DUMMY")
# Same model as above with additional dummy variables
summary(mci_expen2)
Call:
lm(formula = mci_formula, data = mciworkfile)
Residuals:
Min 1Q Median 3Q Max
-2.1601 -0.4338 -0.1041 0.2561 2.5342
Coefficients:
Estimate Std. Error t value Pr(>|t|)
d_time_t -2.56243 0.41753 -6.137 1.28e-08 ***
salesarea_qm_t 1.31622 0.30754 4.280 3.94e-05 ***
Aldi_DUMMY -0.05658 0.17995 -0.314 0.753763
Edeka_DUMMY 0.11873 0.12051 0.985 0.326637
Lidl_DUMMY -0.59177 0.24441 -2.421 0.017060 *
Netto_DUMMY -0.19785 0.25719 -0.769 0.443330
Real_DUMMY 1.32882 0.33093 4.015 0.000107 ***
Rewe_DUMMY -0.38429 0.24012 -1.600 0.112299
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7955 on 113 degrees of freedom
Multiple R-squared: 0.5978, Adjusted R-squared: 0.5694
F-statistic: 21 on 8 and 113 DF, p-value: < 2.2e-16
The independent variables d_time_t
and salesarea_qm_t
are still
statistically significant (both Lidl_DUMMY
(Real_DUMMY
(
As this new model includes dummies, it would not make any sense to
insert the estimated parameters in the multiplicative MCI function: if
one dummy variable is equal to zero, the complete term is equal to zero.
Thus, in this case, it is necessary to use the inverse log-centering
transformation which leads to the following transformed
utility/attraction function:
We repeat the market shares prediction including the dummies and their
weights. The inverse log-centering transformation, which is required
here, can be used in the function mci.shares()
by stating the function
parameter mcitrans
to "ilc"
(default: "lc"
). In the next step, the
total expenditures for each grocery store and the corresponding over-all
shares are computed by the function shares.total()
:
expen2 <- mci.shares(ijmatrix_alldata, "resid_code", "gro_purchase_code",
"salesarea_qm", 1.31622, "d_time", -2.56243, "Lidl_DUMMY", -0.59177,
"Real_DUMMY", 1.32882, mcitrans = "ilc")
# MCI market share prediction with four variables
# (ratio-scaled variables are log-centering transformed)
# salesarea_qm (multiplicative weighting with factor equal to 1.31622)
# d_time (multiplicative weighting with factor equal to -2.56243)
# Lidl_DUMMY (multiplicative weighting with factor equal to -0.59177)
# Real_DUMMY (multiplicative weighting with factor equal to 1.32882)
shares.total(expen2, "resid_code", "gro_purchase_code", "p_ij",
"freq_ij_abs_gro_purchase_expen")
# Expected total sales and shares based on the observed local
# market potential (sum of all obtained expenditures for each origin)
suppliers_single sum_E_j share_j
1 ALDI1 16.029089 0.008058517
2 ALDI2 2.157771 0.001084805
3 CAP1 7.101650 0.003570307
4 EDEKA1 790.385662 0.397361072
5 EDEKA2 34.432270 0.017310592
6 EDEKA3 25.098495 0.012618099
7 LIDL1 40.925354 0.020574946
8 NETTO1 2.063759 0.001037541
9 REAL1 1019.935563 0.512765741
10 REWE1 35.451053 0.017822778
11 TREFF1 15.506130 0.007795602
The result of shares.total
is a "data.frame"
containing the
suppliers (suppliers_single
), the total values (sum_E_j
) and the
corresponding over-all market shares share_j
, where the sum of all
share_j
is equal to one. Here, the stores with the biggest shares of
customer expenditures are "REAL1"
("EDEKA1"
(
Now, we analyze the impact of a change in the competitive environment:
the biggest grocery store in the study area is Real
(gro_purchase_code = "REAL1"
). On condition that the sales area of
this store is increased by 10 % (because of an increase of offered
goods), how will, ceteris paribus, the purchasing power flows with
respect to this supplier and to all other stores change? First, we
update the regarded variable in the existing interaction matrix:
ijmatrix_alldata[ijmatrix_alldata$gro_purchase_code == "REAL1",]$salesarea_qm <- 8525
# Replacing the sales area of REAL1 with a new value: 8525 sqm (increase of 10 %)
Now, we only have to repeat the MCI market share prediction above:
expen2_new <- mci.shares(ijmatrix_alldata, "resid_code",
"gro_purchase_code", "salesarea_qm", 1.31622,
"d_time", -2.56243,
"Lidl_DUMMY", -0.59177, "Real_DUMMY", 1.32882,
mcitrans = "ilc")
shares.total(expen2_new, "resid_code", "gro_purchase_code", "p_ij",
"freq_ij_abs_gro_purchase_expen")
suppliers_single sum_E_j share_j
1 ALDI1 15.787692 0.007886640
2 ALDI2 2.097968 0.001048026
3 CAP1 6.995287 0.003494451
4 EDEKA1 778.019361 0.388654556
5 EDEKA2 33.873811 0.016921444
6 EDEKA3 24.365708 0.012171732
7 LIDL1 40.262537 0.020112891
8 NETTO1 2.032720 0.001015432
9 REAL1 1048.206517 0.523624808
10 REWE1 34.910285 0.017439208
11 TREFF1 15.275568 0.007630811
The results in the total sales/shares table show an impact of the
trading-up in the regarded hypermarket: the over-all share of Real
increases from EDEKA1
decreases from
It should be noticed that the survey in the shopping1
dataset is not
statistically representative, since it is a POS survey not regarding
customers shopping at other supply locations. Thus, the analyses and
results shown here should not be overinterpreted but regarded as an
example how to use the MCI-related functions in the MCI package.
The final example deals with the problem of fitting the Huff Model on condition that no empirically observed store choices, market shares and market areas, respectively, are available. The aim of the analysis is to estimate the market areas of the grocery stores in Freiburg, Germany, based on their total annual sales and their "attraction" (size).
The dataset Freiburg1
contains the preliminary stage of a Huff
interaction matrix, containing the origins (statistical districts of
Freiburg, column district
), codes representing the grocery stores
(store
), the sales area of these stores in sqm (salesarea
) and the
street distances between the origins and the stores (distance
). The
grocery stores and their characteristics were obtained in spring 2015
(Wieland 2015a), while the street distances were calculated via network
analysis in GRASS (GRASS Development Team 2015) using OpenStreetMap vector data.
data(Freiburg1)
# Distance matrix and sales area
First, we try to approximate the market areas using a conventional Huff
Model calculation using the function huff.shares()
. This function is
similar to the mci.shares()
function and allows two explanatory
variables (size, transport costs) which can be weighted by given
parameters in a power, exponential or logistic function. The function
type (power) and the exponents are set corresponding to the default
parameters of the Huff Model:
huff_mat <- huff.shares (Freiburg1, "district", "store", "salesarea", "distance")
# Market area estimation using the Huff Model with standard parameters
# (gamma = 1, lambda = -2)
In the next step, the total annual sales of the grocery stores are
computed based on the estimated interaction probabilities/local market
shares and the grocery purchasing power potential on the district level
in EUR (calculated based on the local population size and the national
average expenditures for groceries), stored in the dataset Freiburg2
:
data(Freiburg2)
# Grocery purchasing power on the city district level
huff_mat_pp <- merge (huff_mat, Freiburg2)
# Adding the purchasing power data for the city districts
huff_total <- shares.total (huff_mat_pp, "district", "store", "p_ij", "ppower")
# Total expected sales and shares
huff_total
suppliers_single sum_E_j share_j
1 1 4057591 0.010759819
2 10 5809861 0.015406444
3 11 1289847 0.003420383
4 12 7103210 0.018836115
5 13 3476313 0.009218400
...
The new dataset huff_total
contains the expected total annual sales in
EUR (sum_E_j
) and the corresponding over-all market shares
(share_j
). Since the “real” annual sales are known (calculated by
chain-based average retail space productivity, stored in the dataset
Freiburg3
), we compare the expected values to the observed values
using the help function model.fit()
:
data(Freiburg3)
# Annual sales of the grocery stores
huff_total_control <- merge(huff_total, Freiburg3, by.x = "suppliers_single",
by.y = "store")
model.fit(huff_total_control$annualsales, huff_total_control$sum_E_j, plotVal = TRUE)
$resids_sq_sum
[1] 2.125162e+15
$pseudorsq
[1] 0.5128422
$globerr
[1] 0.5210329
$mape
[1] 0.6383766
The function model.fit()
returns a "list"
with four entries
containing the following goodness-of-fit measures: the sum of the
squared residuals (resids_sq_sum
), a Pseudo-pseudorsq
), the global error (globerr
) and the MAPE (mape
), as
described in the Huff Model section. Note that these fit measures are
closely related to each other. Optionally, the function returns a plot
to compare the observed and expected values graphically (see the plot in
Figure 2, top left).
Obviously, the fit in this Huff Model market area estimation using the default values is quite poor as can be seen from the fit measures and, of course, from the plot. E.g. the sales of the most high-selling grocery store is extremely underestimated. From this, one can conclude that also the estimated market shares/market areas will not reflect the reality to some extent.
Thus, we have to use the optimization algorithm as discussed in the Huff
Model section. The function huff.attrac()
provides one iteration of
this algorithm, requiring the interaction matrix, the local market
potential and the total values (e.g. sales) of the suppliers. The
tolerance value to accept a difference of individual total sales or not
is set equal to 5, while the function output contains the estimated
total values (Internally, huff.attrac()
uses the function
shares.total()
). We apply the function to the three datasets
(Freiburg1
, Freiburg2
and Freiburg3
):
huff_total_opt1 <- huff.attrac(Freiburg1, "district", "store", "salesarea", "distance",
lambda = -2, dtype= "pow", lambda2 = NULL,
Freiburg2, "district", "ppower",
Freiburg3, "store", "annualsales",
tolerance = 5, output = "total")
# One-time optimization (one iteration) with an accepted difference of +/- 5 %
# Output of total sales/shares
Note that this calculation includes a great many calculation steps while working with 63 stores and 42 statistical districts. In a test environment (computer with CPU: Intel Core i3-2100, RAM: 4.00 GB, OS: Windows 7 64bit), the command above takes about 43 seconds.
huff_total_opt1
suppliers_single sum_E_j share_j total_obs diff attrac_new_opt
1 1 7097226 0.018820246 7210720 113494.39 1329.26622
2 10 5043058 0.013373057 5600000 556941.50 1400.00000
3 11 1136322 0.003013270 1000000 -136321.73 193.82148
4 12 2849360 0.007555862 2776000 -73360.04 271.22161
...
Since the option output
was set to "total"
, the huff.attrac()
function returns a "data.frame"
(rows 1-4 are displayed) that contains
a comparison between the observed (total_obs
) and the expected total
values (sum_E_j
) as well as the corresponding difference (diff
) and
the estimated new attractions (attrac_new_opt
). If output
is set to
"matrix"
, the function returns an interaction matrix equal to the
output of mci.shares()
or huff.shares()
.
Next, the validity analysis of the new model using model.fit()
is
repeated as described above:
model.fit(huff_total_opt1$total_obs, huff_total_opt1$sum_E_j, plotVal = TRUE)
# total_obs = observed total values, originally from dataset Freiburg3
# sum_E_j = expected total values
$resids_sq_sum
[1] 2.901841e+14
$pseudorsq
[1] 0.9334801
$globerr
[1] 0.1564878
$mape
[1] 0.1620126
The goodness-of-fit measures and the plot (see the plot in Figure
2, top right) reveal a much better fit: the
Pseudo-
To extend this optimization algorithm to a given number of iterations,
the MCI package provides the function huff.fit()
. First, we run two
iterations decreasing the tolerance value equal to one which means a
more strict check. Since this optimization takes some time, we enable
the printing of status messages. The output of the function huff.fit()
is equal to the output of huff.attrac()
and can be processed in the
same way. As above, the estimated total sales are compared to the
observed sales using model.fit()
:
huff_total_opt2 <- huff.fit(Freiburg1, "district", "store", "salesarea", "distance",
lambda = -2, dtype= "pow", lambda2 = NULL,
Freiburg2, "district", "ppower",
Freiburg3, "store", "annualsales",
tolerance = 1, iterations = 2, output = "total",
show_proc = TRUE)
# 2 iterations of the optimization algorithm with an accepted difference of +/- 1 %
# Output of total sales/shares, stored in dataset huff_total_opt2
# printing of status messages:
Iteration 1 of 2 ...
Processing location 1 ...
...
model.fit(huff_total_opt2$total_obs, huff_total_opt2$sum_E_j, plotVal = TRUE)
# total_obs = observed total values, originally from dataset Freiburg3
# sum_E_j = expected total values
$resids_sq_sum
[1] 4.806282e+13
$pseudorsq
[1] 0.9889824
$globerr
[1] 0.05946104
$mape
[1] 0.0618133
To run the algorithm twice with a smaller tolerance leads to a much
better fit: while Pseudo-
In the test environment mentioned above, this huff.fit
operation takes
about one and a half minute (89 seconds). Consequently, to extend the
procedure to more iterations means also an increase of calculating time.
We repeat the algorithm with 10 iterations and compare the results once
more:
huff_total_opt10 <- huff.fit(Freiburg1, "district", "store", "salesarea", "distance",
lambda = -2, dtype= "pow", lambda2 = NULL,
Freiburg2, "district", "ppower",
Freiburg3, "store", "annualsales",
tolerance = 1, iterations = 10, output = "total",
show_proc = TRUE)
# 10 iterations of the optimization algorithm with an accepted difference of +/- 1 %
# Output of total sales/shares, stored in dataset huff_total_opt10
# with printing of status messages
model.fit(huff_total_opt10$total_obs, huff_total_opt10$sum_E_j, plotVal = TRUE)
$resids_sq_sum
[1] 185646134781
$pseudorsq
[1] 0.9999574
$globerr
[1] 0.004508996
$mape
[1] 0.004405252
This operation is, of course, very time-consuming: in the test
environment mentioned above, ten iterations of the algorithm took about
five minutes (308 seconds), while the time of an iteration decreases
since the local fits (for each store/locations) also gets better with
each iteration. The final result is a nearly perfect fit with a MAPE of
about
The function huff.fit()
allows an evaluation of the iterations by
returning the corresponding global fit measures for each step
(Internally, huff.fit()
uses the model.fit()
function). We set the
function parameter output = "diag"
to make the function return a
"data.frame"
containing the iteration statistics:
huff.fit(Freiburg1, "district", "store", "salesarea", "distance", lambda = -2,
dtype= "pow", lambda2 = NULL,
Freiburg2, "district", "ppower",
Freiburg3, "store", "annualsales",
tolerance = 1, iterations = 10, output = "diag", show_proc = TRUE)
iterations_count resids_sq_sum pseudorsq globerr mape
1 1 2.886177e+14 0.9338392 0.155826392 0.162228371
2 2 4.806282e+13 0.9889824 0.059461039 0.061813302
3 3 9.936652e+12 0.9977222 0.031229899 0.032170472
4 4 2.318130e+12 0.9994686 0.018162793 0.019737125
5 5 6.378060e+11 0.9998538 0.010563610 0.011844831
6 6 5.567110e+11 0.9998724 0.007654992 0.007507305
7 7 1.776295e+11 0.9999593 0.005357527 0.005444095
8 8 1.778885e+11 0.9999592 0.004689937 0.004669445
9 9 1.828774e+11 0.9999581 0.004587815 0.004541543
10 10 1.856461e+11 0.9999574 0.004508996 0.004405252
Obviously, all goodness-of-fit measures show an improvement of the model
fit with each iteration. But there is no noticeable improvement after
the eighth iteration. All in all, the accuracy of the results are
determined by the error tolerance (parameter tolerance
) and the number
of iterations (parameter iterations
) which are stated by the user who
has to trade off accuracy against computing time.
Model-based market area analysis for retail and service locations includes 1) theoretical considerations, 2) collection of empirical data regarding the locations and the customers, 3) data processing and, in some cases, data transformation, 4) fitting of the used models and 5) the market area prediction itself. While the first two aspects can not be offered by a statistical software, the presented package MCI provides several functions to process the other mentioned steps by implementing the Huff Model and the Multiplicative Competitive Interaction (MCI) Model into R.
The focus of the package is 1) on model fitting via OLS regression (MCI Model) or iterative optimization (Huff Model) and 2) on data preparation, while both aspects are closely related (especially in the MCI Model). As a consequence, some other related substeps are not considered.
Though the package is designed to allow a data exchange with a GIS
(e.g. ijmatrix.crosstab()
to prepare an interaction matrix for a map
visualization), it does not provide any GIS functions. Thus, the
GIS-related steps of market area analysis have to be borrowed from other
packages: especially distance or travel time calculations are needed
what can be done using GIS-orientated packages like ggmap which
utilizes the Google API for routing (and was used in the example dataset
shopping2
), osmar
(Eugster and Schlesinger 2013) and osrm (Giraud 2016)
providing similar functions with respect to OpenStreetMap, or the
huff-tools package mentioned above.
In econometric market area analyses using the MCI Model, also further
diagnostics of the model are recommended, especially with respect to the
possible violations of the OLS-related assumptions, such as
heteroscedasticity or multicollinearity. A discussion of specific
problems and opportunities regarding logarithmic transformations can be
found in Silva and Tenreyro (2006) and Manning and Mullahy (2001). The violation of OLS
assumptions and possible solutions have beed addressed in MCI studies
several times (Nakanishi and Cooper 1974; Kubis and Hartmann 2007; Tihi and Oruc 2012; Wieland 2015b),
but there are no corresponding extensions implemented in MCI yet.
Other packages may be helpful for this analyses: since the implemented
functions return "data.frame"
(mci.transmat()
) and "lm"
objects
(mci.fit()
), respectively, this data can be processed e.g. in
car (Fox and Weisberg 2011) for further
diagnostics.
Another remaining step not yet provided by the presented package is the combination of the MCI Model with the Geographically Weighted Regression (GWR) which has been already used to identify spatial nonstationarity in the estimated parameters (Suárez-Vega et al. 2015; Wieland 2015b). The GWR is implemented in R by the package spgwr (Bivand and Yu 2015).
The author would like to thank the two anonymous reviewers for providing useful feedback that helped to improve the paper.
MCI, SpatialPosition, ggmap, osmar, osrm, car, spgwr
Econometrics, Finance, MixedModels, Spatial, TeachingStatistics, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wieland, "Market Area Analysis for Retail and Service Locations with MCI", The R Journal, 2017
BibTeX citation
@article{RJ-2017-020, author = {Wieland, Thomas}, title = {Market Area Analysis for Retail and Service Locations with MCI}, journal = {The R Journal}, year = {2017}, note = {https://rjournal.github.io/}, volume = {9}, issue = {1}, issn = {2073-4859}, pages = {298-323} }