XML2R is a framework that reduces the effort required to transform XML content into tables in a way that preserves parent to child relationships. pitchRx applies XML2R’s grammar for XML manipulation to Major League Baseball Advanced Media (MLBAM)’s Gameday data. With pitchRx, one can easily obtain and store Gameday data in a remote database. The Gameday website hosts a wealth of XML data, but perhaps most interesting is PITCHf/x. Among other things, PITCHf/x data can be used to recreate a baseball’s flight path from a pitcher’s hand to home plate. With pitchRx, one can easily create animations and interactive 3D scatterplots of the baseball’s flight path. PITCHf/x data is also commonly used to generate a static plot of baseball locations at the moment they cross home plate. These plots, sometimes called strike-zone plots, can also refer to a plot of event probabilities over the same region. pitchRx provides an easy and robust way to generate strike-zone plots using the ggplot2 package.
PITCHf/x is a general term for a system that generates a series of 3D measurements of a baseball’s path from a pitcher’s hand to home plate (Alt and White 2008). 1 In an attempt to estimate the location of the ball at any time point, a quadratic regression model with nine parameters (defined by the equations of motion for constant linear acceleration) is fit to each pitch. Studies with access to the actual measurements suggest that this model is quite reasonable – especially for non-knuckleball pitches (Nathan 2008). That is, the fitted path often provides a reasonable estimate (within a couple of inches) of the actual locations. Unfortunately, only the parameter estimates are made available to the public. The website that provides these estimates is maintained by MLBAM and hosts a wealth of other baseball related data used to inform MLB’s Gameday webcast service in near real time.
On the business side of baseball, using statistical analysis to scout and evaluate players has become mainstream. When PITCHf/x was first introduced, (DiMeo 2007) proclaimed it as,
“The new technology that will change statistical analysis [of baseball] forever.”
PITCHf/x has yet to fully deliver this claim, partially due to the difficulty in accessing and deriving insight from the large amount of complex information. By providing better tools to collect and visualize this data, pitchRx makes PITCHf/x analysis more accessible to the general public.
PITCHf/x data is and can be used for many different projects. It can also complement other baseball data sources, which poses an interesting database management problem. Statistical analysis of PITCHf/x data and baseball in general has become so popular that it has helped expose statistical ideas and practice to the general public. If you have witnessed television broadcasts of MLB games, you know one obvious application of PITCHf/x is locating pitches in the strike-zone as well as recreating flight trajectories, tracking pitch speed, etc. Some on-going statistical research related to PITCHf/x includes: classifying pitch types, predicting pitch sequences, and clustering pitchers with similar tendencies (Pane et al. 2013).
pitchRx has two main focuses (Sievert 2014a). The first focus is to provide easy access to Gameday data. Not only is pitchRx helpful for collecting this data in bulk, but it has served as a helpful teaching and research aide (http://baseballwithr.wordpress.com/ is one such example). Methods for collecting Gameday data existed prior to pitchRx; however, these methods are not easily extensible and require juggling technologies that may not be familiar or accessible (Fast 2007). Moreover, these working environments are less desirable than R for data analysis and visualization. Since pitchRx is built upon XML2R’s united framework, it can be easily modified and/or extended (Sievert 2014b). For this same reason, pitchRx serves as a model for building customized XML data collection tools with XML2R.
The other main focus of pitchRx is to simplify the process creating popular PITCHf/x graphics. Strike-zone plots and animations made via pitchRx utilize the extensible ggplot2 framework as well as various customized options (Wickham 2009). ggplot2 is a convenient framework for generating strike-zone plots primarily because of its facet schema which allows one to make visual comparisons across any combination of discrete variable(s). Interactive 3D scatterplots are based on the rgl package and useful for gaining a new perspective on flight trajectories (Adler et al.).
Gameday data is hosted and made available for free thanks to MLBAM via
http://gd2.mlb.com/components/game/mlb/. 2 From this website, one
can obtain many different types of data besides PITCHf/x. For example,
one can obtain everything from structured media
metadata
to insider
tweets.
In fact, this website’s purpose is to serve data to various
http://mlb.com web pages and applications. As a result, some data is
redundant and the format may not be optimal for statistical analysis.
For these reasons, the scrape
function is focused on retrieving data
that is useful for PITCHf/x analysis and providing it in a convenient
format for data analysis.
Navigating through the MLBAM website can be overwhelming, but it helps to recognize that a homepage exists for nearly every day and every game. For example, http://gd2.mlb.com/components/game/mlb/year_2011/month_02/day_26/ displays numerous hyperlinks to various files specific to February 26th, 2011. On this page is a hyperlink to miniscoreboard.xml which contains information on every game played on that date. This page also has numerous hyperlinks to game specific pages. For example, gid_2011_02_26_phimlb_nyamlb_1/ points to the homepage for that day’s game between the NY Yankees and Philadelphia Phillies. On this page is a hyperlink to the players.xml file which contains information about the players, umpires, and coaches (positions, names, batting average, etc.) coming into that game.
Starting from a particular game’s homepage and clicking on the
inning/
directory, we should see another page with links to the
inning_all.xml
file and the
inning_hit.xml
file. If it is available, the inning_all.xml
file contains the
PITCHf/x data for that game. It’s important to note that this file won’t
exist for some games, because some games are played in venues that do
not have a working PITCHf/x system in place. This is especially true for
preseason games and games played prior to the 2008 season when the
PITCHf/x system became widely adopted. 3 The inning_hit.xml
files
have manually recorded spatial coordinates of where a home run landed or
where the baseball made initial contact with a defender after it was hit
into play.
The relationship between these XML files and the tables returned by
scrape
is presented in Table 1. The pitch
table is
extracted from files whose name ends in inning_all.xml
. This is the
only table returned by scrape
that contains data on the pitch-by-pitch
level. The atbat
, runner
, action
and hip
tables from this same
file are commonly labeled somewhat ambiguously as play-by-play data. The
player
, coach
, and umpire
tables are extracted from players.xml
and are classified as game-by-game since there is one record per person
per game. Figure 1 shows how these tables can be
connected to one another in a database setting. The direction of the
arrows represent a one to possibly many relationship. For example, at
least one pitch is thrown for each at bat (that is, each bout between
pitcher and batter) and there are numerous at bats within each game.
In a rough sense, one can relate tables returned by scrape
back to XML
nodes in the XML files. For convenience, some information in certain XML
nodes are combined into one table. For example, information gleaned from
the ‘top’, ‘bottom’, and ‘inning’ XML nodes within inning_all.xml
are
included as inning
and inning_side
fields in the pitch
, po
,
atbat
, runner
, and action
tables. This helps reduce the burden of
merging many tables together in order to have inning information on the
play-by-play and/or pitch-by-pitch level. Other information is simply
ignored simply because it is redundant. For example, the ‘game’ node
within the players.xml
file contains information that can be recovered
from the game
table extracted from the miniscoreboard.xml
file. If
the reader wants a more detailed explanation of fields in these tables,
Marchi and Albert (2013) provide nice overview.
Source file suffix | Information level | XML node | Tables returned by scrape |
---|---|---|---|
miniscoreboard.xml |
game-by-game | games, game, game_media, media | game, media |
players.xml |
game-by-game | game, team, player, coach, umpire | player, coach, umpire |
inning_all.xml |
play-by-play, pitch-by-pitch | game, inning, bottom, top, atbat, po, pitch, runner, action | atbat, po, pitch, runner, action |
inning_hit.xml |
play-by-play | hitchart, hip | hip |
XML2R adds to the CRAN Task View on Web Technologies and Services by focusing on the transformation of XML content into a collection of tables. Compared to a lower-level API like the XML package, it can significantly reduce the amount of coding and cognitive effort required to perform such a task. In contrast to most higher-level APIs, it does not make assumptions about the XML structure or its source. Although XML2R works on any structure, performance and user experience are enhanced if the content has an inherent relational structure. XML2R’s novel approach to XML data collection breaks down the transformation process into a few simple steps and allows the user to decide how to apply those steps.
The next few sections demonstrate how pitchRx leverages XML2R in
order to produce a collection of tables from inning_all.xml
files. A
similar approach is used by pitchRx::scrape
to construct tables from
the other Gameday files in Table 1. In fact, XML2R
has also been used in the R package
bbscrapeR which collects data
from nba.com and wnba.com.
Sometimes the most frustrating part of obtaining data in bulk off of the
web is finding the proper collection of URLs or file names of interest.
Since files on the Gameday website are fairly well organized, the
makeUrls
function can be used to construct urls
that point to every
game’s homepage within a window of dates.
<- makeUrls(start = "2011-06-01", end = "2011-06-01")
urls sub("http://gd2.mlb.com/components/game/mlb/", "", head(urls))
#> [1] "year_2011/month_06/day_01/gid_2011_06_01_anamlb_kcamlb_1"
#> [2] "year_2011/month_06/day_01/gid_2011_06_01_balmlb_seamlb_1"
#> [3] "year_2011/month_06/day_01/gid_2011_06_01_chamlb_bosmlb_1"
#> [4] "year_2011/month_06/day_01/gid_2011_06_01_clemlb_tormlb_1"
#> [5] "year_2011/month_06/day_01/gid_2011_06_01_colmlb_lanmlb_1"
#> [6] "year_2011/month_06/day_01/gid_2011_06_01_flomlb_arimlb_1"
Once we have a collection of XML files
, the next step is to parse the
content into a list of observations. An observation is technically
defined as a matrix with one row and some number of columns. The columns
are defined by XML attributes and the XML value (if any) for a
particular XML lineage. The name of each observation tracks the XML
hierarchy so observations can be grouped together in a sensible fashion
at a later point.
library(XML2R)
<- paste0(urls, "/inning/inning_all.xml")
files <- XML2Obs(files, url.map = TRUE, quiet = TRUE) obs
table(names(obs))
#>
#> game game//inning
#> 2 18
#> game//inning//bottom//action game//inning//bottom//atbat
#> 13 69
#> game//inning//bottom//atbat//pitch game//inning//bottom//atbat//po
#> 247 4
#> game//inning//bottom//atbat//runner game//inning//top//action
#> 50 20
#> game//inning//top//atbat game//inning//top//atbat//pitch
#> 79 278
#> game//inning//top//atbat//po game//inning//top//atbat//runner
#> 17 89
#> url_map
#> 1
This output tells us that 247 pitches were thrown in the bottom inning
and 278 were thrown in the top inning on June 1st, 2011. Also, there are
12 different levels of observations. The list element named url_map
is
not considered an observation and was included since url.map = TRUE
.
This helps avoid repeating long file names in the url_key
column which
tracks the mapping between observations and file names.
1] obs[
#> $`game//inning//top//atbat//pitch`
#> des id type tfs tfs_zulu x y
#> [1,] "Called Strike" "3" "S" "191018" "2011-06-01T23:10:18Z" "109.87" "145.06"
#> sv_id start_speed end_speed sz_top sz_bot pfx_x pfx_z px
#> [1,] "110601_191020" "87.9" "82.1" "3.6" "1.65" "-6.7" "4.36" "-0.213"
#> pz x0 y0 z0 vx0 vy0 vz0 ax
#> [1,] "2.611" "-1.612" "50.0" "5.633" "5.808" "-128.728" "-2.903" "-11.406"
#> ay az break_y break_angle break_length pitch_type
#> [1,] "22.954" "-24.681" "23.9" "22.5" "6.5" "SI"
#> type_confidence zone nasty spin_dir spin_rate cc mt url_key
#> [1,] ".798" "5" "39" "236.697" "1538.041" "" "" "url1"
Before grouping observations into a collection tables based on their
names, one may want to re_name
observations. Observations with names
’game//inning//bottom//atbat’
and ’game//inning//top//atbat’
should
be included in the same table since they share XML attributes (in other
words, the observations share variables).
<- re_name(obs, equiv = c("game//inning//top//atbat",
tmp "game//inning//bottom//atbat"), diff.name = "inning_side")
By passing these names to the equiv
argument, re_name
determines the
difference in the naming scheme and suppresses that difference. In other
words, observation names that match the equiv
argument will be renamed
to ’game//inning//atbat’
. The information removed from the name is not
lost; however, as a new column is appended to the end of each relevant
observation. For example, notice how the inning_side
column contains
the part of the name we just removed:
grep("game//inning//atbat", names(tmp))][1:2] tmp[
#> $`game//inning//atbat`
#> num b s o start_tfs start_tfs_zulu batter stand b_height
#> [1,] "1" "0" "1" "0" "190935" "2011-06-01T23:09:35Z" "430001" "R" "5-10"
#> pitcher p_throws
#> [1,] "502190" "R"
#> des event score
#> [1,] "Rickie Weeks homers (10) on a fly ball to left field. " "Home Run" "T"
#> home_team_runs away_team_runs url_key inning_side
#> [1,] "0" "1" "url1" "top"
#>
#> $`game//inning//atbat`
#> num b s o start_tfs start_tfs_zulu batter stand b_height
#> [1,] "2" "0" "0" "0" "191105" "2011-06-01T23:11:05Z" "460579" "L" "6-0"
#> pitcher p_throws
#> [1,] "502190" "R"
#> des
#> [1,] "Nyjer Morgan triples (3) on a line drive to right fielder Jay Bruce. "
#> event url_key inning_side
#> [1,] "Triple" "url1" "top"
For similar reasons, other observation name pairs are renamed in a similar fashion.
<- re_name(tmp, equiv = c("game//inning//top//atbat//runner",
tmp "game//inning//bottom//atbat//runner"), diff.name = "inning_side")
<- re_name(tmp, equiv = c("game//inning//top//action",
tmp "game//inning//bottom//action"), diff.name = "inning_side")
<- re_name(tmp, equiv = c("game//inning//top//atbat//po",
tmp "game//inning//bottom//atbat//po"), diff.name = "inning_side")
<- re_name(tmp, equiv = c("game//inning//top//atbat//pitch",
obs2 "game//inning//bottom//atbat//pitch"), diff.name = "inning_side")
table(names(obs2))
#>
#> game game//inning
#> 2 18
#> game//inning//action game//inning//atbat
#> 33 148
#> game//inning//atbat//pitch game//inning//atbat//po
#> 525 21
#> game//inning//atbat//runner url_map
#> 139 1
After all that renaming, we now have 7 different levels of observations.
Let’s examine the first three observations on the game//inning
level:
grep("^game//inning$", names(obs2))][1:3] obs2[
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "1" "mil" "cin" "Y" "url1"
#>
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "2" "mil" "cin" "Y" "url1"
#>
#> $`game//inning`
#> num away_team home_team next url_key
#> [1,] "3" "mil" "cin" "Y" "url1"
Before grouping observations into tables, it is usually important
preserve the parent-to-child relationships in the XML lineage. For
example, one may want to map a particular pitch back to the inning in
which it was thrown. Using the add_key
function, the relevant value of
num
for game//inning
observations can be recycle
d to its XML
descendants.
<- add_key(obs2, parent = "game//inning", recycle = "num", key.name = "inning") obswkey
#> A key for the following children will be generated for the game//inning node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat
#> game//inning//atbat//po
#> game//inning//action
As it turns out, the away_team
and home_team
columns are redundant
as this information is embedded in the url
column. Thus, there is only
one other informative attribute on this level which is next
. By
recycling this value among its descendants, we remove any need to retain
a game//inning
table.
<- add_key(obswkey, parent = "game//inning", recycle = "next") obswkey
#> A key for the following children will be generated for the game//inning node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat
#> game//inning//atbat//po
#> game//inning//action
It is also imperative that we can link a pitch
, runner
, or po
back
to a particular atbat
. This can be done as follows:
<- add_key(obswkey, parent = "game//inning//atbat", recycle = "num") obswkey
#> A key for the following children will be generated for the game//inning//atbat node:
#> game//inning//atbat//pitch
#> game//inning//atbat//runner
#> game//inning//atbat//po
Finally, we are in a position to pool together observations that have a
common name. The collapse_obs
function achieves this by row binding
observations with the same name together and returning a list of
matrices. Note that collapse_obs
does not require that observations
from the same level to have the same set of variables in order to be
bound into a common table. In the case where variables are missing,
NA
s will be inserted as values.
<- collapse_obs(obswkey)
tables #As mentioned before, we do not need the 'inning' table
<- tables[!grepl("^game//inning$", names(tables))]
tables <- c("game", "action", "atbat", "pitch", "po", "runner")
table.names <- setNames(tables, table.names)
tables head(tables[["runner"]])
#> id start end event score rbi earned url_key inning_side
#> [1,] "430001" "" "" "Home Run" "T" "T" "T" "url1" "top"
#> [2,] "460579" "" "3B" "Triple" NA NA NA "url1" "top"
#> [3,] "460579" "3B" "" "Groundout" "T" "T" "T" "url1" "top"
#> [4,] "425902" "" "1B" "Single" NA NA NA "url1" "top"
#> [5,] "425902" "1B" "" "Pop Out" NA NA NA "url1" "top"
#> [6,] "458015" "" "1B" "Single" NA NA NA "url1" "bottom"
#> inning next num
#> [1,] "1" "Y" "1"
#> [2,] "1" "Y" "2"
#> [3,] "1" "Y" "3"
#> [4,] "1" "Y" "4"
#> [5,] "1" "Y" "6"
#> [6,] "1" "Y" "9"
The main scraping function in pitchRx, scrape
, can be used to easily
obtain data from the files listed in Table 1. In fact,
any combination of these files can be queried using the suffix
argument. In the example below, the start
and end
arguments are also
used so that all available file types for June 1st, 2011 are queried.
library(pitchRx)
<- c("inning/inning_all.xml", "inning/inning_hit.xml",
files "miniscoreboard.xml", "players.xml")
<- scrape(start = "2011-06-01", end = "2011-06-01", suffix = files) dat
The game.ids
option can be used instead of start
and end
to obtain
an equivalent dat
object. This option can be useful if the user wants
to query specific games rather than all games played over a particular
time span. When using this game.ids
option, the built-in gids
object, is quite convenient.
data(gids, package = "pitchRx")
<- gids[grep("2011_06_01", gids)]
gids11 head(gids11)
#> [1] "gid_2011_06_01_anamlb_kcamlb_1" "gid_2011_06_01_balmlb_seamlb_1"
#> [3] "gid_2011_06_01_chamlb_bosmlb_1" "gid_2011_06_01_clemlb_tormlb_1"
#> [5] "gid_2011_06_01_colmlb_lanmlb_1" "gid_2011_06_01_flomlb_arimlb_1"
<- scrape(game.ids = gids11, suffix = files) dat
The object dat
is a list of data frames containing all data available
for June 1st, 2011 using scrape
. The list names match the table names
provided in Table 1. For example, dat$atbat
is data
frame with every at bat on June 1st, 2011 and dat$pitch
has
information related to the outcome of each pitch (including PITCHf/x
parameters). The object.size
of dat
is nearly 300MB. Multiplying
this number by 100 days exceeds the memory of most machines. Thus, if a
large amount of data is required, the user should exploit the R database
interface (R Special Interest Group on Databases 2013).
Since PITCHf/x data can easily exhaust memory, one should consider
establishing a database instance before using scrape
. By passing a
database connection to the connect
argument, scrape
will try to
create (and/or append to existing) tables using that connection. If the
connection fails for some reason, tables will be written as csv files in
the current working directory. The benefits of using the connect
argument includes improved memory management which can greatly reduce
run time. connect
will support a MySQL connection, but creating a
SQLite database is quite easy with
dplyr (Wickham and Francois 2014).
library(dplyr)
<- src_sqlite("GamedayDB.sqlite3", create = TRUE)
db # Collect and store all PITCHf/x data from 2008 to now
scrape(start = "2008-01-01", end = Sys.Date(),
suffix = "inning/inning_all.xml", connect = db$con)
In the later sections, animations of four-seam and cut fastballs thrown
by Mariano Rivera and Phil Hughes during the 2011 season are created. In
order to obtain the data for those animations, one could query db
which now has PITCHf/x data from 2008 to date. This query requires
criteria on: the pitcher_name
field (in the atbat
table), the
pitch_type
field (in the pitch
table), and the date
field (in both
tables). To reduce the time required to search those records, one should
create an index on each of these three fields.
library(DBI)
dbSendQuery(db$con, "CREATE INDEX pitcher_index ON atbat(pitcher_name)")
dbSendQuery(db$con, "CREATE INDEX type_index ON pitch(pitch_type)")
dbSendQuery(db$con, "CREATE INDEX date_atbat ON atbat(date)")
As a part of our query, we’ll have to join the atbat
table together
with the pitch
table. For this task, the gameday_link
and num
fields are helpful since together they provide a way to match pitches
with at bats. For this reason, a multi-column index on the
gameday_link
and num
fields will further reduce run time of the
query.
dbSendQuery(db$con, 'CREATE INDEX pitch_join ON pitch(gameday_link, num)')
dbSendQuery(db$con, 'CREATE INDEX atbat_join ON atbat(gameday_link, num)')
Although the query itself could be expressed entirely in SQL, dplyr’s
grammar for data manipulation (which is database agnostic) can help to
simplify the task. In this case, at.bat
is a tabular representation
of the remote atbat
table restricted to cases where Rivera or Hughes
was the pitcher. That is, at.bat
does not contain the actual data, but
it does contain the information necessary to retrieve it from the
database.
<- tbl(db, "atbat") %>%
at.bat filter(pitcher_name %in% c("Mariano Rivera", "Phil Hughes"))
Similarly, fbs
is a tabular representation of the pitch
table
restricted to four-seam (FF) and cut fastballs (FC).
<- tbl(db, "pitch") %>%
fbs filter(pitch_type %in% c("FF", "FC"))
An inner_join
of these two filtered tables returns a tabular
representation of all four-seam and cut fastballs thrown by Rivera and
Hughes. Before collect
actually performs the database query and brings
the relevant data into the R session, another restriction is added so
that only pitches from 2011 are included.
<- inner_join(fbs, at.bat) %>%
pitches filter(date >= "2011_01_01" & date <= "2012_01_01") %>%
collect()
Amongst the most common PITCHf/x graphics are strike-zone plots. Such a plot has two axes and the coordinates represent the location of baseballs as they cross home plate. The term strike-zone plot can refer to either density or probabilistic plots. Density plots are useful for exploring what actually occurred, but probabilistic plots can help address much more interesting questions using statistical inference. Although probabilistic plots can be used to visually track any event probability across the strike-zone, their most popular use is for addressing umpire bias in a strike versus ball decision Green and Daniels (2014). The probabilistic plots section demonstrates how pitchRx simplifies the process behind creating such plots via a case study on the impact of home field advantage on umpire decisions.
In the world of sports, it is a common belief that umpires (or referees) have a tendency to favor the home team. PITCHf/x provides a unique opportunity to add to this discussion by modeling the probability of a called strike at home games versus away games. Specifically, conditioned upon the umpire making a decision at a specific location in the strike-zone, if the probability that a home pitcher receives a called strike is higher than the probability that an away pitcher receives a called strike, then there is evidence to support umpire bias towards a home pitcher.
There are many different possible outcomes of each pitch, but we can
condition on the umpire making a decision by limiting to the following
two cases. A called strike is an outcome of a pitch where the batter
does not swing and the umpire declares the pitch a strike (which is a
favorable outcome for the pitcher). A ball is another outcome where the
batter does not swing and the umpire declares the pitch a ball (which is
a favorable outcome for the batter). All decisions
made between 2008
and 2013 can be obtained from db
with the following query using
dplyr.
# First, add an index on the pitch description to speed up run-time
dbSendQuery(db$con, "CREATE INDEX des_index ON pitch(des)")
<- tbl(db, "pitch") %>%
pitch filter(des %in% c("Called Strike", "Ball")) %>%
# Keep pitch location, descriptions
select(px, pz, des, gameday_link, num) %>%
# 0-1 indicator of strike/ball
mutate(strike = as.numeric(des == "Called Strike"))
<- tbl(db, "atbat") %>%
atbat # Select variables to be used later as covariates in probabilistic models
select(b_height, p_throws, stand, inning_side, date, gameday_link, num)
<- inner_join(pitch, atbat) %>%
decisions filter(date <= "2014_01_01") %>%
collect()
The decisions
data frame contains data on over 2.5 million pitches
thrown from 2008 to 2013. About a third of them are called strikes and
two-thirds balls. Figure 2 shows the density of all
called strikes. Clearly, most called strikes occur on the outer region
of the strike-zone. Many factors could contribute to this phenomenon;
which we will not investigate here.
# strikeFX uses the stand variable to calculate strike-zones
# Here is a slick way to create better facet titles without changing data values
<- function(variable, value) {
relabel <- sub("^R$", "Right-Handed Batter", value)
value sub("^L$", "Left-Handed Batter", value)
}<- subset(decisions, strike == 1)
strikes strikeFX(strikes, geom = "raster", layer = facet_grid(. ~ stand, labeller = relabel))
Figure 2 shows one static rectangle (or strike-zone) per
plot automatically generated by strikeFX
. The definition of the
strike-zone is notoriously ambiguous. As a result, the boundaries of the
strike-zone may be noticeably different in some situations. However, we
can achieve a fairly accurate representation of strike-zones using a
rectangle defined by batters’ average height and stance (Fast 2011).
As Figure 4 reinforces, batter stance makes an
important difference since the strike-zone seems to be horizontally
shifted away from the batter. The batter’s height is also important
since the strike-zone is classically defined as approximately between
the batter’s knees and armpits.
Figure 2 has is one strike-zone per plot since the
layer
option contains a ggplot2 argument that facets according to
batter stance. Facet layers are a powerful tool for analyzing PITCHf/x
data because they help produce quick and insightful comparisons. In
addition to using the layer
option, one can add layers to a graphic
returned by strikeFX
using ggplot2 arithmetic. It is also worth
pointing out that Figure 2 could have been created
without introducing the strikes
data frame by using the density1
and
density2
options.
strikeFX(decisions, geom = "raster", density1 = list(des = "Called Strike"),
density2 = list(des = "Called Strike")) + facet_grid(. ~ stand, labeller = relabel)
In general, when density1
and density2
are identical, the result is
equivalent to subsetting the data frame appropriately beforehand. More
importantly, by specifying different values for density1
and
density2
, differenced densities are easily generated. In this case, a
grid of density estimates for density2
are subtracted from the
corresponding grid of density estimates for density1
. Note that the
default NULL
value for either density option infers that the entire
data set defines the relevant density. Thus, if density2
was NULL
(when density1 = list(des = ’Called Strike’)
), we would obtain the
density of called strikes minus the density of both called strikes and
balls. In Figure 3, we define density1
as
called strikes and define density2
as balls. As expected, we see
positive density values (in blue) inside the strike-zone and negative
density values (in red) outside of the strike-zone.
strikeFX(decisions, geom = "raster", density1 = list(des = "Called Strike"),
density2 = list(des = "Ball"), layer = facet_grid(. ~ stand, labeller = relabel))
These density plots are helpful for visualizing the observed frequency of events; however, they are not very useful for addressing our umpire bias hypothesis. Instead of looking simply at the density, we want to model the probability of a strike called at each coordinate given the umpire has to make a decision.
There are many approaches to probabilistic modeling over a two
dimensional spatial region. Since our response is often categorical,
generalized additive models (GAMs) is a popular and desirable approach
to modeling events over the strike-zone (Mills 2010). There are numerous R
package implementations of GAMs, but the bam
function from the
mgcv package has several
desirable properties (Wood 2006). Most importantly, the smoothing parameter
can be estimated using several different methods. In order to have a
reasonable estimate of the smooth 2D surface, GAMs require fairly large
amount of observations. As a result, run time can be an issue –
especially when modeling 2.5 million observations! Thankfully, the bam
function has a cluster
argument which allows one to distribute
computations across multiple cores using the built in parallel
package.
library(parallel)
<- makeCluster(detectCores() - 1)
cl library(mgcv)
<- bam(strike ~ interaction(stand, p_throws, inning_side) +
m s(px, pz, by = interaction(stand, p_throws, inning_side)),
data = decisions, family = binomial(link = 'logit'), cluster = cl)
This formula models the probability of a strike as a function of the
baseball’s spatial location, the batter’s stance, the pitcher’s throwing
arm, and the side of the inning. Since home pitchers always pitch during
the top of the inning, inning_side
also serves as an indication of
whether a pitch is thrown by a home pitcher. In this case, the
interaction
function creates a factor with eight different levels
since every input factor has two levels. Consequently, there are 8
different levels of smooth surfaces over the spatial region defined by
px
and pz
.
The fitted model m
contains a lot of information which strikeFX
uses
in conjunction with any ggplot2 facet commands to infer which and how
surfaces should be plotted. In particular, the var.summary
is used to
identify model covariates, as well their default conditioning values. In
our case, the majority of decisions
are from right-handed pitchers and
the top of the inning. Thus, the default conditioning values are "top"
for inning_side
and "R"
for p_throws
. If different conditioning
values are desired, var.summary
can be modified accordingly. To
demonstrate, Figure 4 shows 2 of the 8 possible
surfaces that correspond to a right-handed away pitcher.
<- list(inning_side = factor("bottom", levels = c("top", "bottom")))
away $var.summary <- modifyList(m$var.summary, away)
mstrikeFX(decisions, model = m, layer = facet_grid(. ~ stand, labeller = relabel))
Using the same intuition exploited earlier to obtain differenced density
plots, we can easily obtain differenced probability plots. To obtain
Figure 5, we simply add p_throws
as another facet
variable and inning_side
as a differencing variable. In this case,
conditioning values do not matter since every one of the 8 surfaces are
required in order to produce Figure 5.
# Function to create better labels for both stand and p_throws
<- function(variable, value) {
relabel2 if (variable %in% "stand")
return(sub("^L$", "Left-Handed Batter",
sub("^R$", "Right-Handed Batter", value)))
if (variable %in% "p_throws")
return(sub("^L$", "Left-Handed Pitcher",
sub("^R$", "Right-Handed Pitcher", value)))
}strikeFX(decisions, model = m, layer = facet_grid(p_throws ~ stand, labeller = relabel2),
density1 = list(inning_side = "top"), density2 = list(inning_side = "bottom"))
The four different plots in Figure 5 represent the
four different combination of values among p_throws
and stand
. In
general, provided that a pitcher throws to a batter in the blue region,
the pitch is more likely to be called a strike if the pitcher is on
their home turf. Interestingly, there is a well-defined blue elliptical
band around the boundaries of the typical strike-zone. Thus, home
pitchers are more likely to receive a favorable call – especially when
the classification of the pitch is in question. In some areas, the home
pitcher has up to a 6 percent higher probability of receiving a called
strike than an away pitcher. The subtle differences in spatial patterns
across the different values of p_throws
and stand
are interesting as
well. For instance, pitching at home has a large positive impact for a
left-handed pitcher throwing in the lower inside portion of the
strike-zone to a right-handed batter, but the impact seems negligible in
the mirror opposite case. Differenced probabilistic densities are
clearly an interesting visual tool for analyzing PITCHf/x data. With
strikeFX
, one can quickly and easily make all sorts of visual
comparisons for various situations. In fact, one can explore and compare
the probabilistic structure of any well-defined event over a strike-zone
region (for example, the probability a batter reaches base) using a
similar approach.
animateFX
provides convenient and flexible functionality for animating
the trajectory of any desired set of pitches. For demonstration
purposes, this section animates every four-seam and cut fastball thrown
by Mariano Rivera and Phil Hughes during the 2011 season. These pitches
provide a good example of how facets play an important role in
extracting new insights. Similar methods can be used to analyze any MLB
player (or combination of players) in greater detail.
animateFX
tracks three dimensional pitch locations over a sequence of
two dimensional plots. The animation takes on the viewpoint of the
umpire; that is, each time the plot refreshes, the balls are getting
closer to the viewer. This is reflected with the increase in size of the
points as the animation progresses. Obviously, some pitches travel
faster than others, which explains the different sizes within a
particular frame. Animations revert to the initial point of release once
all of the baseballs have reached home plate. During an interactive
session, animateFX
produces a series of plots that may not viewed
easily. One option available to the user is to wrap
animation::saveHTML
around animateFX
to view the animation in a
browser with proper looping controls (Xie 2013a).
To reduce the time and thinking required to produce these animations,
animateFX
has default settings for the geometry, color, opacity and
size associated with each plot. Any of these assumptions can be
altered - except for the point geometry. In order for animations to
work, a data frame with the appropriately named PITCHf/x parameters
(that is, x0, y0, z0, vx0, vy0, vz0, ax0, ay0 and az0) is required. In
Figure 6, every four-seam and cut fastball thrown by
Rivera and Hughes during the 2011 season is visualized using the
pitches
data frame obtained earlier (the animation is available at
http://cpsievert.github.io/pitchRx/ani1).
animateFX(pitches, layer=list(theme_bw(), coord_equal(),
facet_grid(pitcher_name~stand, labeller = relabel)))
In the animation corresponding to Figure 6, the upper right-hand portion (Rivera throwing to right-handed batters) reveals the clearest pattern in flight trajectories. Around the point of release, Rivera’s two pitch types are hard to distinguish. However, after a certain point, there is a very different flight path among the two pitch types. Specifically, the drastic left-to-right movement of the cut fastball is noticeably different from the slight right-to-left movement of the four-seam fastball. In recent years, cut fastballs have gained notoriety among the baseball community as a coveted pitch for pitchers have at their disposal. This is largely due to the difficulty that a batter has in distinguishing the cut fastball from another fastball as the ball travels toward home plate. Clearly, this presents an advantage for the pitcher since they can use deception to reduce batter’s ability to predict where the ball will cross home plate. This deception factor combined with Rivera’s ability to locate his pitches explain his accolades as one of the greatest pitchers of all time (Traub 2010).
Although we see a clear pattern in Rivera’s pitches, MLB pitchers are
hardly ever that predictable. Animating that many pitches for another
pitcher can produce a very cluttered graphic which is hard to interpret
(especially when many pitch types are considered). However, we may still
want to obtain an indication of pitch trajectory over a set of many
pitches. A way to achieve this is to average over the PITCHf/x
parameters to produce an overall sense of pitch type behavior (via the
avg.by
option). Note that the facet variables are automatically
considered indexing variables. That is, in Figure 7,
there are eight ‘average’ pitches since there are two pitch types, two
pitchers, and two types of batting stance (the animation is available at
http://cpsievert.github.io/pitchRx/ani2).
animateFX(pitches, avg.by = "pitch_types", layer = list(coord_equal(), theme_bw(),
facet_grid(pitcher_name~stand, labeller = relabel)))
rgl is an R package that utilizes OpenGL for graphics rendering.
interactiveFX
utilizes rgl functionality to reproduce flight paths
on an interactive 3D platform. Figure 8 has two static
pictures of Mariano Rivera’s 2011 fastballs on this interactive
platform. This is great for gaining new perspectives on a certain set of
pitches, since the trajectories can be viewed from any angle.
Figure 8 showcases the difference in trajectory between
Rivera’s pitch types.
<- subset(pitches, pitcher_name == "Mariano Rivera")
Rivera interactiveFX(Rivera, avg.by = "pitch_types")
pitchRx utilizes XML2R’s convenient framework for manipulating XML content in order to provide easy access to PITCHf/x and related Gameday data. pitchRx removes access barriers which allows the average R user and baseball fan to spend their valuable time analyzing Gameday’s enormous source of baseball information. pitchRx also provides a suite of functions that greatly reduce the amount of work involved to create popular PITCHf/x graphics. For those interested in obtaining other XML data, pitchRx serves as a nice example of leveraging XML2R to quickly assemble custom XML data collection mechanisms.
Many thanks to my major professor, Dr. Heike Hofmann, for her direction and support throughout this project. Thanks also to the anonymous reviewers for helpful feedback. This document was created using the knitr package Xie (2013b). The source files can be found here.
pitchRx, XML2R, ggplot2, rgl, dplyr, mgcv, knitr
Bayesian, Databases, Econometrics, Environmetrics, MixedModels, ModelDeployment, Phylogenetics, ReproducibleResearch, Spatial, SpatioTemporal, SportsAnalytics, TeachingStatistics, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Sievert, "Taming PITCHf/x Data with XML2R and pitchRx", The R Journal, 2014
BibTeX citation
@article{RJ-2014-001, author = {Sievert, Carson}, title = {Taming PITCHf/x Data with XML2R and pitchRx}, journal = {The R Journal}, year = {2014}, note = {https://rjournal.github.io/}, volume = {6}, issue = {1}, issn = {2073-4859}, pages = {5-19} }