TIGER/Line shapefiles from the United States Census Bureau are commonly used for the mapping and analysis of US demographic trends. The tigris package provides a uniform interface for R users to download and work with these shapefiles. Functions in tigris allow R users to request Census geographic datasets using familiar geographic identifiers and return those datasets as objects of class "Spatial*DataFrame"
. In turn, tigris ensures consistent and high-quality spatial data for R users’ cartographic and spatial analysis projects that involve US Census data. This article provides an overview of the functionality of the tigris package, and concludes with an applied example of a geospatial workflow using data retrieved with tigris.
Analysis and visualization of geographic data are often core components of the analytical workflow for researchers and data scientists; as such, access to open and reliable geographic datasets are of paramount importance. The United States Census Bureau provides access to such data in the form of its TIGER/Line shapefile products (United States Census Bureau 2016a). The files are extracts from the Census Bureau’s Master Address File/Topologically Integrated Geographic Encoding and Referencing (TIGER) database, which in turn are released to the public as shapefiles, a common format for encoding geographic data as vectors (e.g. points, lines, and polygons). Available TIGER/Line shapefiles include all of the Census Bureau’s areal enumeration units, such as states, counties, Census tracts, and Census blocks; transportation data such as roads and railways; and both linear and areal hydrography. The TIGER/Line files are updated annually, and include attributes that allow them to be joined with other tabular data, including demographic data products released by the Census Bureau.
The tigris package aims
to simplify the process of working with these datasets for R users
(Walker 2016). With functions in tigris, R users can specify the data
type and geography for which they would like to obtain geographic data,
and return the corresponding TIGER/Line data as an R object of class
"Spatial*DataFrame"
. This article provides an overview of the tigris
package, and gives examples that show how it can contribute to common
geographic visualization and spatial analysis workflows in R. Examples
in the article include a discussion of how tigris helps R users
retrieve and work with data from the US Census Bureau, as well as an
applied example of how tigris can fit within a common geospatial
workflow in R, in which data from the United States Internal Revenue
Service are visualized with both static and interactive cartography.
The TIGER/Line files were first released by the US Census Bureau in ASCII format in 1989, and represented street centerline data for the entire United States. Since then, the Census Bureau has expanded the coverage of the TIGER/Line data, and transitioned the core format of the publicly-available files to the shapefile in 2007. TIGER/Line shapefiles include boundary files, which encompass the boundaries of governmental units or other areal units for which the Census Bureau tabulates data. This includes the core Census hierarchy of areal units from the Census block (analogous to a city block) to the entire United States, as well as common geographic entities such as city boundaries. The Census Bureau distinguishes between legal entities, which have official government standing, and statistical entities, which have no legal definition but are used for the tabulation of data. The Census Bureau also makes available shapefiles of geographic features, which include entities such as roads, rivers, and railroads. All TIGER/Line shapefiles are distributed in a geographic coordinate system using the North American Datum of 1983 (NAD83) (United States Census Bureau 2014).
As the TIGER/Line datasets are available in shapefile format, they can
be read into and translated to R objects by the
rgdal package (Bivand, T. Keitt, and B. Rowlingson 2015).
rgdal is an R interface to the open-sourced Geospatial Data
Abstraction Library, or GDAL, an open-source translator that can convert
between numerous common vector and raster spatial data formats (GDAL Development Team 2015).
When loaded into R, shapefiles will be represented as objects of class
"Spatial*"
by the sp
package (Bivand et al. 2013). Most Census datasets obtained by tigris will be
loaded as objects of class "SpatialPolygonsDataFrame"
given that they
represent Census areal entities; selected geographic features, such as
roads, linear water features, and landmarks, may be represented as
objects of class "SpatialLinesDataFrame"
or
"SpatialPointsDataFrame"
. "Spatial*DataFrames"
are R objects that
represent spatial data as closely as possible to regular R data frames,
yet also contain information about the feature geometry and coordinate
system of the data (see Bivand et al. 2013 for more information).
Several R packages provide access to Census geographic and demographic data. The UScensus2000 (no longer on CRAN) and UScensus2010 packages by Zack Almquist allow for access to several geographic datasets for the 2000 and 2010 Censuses, including blocks, Census tracts, and counties (Almquist 2010). These datasets can also be linked to demographic data from the 2000 and 2010 Censuses, which are stored in related, external packages. The USABoundaries package similarly provides access to some Census geographic boundary files such as zip code tabulation areas (ZCTAs) and counties; it also makes available historical boundary files dating back to 1629 (Mullen 2015). Another R package, choroplethr, wraps ggplot2 (Wickham 2009) to map data from the US Census Bureau’s American Community Survey aggregated to common Census geographies (Lamstein and Johnson 2015). The purpose of the tigris package is to help R users work with US Census Bureau geographic data by granting direct access to the Census shapefiles via a simple, uniform interface. Further, as tigris interfaces directly with Census Bureau data stores, it ensures access to high-quality and up-to-date geographic data for R projects.
The core functionality of tigris consists of a series of functions, each corresponding to a single Census Bureau geography of interest, that grant access to geographic data from the US Census Bureau. tigris allows R users to obtain both the core TIGER/Line shapefiles as well as the Census Bureau’s Cartographic Boundary Files. Cartographic Boundary Files, following the United States Census Bureau (2015), "are simplified representations of selected geographic areas from the Census Bureau’s MAF/TIGER geographic database" (United States Census Bureau 2015).
To download geographic data using tigris, the R user calls the
function corresponding to the desired geography. For example, to obtain
a "SpatialPolygonsDataFrame"
of US states from the TIGER/Line dataset,
the user calls the states()
function in tigris, which can then be
plotted with the plot()
function from the
sp package, which is loaded
by tigris automatically:
library(tigris)
<- states()
us_states plot(us_states)
The states()
function call instructs tigris to fetch a TIGER/Line
shapefile from the US Census Bureau that represents the boundaries of
the 50 US states, the District of Columbia, and US territories. tigris
then uses rgdal to load the data into the user’s R session as an
object of class "Spatial*DataFrame"
. Many functions in tigris have a
parameter, cb
, that if set to TRUE
will direct tigris to load a
cartographic boundary file instead. Cartographic boundary files default
to a simplified resolution of 1:500,000; in some cases, as with states,
resolutions of 1:5 million and 1:20 million are available. For example,
an R user could specify the following modifications to the states()
function, and retrieve a simplified dataset.
<- states(cb = TRUE, resolution = "20m")
us_states_20m <- us_states[us_states$NAME == "Rhode Island", ]
ri <- us_states_20m[us_states_20m$NAME == "Rhode Island", ]
ri_20m plot(ri)
plot(ri_20m, border = "red", add = TRUE)
The plot illustrates some of the differences between the TIGER/Line shapefiles and the cartographic boundary files, in this instance for the state of Rhode Island. The TIGER/Line shapefiles are the most detailed datasets in interior areas, and represent the legal boundaries for coastal areas which extend three miles beyond the shoreline. The Cartographic Boundary Files have less detail in interior areas, but are clipped to the shoreline of the United States, which may be preferable for thematic mapping but can introduce additional detail for coastal features. A full list of the geographic datasets available through tigris is found in Table 1; datasets with an asterisk are available as both TIGER/Line and cartographic boundary files.
Family | Functions |
---|---|
General area functions | nation *; divisions *; regions *; states *; counties *; tracts *; block_groups *; blocks ; places *; pumas *; school_districts ; county_subdivisions *; zctas * |
Legislative district functions | congressional_districts *; state_legislative_districts *; voting_districts (2012 only) |
Water functions | area_water ; linear_water ; coastline |
Metro area functions | core_based_statistical_areas *; combined_statistical_areas *; metro_divisions ; new_england *; urban_areas * |
Transportation functions | primary_roads ; primary_secondary_roads ; roads ; rails |
Native/tribal geometries functions | native_areas *; alaska_native_regional_corporations *; tribal_block_groups ; tribal_census_tracts ; tribal_subdivisions_national |
Other | landmarks ; military |
When Census data are available for download at sub-national levels, they are referenced by their Federal Information Processing Standard (FIPS) codes, which are codes that uniquely identify geographic entities in the Census database. When applicable, tigris uses smart state and county lookup to simplify the process of data acquisition for R users. This allows users to obtain data by supplying the name or postal code of the desired state – along with the name of the desired county, when applicable – rather than their FIPS codes. In the following example, the R user fetches roads data for Kalawao County Hawaii, the smallest county in the United States by area, located on the northern coast of the island of Moloka’i.
<- roads("HI", "Kalawao")
kw_roads plot(kw_roads)
While many Census shapefiles correspond to these common geographic identifiers in the United States, not all datasets are identifiable in this way. A good example is the Zip Code Tabulation Area (ZCTA), a geographic dataset developed by the US Census Bureau to approximate zip codes, postal codes used by the United States Postal Service. Social data in the United States are commonly distributed at the zip code level, including an example later in this article; however, zip codes themselves are not coherent geographic entities, and change frequently. ZCTAs, then, function as proxies for zip codes, and are built from Census blocks in which a plurality of addresses on a given block have a given zip code (United States Census Bureau 2016b).
ZCTAs commonly cross county lines and even cross state lines in certain
instances; as such, the US Census Bureau only makes the entire ZCTA
dataset of over 33,000 zip codes available for download. Often, analysts
will not need all of these ZCTAs for a given project. tigris allows
users to subset ZCTAs on load with the starts_with
parameter, which
accepts a vector of strings that contains the beginning digits of the
ZCTAs that the analyst wants to load into R. The example below retrieves
ZCTAs in the area around Fort Worth, Texas.
<- zctas(cb = TRUE, starts_with = "761")
fw_zips plot(fw_zips)
When tigris downloads Census shapefiles to the R user’s computer, it
uses the rappdirs
package to cache the downloads for future access (Ratnakumar et al. 2014). In turn,
once the R user has downloaded the Census geographic data, tigris will
know where to look for it and will not need to re-download. To turn off
this behavior, a tigris user can set
options(tigris_use_cache = FALSE)
after loading the package; this will
direct tigris to download shapefiles to a temporary directory on the
user’s computer instead, and load data into R from there.
The Census Bureau releases updated TIGER/Line shapefiles every year, and
these yearly updates are available to tigris users. tigris defaults
to the 2015 shapefiles, which at the time of this writing is the most
recent year available.. However, tigris users can supply a different
year to a tigris function as a named argument to obtain data for a
different year; for example, year = 2014
in the function call will
fetch TIGER/Line shapefiles or cartographic boundary files from 2014.
Additionally, R users can set this as a global option in their R session
by entering the command options(tigris_year = 2014)
.
The primary utility of the tigris package is for consistent and quick
data access for R users with a minimum of code. As tigris loads
objects of class "Spatial*"
from the sp package, data analysis and
visualization using the acquired data can be handled with R’s suite of
cartographic and spatial analysis packages, which will be addressed
later in the article. However, tigris does include two functions,
rbind_tigris()
and geo_join()
, to assist with common operations when
working with Census Bureau geographic data: combining datasets with one
another, or merging them to tabular data.
Some Census shapefiles, like roads, are only available by county;
however, an R user may want a roads dataset that represents multiple
counties. tigris has built-in functionality to handle these
circumstances. Data loaded into R by tigris functions are assigned a
special "tigris"
attribute that identifies the type of geographic data
represented by the object. This attribute can be checked with the
function tigris_type()
:
> tigris_type(kw_roads)
1] "road" [
Objects with the same "tigris"
attributes can then be combined into a
single object using the function rbind_tigris()
. In the example below,
the user loads data for Maui County, which comprises the remainder of
the island of Moloka’i as well as the islands of Maui and Lana’i. As the
roads data do not explicitly contain information about counties, an
identifying "county"
column is specified in the example below. The
data are then plotted and colored by county; Kalawao County is colored
red in the figure.
<- roads("HI", "Maui")
maui_roads $county <- "Kalawao"
kw_roads$county <- "Maui"
maui_roads<- rbind_tigris(kw_roads, maui_roads)
maui_kw_roads plot(maui_kw_roads, col = c("red", "black")[as.factor(maui_kw_roads$county)])
The rbind_tigris()
function also accepts a list of sp objects with
the same "tigris"
attributes. This is particularly useful in the event
that an analyst needs a dataset that covers the United States, but is
only available at sub-national levels. An example of this is the Public
Use Microdata Area (PUMA), the Census geography at which
individual-level microdata samples are associated. PUMAs are available
by state in tigris via the pumas()
function; however, an analyst
will commonly want PUMA geography for the entire United States to
facilitate national-level analyses. In this example, rbind_tigris()
can be used with lapply()
to fetch PUMA datasets for each state and
then combine them into a dataset covering the continental United States.
tigris includes a built-in data frame named fips_codes
that used to
match state and county names with Census FIPS codes in the various
functions in the package. It can also be used to generate vectors of
codes to be passed to tigris functions as in this example, so that the
analyst does not have to generate the full list of state codes by hand.
<- unique(fips_codes$state)[1:51]
us_states <- us_states[!us_states %in% c("AK", "HI")]
continental_states <- rbind_tigris(
us_pumas lapply(
function(x) {
continental_states, pumas(state = x, cb = TRUE)
}
)
)plot(us_pumas)
The above code directs R to iterate through the state codes for the
continental United States, fetching PUMA geography for each state and
storing it in a list which rbind_tigris()
then combines into a
continental PUMA dataset.
The other data manipulation function in tigris, geo_join()
, is
designed to assist with the common but sometimes-messy process of
merging tabular data to US Census Bureau shapefiles. Such joined data
can then be used for statistical mapping, such as a choropleth map that
shows variation in an attribute by the shading of polygons.
In the example below, an analyst uses functions in tigris to help create a choropleth map that shows how the areas represented by legislators in the Texas State House of Representatives, the lower house of the Texas state legislature, vary by political party. By convention in the United States, areas represented by members of the Republican Party are shaded in red, and areas represented by members of the Democratic Party are shaded in blue.
To accomplish this, the analyst loads in a CSV containing information on
party representation in Texas by legislative district, and uses the
state_legislative_districts()
function to retrieve boundaries for the
legislative districts. The two datasets can then be joined with
geo_join()
. The first argument in the geo_join()
call represents the
object of class "Spatial*DataFrame"
; the second argument represents a
regular R data frame. The third and fourth arguments specify the columns
in the spatial data frame and regular data frame, respectively, to be
used to match the rows; if the names of these columns are the same, that
name can be passed as a named argument to the by
parameter, which is
unused here. Once the two datasets are joined, they can be visualized
with sp plotting functions.
<- read.csv("http://personal.tcu.edu/kylewalker/data/txlege.csv",
df stringsAsFactors = FALSE)
<- state_legislative_districts("TX", house = "lower", cb = TRUE)
districts <- geo_join(districts, df, "NAME", "District")
txlege $color <- ifelse(txlege$Party == "R", "red", "blue")
txlegeplot(txlege, col = txlege$color)
legend("topright", legend = c("Republican", "Democrat"),
fill = c("red", "blue"))
As the plot illustrates, Democrats in Texas tend to represent areas in and around major cities such as Dallas, Houston, and Austin, as well as areas along the United States-Mexico border. Republicans, on the other hand, tend to represent rural areas in addition to suburban areas on the edges of metropolitan areas in the state.
To this point, this paper has employed simplified examples to demonstrate the functionality of tigris; the following scenario combines these examples into an applied analytic and visualization workflow. The goal here is to show how tigris fits in with a broader spatial analysis workflow in R. R has a plethora of packages available for geographic visualization and spatial analysis; for analysts working with United States geographies, tigris can contribute to the analytic process by providing ample data access with a minimum of code, and without having to retrieve datasets outside of R.
This example demonstrates how to create metropolitan area maps of
taxation data from the United States Internal Revenue Service (IRS),
which are made available at the zip code level (United States Internal Revenue Service 2015). As discussed
earlier, zip codes are not physical areas but rather designations given
by the United States Postal Service (USPS) to guide mail routes; as
such, ZCTAs will be used instead, and accessed with the zctas
function.
As ZCTAs do not have a clear correspondence between their boundaries and those of other Census units, ZCTA boundaries will commonly cross those of metropolitan areas, which are county-based. However, tigris provides programmatic access to metropolitan area boundaries as well, which in turn can be used to identify intersecting ZCTAs through spatial overlay with the sp package. The resultant spatial data can then be merged to data from the IRS and visualized.
Such a workflow could resemble the following. An analyst reads in raw
data from the IRS website as an R data frame, and uses the
dplyr package (Wickham and R. Francois 2015) to
subset the data frame and identify the average total income reported to
the IRS by zip code in thousands of dollars in 2013, assigning it to the
variable df
. In the original IRS dataset, A02650
represents the
aggregate total income reported to the IRS by zip code in thousands of
dollars, and N02650
represents the number of tax returns that reported
total income in that zip code.
library(dplyr)
library(stringr)
library(readr)
# Read in the IRS data
<- "https://www.irs.gov/pub/irs-soi/13zpallnoagi.csv"
zip_data <- read_csv(zip_data) %>%
df mutate(zip_str = str_pad(as.character(ZIPCODE), width = 5,
side = "left", pad = "0"),
incpr = A02650 / N02650) %>%
select(zip_str, incpr)
The analyst then defines a function that will leverage tigris to read
in Census ZCTA and metropolitan area datasets as objects of class
"SpatialPolygonsDataFrame"
, and return the ZCTAs that intersect a
given metropolitan area as defined by the analyst.
library(tigris)
library(sp)
# Write function to get ZCTAs for a given metro
<- function(metro_name) {
get_zips <- zctas(cb = TRUE)
zips <- core_based_statistical_areas(cb = TRUE)
metros # Subset for specific metro area
# (be careful with duplicate cities like "Washington")
<- metros[grepl(sprintf("^%s", metro_name),
my_metro $NAME, ignore.case = TRUE), ]
metros# Find all ZCTAs that intersect the metro boundary
<- over(my_metro, zips, returnList = TRUE)[[1]]
metro_zips <- zips[zips$ZCTA5CE10 %in% metro_zips$ZCTA5CE10, ]
my_zips # Return those ZCTAs
return(my_zips)
}
The analyst can then fetch ZCTA geography for a given metropolitan area,
which in this example will be Dallas-Fort Worth, Texas, and merge the
IRS income data to it with geo_join()
. For visualization, this example
uses the tmap package
(Tennekes 2015), an excellent option for creating high-quality
cartographic products in R, to create a choropleth map. To provide
spatial reference to the Census tracts on the map, major roads obtained
with the primary_roads()
function in tigris are added to the map as
well.
library(tmap)
<- primary_roads()
rds <- get_zips("Dallas")
dfw <- geo_join(dfw, df, "ZCTA5CE10", "zip_str")
dfw_merged tm_shape(dfw_merged, projection = "+init=epsg:26914") +
tm_fill("incpr", style = "quantile", n = 7, palette = "Greens", title = "") +
tm_shape(rds, projection = "+init=epsg:26914") +
tm_lines(col = "darkgrey") +
tm_layout(bg.color = "ivory",
title = "Average income by zip code \n(in $1000s US), Dallas-Fort Worth",
title.position = c("right", "top"), title.size = 1.1,
legend.position = c(0.85, 0), legend.text.size = 0.75,
legend.width = 0.2) +
tm_credits("Data source: US Internal Revenue Service",
position = c(0.002, 0.002))
Given that both geographic data obtained through tigris and the IRS data are available for the entire country, an R developer could extend this example and create a web application that generates interactive income maps based on user input with the shiny package (Chang, J. Cheng, J. Allaire, Y. Xie, and J. McPherson 2016). Below is an example of such an application, which is viewable online at http://walkerke.shinyapps.io/tigris-zip-income; the code for the application can be viewed at http://github.com/walkerke/tigris-zip-income.
In the application, the user selects a metropolitan area from the drop-down menu, instructing the Shiny server to generate an interactive choropleth map of average reported total income by zip code from the IRS, as in the above example, but in this instance using the leaflet package (Cheng and Y. Xie 2015). The application uses the same process described above for the static map to subset the data; in this instance, however, the Shiny server makes these computations on-the-fly in response to user input. In the figure, the Los Angeles, California metropolitan area is selected; however, all metropolitan areas in the United States are available to users of the application. While both of these cartographic examples require considerable R infrastructure to process the data and ultimately create the visualizations, tigris plays a key role in each by providing direct access to reliable spatial data programmatically from R.
This paper has summarized the functionality of the tigris package for retrieving and working with shapefiles from the United States Census Bureau. Access to high-quality spatial data is essential for the geospatial analyst, but can be difficult to access. For R users working on projects that can benefit from United States Census Bureau data, tigris provides direct access to the Census Bureau’s TIGER/Line and cartographic boundary files using a simple and consistent API. In turn, tasks such as looking up FIPS codes to identify the correct datasets to download or combining several Census datasets are reduced to a few lines of R code in tigris.
More significantly, the utility of tigris is exemplified when included in a larger geospatial project that incorporates US Census data, such as the static and interactive maps of IRS data included in this article. These examples illustrate some clear advantages that R has over traditional desktop Geographic Information Systems software for geographic analysis and visualization. To create an interactive application showing IRS data by zip code as in Figure 9, a GIS analyst would traditionally have to search out and download the data from the web; load it into a desktop GIS for merging and calculating new columns; publish the data to a server; and build the application with a web mapping client and web application framework, often requiring several software applications. As shown in this article, this entire process now can take place inside of an R script, helping ensure quality and reproducibility. For projects that require US Census Bureau geographic data, tigris aims to fit well within these types of workflows.
I am indebted to Bob Rudis, whose key contributions to the tigris package were essential in improving its functionality. I would also like to thank Eli Knaap, who encouraged me to write the package; Hadley Wickham for writing R Packages and the devtools package (Wickham 2015); (Wickham and W. Chang 2015), which helped me significantly as I developed tigris; and Roger Bivand as well as the three anonymous reviewers for providing useful feedback on the manuscript.
tigris, rgdal, sp, UScensus2010, USABoundaries, choroplethr, ggplot2, rappdirs, dplyr, tmap, shiny, leaflet, devtools
Databases, ModelDeployment, OfficialStatistics, Phylogenetics, Spatial, SpatioTemporal, TeachingStatistics, WebTechnologies
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Walker, "tigris: An R Package to Access and Work with Geographic Data from the US Census Bureau", The R Journal, 2016
BibTeX citation
@article{RJ-2016-043, author = {Walker, Kyle}, title = {tigris: An R Package to Access and Work with Geographic Data from the US Census Bureau}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {231-242} }