In 1935, Edgar Anderson collected size measurements for 150 flowers from three species of Iris on the Gaspé Peninsula in Quebec, Canada. Since then, Anderson’s Iris observations have become a classic dataset in statistics, machine learning, and data science teaching materials. It is included in the base R datasets package as iris
, making it easy for users to access without knowing much about it. However, the lack of data documentation, presence of non-intuitive variables (e.g. “sepal width”), and perfectly balanced groups with zero missing values make iris
an inadequate and stale dataset for teaching and learning modern data science skills. Users would benefit from working with a more representative, real-world environmental dataset with a clear link to current scientific research. Importantly, Anderson’s Iris data appeared in a 1936 publication by R. A. Fisher in the Annals of Eugenics (which is often the first-listed citation for the dataset), inextricably linking iris
to eugenics research. Thus, a modern alternative to iris
is needed. In this paper, we introduce the palmerpenguins R package (Horst et al. 2020), which includes body size measurements collected from 2007 - 2009 for three species of Pygoscelis penguins that breed on islands throughout the Palmer Archipelago, Antarctica. The penguins
dataset in palmerpenguins provides an approachable, charismatic, and near drop-in replacement for iris
with topical relevance for polar climate change and environmental impacts on marine predators. Since the release on CRAN in July 2020, the palmerpenguins package has been downloaded over 462,000 times, highlighting the demand and widespread adoption of this viable iris
alternative. We directly compare the iris
and penguins
datasets for selected analyses to demonstrate that R users, in particular teachers and learners currently using iris
, can switch to the Palmer Archipelago penguins for many use cases including data wrangling, visualization, linear modeling, multivariate analysis (e.g., PCA), cluster analysis and classification (e.g., by k-means).
In 1935, American botanist Edgar Anderson measured petal and sepal structural dimensions (length and width) for 50 flowers from three Iris species: Iris setosa, Iris versicolor, and Iris virginica (Anderson 1935). The manageable but non-trivial size (5 variables and 150 total observations) and characteristics of Anderson’s Iris dataset, including linear relationships and multivariate normality, have made it amenable for introducing a wide range of statistical methods including data wrangling, visualization, linear modeling, multivariate analyses, and machine learning. The Iris dataset is built into a number of software packages including the auto-installed datasets package in R (as iris
, R Core Team 2021), Python’s scikit-learn machine learning library (Pedregosa et al. 2011), and the SAS Sashelp library (SAS Institute, Cary NC), which has facilitated its widespread use. As a result, eighty-six years after the data were initially published, the Iris dataset remains ubiquitous in statistics, computational methods, software documentation, and data science courses and materials.
There are a number of reasons that modern data science practitioners and educators may want to move on from iris
. First, the dataset lacks metadata (Anderson 1935), which does not reinforce best practices and limits meaningful interpretation and discussion of research methods, analyses, and outcomes. Of the five variables in iris
, two (Sepal.Width
and Sepal.Length
) are not intuitive for most non-botanists. Even with explanation, the difference between petal and sepal dimensions is not obvious. Second, iris
contains equal sample sizes for each of the three species (n = 50) with no missing values, which is cleaner than most real-world data that learners are likely to encounter. Third, the single factor (Species
) in iris
limits options for analyses. Finally, due to its publication in the Annals of Eugenics by statistician R.A. Fisher (Fisher 1936), iris
is burdened by a history in eugenics research, which we are committed to addressing through the development of new data science education products as described below.
Given the growing need for fresh data science-ready datasets, we sought to identify an alternative dataset that could be made easily accessible for a broad audience. After evaluating the positive and negative features of iris
in data science and statistics materials, we established the following criteria for a suitable alternative:
iris
for most use casesHere, we describe an alternative to iris
that largely satisfies these criteria: a refreshing, approachable, and charismatic dataset containing real-world body size measurements for three Pygoscelis penguin species that breed throughout the Western Antarctic Peninsula region, made available through the United States Long-Term Ecological Research (US LTER) Network. By comparing data structure, size, and a range of analyses side-by-side for the two datasets, we demonstrate that the Palmer Archipelago penguin data are an ideal substitute for iris
for many use cases in statistics and data science education.
Body size measurements (bill length and depth, flipper length - flippers are the modified “wings” of penguins used for maneuvering in water, and body mass), clutch (i.e., egg laying) observations (e.g., date of first egg laid, and clutch completion), and carbon (13C/12C, \(\delta\)13C) and nitrogen (15N/14N, \(\delta\)15N) stable isotope values of red blood cells for adult male and female Adélie (P. adeliae), chinstrap (P. antarcticus), and gentoo (P. papua) penguins on three islands (Biscoe, Dream, and Torgersen) within the Palmer Archipelago were collected from 2007 - 2009 by Dr. Kristen Gorman in collaboration with the Palmer Station LTER, part of the US LTER Network. For complete data collection methods and published analyses, see Gorman et al. (2014). Throughout this paper, penguins species are referred to as “Adélie”, “Chinstrap”, and “Gentoo”.
The data in the palmerpenguins R package are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy, and were imported from the Environmental Data Initiative (EDI) Data Portal at the links below:
R users can install the palmerpenguins package from CRAN:
install.packages("palmerpenguins")
Information, examples, and links to community-contributed materials are available on the palmerpenguins package website: allisonhorst.github.io/palmerpenguins/. See the Appendix for how Python and Julia users can access the same data.
The palmerpenguins R package contains two data objects: penguins_raw
and penguins
. The penguins_raw
data consists of all raw data for 17 variables, recorded completely or in part for 344 individual penguins, accessed directly from EDI (penguins_raw
properties are summarized in Appendix B). We generally recommend using the curated data in penguins
, which is a subset of penguins_raw
retaining all 344 observations, minimally updated (Appendix A) and reduced to the following eight variables:
species
: a factor denoting the penguin species (Adélie, Chinstrap, or Gentoo)island
: a factor denoting the Palmer Archipelago island in Antarctica where each penguin was observed (Biscoe Point, Dream Island, or Torgersen Island)bill_length_mm
: a number denoting length of the dorsal ridge of a penguin bill (millimeters)bill_depth_mm
: a number denoting the depth of a penguin bill (millimeters)flipper_length_mm
: an integer denoting the length of a penguin flipper (millimeters)body_mass_g
: an integer denoting the weight of a penguin’s body (grams)sex
: a factor denoting the sex of a penguin sex (male, female) based on molecular datayear
: an integer denoting the year of study (2007, 2008, or 2009)The same data exist as comma-separated value (CSV) files in the package (“penguins_raw.csv” and “penguins.csv”), and can be read in using the built-in path_to_file()
function in palmerpenguins. For example,
library(palmerpenguins)
<- read.csv(path_to_file("penguins.csv")) df
will read in “penguins.csv” as if from an external file, thus automatically parsing species
, island
, and sex
variables as characters instead of factors. This option allows users opportunities to practice or demonstrate reading in data from a CSV, then updating variable class (e.g., characters to factors).
iris
and penguins
The penguins
data in palmerpenguins is useful and approachable for data science and statistics education, and is uniquely well-suited to replace the iris
dataset. Comparisons presented are selected examples for common iris
uses, and are not exhaustive.
Feature | iris | penguins |
---|---|---|
Year(s) collected | 1935 | 2007 - 2009 |
Dimensions (col x row) | 5 x 150 | 8 x 344 |
Documentation | minimal | complete metadata |
Variable classes | double (4), factor (1) | double (2), int (3), factor (3) |
Missing values? | no (n = 0; 0.0%) | yes (n = 19; 0.7%) |
Both iris
and penguins
are in tidy format (Wickham 2014) with each column denoting a single variable and each row containing measurements for a single iris flower or penguin, respectively. The two datasets are comparable in size: dimensions (columns × rows) are 5 × 150 and 8 × 344 for iris
and penguins
, respectively, and sample sizes within species are similar (Tables 1 & 2).
Notably, while sample sizes in iris
across species are all the same, sample sizes in penguins
differ across the three species. The inclusion of three factor variables in penguins
(species
, island
, and sex
), along with year
, create additional opportunities for grouping, faceting, and analysis compared to the single factor (Species
) in iris
.
Unlike iris
, which contains only complete cases, the penguins
dataset contains a small number of missing values (nmissing = 19, out of 2,752 total values). Missing values and unequal sample sizes are common in real-world data, and create added learning opportunity to the penguins
dataset.
Iris species | Sample size | Penguin species | Female | Male | NA |
---|---|---|---|---|---|
setosa | 50 | Adélie | 73 | 73 | 6 |
versicolor | 50 | Chinstrap | 34 | 34 | 0 |
virginica | 50 | Gentoo | 58 | 61 | 5 |
Distributions, relationships between variables, and clustering can be visually explored between species for the four structural size measurements in penguins
(flipper length, body mass, bill length and depth; Figure 2) and iris
(sepal width and length, petal width and length; Figure 3).
Both penguins
and iris
offer numerous opportunities to explore linear relationships and correlations, within and across species (Figures 2 & 3). A bivariate scatterplot made with the iris
dataset reveals a clear linear relationship between petal length and petal width. Using penguins
(Figure 4), we can create a uniquely similar scatterplot with flipper length and body mass. The overall trend across all three species is approximately linear for both iris
and penguins
. Teachers may encourage students to explore how simple linear regression results and predictions differ when the species variable is omitted, compared to, for example, multiple linear regression with species included (Figure 4).
Notably, distinctions between species are clearer for iris petals - particularly, the much smaller petals for Iris setosa - compared to penguins, in which Adélie and Chinstrap penguins are largely overlapping in body size (body mass and flipper length), and are both generally smaller than Gentoo penguins.
Simpson’s Paradox is a data phenomenon in which a trend observed between variables is reversed when data are pooled, omitting a meaningful variable. While often taught and discussed in statistics courses, finding a real-world and approachable example of Simpson’s Paradox can be a challenge. Here, we show one (of several possible - see Figure 2) Simpson’s Paradox example in penguins
: exploring bill dimensions with and without species included (Figure 5). When penguin species is omitted (Figure 5A), bill length and depth appear negatively correlated overall. The trend is reversed when species is included, revealing an obviously positive correlation between bill length and bill depth within species (Figure 5B).
Principal component analysis (PCA) is a dimensional reduction method commonly used to explore patterns in multivariate data. The iris
dataset frequently appears in PCA tutorials due to multivariate normality and clear interpretation of variable loadings and clustering.
A comparison of PCA with the four variables of structural size measurements in penguins
and iris
(both normalized prior to PCA) reveals highly similar results (Figure 6). For both datasets, one species is distinct (Gentoo penguins, and setosa irises) while the other two species (Chinstrap/Adélie and versicolor/virginica) appear somewhat overlapping in the first two principal components (Figure 6 A,B). Screeplots reveal that the variance explained by each principal component (PC) is very similar across the two datasets, particularly for PC1 and PC2: for penguins
, 88.15% of total variance is captured by the first two PCs, compared to 95.81% for iris
, with a similarly large percentage of variance captured by PC1 and PC2 in each (Figure 6 C,D).
Unsupervised clustering by k-means is a common and popular entryway to machine learning and classification, and again, the iris
dataset is frequently used in introductory examples. The penguins
data provides similar opportunities for introducing k-means clustering. For simplicity, we compare k-means clustering using only two variables for each dataset: for iris
, petal width and petal length, and for penguins
, bill length and bill depth. All variables are scaled prior to k-means. Three clusters (k = 3) are specified for each, since there are three species of irises (Iris setosa, Iris versicolor, and Iris virginica) and penguins (Adélie, Chinstrap and Gentoo).
K-means clustering with penguin bill dimensions and iris petal dimensions yields largely distinct clusters, each dominated by one species (Figure 7). For iris petal dimensions, k-means yields a perfectly separated cluster (Cluster 3) containing all 50 Iris setosa observations and zero misclassified Iris virginica or Iris versicolor (Table 3). While clustering is not perfectly distinct for any penguin species, each species is largely contained within a single cluster, with little overlap from the other two species. For example, considering Adélie penguins (orange observations in Figure 7A): 147 (out of 151) Adélie penguins are assigned to Cluster 3, zero are assigned to Cluster 1, and 4 are assigned to the Chinstrap-dominated Cluster 2 (Table 3). Only 5 (of 68) Chinstrap penguins and 1 (of 123) Gentoo penguins are assigned to the Adélie-dominated Cluster 3 (Table 3).
Cluster | Adélie | Chinstrap | Gentoo | Cluster | setosa | versicolor | virginica |
---|---|---|---|---|---|---|---|
1 | 0 | 9 | 116 | 1 | 0 | 2 | 46 |
2 | 4 | 54 | 6 | 2 | 0 | 48 | 4 |
3 | 147 | 5 | 1 | 3 | 50 | 0 | 0 |
Here, we have shown that structural size measurements for Palmer Archipelago Pygoscelis penguins, available as penguins
in the palmerpenguins R package, offer a near drop-in replacement for iris
in a number of common use cases for data science and statistics education including exploratory data visualization, linear correlation and regression, PCA, and clustering by k-means. In addition, teaching and learning opportunities in penguins
are increased due to a greater number of variables, missing values, unequal sample sizes, and Simpson’s Paradox examples. Importantly, the penguins
dataset encompasses real-world information derived from several charismatic marine predator species with regional breeding populations notably responding to environmental change occurring throughout the Western Antarctic Peninsula region of the Southern Ocean (see Bestelmeyer et al. (2011), Gorman et al. (2014), Gorman et al. (2017), Gorman et al. (2021)). Thus, the penguins
dataset can facilitate discussions more broadly on biodiversity responses to global change - a contemporary and critical topic in ecology, evolution, and the environmental sciences.
Data in the penguins
object have been minimally updated from penguins_raw
as follows:
Flipper Length (mm)
to flipper_length_mm
)species
are truncated to only include the common name (e.g. “Gentoo”, instead of “gentoo penguin (Pygoscelis papua)”)NA
culmen_length_mm
and culmen_depth_mm
variable names are updated to bill_length_mm
and bill_depth_mm
, respectivelyspecies
, island
, sex
) is updated to factoryear
was pulled from clutch observationspenguins_raw
datasetFeature | penguins_raw |
---|---|
Year(s) collected | 2007 - 2009 |
Dimensions (col x row) | 17 x 344 |
Documentation | complete metadata |
Variable classes | character (9), Date (1), numeric (7) |
Missing values? | yes (n = 336; 5.7%) |
Python: Python users can load the palmerpenguins datasets into their Python environment using the following code to install and access data in the palmerpenguins Python package:
pip install palmerpenguinsfrom palmerpenguins import load_penguins
= load_penguins() penguins
Julia: Julia users can access the penguins data in the PalmerPenguins.jl package. Example code to import the penguins data through PalmerPenguins.jl (more information on PalmerPenguins.jl from David Widmann can be found here):
> using PalmerPenguins
julia> table = PalmerPenguins.load() julia
TensorFlow: TensorFlow users can access the penguins data in TensorFlow Datasets. Information and examples for penguins data in TensorFlow can be found here.
All analyses were performed in the R language environment using version 4.1.2 (R Core Team 2021). Complete code for this paper is shared in the Supplemental Material. We acknowledge the following R packages used in analyses, with gratitude to developers and contributors:
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2022-020.zip
datasets, palmerpenguins, GGally, ggiraph, ggplot2, kableExtra, paletteer, colorblindr, patchwork, plotly, recipes, broom, shadowtext, tidyverse
Spatial, TeachingStatistics, WebTechnologies
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Horst, et al., "Palmer Archipelago Penguins Data in the palmerpenguins R Package - An Alternative to Anderson's Irises", The R Journal, 2022
BibTeX citation
@article{RJ-2022-020, author = {Horst, Allison M. and Hill, Alison Presmanes and Gorman, Kristen B.}, title = {Palmer Archipelago Penguins Data in the palmerpenguins R Package - An Alternative to Anderson's Irises}, journal = {The R Journal}, year = {2022}, note = {https://doi.org/10.32614/RJ-2022-020}, doi = {10.32614/RJ-2022-020}, volume = {14}, issue = {1}, issn = {2073-4859}, pages = {244-254} }