maxent is a package with tools for data classification using multinomial logistic regression, also known as maximum entropy. The focus of this maximum entropy classifier is to minimize memory consumption on very large datasets, particularly sparse document-term matrices represented by the tm text mining package.
The information era has provided political scientists with a wealth of data, yet along with this data has come the need to make sense of it all. Researchers have spent the past two decades manually classifying, or "coding" data according to their specifications—a monotonous task often assigned to undergraduate research assistants.
In recent years, supervised machine learning has become a boon in the social sciences, supplementing assistants with a computer that can classify documents with comparable accuracy. One of the main programming languages used for supervised learning in political science is R, which contains a plethora of machine learning add-ons via its package repository, CRAN.
Multinomial logistic regression, or maximum entropy, has historically been a strong contender for text classification via supervised learning. When compared to the naive Bayes algorithm, a common benchmark for text classification, maximum entropy generally classifies documents with higher accuracy (Nigam, J. Lafferty, and A. McCallum 1999).
Although packages for multinomial logistic regression exist on CRAN
(e.g. multinom
in package
nnet, package
mlogit), none handle the
data using compressed sparse matrices, and most use a formula interface
to define the predictor variables, approaches which tend to be
inefficient. For most text classification applications, these approaches
do not provide the level of optimization necessary to train and classify
maximum entropy models using large corpora. Typically, this problem
manifests as the error "cannot allocate vector of size N"
that
terminates execution of the script.
maxent seeks to address this issue by utilizing an efficient C++ library based on an implementation written by Yoshimasa Tsuruoka at the University of Tokyo (Tsuruoka 2011). Tsuruoka reduces memory consumption by using three efficient algorithms for parameter estimation: limited-memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS), orthant-wise limited-memory quasi-newton optimization (OWLQN), and stochastic gradient descent optimization (SGD). Typically maximum entropy models grow rapidly in size when trained on large text corpora, and minimizing the number of parameters in the model is key to reducing memory consumption. In this respect, these algorithms outperform alternative optimization techniques such as the conjugate gradient method (Nocedal 1990). Furthermore, the SGD algorithm provides particularly efficient parameter estimation when dealing with large-scale learning problems (Bouttou, O. Bousquet 2008)
After modifying the C++ implementation to make it work with R and
creating an interface to its functions using the
Rcpp package
(Eddelbuettel, R. Francois 2011), the maxent package yielded
impressively efficient memory usage on large corpora created by
tm
(Feinerer, K. Hornik, D. Meyer 2008), as can be seen in Figure
1. Additionally, maxent includes the
as.compressed.matrix()
function for converting various matrix
representations including "DocumentTermMatrix"
found in the tm
package, "Matrix"
found in the
Matrix package, and
"simple_triplet_matrix"
found in package
slam into the
"matrix.csr"
representation from package
SparseM
(Koenker, P. Ng 2011). In the case of document-term matrices generated by
tm, this function eliminates the intermediate step of converting the
"DocumentTermMatrix"
class to a standard "matrix"
, which consumes
significant amounts of memory when representing large datasets. Memory
consumption was measured using a shell script that tracked the peak
memory usage reported by the UNIX top
command during execution of the
maxent()
and predict()
functions in a sterile R environment.
Furthermore, the standard "DocumentTermMatrix"
represents i
indices
as repeating values, which needlessly consumes memory. The i
indices
link a document to its features via an intermediate j
index. In the
example below, an eight-digit series of 1’s signifies that the first
eight j
indices correspond to the first document in the
"DocumentTermMatrix"
.
> matrix\$i
1 1 1 1 1 1 1 1 2 2 2 2 2 2 3
3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
4 4 4 5 5 5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 7 7 7 7 7 8 8 8 8
8 8
Conversely, the SparseM package used in maxent consolidates
repeating values into a range of values. In the next example, j
indices 1 through 8 correspond to the first document, 9 through 14 to
the second document, and so forth. This removes excess integers in the
"DocumentTermMatrix"
representation of the sparse triplet matrix,
further reducing the memory footprint of the sparse matrix
representation. The matrix representation below has been reduced from 62
integers down to 9, or approximately 1/7th the original size.
> sparse@ia
1 9 15 24 34 46 52 57 63
We read in our dataset using input/output functions, in this case
read.csv()
. The NYTimes.csv.gz
dataset is bundled with the maxent
package, and contains 4000 manually coded headlines from The New York
Times (Boydstun).
<- read.csv(
data system.file("data/NYTimes.csv.gz",
package = "maxent"))
Next, we use tm to transform our data into a corpus, and subsequently
a "DocumentTermMatrix"
or its transpose, the "TermDocumentMatrix"
.
We then convert our "DocumentTermMatrix"
into a compressed
"matrix.csr"
representation using the as.compressed.matrix()
function.
<- Corpus(VectorSource(data\$Title))
corpus <- DocumentTermMatrix(corpus)
matrix <- as.compressed.matrix(matrix) sparse
We are now able to specify a training set and a classification set of
data, and start the supervised learning process using the maxent()
and
predict()
functions. In this case we have specified a training set of
1000 documents and a classification set of 500 documents.
<- maxent(sparse[1:1000,],
model $Topic.Code[1:1000])
data\<- predict(model,
results 1001:1500,]) sparse[
Although maxent’s primary appeal is to those users needing low-memory
multinomial logistic regression, the process can also be run using
standard matrices. However, the input matrix must be an as.matrix()
representation of a "DocumentTermMatrix"
—class
"TermDocumentMatrix"
is not currently supported.
<- maxent(as.matrix(matrix)[1:1000,],
model as.factor(data\$Topic.Code)[1:1000])
<- predict(model,
results as.matrix(matrix)[1001:1500,])
To save time during the training process, we can save our model to a
file using the save.model()
function. The saved maximum entropy model
is accessible using the load.model()
function.
<- maxent(sparse[1:1000,],
model $Topic.Code[1:1000])
data\save.model(model, "myModel")
<- load.model("myModel")
saved\_model <- predict(saved\_model,
results 1001:1500,]) sparse[
In political science, supervised learning is heavily used to categorize textual data such as legislation, television transcripts, and survey responses. To demonstrate the accuracy of the maxent package in a research setting, we will train the multinomial logistic regression classifier using 4,000 summaries of legislation proposed in the United States Congress and then test accuracy on 400 new summaries. The summaries have been manually partitioned into 20 topic codes corresponding to policy issues such as health, agriculture, education, and defense (Adler, J. Wilkerson 2004).
We begin by loading the maxent package and reading in the bundled
USCongress.csv.gz
dataset.
<- read.csv(
data system.file("data/USCongress.csv.gz",
package = "maxent"))
Using the built-in sample()
function in R and the tm text mining
package, we then randomly sample 4,400 summaries and generate a
"TermDocumentMatrix"
. To reduce noise in the data, we make all
characters lowercase and remove punctuation, numbers, stop words, and
whitespace.
<- sample(1:4449, 4400, replace=FALSE)
indices <- data[indices,]
sample <- VectorSource(as.vector(sample\$text))
vector <- Corpus(vector)
corpus <- TermDocumentMatrix(corpus,
matrix control=list(weighting = weightTfIdf,
language = "english",
tolower = TRUE,
stopwords = TRUE,
removeNumbers = TRUE,
removePunctuation = TRUE,
stripWhitespace = TRUE))
Next, we pass this matrix into the maxent()
function to train the
classifier, partitioning the first 4,000 summaries as the training data.
Similarly, we partition the last 400 summaries as the new data to be
classified. In this case, we use stochastic gradient descent for
parameter estimation and set a held-out sample of 400 summaries to
account for model overfitting.
<- maxent(matrix[,1:4000],
model $major[1:4000],
sample\_sgd = TRUE,
use\_heldout = 400)
set\<- predict(model, matrix[,4001:4400]) results
When we compare the predicted value of the results to the true value assigned by a human coder, we find that 80% of the summaries are accurately classified. Considering we only used 4,000 training samples and 20 labels, this result is impressive, and the classifier could certainly be used to supplement human coders. Furthermore, the algorithm is very fast; maxent uses only 135.4 megabytes of RAM and finishes in 53.3 seconds.
The maxent package also includes the maxent.tune()
function to
determine parameters that will maximize the recall of the results during
the training process. The function tests the algorithm across a sample
space of 18 parameter configurations including varying the
regularization terms for L1 and L2 regularization, using SGD
optimization, and setting a held-out portion of the data to detect
overfitting.
We can use the tuning function to optimize a model that will predict the species of an iris given the dimensions of petals and sepals. Using Anderson’s Iris data set that is bundled with R, we will use n-fold cross-validation to test the model using several parameter configurations.
data(iris)
<- subset(iris, select = -Species)
x <- iris\$Species
y
<- tune.maxent(x, y, nfold=3, showall=TRUE) f
L1 | L2 | SGD | Held-out | Accuracy | Pct. Fit |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0.948 | 0.975 |
0.2 | 0 | 0 | 0 | 0.666 | 0.685 |
0.4 | 0 | 0 | 0 | 0.666 | 0.685 |
0.6 | 0 | 0 | 0 | 0.666 | 0.685 |
0.8 | 0 | 0 | 0 | 0.666 | 0.685 |
1 | 0 | 0 | 0 | 0.666 | 0.685 |
0 | 0 | 0 | 25 | 0.948 | 0.975 |
0 | 0.2 | 0 | 0 | 0.966 | 0.993 |
0 | 0.4 | 0 | 0 | 0.966 | 0.993 |
0 | 0.6 | 0 | 0 | 0.972 | 1 |
0 | 0.8 | 0 | 0 | 0.972 | 1 |
0 | 1 | 0 | 0 | 0.966 | 0.993 |
0 | 0 | 1 | 0 | 0.966 | 0.993 |
0.2 | 0 | 1 | 0 | 0.666 | 0.685 |
0.4 | 0 | 1 | 0 | 0.666 | 0.685 |
0.6 | 0 | 1 | 0 | 0.666 | 0.685 |
0.8 | 0 | 1 | 0 | 0.666 | 0.685 |
1 | 0 | 1 | 0 | 0.666 | 0.685 |
As can be seen in Table 1, the results of the tuning function show that using L2 regularization will provide the best results, but using stochastic gradient descent without any regularization will perform nearly as well. We can also see that we would not want to use L1 regularization for this application as it drastically reduces accuracy. The held-out parameter is used without L1/L2 regularization and SGD because tests showed a decrease in accuracy when using them together. However, when L1/L2 regularization and SGD do not yield an increase in accuracy, using held-out data can improve performance. Using this information, we can improve the accuracy of our classifier without adding or manipulating the training data.
The maxent package provides a fast, low-memory maximum entropy classifier for a variety of classification tasks including text classification and natural language processing. The package integrates directly into the popular tm text mining package, and provides sample datasets and code to get started quickly. Additionally, a number of advanced features are available to prevent model overfitting and provide more accurate results.
This project was made possible through financial support from the University of California at Davis, University of Antwerp, and Sciences Po Bordeaux. I am indebted to Professor Yoshimasa Tsuruoka for taking the time to answer questions about his excellent C++ maximum entropy implementation, as well as Professor Amber E. Boydstun for her New York Times dataset and comments on this paper. This paper also benefited from the feedback of two anonymous reviewers.
nnet, mlogit, maxent, Rcpp, tm, Matrix, slam, SparseM
Econometrics, HighPerformanceComputing, MachineLearning, NaturalLanguageProcessing, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Jurka, "maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification", The R Journal, 2012
BibTeX citation
@article{RJ-2012-007, author = {Jurka, Timothy P.}, title = {maxent: An R Package for Low-memory Multinomial Logistic Regression with Support for Semi-automated Text Classification}, journal = {The R Journal}, year = {2012}, note = {https://rjournal.github.io/}, volume = {4}, issue = {1}, issn = {2073-4859}, pages = {56-59} }