LISTSERV - POLMETH Archives - LISTSERV.WUSTL.EDU

Title:      Extracting Systematic Social Science Meaning from Text

Authors:    Daniel Hopkins, Gary King

Entrydate:  2007-07-12 07:18:35

Keywords:   automated content analysis, machine learning, simulated
extrapolation, non-parametric estimation, internet, 2008 U.S.
Presidential election

Abstract:   We develop two methods of automated content analysis that
give approximately unbiased estimates of quantities of theoretical
interest to social scientists.  With a small sample of documents hand
coded into investigator-chosen categories, our methods can give
accurate estimates of the proportion of text documents in each
category in a larger population. Existing methods successful at
maximizing the percent of documents correctly classified allow for
the possibility of substantial estimation bias in the category
proportions of interest.  Our first approach corrects this bias for
any existing classifier, with no additional assumptions.  Our second
method estimates the proportions without the intermediate step of
individual document classification, and thereby greatly reduces the
required assumptions.  For both methods, we also correct
statistically, apparently for the first time, for the far
less-than-perfect levels of inter-coder reliability that typically
characterize human attempts to classify documents, an approach that
will normally outperform even population hand coding when that is
feasible.  We illustrate these methods by tracking the daily opinions
of millions of people about candidates for the 2008 presidential
nominations in online blogs, data we introduce and make available
with this article, and through evaluations in available corpora from
other areas, including movie reviews, university web sites, and Enron
emails.  We also offer easy-to-use software that implements all
methods described.

http://polmeth.wustl.edu/retrieve.php?id=701