POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Kim Hill <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Mon, 2 Jun 2014 10:23:41 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (117 lines)
Colleagues,

 

Several readers urged me to summarize the comments in reply to my earlier
query to the list about software for web scraping.  So here is a top of the
iceberg version.  That is, it summarizes the major recommendations without
the particular details (that were often very particular.)  Yet I assume that
this summary can point readers down general paths they might prefer as well
as indicate which named software websites they should investigate to
assemble a complete web scraping software package.

 

The thrust of most of the suggestions, that would seem to comport with the
ambitions of most of the readers of this list, is that one choose one or
another base software like R, Perl, or Python and then explore compatible
tools for that base software for web scraping.  

 

Relatedly, several folks observed that the character of the web site or
sites you wish to scrape is important, based on the complexity of their code
and structure and whether they are relatively static or dynamic.  If the
sites are very complex, I infer you'll need more complex, programmable
software.  Comparably, it was observed that the complexity of your own goals
(likely in terms of how frequently you wish to scrape given sites, how fully
you wish to scrape, and so on) can affect the choice of software.

 

For R, readers suggested the tools RCurl, XML, and RSelenium.

 

For Perl, the suggested go-withs were WWW::Mechanize.

 

For Python, the tools that were suggested were Scrapy, Portia, Beautiful
Soup, PyQuery, and Selenium. 

 

I am working with our recent Ph.D. Grant Ferguson and my current ABD student
Soren Jordan to develop software for scraping government web sites.  Grant
and Soren are putting together a package based on Python.  But as Grant told
me after some work on that effort, these tools "don't do the same kind of
scraping in different ways, but do different kinds of scraping in different
ways (some scrape text, some visual images, some tables, and so on)."

 

Also, Simon Munzert directed me to a forthcoming book on web scraping of
which he is a co-author and which uses R as its base software.  A blurb for
that book can be found at 

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111883481X.html

 

Finally, Bear Braumoeller put in a vote for Outwit Hub, which is a lower
tech tool that he especially recommended for use in undergraduate classes.
(There is a limited function, or "light," free version and a more powerful
"pro" version available for $90.)

 

In our research Grant, Soren, and I will work with the Python package they
are developing and Outwit Hub so we can compare a high end and a low end
approach.  If we learn anything seemingly interesting in our venture, one of
us will report that to the list later this summer.

 

kqh

 

Kim Quaile Hill

Cullen-McFadden Professor of Political Science, 

Presidential Professor for Teaching Excellence, and

Eppright Professor in UndergraduateTeaching Excellence

Department of Political Science

4348 TAMU

Texas A&M University

College Station, TX 77843-4348

ph. 979/845-8235

fax 979/847-8924

e-mail:   <mailto:[log in to unmask]> [log in to unmask]

 


**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2