POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
John Nelson <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Sun, 18 May 2014 13:01:31 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (139 lines)
The one tutorial you linked used BeautifulSoup and Python. An alternative
-- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's
increasing in popularity because of it's modeled after JQuery, which has a
really nice (read: simple) API for parsing HTML documents.

However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and
the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.
Selenium is heavily-used by professional web developers as part of their
unit-testing code. (That's what it was designed for.) It's also
exceptionally good for academic scraping. Basically, you can mechanize a
browser instance. As such, you're working with the actual DOM (rather than
parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get
(to parse.) Also -- and, I think, more importantly -- it fails obviously.
That is, if your scraper breaks on some page: the browser window is still
open; you can manually inspect the source; or, occasionally, you can read
the message "you've been banned."

If you do not have to parse Javascript generated elements and the target is
straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through
Python. (It has bindings to your favorite language. Like I said, it's much
used by web developers.) Even if you want continuous scraping of a large
website -- as opposed to a single snapshot for a study -- you can run
Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and
not be bothered by the persistently open window.

Good luck,
John B Nelson


On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:

> Colleagues,
>
>
>
> Might one, two, or more of you have advice on software for web scraping?
>  My
> particular and immediate interest is in one-time scraping of modest amounts
> of text data from government websites in the United States.  Of course, if
> I
> like this hammer..
>
>
>
> I have found only brief anecdotal comments about how to do it in political
> science methods papers, and even in papers that obviously have carried out
> such a task.
>
>
>
> Much software of this ilk is touted on the web, but rare is the online
> paper
> that reviews more than one piece of software.  And the most useful, if
> still
> limited, papers of the latter sort that I have found are at:
>
>
>
> http://lethain.com/an-introduction-to-compassionate-screenscraping/
>
>
>
> and
>
>
>
>
> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin
> g-insightful-content/
>
>
>
> From having read those sources and the text on a number of different
> software options, I am led to believe that such software varies by how
> "user
> friendly" it is, how technically sophisticated it can be, and how much user
> specified programming is necessary to employ a given software package.
> Thus I, and perhaps other readers of this list might appreciate advice from
> those of you who have used such software about several options across the
> range of those attributes.
>
>
>
> Kim
>
>
>
> Kim Quaile Hill
>
> Cullen-McFadden Professor of Political Science,
>
> Presidential Professor for Teaching Excellence, and
>
> Eppright Professor in UndergraduateTeaching Excellence
>
> Department of Political Science
>
> 4348 TAMU
>
> Texas A&M University
>
> College Station, TX 77843-4348
>
> ph. 979/845-8235
>
> fax 979/847-8924
>
> e-mail:   <mailto:[log in to unmask]> [log in to unmask]
>
>
>
>
> **********************************************************
>              Political Methodology E-Mail List
>    Editors: Ethan Porter        <[log in to unmask]>
>             Gregory Whitfield   <[log in to unmask]>
> **********************************************************
>         Send messages to [log in to unmask]
>   To join the list, cancel your subscription, or modify
>            your subscription settings visit:
>
>           http://polmeth.wustl.edu/polmeth.php
>
> **********************************************************
>

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2