POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Thomas J. Leeper" <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Sun, 18 May 2014 22:51:09 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (172 lines)
Just to chip in, there are analogous tools for R. The readily
available packages RCurl and XML provide pretty much everything you'll
need to scrape static webpages. There are various tutorials online,
but of course anything you'll code up will be highly application
specific depending on the structure of the data you're trying to
access.

For dynamic pages, bindings for Selenium are provided via RSelenium
https://github.com/johndharrison/RSelenium. Conveniently, there is a
webinar on the package this week. Here's a link:
http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/

Best,
-Thomas

Thomas J. Leeper
http://www.thomasleeper.com


On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote:
> The one tutorial you linked used BeautifulSoup and Python. An alternative
> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's
> increasing in popularity because of it's modeled after JQuery, which has a
> really nice (read: simple) API for parsing HTML documents.
>
> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and
> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.
> Selenium is heavily-used by professional web developers as part of their
> unit-testing code. (That's what it was designed for.) It's also
> exceptionally good for academic scraping. Basically, you can mechanize a
> browser instance. As such, you're working with the actual DOM (rather than
> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get
> (to parse.) Also -- and, I think, more importantly -- it fails obviously.
> That is, if your scraper breaks on some page: the browser window is still
> open; you can manually inspect the source; or, occasionally, you can read
> the message "you've been banned."
>
> If you do not have to parse Javascript generated elements and the target is
> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through
> Python. (It has bindings to your favorite language. Like I said, it's much
> used by web developers.) Even if you want continuous scraping of a large
> website -- as opposed to a single snapshot for a study -- you can run
> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and
> not be bothered by the persistently open window.
>
> Good luck,
> John B Nelson
>
>
> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:
>
>> Colleagues,
>>
>>
>>
>> Might one, two, or more of you have advice on software for web scraping?
>>  My
>> particular and immediate interest is in one-time scraping of modest amounts
>> of text data from government websites in the United States.  Of course, if
>> I
>> like this hammer..
>>
>>
>>
>> I have found only brief anecdotal comments about how to do it in political
>> science methods papers, and even in papers that obviously have carried out
>> such a task.
>>
>>
>>
>> Much software of this ilk is touted on the web, but rare is the online
>> paper
>> that reviews more than one piece of software.  And the most useful, if
>> still
>> limited, papers of the latter sort that I have found are at:
>>
>>
>>
>> http://lethain.com/an-introduction-to-compassionate-screenscraping/
>>
>>
>>
>> and
>>
>>
>>
>>
>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin
>> g-insightful-content/
>>
>>
>>
>> From having read those sources and the text on a number of different
>> software options, I am led to believe that such software varies by how
>> "user
>> friendly" it is, how technically sophisticated it can be, and how much user
>> specified programming is necessary to employ a given software package.
>> Thus I, and perhaps other readers of this list might appreciate advice from
>> those of you who have used such software about several options across the
>> range of those attributes.
>>
>>
>>
>> Kim
>>
>>
>>
>> Kim Quaile Hill
>>
>> Cullen-McFadden Professor of Political Science,
>>
>> Presidential Professor for Teaching Excellence, and
>>
>> Eppright Professor in UndergraduateTeaching Excellence
>>
>> Department of Political Science
>>
>> 4348 TAMU
>>
>> Texas A&M University
>>
>> College Station, TX 77843-4348
>>
>> ph. 979/845-8235
>>
>> fax 979/847-8924
>>
>> e-mail:   <mailto:[log in to unmask]> [log in to unmask]
>>
>>
>>
>>
>> **********************************************************
>>              Political Methodology E-Mail List
>>    Editors: Ethan Porter        <[log in to unmask]>
>>             Gregory Whitfield   <[log in to unmask]>
>> **********************************************************
>>         Send messages to [log in to unmask]
>>   To join the list, cancel your subscription, or modify
>>            your subscription settings visit:
>>
>>           http://polmeth.wustl.edu/polmeth.php
>>
>> **********************************************************
>>
>
> **********************************************************
>              Political Methodology E-Mail List
>    Editors: Ethan Porter        <[log in to unmask]>
>             Gregory Whitfield   <[log in to unmask]>
> **********************************************************
>         Send messages to [log in to unmask]
>   To join the list, cancel your subscription, or modify
>            your subscription settings visit:
>
>           http://polmeth.wustl.edu/polmeth.php
>
> **********************************************************

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2