POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Andrew Heiss <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Sun, 18 May 2014 20:59:17 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (194 lines)
There’s also a new tool for Scrapy (a web crawler built in Python: http://scrapy.org/) that lets you graphically train the crawler which  sections of a page to select and save, which can be a lot easier than walking through the DOM. Check out Portia at http://blog.scrapinghub.com/2014/04/01/announcing-portia/ (there’s a video demonstration; I’ve also tried it on a small site just for testing purposes and it worked well).

Andrew Heiss  
Ph.D. Student, Public Policy and Political Science  
Sanford School of Public Policy | Duke University  
[log in to unmask] | www.andrewheiss.com

On May 18, 2014 at 20:53:20 PM, Thomas J. Leeper ([log in to unmask]) wrote:

Just to chip in, there are analogous tools for R. The readily  
available packages RCurl and XML provide pretty much everything you'll  
need to scrape static webpages. There are various tutorials online,  
but of course anything you'll code up will be highly application  
specific depending on the structure of the data you're trying to  
access.  

For dynamic pages, bindings for Selenium are provided via RSelenium  
https://github.com/johndharrison/RSelenium. Conveniently, there is a  
webinar on the package this week. Here's a link:  
http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/  

Best,  
-Thomas  

Thomas J. Leeper  
http://www.thomasleeper.com  


On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote:  
> The one tutorial you linked used BeautifulSoup and Python. An alternative  
> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's  
> increasing in popularity because of it's modeled after JQuery, which has a  
> really nice (read: simple) API for parsing HTML documents.  
>  
> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and  
> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.  
> Selenium is heavily-used by professional web developers as part of their  
> unit-testing code. (That's what it was designed for.) It's also  
> exceptionally good for academic scraping. Basically, you can mechanize a  
> browser instance. As such, you're working with the actual DOM (rather than  
> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get  
> (to parse.) Also -- and, I think, more importantly -- it fails obviously.  
> That is, if your scraper breaks on some page: the browser window is still  
> open; you can manually inspect the source; or, occasionally, you can read  
> the message "you've been banned."  
>  
> If you do not have to parse Javascript generated elements and the target is  
> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through  
> Python. (It has bindings to your favorite language. Like I said, it's much  
> used by web developers.) Even if you want continuous scraping of a large  
> website -- as opposed to a single snapshot for a study -- you can run  
> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and  
> not be bothered by the persistently open window.  
>  
> Good luck,  
> John B Nelson  
>  
>  
> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:  
>  
>> Colleagues,  
>>  
>>  
>>  
>> Might one, two, or more of you have advice on software for web scraping?  
>> My  
>> particular and immediate interest is in one-time scraping of modest amounts  
>> of text data from government websites in the United States. Of course, if  
>> I  
>> like this hammer..  
>>  
>>  
>>  
>> I have found only brief anecdotal comments about how to do it in political  
>> science methods papers, and even in papers that obviously have carried out  
>> such a task.  
>>  
>>  
>>  
>> Much software of this ilk is touted on the web, but rare is the online  
>> paper  
>> that reviews more than one piece of software. And the most useful, if  
>> still  
>> limited, papers of the latter sort that I have found are at:  
>>  
>>  
>>  
>> http://lethain.com/an-introduction-to-compassionate-screenscraping/  
>>  
>>  
>>  
>> and  
>>  
>>  
>>  
>>  
>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin  
>> g-insightful-content/  
>>  
>>  
>>  
>> From having read those sources and the text on a number of different  
>> software options, I am led to believe that such software varies by how  
>> "user  
>> friendly" it is, how technically sophisticated it can be, and how much user  
>> specified programming is necessary to employ a given software package.  
>> Thus I, and perhaps other readers of this list might appreciate advice from  
>> those of you who have used such software about several options across the  
>> range of those attributes.  
>>  
>>  
>>  
>> Kim  
>>  
>>  
>>  
>> Kim Quaile Hill  
>>  
>> Cullen-McFadden Professor of Political Science,  
>>  
>> Presidential Professor for Teaching Excellence, and  
>>  
>> Eppright Professor in UndergraduateTeaching Excellence  
>>  
>> Department of Political Science  
>>  
>> 4348 TAMU  
>>  
>> Texas A&M University  
>>  
>> College Station, TX 77843-4348  
>>  
>> ph. 979/845-8235  
>>  
>> fax 979/847-8924  
>>  
>> e-mail: <mailto:[log in to unmask]> [log in to unmask]  
>>  
>>  
>>  
>>  
>> **********************************************************  
>> Political Methodology E-Mail List  
>> Editors: Ethan Porter <[log in to unmask]>  
>> Gregory Whitfield <[log in to unmask]>  
>> **********************************************************  
>> Send messages to [log in to unmask]  
>> To join the list, cancel your subscription, or modify  
>> your subscription settings visit:  
>>  
>> http://polmeth.wustl.edu/polmeth.php  
>>  
>> **********************************************************  
>>  
>  
> **********************************************************  
> Political Methodology E-Mail List  
> Editors: Ethan Porter <[log in to unmask]>  
> Gregory Whitfield <[log in to unmask]>  
> **********************************************************  
> Send messages to [log in to unmask]  
> To join the list, cancel your subscription, or modify  
> your subscription settings visit:  
>  
> http://polmeth.wustl.edu/polmeth.php  
>  
> **********************************************************  

**********************************************************  
Political Methodology E-Mail List  
Editors: Ethan Porter <[log in to unmask]>  
Gregory Whitfield <[log in to unmask]>  
**********************************************************  
Send messages to [log in to unmask]  
To join the list, cancel your subscription, or modify  
your subscription settings visit:  

http://polmeth.wustl.edu/polmeth.php  

**********************************************************  

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2