LISTSERV - POLMETH Archives - LISTSERV.WUSTL.EDU

POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

	LISTSERV Archives
	POLMETH Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: A QUERY FOR THE POLMETH LISTSERV: In search of advice about web scraping software
From:	"Braumoeller, Bear" <[log in to unmask]>
Reply To:	Political Methodology Society <[log in to unmask]>
Date:	Mon, 19 May 2014 12:36:12 +0000
Content-Type:	text/plain
Parts/Attachments:	text/plain (211 lines)
I'd put in a vote for Outwit Hub. I've found it to be surprisingly versatile and very easy to use—so much so that I get a lot of mileage out of it in my undergraduate methods class.

> On May 19, 2014, at 1:22 AM, "Andrew Heiss" <[log in to unmask]> wrote:
> 
> There’s also a new tool for Scrapy (a web crawler built in Python: http://scrapy.org/) that lets you graphically train the crawler which  sections of a page to select and save, which can be a lot easier than walking through the DOM. Check out Portia at http://blog.scrapinghub.com/2014/04/01/announcing-portia/ (there’s a video demonstration; I’ve also tried it on a small site just for testing purposes and it worked well).
> 
> Andrew Heiss  
> Ph.D. Student, Public Policy and Political Science  
> Sanford School of Public Policy | Duke University  
> [log in to unmask] | www.andrewheiss.com
> 
> On May 18, 2014 at 20:53:20 PM, Thomas J. Leeper ([log in to unmask]) wrote:
> 
> Just to chip in, there are analogous tools for R. The readily  
> available packages RCurl and XML provide pretty much everything you'll  
> need to scrape static webpages. There are various tutorials online,  
> but of course anything you'll code up will be highly application  
> specific depending on the structure of the data you're trying to  
> access.  
> 
> For dynamic pages, bindings for Selenium are provided via RSelenium  
> https://github.com/johndharrison/RSelenium. Conveniently, there is a  
> webinar on the package this week. Here's a link:  
> http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/  
> 
> Best,  
> -Thomas  
> 
> Thomas J. Leeper  
> http://www.thomasleeper.com  
> 
> 
>> On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote:  
>> The one tutorial you linked used BeautifulSoup and Python. An alternative  
>> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's  
>> increasing in popularity because of it's modeled after JQuery, which has a  
>> really nice (read: simple) API for parsing HTML documents.  
>> 
>> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and  
>> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.  
>> Selenium is heavily-used by professional web developers as part of their  
>> unit-testing code. (That's what it was designed for.) It's also  
>> exceptionally good for academic scraping. Basically, you can mechanize a  
>> browser instance. As such, you're working with the actual DOM (rather than  
>> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get  
>> (to parse.) Also -- and, I think, more importantly -- it fails obviously.  
>> That is, if your scraper breaks on some page: the browser window is still  
>> open; you can manually inspect the source; or, occasionally, you can read  
>> the message "you've been banned."  
>> 
>> If you do not have to parse Javascript generated elements and the target is  
>> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through  
>> Python. (It has bindings to your favorite language. Like I said, it's much  
>> used by web developers.) Even if you want continuous scraping of a large  
>> website -- as opposed to a single snapshot for a study -- you can run  
>> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and  
>> not be bothered by the persistently open window.  
>> 
>> Good luck,  
>> John B Nelson  
>> 
>> 
>>> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:  
>>> 
>>> Colleagues,  
>>> 
>>> 
>>> 
>>> Might one, two, or more of you have advice on software for web scraping?  
>>> My  
>>> particular and immediate interest is in one-time scraping of modest amounts  
>>> of text data from government websites in the United States. Of course, if  
>>> I  
>>> like this hammer..  
>>> 
>>> 
>>> 
>>> I have found only brief anecdotal comments about how to do it in political  
>>> science methods papers, and even in papers that obviously have carried out  
>>> such a task.  
>>> 
>>> 
>>> 
>>> Much software of this ilk is touted on the web, but rare is the online  
>>> paper  
>>> that reviews more than one piece of software. And the most useful, if  
>>> still  
>>> limited, papers of the latter sort that I have found are at:  
>>> 
>>> 
>>> 
>>> http://lethain.com/an-introduction-to-compassionate-screenscraping/  
>>> 
>>> 
>>> 
>>> and  
>>> 
>>> 
>>> 
>>> 
>>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin  
>>> g-insightful-content/  
>>> 
>>> 
>>> 
>>> From having read those sources and the text on a number of different  
>>> software options, I am led to believe that such software varies by how  
>>> "user  
>>> friendly" it is, how technically sophisticated it can be, and how much user  
>>> specified programming is necessary to employ a given software package.  
>>> Thus I, and perhaps other readers of this list might appreciate advice from  
>>> those of you who have used such software about several options across the  
>>> range of those attributes.  
>>> 
>>> 
>>> 
>>> Kim  
>>> 
>>> 
>>> 
>>> Kim Quaile Hill  
>>> 
>>> Cullen-McFadden Professor of Political Science,  
>>> 
>>> Presidential Professor for Teaching Excellence, and  
>>> 
>>> Eppright Professor in UndergraduateTeaching Excellence  
>>> 
>>> Department of Political Science  
>>> 
>>> 4348 TAMU  
>>> 
>>> Texas A&M University  
>>> 
>>> College Station, TX 77843-4348  
>>> 
>>> ph. 979/845-8235  
>>> 
>>> fax 979/847-8924  
>>> 
>>> e-mail: <mailto:[log in to unmask]> [log in to unmask]  
>>> 
>>> 
>>> 
>>> 
>>> **********************************************************  
>>> Political Methodology E-Mail List  
>>> Editors: Ethan Porter <[log in to unmask]>  
>>> Gregory Whitfield <[log in to unmask]>  
>>> **********************************************************  
>>> Send messages to [log in to unmask]  
>>> To join the list, cancel your subscription, or modify  
>>> your subscription settings visit:  
>>> 
>>> http://polmeth.wustl.edu/polmeth.php  
>>> 
>>> **********************************************************  
>> 
>> **********************************************************  
>> Political Methodology E-Mail List  
>> Editors: Ethan Porter <[log in to unmask]>  
>> Gregory Whitfield <[log in to unmask]>  
>> **********************************************************  
>> Send messages to [log in to unmask]  
>> To join the list, cancel your subscription, or modify  
>> your subscription settings visit:  
>> 
>> http://polmeth.wustl.edu/polmeth.php  
>> 
>> **********************************************************  
> 
> **********************************************************  
> Political Methodology E-Mail List  
> Editors: Ethan Porter <[log in to unmask]>  
> Gregory Whitfield <[log in to unmask]>  
> **********************************************************  
> Send messages to [log in to unmask]  
> To join the list, cancel your subscription, or modify  
> your subscription settings visit:  
> 
> http://polmeth.wustl.edu/polmeth.php  
> 
> **********************************************************  
> 
> **********************************************************
>             Political Methodology E-Mail List
>   Editors: Ethan Porter        <[log in to unmask]>
>            Gregory Whitfield   <[log in to unmask]>
> **********************************************************
>        Send messages to [log in to unmask]
>  To join the list, cancel your subscription, or modify
>           your subscription settings visit:
> 
>          http://polmeth.wustl.edu/polmeth.php
> 
> **********************************************************
> 

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************
ATOM RSS1 RSS2
LISTSERV.WUSTL.EDU