I'd put in a vote for Outwit Hub. I've found it to be surprisingly versatile and very easy to use—so much so that I get a lot of mileage out of it in my undergraduate methods class.
> On May 19, 2014, at 1:22 AM, "Andrew Heiss" <[log in to unmask]> wrote:
>
> There’s also a new tool for Scrapy (a web crawler built in Python: http://scrapy.org/) that lets you graphically train the crawler which sections of a page to select and save, which can be a lot easier than walking through the DOM. Check out Portia at http://blog.scrapinghub.com/2014/04/01/announcing-portia/ (there’s a video demonstration; I’ve also tried it on a small site just for testing purposes and it worked well).
>
> Andrew Heiss
> Ph.D. Student, Public Policy and Political Science
> Sanford School of Public Policy | Duke University
> [log in to unmask] | www.andrewheiss.com
>
> On May 18, 2014 at 20:53:20 PM, Thomas J. Leeper ([log in to unmask]) wrote:
>
> Just to chip in, there are analogous tools for R. The readily
> available packages RCurl and XML provide pretty much everything you'll
> need to scrape static webpages. There are various tutorials online,
> but of course anything you'll code up will be highly application
> specific depending on the structure of the data you're trying to
> access.
>
> For dynamic pages, bindings for Selenium are provided via RSelenium
> https://github.com/johndharrison/RSelenium. Conveniently, there is a
> webinar on the package this week. Here's a link:
> http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/
>
> Best,
> -Thomas
>
> Thomas J. Leeper
> http://www.thomasleeper.com
>
>
>> On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote:
>> The one tutorial you linked used BeautifulSoup and Python. An alternative
>> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's
>> increasing in popularity because of it's modeled after JQuery, which has a
>> really nice (read: simple) API for parsing HTML documents.
>>
>> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and
>> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.
>> Selenium is heavily-used by professional web developers as part of their
>> unit-testing code. (That's what it was designed for.) It's also
>> exceptionally good for academic scraping. Basically, you can mechanize a
>> browser instance. As such, you're working with the actual DOM (rather than
>> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get
>> (to parse.) Also -- and, I think, more importantly -- it fails obviously.
>> That is, if your scraper breaks on some page: the browser window is still
>> open; you can manually inspect the source; or, occasionally, you can read
>> the message "you've been banned."
>>
>> If you do not have to parse Javascript generated elements and the target is
>> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through
>> Python. (It has bindings to your favorite language. Like I said, it's much
>> used by web developers.) Even if you want continuous scraping of a large
>> website -- as opposed to a single snapshot for a study -- you can run
>> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and
>> not be bothered by the persistently open window.
>>
>> Good luck,
>> John B Nelson
>>
>>
>>> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:
>>>
>>> Colleagues,
>>>
>>>
>>>
>>> Might one, two, or more of you have advice on software for web scraping?
>>> My
>>> particular and immediate interest is in one-time scraping of modest amounts
>>> of text data from government websites in the United States. Of course, if
>>> I
>>> like this hammer..
>>>
>>>
>>>
>>> I have found only brief anecdotal comments about how to do it in political
>>> science methods papers, and even in papers that obviously have carried out
>>> such a task.
>>>
>>>
>>>
>>> Much software of this ilk is touted on the web, but rare is the online
>>> paper
>>> that reviews more than one piece of software. And the most useful, if
>>> still
>>> limited, papers of the latter sort that I have found are at:
>>>
>>>
>>>
>>> http://lethain.com/an-introduction-to-compassionate-screenscraping/
>>>
>>>
>>>
>>> and
>>>
>>>
>>>
>>>
>>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin
>>> g-insightful-content/
>>>
>>>
>>>
>>> From having read those sources and the text on a number of different
>>> software options, I am led to believe that such software varies by how
>>> "user
>>> friendly" it is, how technically sophisticated it can be, and how much user
>>> specified programming is necessary to employ a given software package.
>>> Thus I, and perhaps other readers of this list might appreciate advice from
>>> those of you who have used such software about several options across the
>>> range of those attributes.
>>>
>>>
>>>
>>> Kim
>>>
>>>
>>>
>>> Kim Quaile Hill
>>>
>>> Cullen-McFadden Professor of Political Science,
>>>
>>> Presidential Professor for Teaching Excellence, and
>>>
>>> Eppright Professor in UndergraduateTeaching Excellence
>>>
>>> Department of Political Science
>>>
>>> 4348 TAMU
>>>
>>> Texas A&M University
>>>
>>> College Station, TX 77843-4348
>>>
>>> ph. 979/845-8235
>>>
>>> fax 979/847-8924
>>>
>>> e-mail: <mailto:[log in to unmask]> [log in to unmask]
>>>
>>>
>>>
>>>
>>> **********************************************************
>>> Political Methodology E-Mail List
>>> Editors: Ethan Porter <[log in to unmask]>
>>> Gregory Whitfield <[log in to unmask]>
>>> **********************************************************
>>> Send messages to [log in to unmask]
>>> To join the list, cancel your subscription, or modify
>>> your subscription settings visit:
>>>
>>> http://polmeth.wustl.edu/polmeth.php
>>>
>>> **********************************************************
>>
>> **********************************************************
>> Political Methodology E-Mail List
>> Editors: Ethan Porter <[log in to unmask]>
>> Gregory Whitfield <[log in to unmask]>
>> **********************************************************
>> Send messages to [log in to unmask]
>> To join the list, cancel your subscription, or modify
>> your subscription settings visit:
>>
>> http://polmeth.wustl.edu/polmeth.php
>>
>> **********************************************************
>
> **********************************************************
> Political Methodology E-Mail List
> Editors: Ethan Porter <[log in to unmask]>
> Gregory Whitfield <[log in to unmask]>
> **********************************************************
> Send messages to [log in to unmask]
> To join the list, cancel your subscription, or modify
> your subscription settings visit:
>
> http://polmeth.wustl.edu/polmeth.php
>
> **********************************************************
>
> **********************************************************
> Political Methodology E-Mail List
> Editors: Ethan Porter <[log in to unmask]>
> Gregory Whitfield <[log in to unmask]>
> **********************************************************
> Send messages to [log in to unmask]
> To join the list, cancel your subscription, or modify
> your subscription settings visit:
>
> http://polmeth.wustl.edu/polmeth.php
>
> **********************************************************
>
**********************************************************
Political Methodology E-Mail List
Editors: Ethan Porter <[log in to unmask]>
Gregory Whitfield <[log in to unmask]>
**********************************************************
Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
your subscription settings visit:
http://polmeth.wustl.edu/polmeth.php
**********************************************************
|