POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Simon Munzert <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Mon, 19 May 2014 18:58:09 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (262 lines)
I'd take the opportunity and point to a book colleagues of mine and 
myself have just written. It'll be forthcoming with Wiley in a few months:
http://eu.wiley.com/WileyCDA/WileyTitle/productCd-111883481X.html
In line with Thomas' suggestions, we take an almost purely R-based 
approach to web scraping, as we argue that using R as the entire package 
from start to finish (data collection, analysis, publication) is 
something very attractive to do. We introduce R-to-Selenium bindings as 
well. If anybody is interested in a look at the manuscript, I'll be 
pleased to provide you with a preliminary version.

With regards to your problem, if the text data are embedded in static 
HTML, I would pick an XPath-based extraction approach using R (with the 
XML package). With XPath you can easily retrieve content from specific 
nodes in the HTML code, and R helps you document every single step and 
construct reproducible code. If you do not want to put any effort in 
learning XPath, the needed XPath expressions can easily be constructed 
with the SelectorGadget tool, available at http://selectorgadget.com/. 
Finally, the RCurl package helps to stay identifiable on the Web, and 
cleansing work on the raw text data is best done with the convenient 
string manipulation functions from the stringr package. I'll be happy to 
help if you have any further questions.

Best,
Simon


Am 19.05.2014 14:36, schrieb Braumoeller, Bear:
> I'd put in a vote for Outwit Hub. I've found it to be surprisingly versatile and very easy to use—so much so that I get a lot of mileage out of it in my undergraduate methods class.
>
>> On May 19, 2014, at 1:22 AM, "Andrew Heiss" <[log in to unmask]> wrote:
>>
>> There’s also a new tool for Scrapy (a web crawler built in Python: http://scrapy.org/) that lets you graphically train the crawler which  sections of a page to select and save, which can be a lot easier than walking through the DOM. Check out Portia at http://blog.scrapinghub.com/2014/04/01/announcing-portia/ (there’s a video demonstration; I’ve also tried it on a small site just for testing purposes and it worked well).
>>
>> Andrew Heiss
>> Ph.D. Student, Public Policy and Political Science
>> Sanford School of Public Policy | Duke University
>> [log in to unmask] | www.andrewheiss.com
>>
>> On May 18, 2014 at 20:53:20 PM, Thomas J. Leeper ([log in to unmask]) wrote:
>>
>> Just to chip in, there are analogous tools for R. The readily
>> available packages RCurl and XML provide pretty much everything you'll
>> need to scrape static webpages. There are various tutorials online,
>> but of course anything you'll code up will be highly application
>> specific depending on the structure of the data you're trying to
>> access.
>>
>> For dynamic pages, bindings for Selenium are provided via RSelenium
>> https://github.com/johndharrison/RSelenium. Conveniently, there is a
>> webinar on the package this week. Here's a link:
>> http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/
>>
>> Best,
>> -Thomas
>>
>> Thomas J. Leeper
>> http://www.thomasleeper.com
>>
>>
>>> On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote:
>>> The one tutorial you linked used BeautifulSoup and Python. An alternative
>>> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's
>>> increasing in popularity because of it's modeled after JQuery, which has a
>>> really nice (read: simple) API for parsing HTML documents.
>>>
>>> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and
>>> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>.
>>> Selenium is heavily-used by professional web developers as part of their
>>> unit-testing code. (That's what it was designed for.) It's also
>>> exceptionally good for academic scraping. Basically, you can mechanize a
>>> browser instance. As such, you're working with the actual DOM (rather than
>>> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get
>>> (to parse.) Also -- and, I think, more importantly -- it fails obviously.
>>> That is, if your scraper breaks on some page: the browser window is still
>>> open; you can manually inspect the source; or, occasionally, you can read
>>> the message "you've been banned."
>>>
>>> If you do not have to parse Javascript generated elements and the target is
>>> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through
>>> Python. (It has bindings to your favorite language. Like I said, it's much
>>> used by web developers.) Even if you want continuous scraping of a large
>>> website -- as opposed to a single snapshot for a study -- you can run
>>> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and
>>> not be bothered by the persistently open window.
>>>
>>> Good luck,
>>> John B Nelson
>>>
>>>
>>>> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote:
>>>>
>>>> Colleagues,
>>>>
>>>>
>>>>
>>>> Might one, two, or more of you have advice on software for web scraping?
>>>> My
>>>> particular and immediate interest is in one-time scraping of modest amounts
>>>> of text data from government websites in the United States. Of course, if
>>>> I
>>>> like this hammer..
>>>>
>>>>
>>>>
>>>> I have found only brief anecdotal comments about how to do it in political
>>>> science methods papers, and even in papers that obviously have carried out
>>>> such a task.
>>>>
>>>>
>>>>
>>>> Much software of this ilk is touted on the web, but rare is the online
>>>> paper
>>>> that reviews more than one piece of software. And the most useful, if
>>>> still
>>>> limited, papers of the latter sort that I have found are at:
>>>>
>>>>
>>>>
>>>> http://lethain.com/an-introduction-to-compassionate-screenscraping/
>>>>
>>>>
>>>>
>>>> and
>>>>
>>>>
>>>>
>>>>
>>>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin
>>>> g-insightful-content/
>>>>
>>>>
>>>>
>>>>  From having read those sources and the text on a number of different
>>>> software options, I am led to believe that such software varies by how
>>>> "user
>>>> friendly" it is, how technically sophisticated it can be, and how much user
>>>> specified programming is necessary to employ a given software package.
>>>> Thus I, and perhaps other readers of this list might appreciate advice from
>>>> those of you who have used such software about several options across the
>>>> range of those attributes.
>>>>
>>>>
>>>>
>>>> Kim
>>>>
>>>>
>>>>
>>>> Kim Quaile Hill
>>>>
>>>> Cullen-McFadden Professor of Political Science,
>>>>
>>>> Presidential Professor for Teaching Excellence, and
>>>>
>>>> Eppright Professor in UndergraduateTeaching Excellence
>>>>
>>>> Department of Political Science
>>>>
>>>> 4348 TAMU
>>>>
>>>> Texas A&M University
>>>>
>>>> College Station, TX 77843-4348
>>>>
>>>> ph. 979/845-8235
>>>>
>>>> fax 979/847-8924
>>>>
>>>> e-mail: <mailto:[log in to unmask]> [log in to unmask]
>>>>
>>>>
>>>>
>>>>
>>>> **********************************************************
>>>> Political Methodology E-Mail List
>>>> Editors: Ethan Porter <[log in to unmask]>
>>>> Gregory Whitfield <[log in to unmask]>
>>>> **********************************************************
>>>> Send messages to [log in to unmask]
>>>> To join the list, cancel your subscription, or modify
>>>> your subscription settings visit:
>>>>
>>>> http://polmeth.wustl.edu/polmeth.php
>>>>
>>>> **********************************************************
>>> **********************************************************
>>> Political Methodology E-Mail List
>>> Editors: Ethan Porter <[log in to unmask]>
>>> Gregory Whitfield <[log in to unmask]>
>>> **********************************************************
>>> Send messages to [log in to unmask]
>>> To join the list, cancel your subscription, or modify
>>> your subscription settings visit:
>>>
>>> http://polmeth.wustl.edu/polmeth.php
>>>
>>> **********************************************************
>> **********************************************************
>> Political Methodology E-Mail List
>> Editors: Ethan Porter <[log in to unmask]>
>> Gregory Whitfield <[log in to unmask]>
>> **********************************************************
>> Send messages to [log in to unmask]
>> To join the list, cancel your subscription, or modify
>> your subscription settings visit:
>>
>> http://polmeth.wustl.edu/polmeth.php
>>
>> **********************************************************
>>
>> **********************************************************
>>              Political Methodology E-Mail List
>>    Editors: Ethan Porter        <[log in to unmask]>
>>             Gregory Whitfield   <[log in to unmask]>
>> **********************************************************
>>         Send messages to [log in to unmask]
>>   To join the list, cancel your subscription, or modify
>>            your subscription settings visit:
>>
>>           http://polmeth.wustl.edu/polmeth.php
>>
>> **********************************************************
>>
> **********************************************************
>               Political Methodology E-Mail List
>     Editors: Ethan Porter        <[log in to unmask]>
>              Gregory Whitfield   <[log in to unmask]>
> **********************************************************
>          Send messages to [log in to unmask]
>    To join the list, cancel your subscription, or modify
>             your subscription settings visit:
>
>            http://polmeth.wustl.edu/polmeth.php
>
> **********************************************************


-- 
Simon Munzert
Research Assistant
University of Konstanz
Department of Politics and Public Administration
Chair for Survey Research
P.o. Box D85
78457 Konstanz

phone :: +49.(0)7531.883087
mail to :: [log in to unmask]
room :: 7.13 (Hochhaus Moltkestraße), d307 (University Campus)

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2