POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Tom Nicholls <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Mon, 19 May 2014 10:34:16 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (85 lines)
It depends quite a lot on the scale and completeness that you're aiming
for. For very precise crawls of relatively small sites then a browser
automation approach based on Selenium or whatever makes a lot of sense.
If your ambitions are broader, then something on a more industrial scale
may be appropriate.

The Oxford Internet Institute is currently conducting a crawl of the
whole of .gov.uk, using heritrix, the Internet Archive's web crawler. It
handles robots.txt files, redirects, extraction of onward links from
css, flash, HTML etc., defaults to being polite, is scalable, and can
crawl very fast (a gigabyte of pages in a few minutes on our machine
when there's plenty of different sites to crawl from). It is highly
configurable.

Potential issues: by default it saves its pages in .warc files, which
are an archival standard (though it's possible to arrange it to save in
directories ala wget, I think, and there are lots of tools available to
unpack and process WARC files such as
https://github.com/internetarchive/warctools). Although heritrix has
pretty good detection algorithms for crawler traps/infinite loops on the
web, it will (like other crawlers) still get hung up sometimes, so it
needs monitoring once a day or so. Finally, heritrix is optimised for
get-everything-once archive crawls. It is possible to make it fetch
repeatedly over a period of time, but it takes a bit of setting up.

In general, we are interested in both the link graph and the textual
content of the web pages (to add categorical data using automated
classification), so we've written a setup to process PDFs/word files
etc. through a text extraction library automatically:
https://github.com/pmyteh/warctika.

I've been thinking about writing some of this stuff up for publication;
if people think it might be useful then do let me know. In the mean
time, do drop me a line if you have any questions.

Tom
-- 
Tom Nicholls
Oxford Internet Institute, University of Oxford
[log in to unmask]

On 15/05/14 21:26, Kim Hill wrote:
> Colleagues,
> 
> Might one, two, or more of you have advice on software for web scraping?  My
> particular and immediate interest is in one-time scraping of modest amounts
> of text data from government websites in the United States.  Of course, if I
> like this hammer..
> 
> I have found only brief anecdotal comments about how to do it in political
> science methods papers, and even in papers that obviously have carried out
> such a task.  
> 
> Much software of this ilk is touted on the web, but rare is the online paper
> that reviews more than one piece of software.  And the most useful, if still
> limited, papers of the latter sort that I have found are at:
> 
> http://lethain.com/an-introduction-to-compassionate-screenscraping/
> 
> and
> 
> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin
> g-insightful-content/
> 
> From having read those sources and the text on a number of different
> software options, I am led to believe that such software varies by how "user
> friendly" it is, how technically sophisticated it can be, and how much user
> specified programming is necessary to employ a given software package.
> Thus I, and perhaps other readers of this list might appreciate advice from
> those of you who have used such software about several options across the
> range of those attributes.

**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2