I'd put in a vote for Outwit Hub. I've found it to be surprisingly versatile and very easy to use—so much so that I get a lot of mileage out of it in my undergraduate methods class. > On May 19, 2014, at 1:22 AM, "Andrew Heiss" <[log in to unmask]> wrote: > > There’s also a new tool for Scrapy (a web crawler built in Python: http://scrapy.org/) that lets you graphically train the crawler which sections of a page to select and save, which can be a lot easier than walking through the DOM. Check out Portia at http://blog.scrapinghub.com/2014/04/01/announcing-portia/ (there’s a video demonstration; I’ve also tried it on a small site just for testing purposes and it worked well). > > Andrew Heiss > Ph.D. Student, Public Policy and Political Science > Sanford School of Public Policy | Duke University > [log in to unmask] | www.andrewheiss.com > > On May 18, 2014 at 20:53:20 PM, Thomas J. Leeper ([log in to unmask]) wrote: > > Just to chip in, there are analogous tools for R. The readily > available packages RCurl and XML provide pretty much everything you'll > need to scrape static webpages. There are various tutorials online, > but of course anything you'll code up will be highly application > specific depending on the structure of the data you're trying to > access. > > For dynamic pages, bindings for Selenium are provided via RSelenium > https://github.com/johndharrison/RSelenium. Conveniently, there is a > webinar on the package this week. Here's a link: > http://www.r-bloggers.com/the-rselenium-r-package-free-webinar/ > > Best, > -Thomas > > Thomas J. Leeper > http://www.thomasleeper.com > > >> On Sun, May 18, 2014 at 7:01 PM, John Nelson <[log in to unmask]> wrote: >> The one tutorial you linked used BeautifulSoup and Python. An alternative >> -- one, that I favor -- is PyQuery <http://pythonhosted.org/pyquery/>. It's >> increasing in popularity because of it's modeled after JQuery, which has a >> really nice (read: simple) API for parsing HTML documents. >> >> However, I'd suggest you look at Selenium <http://docs.seleniumhq.org/> and >> the Python bindings for Selenium <https://pypi.python.org/pypi/selenium>. >> Selenium is heavily-used by professional web developers as part of their >> unit-testing code. (That's what it was designed for.) It's also >> exceptionally good for academic scraping. Basically, you can mechanize a >> browser instance. As such, you're working with the actual DOM (rather than >> parsing HTML), so for Javascript-heavy sites, what-you-see-is-what-you-get >> (to parse.) Also -- and, I think, more importantly -- it fails obviously. >> That is, if your scraper breaks on some page: the browser window is still >> open; you can manually inspect the source; or, occasionally, you can read >> the message "you've been banned." >> >> If you do not have to parse Javascript generated elements and the target is >> straight-forward, I'd use PyQuery. Otherwise, I'd use Selenium through >> Python. (It has bindings to your favorite language. Like I said, it's much >> used by web developers.) Even if you want continuous scraping of a large >> website -- as opposed to a single snapshot for a study -- you can run >> Selenium on a headless browser (see: PhantomJS <http://phantomjs.org/>) and >> not be bothered by the persistently open window. >> >> Good luck, >> John B Nelson >> >> >>> On Thu, May 15, 2014 at 4:26 PM, Kim Hill <[log in to unmask]> wrote: >>> >>> Colleagues, >>> >>> >>> >>> Might one, two, or more of you have advice on software for web scraping? >>> My >>> particular and immediate interest is in one-time scraping of modest amounts >>> of text data from government websites in the United States. Of course, if >>> I >>> like this hammer.. >>> >>> >>> >>> I have found only brief anecdotal comments about how to do it in political >>> science methods papers, and even in papers that obviously have carried out >>> such a task. >>> >>> >>> >>> Much software of this ilk is touted on the web, but rare is the online >>> paper >>> that reviews more than one piece of software. And the most useful, if >>> still >>> limited, papers of the latter sort that I have found are at: >>> >>> >>> >>> http://lethain.com/an-introduction-to-compassionate-screenscraping/ >>> >>> >>> >>> and >>> >>> >>> >>> >>> http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creatin >>> g-insightful-content/ >>> >>> >>> >>> From having read those sources and the text on a number of different >>> software options, I am led to believe that such software varies by how >>> "user >>> friendly" it is, how technically sophisticated it can be, and how much user >>> specified programming is necessary to employ a given software package. >>> Thus I, and perhaps other readers of this list might appreciate advice from >>> those of you who have used such software about several options across the >>> range of those attributes. >>> >>> >>> >>> Kim >>> >>> >>> >>> Kim Quaile Hill >>> >>> Cullen-McFadden Professor of Political Science, >>> >>> Presidential Professor for Teaching Excellence, and >>> >>> Eppright Professor in UndergraduateTeaching Excellence >>> >>> Department of Political Science >>> >>> 4348 TAMU >>> >>> Texas A&M University >>> >>> College Station, TX 77843-4348 >>> >>> ph. 979/845-8235 >>> >>> fax 979/847-8924 >>> >>> e-mail: <mailto:[log in to unmask]> [log in to unmask] >>> >>> >>> >>> >>> ********************************************************** >>> Political Methodology E-Mail List >>> Editors: Ethan Porter <[log in to unmask]> >>> Gregory Whitfield <[log in to unmask]> >>> ********************************************************** >>> Send messages to [log in to unmask] >>> To join the list, cancel your subscription, or modify >>> your subscription settings visit: >>> >>> http://polmeth.wustl.edu/polmeth.php >>> >>> ********************************************************** >> >> ********************************************************** >> Political Methodology E-Mail List >> Editors: Ethan Porter <[log in to unmask]> >> Gregory Whitfield <[log in to unmask]> >> ********************************************************** >> Send messages to [log in to unmask] >> To join the list, cancel your subscription, or modify >> your subscription settings visit: >> >> http://polmeth.wustl.edu/polmeth.php >> >> ********************************************************** > > ********************************************************** > Political Methodology E-Mail List > Editors: Ethan Porter <[log in to unmask]> > Gregory Whitfield <[log in to unmask]> > ********************************************************** > Send messages to [log in to unmask] > To join the list, cancel your subscription, or modify > your subscription settings visit: > > http://polmeth.wustl.edu/polmeth.php > > ********************************************************** > > ********************************************************** > Political Methodology E-Mail List > Editors: Ethan Porter <[log in to unmask]> > Gregory Whitfield <[log in to unmask]> > ********************************************************** > Send messages to [log in to unmask] > To join the list, cancel your subscription, or modify > your subscription settings visit: > > http://polmeth.wustl.edu/polmeth.php > > ********************************************************** > ********************************************************** Political Methodology E-Mail List Editors: Ethan Porter <[log in to unmask]> Gregory Whitfield <[log in to unmask]> ********************************************************** Send messages to [log in to unmask] To join the list, cancel your subscription, or modify your subscription settings visit: http://polmeth.wustl.edu/polmeth.php **********************************************************