POLMETH Archives

Political Methodology Society

POLMETH@LISTSERV.WUSTL.EDU

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Mark Manger <[log in to unmask]>
Reply To:
Political Methodology Society <[log in to unmask]>
Date:
Fri, 19 Apr 2013 13:55:35 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (309 lines)
I have on occasion worked with more than a million observations, although (caveat) not directly in the analysis, only in the data preparation. Neither R nor Stata really cut it here, but SQL that Manoel mentioned really isn't hard at all -- much easier than any real programming language including R in fact -- and is for many people all they ever need for preprocessing the data. sqldf is a useful R package for that.

--

Mark S. Manger
Assistant Professor
Munk School of Global Affairs | University of Toronto
Observatory Site | 315 Bloor Street West | Room 212
Toronto, ON   M5S 0A7
Phone: 416-946-8927 | Fax: 416-946-8877
[log in to unmask]<mailto:[log in to unmask]>
www.munkschool.utoronto.ca/mga<http://www.munkschool.utoronto.ca/mga>

JOIN THE GLOBAL CONVERSATION





On 2013-04-19, at 6:02 AM, Richard Sherman <[log in to unmask]<mailto:[log in to unmask]>>
 wrote:

Manoel,

I agree with you on most points.

I think of "big" as more than a million observations. And yes, usually I have far, far, far fewer than that.

It's just that big data is the subject here, and as you (and I) say, R is not the best tool for that.

Anyway, I doubt that the list has much interest in pursuing this discussion. Those who wish, please just write me.

-Richard

---
Prof. Richard Sherman
Division of International Studies
Korea University

On Apr 19, 2013, at 8:29 AM, Manoel Galdino <[log in to unmask]<mailto:[log in to unmask]>> wrote:

I'm curious. How big is your big? Does you data fit in memory?

In any case, you're right that R has a steeper learning curve. I guess it's
a trade-off between low productivity in the beginning vs freedom to
implement things in your own way. But I doubt that Stata is better than R
if you know the R way to do things. For instance, why use melt - and - cast
when you have data.table package? data.table is faster than SQL if the data
fits in memory and you use it the right way! Also, it saves a lot of memory
by avoiding unnecessary copy of objects that are default in base R.

I'm not saying R is the best tool for analysis of big data. It's not
(yet?). But my feeling is that in academy, big data isn't really big data
and R is better than most options.

Best,
Manoel





On Thu, Apr 18, 2013 at 6:27 PM, Richard Sherman <[log in to unmask]<mailto:[log in to unmask]>> wrote:

Hello Patrick,

The likelihood that my weeks-long R episode is due to human (Sherman)
error cannot be overlooked.

I suppose this is partly a question of cultural/linguistic preferences
over software. Still, with big data in Stata you can write

use bigdata
reshape long y, i(x) j(z)
reg y v*

and expect results before the day ends.

To do the same thing in R, you need to -melt- and -cast-, which can take
days, then Google all over to find the right "big" package, and wait until
next Thursday to get what you need.

I like R for many reasons, but the analysis of big data is not one of them.

-Richard

---
Prof. Richard Sherman
Division of International Studies
Korea University

On Apr 18, 2013, at 5:25 AM, Patrick Lam <[log in to unmask]<mailto:[log in to unmask]>> wrote:

Hi Richard,

That is interesting.  My experience is that on the surface, Stata handles
bigger datasets more smoothly due to the way R handles and processes its
data.  But there are almost always packages that allow R to process big
data in a way that is as efficient as Stata, although one has to look for
these packages.  See for example, a recent piece in TPM about the
bigmemory
package:

http://polmeth.wustl.edu/methodologist/tpm_v20_n1.pdf

The difference of weeks versus half an hour to me seems to be so
drastically different that it be a matter of coding.




On Wed, Apr 17, 2013 at 3:52 PM, Richard Sherman <[log in to unmask]>
wrote:

OK, interesting, but:

I've waited weeks for R to do what Stata can do in half an hour. R is
not
suited to big data.

-Richard

---
Prof. Richard Sherman
Division of International Studies
Korea University

On Apr 18, 2013, at 3:18 AM, "Mihas, Paul" <[log in to unmask]<mailto:[log in to unmask]>> wrote:

Practical "Big Data": Separating the Hope from the Hype<

https://apps.research.unc.edu/events/index.cfm?event=events.eventDetails&event_key=51FE2C8EA7C615597B4111E3B07B274D8C2578E5


Two-Day Short Course

May 20-21: 10 a.m.-4 p.m.
Odum Institute for Research in Social Science<http://www.odum.unc.edu>
14 Manning Hall
University of North Carolina, Chapel Hill

Philip A. Schrodt, Pennsylvania State University

Overview: The phrase "Big Data" has come to designate a network of
relatively new computationally intensive methods that merge machine
learning and statistical methods for the analysis of very large data
sets
derived from secondary sources, usually the Web. This two-day short
course
will provide an overview of the most commonly used approaches, and how
these do -- and sometimes do not -- differ from conventional social
science
statistical approaches. The lectures emphasize approaches and resources
for
gaining further knowledge and technical proficiency, rather than going
into
depth on any single method; with very few exceptions, all of the
software
illustrated will be open source.

*   Module 1: Big Data: sources and practical implementation.
Web-scraping. Hadoop and other distributed databases, "cloud" computing,
and the "map-reduce" approach. Resources in R and Python. Ethical
considerations: privacy, intellectual property
*   Module 2: Working with unstructured text: regular expressions,
natural language processing suites for pre-processing text; named entity
and feature extraction
*   Module 3: Working with unstructured text: supervised text
classification and unsupervised topic models.
*   Module 4: Working with large-scale semi-structured data:
clustering, decision-trees, ensemble methods, and visualization

Pre-requisites: The course assumes a general familiarity with social
science data analysis and its mathematical conventions (for example the
equations for regression analysis). Knowledge of some computer
programming
and the R statistical system will be very helpful but not required.

Fee: $420

To register, click here.<

https://apps.research.unc.edu/events/index.cfm?event=events.eventDetails&event_key=51FE2C8EA7C615597B4111E3B07B274D8C2578E5


Odum Institute
The University of North Carolina at Chapel Hill
Manning Hall CB# 3355
Chapel Hill, NC 27599-3355
www.odum.unc.edu<http://www.odum.unc.edu>
Telephone: 919.962.3061
Fax: 919.962.4777









**********************************************************
         Political Methodology E-Mail List
Editors: Ethan Porter        <[log in to unmask]>
        Gregory Whitfield   <[log in to unmask]>
**********************************************************
    Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
       your subscription settings visit:

      http://polmeth.wustl.edu/polmeth.php

**********************************************************

**********************************************************
          Political Methodology E-Mail List
Editors: Ethan Porter        <[log in to unmask]>
         Gregory Whitfield   <[log in to unmask]>
**********************************************************
     Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
        your subscription settings visit:

       http://polmeth.wustl.edu/polmeth.php

**********************************************************




--
Patrick Lam
Department of Government and Institute for Quantitative Social Science,
Harvard University
http://www.patricklam.org

**********************************************************
          Political Methodology E-Mail List
Editors: Ethan Porter        <[log in to unmask]>
         Gregory Whitfield   <[log in to unmask]>
**********************************************************
     Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
        your subscription settings visit:

       http://polmeth.wustl.edu/polmeth.php

**********************************************************

**********************************************************
           Political Methodology E-Mail List
 Editors: Ethan Porter        <[log in to unmask]>
          Gregory Whitfield   <[log in to unmask]>
**********************************************************
      Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
         your subscription settings visit:

        http://polmeth.wustl.edu/polmeth.php

**********************************************************




--
Manoel Galdino
https://sites.google.com/site/galdinomcz/

**********************************************************
           Political Methodology E-Mail List
 Editors: Ethan Porter        <[log in to unmask]>
          Gregory Whitfield   <[log in to unmask]>
**********************************************************
      Send messages to [log in to unmask]
To join the list, cancel your subscription, or modify
         your subscription settings visit:

        http://polmeth.wustl.edu/polmeth.php

**********************************************************

**********************************************************
            Political Methodology E-Mail List
  Editors: Ethan Porter        <[log in to unmask]<mailto:[log in to unmask]>>
           Gregory Whitfield   <[log in to unmask]<mailto:[log in to unmask]>>
**********************************************************
       Send messages to [log in to unmask]<mailto:[log in to unmask]>
 To join the list, cancel your subscription, or modify
          your subscription settings visit:

         http://polmeth.wustl.edu/polmeth.php

**********************************************************


**********************************************************
             Political Methodology E-Mail List
   Editors: Ethan Porter        <[log in to unmask]>
            Gregory Whitfield   <[log in to unmask]>
**********************************************************
        Send messages to [log in to unmask]
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

ATOM RSS1 RSS2