Statistics With R – Why? and Why Not?
When doing math or numerical analysis, the knowledge of the technique is far too often tied to the tool performing the calculation. Consider an engineer whose understanding of the Fast Fourier transformation is inseparably tied to the fft function in Matlab. Of course this hypothetical engineer understands what the results mean (more or less) but may not be able to duplicate his analysis if Matlab were taken away.
In most cases, it is likely that no deeper understanding will be required. But what happens if the computer makes a mistake? Or the program becomes unavailable? Both situations are entirely possible. Computer algorithms aren’t perfect and occasionally arrive at results make little sense; and hardware has been known to fail.
When the engineer understands how the computer arrived at the answer, however, he can recognize, understand, and ultimately correct those cases where the results are unexpected. This is an important reality check that can prevent costly disasters later down the line. Or, if the hardware is unavailable, he can use an alternative tool or software package to duplicate the analysis.
But while such a situation can arise with any type of numerical software, it’s most likely to happen to users of a statistical package. I find this extremely ironic since a proper understanding of statistics is essential to live in the modern world. (Much more so than an understanding of the Fast Fourier transform, at any rate.) The rules of probability, the normal curve, correlation, and multivariate statistics can have a direct impact on how we live our lives. They are used in making important decisions in finance, medicine, science and government. A misunderstanding of stats and the methods of science (from which statistics is inseparable), underlies the most divisive issues of our day: abortion, stem cell research, and global warming.
Moreover, neither side has a monopoly on ignorance or misunderstanding. People fail to distinguish between correlation and causality, or insist in using the word “average” as a slur. Nearly as bad are those that – like the hypothetical engineer described above – only understand statistics within the narrow context of their stats package. Casual statisticians are nearly as dangerous as the wholly uninformed.
The Statistical Package for the Social Sciences (SPSS), is one of the biggest perpetrators of this crisis. Which is hugely ironic, because I happen to love SPSS. SPSS is probably the first statistical package that has placed advanced statistical methods within the grasp of the novice user. I’ve been a happy user for nearly a decade (ever since I was introduced to the program in high school). But there is no doubt that I’ve come to understand statistics within the context of SPSS and its GUI.
Please don’t misunderstand me, I have a pretty good grasp of basic statistics. I can sling probability with the best of them and take relish in describing when to use the Fischer Exact test instead of a Chi-Square; but advanced statistics are a completely different matter. Advanced stats scare me. I can certainly use these more complicated methods. I’ve analyzed and written about multi-variate models and even ventured into Analysis of Variance (ANOVA). But I have to rely on SPSS and the aid of my institution’s biostatistician to help me recognize when there is a problem.
Which is why, in a time of tight budgets, losing the institution’s SPSS license has been a crushing blow to my productivity. (Whoever made that decision should be hauled out and shot!) Because I don’t have my statistics software any more, there are certain aspects of my job that are much more difficult to do. And unfortunately, there is only logical conclusion to draw: I’ve become a victim of the statistical ease of SPSS.
Open Source Alternatives
I went through a similar experience about a year ago. At the time, I had become increasingly frustrated with the restrictions, licensing fees, and limitations of the Matlab technical computing language. After one particularly infuriating meeting, I decided that I had had enough and was going to do something about it. In the months that followed, I spoke with friends and colleagues, and experimented with every alternative I could get my hands on. I looked at Octave (the “Open Source Matlab”) and Ruby, before eventually settling on a combination of Python and PyQt to meet my needs. The result of these changes has been tremendously positive. Python is both easier to use and far more powerful than Matlab could ever hope to be. Not only am I happier and more productive, but so are those who work with me.
It is, therefore, logical that when I lost my statistical language of choice that I would look to open source to provide an alternative. Fortunately, the Open Source community delivers not one alternative to SPSS, but two: Gnu PSPP and R.
Gnu PSPP
As the name implies, PSPP has one simple goal: to clone SPSS in every way that matters. It can perform descriptive statistics, T-tests, linear regression and non-parametric tests. It has an easy to easy to use and relatively intuitive GUI. It can use SPSS syntax and read SPSS data files. It supports an obscene number of variables and cases (about a billion). It interoperates with Gnumeric and OpenOffice. Finally, it’s fast.
Aside from its horribly ugly icon, PSPP would appear to deliver exactly what I want and need. Except, you might have noticed that this article is titled “Statistics with R”, not “Statistics with PSPP”. Obviously, I chose to go with the second alternative. But why?
PSPP works as advertised. I found it able to deal with nearly all of the old SPSS data files and syntax that I threw its way. But, the program suffers from the problem of all clones everywhere: it’s greatest aspiration is to be a copy of something else. That is to say, it seeks to be “Good Enough”, and therein lies the problem, I don’t want a tool that is good enough. I want to use excellent software, even if it’s different or requires me to learn new things. Even if I have to pay for it.
I’m not trying to pick on or be unfair to PSPP. It meets an important need in the free software landscape. It just doesn’t fit in my with my desires or preferences very well.
The R Statistical Project
This is where R steps into the picture. Whereas PSPP is “aimed at statisticians, social scientists and students requiring fast convenient analysis of sampled data (emphasis added)”, R is the software that most statisticians actually use. When I contacted the statistician at my institution to ask, “What statistical software should I use? I’m looking at R and PSPP.”
He responded, “Oh that’s easy. Use R. There will be a learning curve, but it’s much more powerful and capable than even SPSS or SAS.”
As I’ve started to explore the feature set and available modules, it readily becomes apparent as to why. R is a huge language. There are thousands of packages that cover every type of statistics I’ve ever heard of, and many more I haven’t.
Even better, people have gone to great lengths to incorporate R into other tools. It has a set of excellent python bindings and interoperates very well with LyX and LaTeX. As just a single example, using the Sweave document class, you can use R to easily embed code in reports and other documents that need to be updated on a very frequent basis. This allows for these publications to be generated on demand with the most recent data. The only other place I’ve seen the equal to this feature is within the proprietary universe of Microsoft Office and SQL Server.
[Credit: Oak-Tree.us]

