R For Dummies®, 2nd Edition
Published by: John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030-5774, www.wiley.com
Copyright © 2015 by John Wiley & Sons, Inc., Hoboken, New Jersey
Media and software compilation copyright © 2015 by John Wiley & Sons, Inc. All rights reserved.
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions
.
Trademarks: Wiley, For Dummies, the Dummies Man logo, Dummies.com, Making Everything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and may not be used without written permission. All trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNET WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
For general information on our other products and services, please contact our Customer Care Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993, or fax 317-572-4002. For technical support, please visit www.wiley.com/techsupport
.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com
. For more information about Wiley products, visit www.wiley.com
.
Library of Congress Control Number: 2015941928
ISBN 978-1-119-05580-8 (pbk); ISBN 978-1-119-05583-9 (epub); 978-1-119-05585-3 (epdf)
Welcome to R For Dummies, the book that helps you learn the statistical programming language R quickly and easily.
We can’t guarantee that you’ll be a guru if you read this book, but you should be able to
knowledge <- apply(theory, 1, sum)
R For Dummies is an introduction to the statistical programming language known as R. We start by introducing the interface and work our way from the very basic concepts of the language through more sophisticated data manipulation and analysis.
We illustrate every step with easy-to-follow examples. This book contains numerous code snippets, several write-it-yourself functions you can use later on, and complete analysis scripts. All these are for you to try out yourself.
We don’t attempt to give a technical description of how R is programmed internally, but we do focus as much on the why as on the how. R has many features that may seem surprising at first, so we believe it’s important to explain both how you should talk to R, and how the R engine interprets what you say. After reading this book, you should be able to manipulate your data in the form you want and understand how to use functions we didn’t cover in the book (as well as the ones we do cover).
This book is a reference. You don’t have to read it from beginning to end. Instead, you can use the table of contents and index to find the information you need. We cross-reference other chapters where you can find more information.
Since the publication of the first edition, R has kept evolving and improving. To keep the book accurate, we updated the code to reflect any changes in the latest version of R (version 3.2.0). With the feedback from readers, students, and colleagues we could rework some sections to clarify issues and correct inaccuracies. For example, we modified the code to use double quotes instead of single quotes when using text strings. We also refer to the fundamental units of lists as components, rather than elements.
The new rfordummies package contains code examples in the book. Read all about it in Appendix B.
Code snippets appear like this example, where we simulate 1 million throws of two six-sided dice:
> set.seed(42)
> throws <- 1e6
> dice <- replicate(2,
+ sample(1:6, throws, replace = TRUE)
+ )
> table(rowSums(dice))
2 3 4 5 6 7 8
28007 55443 83382 110359 138801 167130 138808
9 10 11 12
110920 83389 55816 27945
Each line of R code in this example is preceded by one of two symbols:
Lines that start without either the prompt or the continuation symbol are output produced by R. In this case, you get the total number of throws where the dice added up to the numbers 2 through 12. For example, out of 1 million throws of the dice, on 28,007 occasions the numbers on the dice added to 2.
You can copy these code snippets and run them in R, but you have to type them exactly as shown. There are only three exceptions:
Instructions to type code into the R console has the > symbol to the left:
> print("Hello world!")
If you type this into a console and press Enter, R responds with:
[1] "Hello world!"
For convenience, we collapse these two events into a single block, like this:
> print("Hello world!")
[1] "Hello world!"
Functions, arguments, and other R keywords appear in monofont. For example, to create a plot, you use the plot() function. Function names are followed by parentheses — for example, plot(). We don't add arguments to the function names mentioned in the text, unless it’s really important.
On some occasions we talk about menu commands, such as File⇒Save. This just means that you open the File menu and choose the Save option.
You can use this book however works best for you, but if you’re pressed for time (or just not interested in the nitty-gritty details), you can safely skip anything marked with a Technical Stuff icon. You also can skip sidebars (text in gray boxes); they contain interesting information, but nothing critical to your understanding of the subject at hand.
This book makes the following assumptions about you and your computer:
The book is organized in six parts. Here’s what each of the six parts covers.
In this part, you write your first script. You use the powerful concept of vectors to make simultaneous calculations on many variables at once. You work with the R workspace (in other words, how to create, modify, or remove variables). You find out how to save your work and retrieve and modify script files that you wrote in previous sessions. We also introduce some fundamentals of R (for example, how to install packages).
In this part, we fill you in on the three R’s: reading, ’riting, and ’rithmetic — in other words, working with text and numbers (and dates for good measure). You also get to use the very important data structures of lists and data frames.
R is a programming language, so you need to know how to write and understand functions. In this part, we show you how to do this, as well as how to control the logic flow of your scripts by making choices using if statements, as well as looping through your code to perform repetitive actions. We explain how to make sense of and deal with warnings and errors that you may experience in your code. Finally, we show you some tools to debug any issues that you may experience.
In this part, we introduce the different data structures that you can use in R, such as lists and data frames. You find out how to get your data in and out of R (for example, by reading data from files or the Clipboard). You also see how to interact with other applications, such as Microsoft Excel.
Then you discover how easy it is to do some advanced data reshaping and manipulation in R. We show you how to select a subset of your data and how to sort and order it. We explain how to merge different datasets based on columns they may have in common. Finally, we show you a very powerful generic strategy of splitting and combining data and applying functions over subsets of your data. When you understand this strategy, you can use it over and over again to do sophisticated data analyses in only a few small steps.
After reading this part, you’ll know how to describe and summarize your variables and data using R. You’ll be able to do some classical tests (for example, calculating a t-test). And you’ll know how to use random numbers to simulate some distributions.
Finally, we show you some of the basics of using linear models (for example, linear regression and analysis of variance). We also show you how to use R to predict the values of new data using models that you’ve fitted to your data.
They say that a picture is worth a thousand words. This is certainly the case when you want to share your results with other people. In this part, you discover how to create basic and more sophisticated plots to visualize your data. We move on from bar charts and line charts, and show you how to present cuts of your data using facets.
In this part, we show you how to do ten things in R that you probably use Microsoft Excel for at the moment (for example, how to do the equivalent of pivot tables and lookup tables). We also give you ten tips for working with packages that are not part of base R.
As you read this book, you’ll find little pictures in the margins. These pictures, or icons, mark certain types of text:
R For Dummies includes the following goodies online for easy download:
www.dummies.com/cheatsheet/r
www.dummies.com/extras/r
www.dummies.com/extras/r
If we have updates to the content of the book, look here for it:
www.dummies.com/extras/r
There’s only one way to learn R: Use it! In this book, we try to make you familiar with the usage of R, but you’ll have to sit down at your PC and start playing around with it yourself. Crack the book open so the pages don’t flip by themselves, and start hitting the keyboard!
Part I
In this part …
Introducing R programming concepts.
Creating your first script.
Making clear, legible code.
Visit www.dummies.com
for great Dummies content online.
Chapter 1
In This Chapter
Discovering the benefits of R
Identifying some programming concepts that make R special
With an estimated worldwide user base of more than 2 million people, the R language has rapidly grown and extended since its origin as an academic demonstration language in the 1990s.
Some people would argue — and we think they’re right — that R is much more than a statistical programming language. It’s also
In this chapter, we fill you in on the benefits of R, as well as its unique features and quirks.
Of the many attractive benefits of R, a few stand out: It’s actively maintained, it has good connectivity to various types of data and other systems, and it’s versatile enough to solve problems in many domains. Possibly best of all, it’s available for free, in more than one sense of the word.
R is available under an open-source license, which means that anyone can download and modify the code. This freedom is often referred to as “free as in speech.” R is also available free of charge — a second kind of freedom, sometimes referred to as “free as in beer.” In practical terms, this means that you can download and use R free of charge.
As a result of this freedom, many excellent programmers have contributed improvements and fixes to the R code. For this reason, R is very stable and reliable.
The R Core Team has put a lot of effort into making R available for different types of hardware and software. This means that R is available for Windows, Unix systems (such as Linux), and the Mac.
R itself is a powerful language that performs a wide variety of functions, such as data manipulation, statistical modeling, and graphics. One really big advantage of R, however, is its extensibility. Developers can easily write their own software and distribute it in the form of add-on packages. Because of the relative ease of creating and using these packages, literally thousands of packages exist. In fact, many new (and not-so-new) statistical methods are published with an R package attached.
The R user base keeps growing. Many people who use R eventually start helping new users and advocating the use of R in their workplaces and professional circles. Sometimes they also become active on
http://www.r-project.org/mail.html
www.stackoverflow.com/questions/tagged/r
)http://stats.stackexchange.com/questions/tagged/r
)In addition to these mailing lists and Q&A websites, R users may
www.r-bloggers.com
).www.twitter.com/search/rstats
).See Chapter 11 for more information on R communities.
As more and more people moved to R for their analyses, they started trying to incorporate R in their previous workflows. This led to a whole set of packages for linking R to file systems, databases, and other applications. Many of these packages have since been incorporated into the base installation of R.
For example, the R package foreign (http://cran.r-project.org/web/packages/foreign/index.html
) forms part of the recommended packages of R and enables you to read data from the statistical packages SPSS, SAS, Stata, and others (see Chapter 12).
Several add-on packages exist to connect R to database systems, such as
http://cran.r-project.org/web/packages/RODBC/index.html
)http://cran.r-project.org/web/packages/ROracle/index.html
).As more data analysts started using R, the developers of commercial data software no longer could ignore the new kid on the block. Many of the big commercial packages have add-ons to connect with R. Notably, both IBM’s SPSS and SAS Institute’s SAS allow you to move data and graphics between the two packages, and also call R functions directly from within these packages.
Other third-party developers also have contributed to better connectivity between different data analysis tools. For example, Statconn developed RExcel, an Excel add-on that allows users to work with R from within Excel (http://www.statconn.com/products.html
).
R is more than just a domain-specific programming language aimed at data analysis. It has some unique features that make it very powerful, the most important one arguably being the notion of vectors. These vectors allow you to perform sometimes complex operations on a set of values in a single command.
R is a vector-based language. You can think of a vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}, for example, could be a vector. Unlike most other programming languages, R allows you to apply functions to the whole vector in a single operation without the need for an explicit loop.
It is time to illustrate vectors with some real R code. First, assign the values 1:5 to a vector called x:
> x <- 1:5
> x
[1] 1 2 3 4 5
Next, add the value 2 to each element in the vector x:
> x + 2
[1] 3 4 5 6 7
You can also add one vector to another. To add the values 6:10 element-wise to x, you do the following:
> x + 6:10
[1] 7 9 11 13 15
To do this in most other programming language would require an explicit loop to run through each value of x. However, R is designed to perform many operations in a single step. This functionality is one of the features that make R so useful — and powerful — for data analysis.
We introduce the concept of vectors in Chapter 2 and expand on vectors and vectorization in much more depth in Chapter 4.
R was developed by statisticians to make statistical data analysis easier. This heritage continues, making R a very powerful tool for performing virtually any statistical computation.
As R started to expand away from its origins in statistics, many people who would describe themselves as programmers rather than statisticians have become involved with R. The result is that R is now eminently suitable for a wide variety of nonstatistical tasks, including data processing, graphical visualization, and analysis of all sorts. R is being used in the fields of finance, natural language processing, genetics, biology, and market research, to name just a few.
In this book, we assume that you want to find out about R programming, not statistics, although we provide an introduction to statistics with R in Part IV.
R is an interpreted language, which means that — contrary to compiled languages like C and Java — you don’t need a compiler to first create a program from your code before you can use it. R interprets the code you provide directly and converts it into lower-level calls to pre-compiled code/functions.
In practice, it means that you simply write your code and send it to R, and the code runs, which makes the development cycle easy. This ease of development comes at the cost of speed of code execution, however. The downside of an interpreted language is that the code usually runs slower than the equivalent compiled code.
Chapter 2
In This Chapter
Looking at your R editing options
Starting R
Writing your first R script
Finding your way around the R environment
In order to start working in R, you need two things. First, you need a tool to easily write and edit code (an editor). You also need an interface, so you can send that code to R. Which tools you use depend to some extent on your operating system. The basic R install gives you these options:
At a practical level, this difference between operating systems and interfaces doesn’t matter very much. R is a programming language, and you can be sure that R interprets your code identically across operating systems.
Still, we want to show you how to use a standard R interface, so in this chapter we briefly illustrate how to use R with the Windows RGui. Our advice also works on the Mac R.app.
Fortunately, there is an alternative, third-party interface called RStudio that provides a consistent user interface regardless of operating system. RStudio increasingly is the standard editing tool for R, so we also illustrate how to use RStudio.
In this chapter, after opening an R console, you flex your R muscles and write some scripts. You do some calculations, create some numeric and text objects, take a look at the built-in help, and save your work.
R is many things: a programming language, a statistical processing environment, a way to solve problems, and a collection of helpful tools to make your life easier. The one thing that R is not is an application, which means that you have the freedom of selecting your own editing tools to interact with R.
In this section we discuss the Windows R interface, RGui (short for R graphical user interface). This interface also includes a very basic editor for your code. Since this standard editor is so, well, basic, we also introduce you to RStudio. RStudio offers a richer editing environment than RGui and many handy shortcuts for common tasks in R.
As part of the process of downloading and installing R, you get the standard graphical user interface (GUI), called RGui. RGui gives you some tools to manage your R environment — most important, a console window. The console is where you type instructions and generally get R to do useful things for you.
The standard installation process creates useful menu shortcuts (although this may not be true if you use Linux, because there is no standard GUI interface for Linux). In the menu system, look for a folder called R, and then find an icon called R followed by a version number (for example, R 3.2.0, as shown in Figure 2-1).
When you open RGui for the first time, you see the R Console screen (shown in Figure 2-2), which lists some basic information such as your version of R and the licensing conditions.
Below all this information is the R prompt, denoted by a > symbol. The prompt indicates where you type your commands to R; you see a blinking cursor to the right of the prompt.
We explore the R console in more depth in “Navigating the Environment,” later in this chapter.
Use the console to issue a very simple command to R. Type the following to calculate the sum of some numbers, directly after the prompt:
> 24 + 7 + 11
R responds immediately to your command, calculates and displays the total in the console:
> 24 + 7 + 11
[1] 42
The answer is 42. R gives you one other piece of information: The [1] preceding 42 indicates that the value 42 is the first element in your answer. It is, in fact, the only element in your answer! One of the clever things about R is that it can deal with calculating many values at the same time, which is called vector operations. We talk about vectors later in this chapter — for now, all you need to know is that R can handle more than one value at a time.
To quit your R session, type the following code in the console, after the command prompt (>):
> quit()
R asks you a question to make sure that you meant to quit, as shown in Figure 2-3. Click No, because you have nothing to save. This action closes your R session (as well as RGui, if you’ve been using RGui as your code editor). In fact, saving a workspace image rarely is useful.
RStudio is a code editor and development environment with some very nice features that make code development in R easy and fun:
Because RStudio is available free of charge for Linux, Windows, and Apple OS X, we think it’s a good option to use with R. In fact, we like RStudio so much that we use it to illustrate the examples in this book. Throughout the book, you find some tips and tricks on how things can be done in RStudio. If you decide to use a different code editor, you can still use all the code examples and you’ll get identical results.
To open RStudio, click the RStudio icon in your menu system or on your desktop. (You can find installation instructions in this book’s appendix.)
Once RStudio starts, choose File⇒New⇒R Script to open a new script file.
Your screen should look like Figure 2-4. You have four work areas (also called panes):
Packages: You can view a list of all installed packages.
A package is a self-contained set of code that adds functionality to R, similar to the way that add-ins add functionality to Microsoft Excel.
By now, you probably are itching to get started on some real code. In this section, you get to do exactly that. Get ready to get your hands dirty!
Programming books typically start with a very simple program. Often, this first program creates the message "Hello world!". In R, hello world program consists of one line of code.
Start a new R session, type the following in your console, and press Enter:
> print("Hello world!")
R responds immediately with this output:
[1] "Hello world!"
> print("Hello world!")
[1] "Hello world!"
Type the following in your console to calculate the sum of five numbers:
> 1 + 2 + 3 + 4 + 5
[1] 15
The answer is 15, which you can easily verify for yourself. You may think that there’s an easier way to calculate this value, though — and you’d be right. We explain how in the following section.
A vector is the simplest type of data structure in R. The R manual defines a vector as “a single entity consisting of a collection of things”. A collection of numbers, for example, is a numeric vector — the first five integer numbers form a numeric vector of length 5.
To construct a vector, type into the console:
> c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
In constructing your vector, you have successfully used a function in R. In programming language, a function is a piece of code that takes some inputs and does something specific with them. In constructing a vector, you tell the c() function to construct a vector with the first five integers. The entries inside the parentheses are referred to as arguments.
You also can construct a vector by using operators. An operator is a symbol you stick between two values to make a calculation. The symbols +, -, *, and / are all operators, and they have the same meaning they do in mathematics. Thus, 1+2 in R returns the value 3, just as you’d expect.
One very handy operator is called sequence, and it looks like a colon (:). Type the following in your console:
> 1:5
[1] 1 2 3 4 5
That’s more like it. With three keystrokes, you’ve generated a vector with the values 1 through 5. To calculate the sum of this vector, type into your console:
> sum(1:5)
[1] 15
While quite basic, this example shows you that using vectors allows you to do complex operations with a small amount of code. As vectors are the smallest possible unit of data in R, you get to work with vectors extensively in later chapters.
Using R as a calculator is very interesting but perhaps not all that useful. A much more useful capability is storing values and then doing calculations on these stored values. Try this:
> x <- 1:5
> x
[1] 1 2 3 4 5
In these two lines of code, you first assign the sequence 1:5 to an object called x. Then you ask R to print the value of x by typing x in the console and pressing Enter.
In addition to retrieving the value of a variable, you can do calculations on that value. Create a second variable called y, and assign it the value 10. Then add the values of x and y, as follows:
> y <- 10
> x + y
[1] 11 12 13 14 15
The values of the two variables themselves don’t change unless you assign a new value to either of them. You can check this by typing the following:
> x
[1] 1 2 3 4 5
> y
[1] 10
Now create a new variable z, assign it the value of x + y, and print its value:
> z <- x + y
> z
[1] 11 12 13 14 15
Variables also can take on text values. You can assign the value "Hello" to a variable called h, for example, by presenting the text to R inside quotation marks, like this:
> h <- "Hello"
> h
[1] "Hello"
In “Using vectors,” earlier in this chapter, you use the c() function to combine numeric values into vectors. This technique also works for text:
> hw <- c("Hello", "world!")
> hw
[1] "Hello" "world!"
You use the paste() function to concatenate multiple text elements. By default, paste() puts a space between the different elements, like this:
> paste("Hello", "world!")
[1] "Hello world!"
You can write R scripts that have some interaction with a user. To ask the user questions, you can use the readline() function. In the following code snippet, you read a value from the keyboard and assign it to the variable yourname:
> h <- "Hello"
> yourname <- readline("What is your name? ")
What is your name? Andrie
> paste(h, yourname)
[1] "Hello Andrie"
This code seems to be a bit cumbersome, however. Clearly, it would be much better to send these three lines of code simultaneously to R and get them evaluated in one go. In the next section, we show you how.