Tutorial

Getting started with R is simple - at least when you are already familiar with any other sort of programming language. In this short tutorial, you will see a few examples of how working with R can be fun ;)

This tutorial is a short synopsis of the much more extensive "An Introduction to R", which is offered on the R Project's pages.

= The first steps =

R is a scripting language that comes with a command-line, shell-like interpreter for exploratory data programming. Using cRunch, your studio workspace provides a comfortable interface to this interpreter. In the picture below, you see this command-line interface in action on the left-hand side: the user has entered a new variable assignment on the prompt (">") in the first line and then inspected it in the second.



Before we get into things, one note on where to find help best. R has a built in help system, which is very good! Most questions one might have can already be answered by reading up on the help pages -- often these provide already an example of what you might want to do. The most simple way how to call the help system is either with the questionmark operator:

> ?read.csv

This calls the help page for the function 'read.csv'. Don't forget to scroll down to 'Details' and 'Examples'.

If you do not know, what the function is you are looking for, you can use the search command with good keywords:

> help.search("mysql")

On the results page you will see all help pages that might be relevant. For this example, you will quickly spot that "RMySQL::MySQL" is amongst the hits. The double colon separates the function name from the package name. As you can see for this search, there are quite a lot of functions in the "RMySQL" package, so it might be a good idea to call for > library(RMySQL) # load the package > help(package="RMySQL")

This provides you with an overview on all functions and data bundled together in the RMySQL package.

Don't worry about this now, here is a tutorial on how to use MySQL on cRunch.

When this does not help, most often your problem has been already dealt with on the various mailing lists and community sites available: try to search for your problem in these.

= A note on data types =

R is handling data types quite intelligently, not meaning that on occasion it can get the data type wrong. There are only a few fundamental data types, for example, "character" (= strings), "integer", and "double". This is pretty much standard and R most often gets it right when input and output operations are performed. Should on occasion, however, such conversion be required, then the basic type casting operators come into play:

> a = "12" # warning: this is a character string > a  [1] "12" > a * 3 Error in a * 3 : non-numeric argument to binary operator > as.integer(a) * 3 36

One thing that may be a bit different in other languages (well, maybe not in other lisp-derivates ;): there are important compound data types, which in learning analytics will quite surely make your daily bread: vectors, lists, matrices, and data frames.

Vectors
Let's start with the most simple of these: vectors. A vector is a series of elements of a fundamental data type. For example, the following command combines an ordered sequence of integers into a vector:

> b = c( 1, 3, 10 ) > b  [1]  1  3 10

The fun thing about these sequenced elements (called 'vectors') is, that you can access their data in a very structured way. We will see later that accessing data in R like this is almost like a mini SQL ;) For example, to get the second element out of /b/, we just do  > b[2]   [1] 3

Now, let's do something more complex: let's get two elements back:

> b[ c(1,3) ] [1] 1 10

And something even more complex: let's calculate the index position

> b[ 1+2 ] # same as b[3] [1] 10

You see, where this is going? Yes, exactly: filtering, reordering, selecting...

Matrices
The other compound data types are even more powerful. Let's go ahead with matrices. A matrix is a data structure with rows and columns. For example, we create a new matrix with

> m = matrix(nrow = 5, ncol=7) > m       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,]  NA   NA   NA   NA   NA   NA   NA   [2,]   NA   NA   NA   NA   NA   NA   NA   [3,]   NA   NA   NA   NA   NA   NA   NA   [4,]   NA   NA   NA   NA   NA   NA   NA   [5,]   NA   NA   NA   NA   NA   NA   NA

Imagine that the columns are week days. Ooh, well, why imagine?

> colnames(m) = c( "Mon","Tue","Wed","Thu","Fri","Sat","Sun") > m       Mon Tue Wed Thu Fri Sat Sun [1,] NA  NA  NA  NA  NA  NA  NA   [2,]  NA  NA  NA  NA  NA  NA  NA   [3,]  NA  NA  NA  NA  NA  NA  NA   [4,]  NA  NA  NA  NA  NA  NA  NA   [5,]  NA  NA  NA  NA  NA  NA  NA

And imagine the rows are users:

> rownames(m) = c( "Fridolin", "Simon", "George", "Erik", "Shane") > m           Mon Tue Wed Thu Fri Sat Sun Fridolin NA  NA  NA  NA  NA  NA  NA   Simon     NA  NA  NA  NA  NA  NA  NA   George    NA  NA  NA  NA  NA  NA  NA   Erik      NA  NA  NA  NA  NA  NA  NA   Shane     NA  NA  NA  NA  NA  NA  NA

Then I can quite easily fill this with date, comfortable accessing not only by index position, but also by 'labels':

> m["Fridolin", "Tue"] = 5 > m["Simon", 1:7] = 1 > m[ c("Erik","Shane","George"), c("Mon","Wed","Fri") ] = 3 > m           Mon Tue Wed Thu Fri Sat Sun Fridolin NA   5  NA  NA  NA  NA  NA   Simon      1   1   1   1   1   1   1 George    3  NA   3  NA   3  NA  NA   Erik       3  NA   3  NA   3  NA  NA   Shane      3  NA   3  NA   3  NA  NA

By now you probably have guess that "NA" is used for missing values ("not available"). So to round this one up, we can replace all missing values in the following:

> m[ which(is.na(m)) ] = 0 > m           Mon Tue Wed Thu Fri Sat Sun Fridolin  0   5   0   0   0   0   0 Simon     1   1   1   1   1   1   1 George    3   0   3   0   3   0   0 Erik      3   0   3   0   3   0   0 Shane     3   0   3   0   3   0   0

You may wonder why there is no row + column index for this last replacement? Well, it's because - secretly - matrices are nothing else then special vectors for which a 'dimension' is defined:

> dim(m) [1] 5 7

So every 'element' of the matrix has also a uni-dimensional index, which can be accessed.

Here are a few more fun things to do with (not only) matrices, why not try them out yourself?

rowSums(m) colSums(m) max(m) mean(m)

Lists
Lists are then not much more suprising: they are just a named sequence of variables, as e.g. created by this:

> li = list( users=rownames(m), weekdays=colnames(m), otherstuff=5:19) > li  $users [1] "Fridolin" "Simon"   "George"   "Erik"     "Shane" $weekdays [1] "Mon" "Tue" "Wed" "Thu" "Fri" "Sat" "Sun" $otherstuff [1] 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

Nothing really special, but comes handy when e.g. in need of all usernames

> li$users [1] "Fridolin" "Simon"   "George"   "Erik"     "Shane"

or in need of the first user's name:

> li$users[1] [1] "Fridolin"

Here are a few more useful things you can do with (not only) lists:

names(li) sd(li$otherstuff) which(li$users == "George")

Data Frames
Data frames are like matrices that can have different data types for each column. A data frame is more like an Excel spreadsheet:

> df = data.frame( users=li$users, hits=as.integer(rowSums(m)),    peakDay=colnames(m)[apply(m, 1, max)]) > df       users hits peakDay 1 Fridolin   5     Fri 2   Simon    7     Mon 3  George    9     Wed 4    Erik    9     Wed 5   Shane    9     Wed

Each column in this data frame can have a different data type:

> is.integer(df$hits) [1] TRUE > is.factor(df$peakDay) [1] TRUE

You can see that there is a so far unknown data type: factor. Factors are special vectors (= ordered sequences of elements). They are special in so far as that their values are 'normalised' and retained in a separate list - and their values are actually only the indices in that list. Sounds complex? It's actually not:

> df$peakDay [1] Fri Mon Wed Wed Wed Levels: Fri Mon Wed

There are three distinct 'values' in that list of peak days (Fri, Mon, Wed), as the other 'possible' weak days do not empirically appear in the data set provided. We can get the levels directly with

> levels(df$peakDay) [1] "Fri" "Mon" "Wed"

and we can get the index values of each df$peakDay via:

> as.integer(df$peakDay) [1] 1 2 3 3 3

= Getting data in =

All roads lead to Rome (hah, notice the big caps "R" in Rome?): there are many different ways how to import your data into R.

If you're just exploring and if your data are not massive, then the most simple options are:


 * read.csv: for importing comma separated value files (which you can save from e.g. Excel)


 * the routines of the packages tm and lsa for handling textual data (using document-term matrices)


 * database connectors such as RMySQL


 * load: for filing in a native R data file


 * url: for fetching data from a url (as often available from an online survey system)

= Getting data out =


 * html reports with knitr (on cRunch you can make them available as a public report service!)


 * PDF reports using R sweave


 * save(myvariable, file="public/data/mydata.rda") for sharing native R data files (.rda)


 * write.csv for storing comma separated values


 * png or jpg or pdf for creating images