R: for your plotting and statistical needs

Bas Bossink
2-2-2016

Agenda

Introduction
- What is R?
- History
The R Language
Getting Data
Getting to Know Data

Agenda (II)

Plots
Reproducible Research
The R Ecosystem
Resources

Introduction

License: Creative Commons Attribution 3
available at: http://basbossink.github.io/presentations/
sources available at: github
build using:
- R
- RStudio
- RMarkdown

What is R?

R is a free software environment for statistical computing and graphics.

What is R?

From: Coursera: R-Programming
- License: These course materials are available under the Creative Commons Attribution NonCommercial ShareAlike (CC-NC-SA) license (http://www.tldrlegal.com/l/CC-NC-SA).

What is R?

R is a dialect of the S language.

What is S?

S is a language that was developed by John Chambers and others at Bell Labs.
S was initiated in 1976 as an internal statistical analysis environment, originally implemented as Fortran libraries.
Early versions of the language did not contain functions for statistical modeling.

What is S?

In 1988 the system was rewritten in C and began to resemble the system that we have today (this was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white book) documents the statistical analysis functionality.
Version 4 of the S language was released in 1998 and is the version we use today. The book Programming with Data by John Chambers (the green book) documents this version of the language.

Historical Notes

In 1993 Bell Labs gave StatSci (now Insightful Corp.) an exclusive license to develop and sell the S language.
In 2004 Insightful purchased the S language from Lucent for $2 million and is the current owner.
In 2006, Alcatel purchased Lucent Technologies and is now called Alcatel-Lucent.

Historical Notes

Insightful sells its implementation of the S language under the product name S-PLUS and has built a number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”.
In 2008 Insightful is acquired by TIBCO for $25 million
The fundamentals of the S language itself has not changed dramatically since 1998.
In 1998, S won the Association for Computing Machinery’s Software System Award.

S Philosophy

In “Stages in the Evolution of S”, John Chambers writes:

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

http://www.stat.bell-labs.com/S/history.html

Back to R

1991: Created in New Zealand by Ross Ihaka and Robert Gentleman. Their experience developing R is documented in a 1996 JCGS paper.
1993: First announcement of R to the public.
1995: Martin Mächler convinces Ross and Robert to use the GNU General Public License to make R free software.
1996: A public mailing list is created (R-help and R-devel)

Back to R

1997: The R Core Group is formed (containing some people associated with S-PLUS). The core group controls the source code for R.
2000: R version 1.0.0 is released.
2015: R version 3.2.3 is released on December 2015.

Features of R

Syntax is very similar to S, making it easy for S-PLUS users to switch over.
Semantics are superficially similar to S, but in reality are quite different (more on that later).
Runs on almost any standard computing platform/OS (even on the PlayStation 3)
Frequent releases (annual + bugfix releases); active development.

Features of R (cont'd)

Quite lean, as far as software goes; functionality is divided into modular packages
Graphics capabilities very sophisticated and better than most stat packages.
Useful for interactive work, but contains a powerful programming language for developing new tools (user -> programmer)
Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow

Features of R (cont'd)

It's free! (Both in the sense of beer and in the sense of speech.)
End of content from Coursera: R-Programming

Demo

The R Language

The R Language: Help

# is the comment character
use the help function for help

help("plot") 
?plot

Logicals

TRUE

[1] TRUE

FALSE

[1] FALSE

Characters

"Hello, world!"

[1] "Hello, world!"

Numeric Values

13.37

[1] 13.4

[1] 37

Vectors

c(), 'concatenate', creates a vector
can hold values of a single type

c(1,2,3)

[1] 1 2 3

Vectors (II)

coercion takes place

c(1, TRUE, "Flinstone")

[1] "1"         "TRUE"      "Flinstone"

Indexing Vectors

c(1,2)[1]

[1] 1

c(3,4)[2]

[1] 4

c()[1]

NULL

Lists

can hold values of different types

list(1,TRUE)

[[1]]
[1] 1

[[2]]
[1] TRUE

Lists (II)

str(list(1,TRUE))

List of 2
 $ : num 1
 $ : logi TRUE

Indexing Lists

list(1,TRUE)[[2]]

[1] TRUE

Ranges

: range operator

1:4

[1] 1 2 3 4

range operator also works in reverse order

6:3

[1] 6 5 4 3

Sequences

generate sequences with a certain step size

seq(0, 1, 0.21)

[1] 0.00 0.21 0.42 0.63 0.84

Vector Operations

all operations can be performed on vectors
values are 'recycled'

fred <- 1:3
wilma <- 3:6
fred * wilma

[1]  3  8 15  6

Vector Dimensions

a <- 1:4
dim(a)

NULL

dim(a) <- c(2,2)
a

     [,1] [,2]
[1,]    1    3
[2,]    2    4

Not Available

NA, R's representation of missing data

NA

[1] NA

Not Available

can be used in calculations

v <- c(1, NA, 3)
sum(v)

[1] NA

Not Available

many functions support na.rm parameter

v <- c(1, NA, 3)
sum(v, na.rm = TRUE)

[1] 4

Factors

factors, R's representation of categorical data

factor(c("Medium", "High"), c("Low", "Medium", "High"), ordered= TRUE)

[1] Medium High  
Levels: Low < Medium < High

Data Frames

data frames, a type that represents tabular data

c <- data.frame(married=c(TRUE,NA), medication=c("a", "b"), satisfaction=c(1,2))
c

  married medication satisfaction
1    TRUE          a            1
2      NA          b            2

Indexing Data Frames

can be indexed using numbers

c[,1]

[1] TRUE   NA

can be index using column names

c[, "married"]

[1] TRUE   NA

Indexing Data Frames (II)

can be indexed using the $ operator

c$medication

[1] a b
Levels: a b

can be indexed using vectors

c[1, 2:3]

  medication satisfaction
1          a            1

Data Frames: Column Names

names(c)

[1] "married"      "medication"   "satisfaction"

names(c)[1] <- "marital.status"
names(c)

[1] "marital.status" "medication"     "satisfaction"

Getting Data

files, CSV, delimited
databases, SQL, MongoDb
webpages
JSON
Excel, SAS, SPSS, Stata, systat, Minitab

Reading CSV Files

git clone git://git.kernel.org/.../linux-stable.git
cd linux-stable
git checkout -b stable-v4.3.3 v4.3.3 
cloc --csv --out=linux-4.3.3.csv .

freebsd <- read.csv("freebsd-10.2.csv")
linux <- read.csv("linux-4.3.3.csv")
openbsd <- read.csv("openbsd-5.8.csv")
minix <- read.csv("minix-3.3.csv")

Getting to Know Data

Data structure

str(linux)

'data.frame':   23 obs. of  5 variables:
 $ files   : int  21978 16944 1406 176 2052 45 162 39 8 8 ...
 $ language: Factor w/ 23 levels "ASP.Net","Assembly",..: 6 8 2 21 13 17 5 18 23 10 ...
 $ blank   : int  2113094 410387 47343 3468 7791 4495 1629 1233 639 292 ...
 $ comment : int  1978872 703038 110270 242 7590 3557 3054 1261 355 289 ...
 $ code    : int  10809997 2610820 242570 51490 32681 23636 8878 7316 4311 1815 ...

Looking at Data

head(linux, 3)

  files     language   blank comment     code
1 21978            C 2113094 1978872 10809997
2 16944 C/C++ Header  410387  703038  2610820
3  1406     Assembly   47343  110270   242570

Looking at Data

tail(linux, 3)

   files                  language blank comment code
21     6                      XSLT    13      27   71
22     1                vim script     3      12   27
23     1 Windows Module Definition     0       0    8

Summarizing Data

cbind(summary(linux$files))

           [,1]
Min.        1.0
1st Qu.     1.5
Median      8.0
Mean     1870.0
3rd Qu.   104.0
Max.    22000.0

Summarizing Data

library(Hmisc)
desc <- data.frame(describe(linux$files)$counts)
names(desc) <- "Value"

Summarizing Data

desc

          Value
n            23
missing       0
unique       16
Info       0.98
Mean       1865
.05         1.0
.10         1.0
.25         1.5
.50         8.0
.75       104.0
.90      1922.8
.95     15454.8

Summarizing Data

library(pastecs)
stats <- data.frame(stat.desc(linux$files, basic=FALSE))
names(stats) <- "Value"

Summarizing Data

stats

                Value
median       8.00e+00
mean         1.87e+03
SE.mean      1.17e+03
CI.mean.0.95 2.43e+03
var          3.17e+07
std.dev      5.63e+03
coef.var     3.02e+00

Summarizing Data

stats <- data.frame(stat.desc(linux$files, basic=FALSE, desc=FALSE,norm=TRUE))
names(stats) <- "Value"

Summarizing Data

stats

              Value
skewness   2.79e+00
skew.2SE   2.90e+00
kurtosis   6.35e+00
kurt.2SE   3.40e+00
normtest.W 3.77e-01
normtest.p 6.90e-09

Summarizing Data

colSums(linux[,c(1,3:5)])

   files    blank  comment     code 
   42905  2591394  2808985 13799561

Summarizing Data

lines.per.file <- rowSums(linux[,3:5])/linux[,1]
mean(lines.per.file)

[1] 297

Extending Your Data

total.lines <- rowSums(linux[,3:5])
linux$total.lines <- total.lines
head(linux[,3:6], 3)

    blank comment     code total.lines
1 2113094 1978872 10809997    14901963
2  410387  703038  2610820     3724245
3   47343  110270   242570      400183

Sorting Data

head(linux[
  order(linux$total.lines, 
      decreasing=TRUE),
    c(2,1,6)])

      language files total.lines
1            C 21978    14901963
2 C/C++ Header 16944     3724245
3     Assembly  1406      400183
4          XML   176       55200
5         make  2052       48062
6         Perl    45       31688

Merging Data

linux$project <- c("linux 4.3.3")
head(linux[,c(1,7)],4)

  files     project
1 21978 linux 4.3.3
2 16944 linux 4.3.3
3  1406 linux 4.3.3
4   176 linux 4.3.3

Merging Data

all <- rbind(freebsd, minix, linux, openbsd)
total.lines <- rowSums(all[,3:5])
all$total.lines <- total.lines
head(all[,5:7], 3)

     code      project total.lines
1 3436448 FreeBSD 10.2     4747900
2 1543770 FreeBSD 10.2     2111553
3   45515 FreeBSD 10.2       72528

Plots

totalsPerProject <- with(all, aggregate(total.lines, list(project), sum))

Plots

barplot(totalsPerProject$x, names=totalsPerProject$Group.1, las=1)

plot of chunk unnamed-chunk-49

Plots

barplot(sort(table(all$language), decreasing = TRUE), las=2)

plot of chunk unnamed-chunk-51

Plots

with(all[all$project=="linux 4.3.3",], barplot(total.lines, names=language, las=2, log ="y"))

plot of chunk unnamed-chunk-53

Plots

with(all, boxplot(total.lines ~ project, log="y",col=rainbow(4)))

plot of chunk unnamed-chunk-54

Plots

grouped <- aggregate(all[,c(3,4,5)], by=list(all$project), sum)
names(grouped)[1] <- "project"

Plots

bars <- data.frame(t(grouped[,2:4]))
names(bars) <- grouped$project
totals <- colSums(bars)
relative <- t(t(bars)/totals)

Plots

barplot(as.matrix(relative), col=rainbow(3), legend.text = row.names(relative))

plot of chunk unnamed-chunk-57

Plots

barplot(sort(relative[2,], decreasing =TRUE))

plot of chunk unnamed-chunk-58

Plots

set.seed(1234)
x <- seq(-10,10, 0.1)
y <- -2 + 3*rnorm(201)

Plots

plot(x,y)
abline(lm(y~x))

plot of chunk unnamed-chunk-60

Plots

hist(y)
rug(y)

plot of chunk unnamed-chunk-61

Plots

plot(density(y))

plot of chunk unnamed-chunk-62

Plots

par(fig=c(0,0.8,0,0.8), new=TRUE)
plot(x,y)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(y, axes=FALSE)

R: for your plotting and statistical needs

Agenda

Agenda (II)

Introduction

Introduction

What is R?

What is R?

What is R?

What is S?

What is S?

Historical Notes

Historical Notes

S Philosophy

Back to R

Back to R

Features of R

Features of R (cont'd)

Features of R (cont'd)

Demo

The R Language

The R Language: Help

Logicals

Characters

Numeric Values

Vectors

Vectors (II)

Indexing Vectors

Lists

Lists (II)

Indexing Lists

Ranges

Sequences

Vector Operations

Vector Dimensions

Not Available

Not Available

Not Available

Factors

Data Frames

Indexing Data Frames

Indexing Data Frames (II)

Data Frames: Column Names

Getting Data

Getting Data

Reading CSV Files

Getting to Know Data

Data structure

Looking at Data

Looking at Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Summarizing Data

Extending Your Data

Sorting Data

Merging Data

Merging Data

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Plots

Reproducible Research

Reproducible Research

Reproducible Research Resources