R: for your plotting and statistical needs

Bas Bossink
2-2-2016

Agenda

  • Introduction
    • What is R?
    • History
  • The R Language
  • Getting Data
  • Getting to Know Data

Agenda (II)

  • Plots
  • Reproducible Research
  • The R Ecosystem
  • Resources

Introduction

Introduction

What is R?

R is a free software environment for statistical computing and graphics.

What is R?

What is R?

  • R is a dialect of the S language.

What is S?

  • S is a language that was developed by John Chambers and others at Bell Labs.
  • S was initiated in 1976 as an internal statistical analysis environment, originally implemented as Fortran libraries.
  • Early versions of the language did not contain functions for statistical modeling.

What is S?

  • In 1988 the system was rewritten in C and began to resemble the system that we have today (this was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white book) documents the statistical analysis functionality.
  • Version 4 of the S language was released in 1998 and is the version we use today. The book Programming with Data by John Chambers (the green book) documents this version of the language.

Historical Notes

  • In 1993 Bell Labs gave StatSci (now Insightful Corp.) an exclusive license to develop and sell the S language.
  • In 2004 Insightful purchased the S language from Lucent for $2 million and is the current owner.
  • In 2006, Alcatel purchased Lucent Technologies and is now called Alcatel-Lucent.

Historical Notes

  • Insightful sells its implementation of the S language under the product name S-PLUS and has built a number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”.
  • In 2008 Insightful is acquired by TIBCO for $25 million
  • The fundamentals of the S language itself has not changed dramatically since 1998.
  • In 1998, S won the Association for Computing Machinery’s Software System Award.

S Philosophy

In “Stages in the Evolution of S”, John Chambers writes:

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

http://www.stat.bell-labs.com/S/history.html

Back to R

  • 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman. Their experience developing R is documented in a 1996 JCGS paper.
  • 1993: First announcement of R to the public.
  • 1995: Martin Mächler convinces Ross and Robert to use the GNU General Public License to make R free software.
  • 1996: A public mailing list is created (R-help and R-devel)

Back to R

  • 1997: The R Core Group is formed (containing some people associated with S-PLUS). The core group controls the source code for R.
  • 2000: R version 1.0.0 is released.
  • 2015: R version 3.2.3 is released on December 2015.

Features of R

  • Syntax is very similar to S, making it easy for S-PLUS users to switch over.
  • Semantics are superficially similar to S, but in reality are quite different (more on that later).
  • Runs on almost any standard computing platform/OS (even on the PlayStation 3)
  • Frequent releases (annual + bugfix releases); active development.

Features of R (cont'd)

  • Quite lean, as far as software goes; functionality is divided into modular packages
  • Graphics capabilities very sophisticated and better than most stat packages.
  • Useful for interactive work, but contains a powerful programming language for developing new tools (user -> programmer)
  • Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow

Features of R (cont'd)

Demo

The R Language

The R Language: Help

  • # is the comment character
  • use the help function for help
help("plot") 
?plot

Logicals

TRUE
[1] TRUE
FALSE
[1] FALSE

Characters

"Hello, world!" 
[1] "Hello, world!"

Numeric Values

13.37
[1] 13.4
37
[1] 37

Vectors

  • c(), 'concatenate', creates a vector
  • can hold values of a single type
c(1,2,3) 
[1] 1 2 3

Vectors (II)

  • coercion takes place
c(1, TRUE, "Flinstone") 
[1] "1"         "TRUE"      "Flinstone"

Indexing Vectors

c(1,2)[1]
[1] 1
c(3,4)[2]
[1] 4
c()[1]
NULL

Lists

  • can hold values of different types
list(1,TRUE)
[[1]]
[1] 1

[[2]]
[1] TRUE

Lists (II)

str(list(1,TRUE))
List of 2
 $ : num 1
 $ : logi TRUE

Indexing Lists

list(1,TRUE)[[2]]
[1] TRUE

Ranges

  • : range operator
1:4
[1] 1 2 3 4
  • range operator also works in reverse order
6:3
[1] 6 5 4 3

Sequences

  • generate sequences with a certain step size
seq(0, 1, 0.21) 
[1] 0.00 0.21 0.42 0.63 0.84

Vector Operations

  • all operations can be performed on vectors
  • values are 'recycled'
fred <- 1:3
wilma <- 3:6
fred * wilma
[1]  3  8 15  6

Vector Dimensions

a <- 1:4
dim(a)
NULL
dim(a) <- c(2,2)
a
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Not Available

  • NA, R's representation of missing data
NA
[1] NA

Not Available

  • can be used in calculations
v <- c(1, NA, 3)
sum(v)
[1] NA

Not Available

  • many functions support na.rm parameter
v <- c(1, NA, 3)
sum(v, na.rm = TRUE)
[1] 4

Factors

  • factors, R's representation of categorical data
factor(c("Medium", "High"), c("Low", "Medium", "High"), ordered= TRUE)
[1] Medium High  
Levels: Low < Medium < High

Data Frames

  • data frames, a type that represents tabular data
c <- data.frame(married=c(TRUE,NA), medication=c("a", "b"), satisfaction=c(1,2))
c
  married medication satisfaction
1    TRUE          a            1
2      NA          b            2

Indexing Data Frames

  • can be indexed using numbers
c[,1] 
[1] TRUE   NA
  • can be index using column names
c[, "married"] 
[1] TRUE   NA

Indexing Data Frames (II)

  • can be indexed using the $ operator
c$medication 
[1] a b
Levels: a b
  • can be indexed using vectors
c[1, 2:3] 
  medication satisfaction
1          a            1

Data Frames: Column Names

names(c)
[1] "married"      "medication"   "satisfaction"
names(c)[1] <- "marital.status"
names(c)
[1] "marital.status" "medication"     "satisfaction"  

Getting Data

Getting Data

  • files, CSV, delimited
  • databases, SQL, MongoDb
  • webpages
  • JSON
  • Excel, SAS, SPSS, Stata, systat, Minitab

Reading CSV Files

git clone git://git.kernel.org/.../linux-stable.git
cd linux-stable
git checkout -b stable-v4.3.3 v4.3.3 
cloc --csv --out=linux-4.3.3.csv .
freebsd <- read.csv("freebsd-10.2.csv")
linux <- read.csv("linux-4.3.3.csv")
openbsd <- read.csv("openbsd-5.8.csv")
minix <- read.csv("minix-3.3.csv")

Getting to Know Data

Data structure

str(linux)
'data.frame':   23 obs. of  5 variables:
 $ files   : int  21978 16944 1406 176 2052 45 162 39 8 8 ...
 $ language: Factor w/ 23 levels "ASP.Net","Assembly",..: 6 8 2 21 13 17 5 18 23 10 ...
 $ blank   : int  2113094 410387 47343 3468 7791 4495 1629 1233 639 292 ...
 $ comment : int  1978872 703038 110270 242 7590 3557 3054 1261 355 289 ...
 $ code    : int  10809997 2610820 242570 51490 32681 23636 8878 7316 4311 1815 ...

Looking at Data

head(linux, 3)
  files     language   blank comment     code
1 21978            C 2113094 1978872 10809997
2 16944 C/C++ Header  410387  703038  2610820
3  1406     Assembly   47343  110270   242570

Looking at Data

tail(linux, 3)
   files                  language blank comment code
21     6                      XSLT    13      27   71
22     1                vim script     3      12   27
23     1 Windows Module Definition     0       0    8

Summarizing Data

cbind(summary(linux$files))
           [,1]
Min.        1.0
1st Qu.     1.5
Median      8.0
Mean     1870.0
3rd Qu.   104.0
Max.    22000.0

Summarizing Data

library(Hmisc)
desc <- data.frame(describe(linux$files)$counts)
names(desc) <- "Value"

Summarizing Data

desc
          Value
n            23
missing       0
unique       16
Info       0.98
Mean       1865
.05         1.0
.10         1.0
.25         1.5
.50         8.0
.75       104.0
.90      1922.8
.95     15454.8

Summarizing Data

library(pastecs)
stats <- data.frame(stat.desc(linux$files, basic=FALSE))
names(stats) <- "Value"

Summarizing Data

stats
                Value
median       8.00e+00
mean         1.87e+03
SE.mean      1.17e+03
CI.mean.0.95 2.43e+03
var          3.17e+07
std.dev      5.63e+03
coef.var     3.02e+00

Summarizing Data

stats <- data.frame(stat.desc(linux$files, basic=FALSE, desc=FALSE,norm=TRUE))
names(stats) <- "Value"

Summarizing Data

stats
              Value
skewness   2.79e+00
skew.2SE   2.90e+00
kurtosis   6.35e+00
kurt.2SE   3.40e+00
normtest.W 3.77e-01
normtest.p 6.90e-09

Summarizing Data

colSums(linux[,c(1,3:5)])
   files    blank  comment     code 
   42905  2591394  2808985 13799561 

Summarizing Data

lines.per.file <- rowSums(linux[,3:5])/linux[,1]
mean(lines.per.file)
[1] 297

Extending Your Data

total.lines <- rowSums(linux[,3:5])
linux$total.lines <- total.lines
head(linux[,3:6], 3)
    blank comment     code total.lines
1 2113094 1978872 10809997    14901963
2  410387  703038  2610820     3724245
3   47343  110270   242570      400183

Sorting Data

head(linux[
  order(linux$total.lines, 
      decreasing=TRUE),
    c(2,1,6)])
      language files total.lines
1            C 21978    14901963
2 C/C++ Header 16944     3724245
3     Assembly  1406      400183
4          XML   176       55200
5         make  2052       48062
6         Perl    45       31688

Merging Data

linux$project <- c("linux 4.3.3")
head(linux[,c(1,7)],4)
  files     project
1 21978 linux 4.3.3
2 16944 linux 4.3.3
3  1406 linux 4.3.3
4   176 linux 4.3.3

Merging Data

all <- rbind(freebsd, minix, linux, openbsd)
total.lines <- rowSums(all[,3:5])
all$total.lines <- total.lines
head(all[,5:7], 3)
     code      project total.lines
1 3436448 FreeBSD 10.2     4747900
2 1543770 FreeBSD 10.2     2111553
3   45515 FreeBSD 10.2       72528

Plots

Plots

totalsPerProject <- with(all, aggregate(total.lines, list(project), sum))

Plots

barplot(totalsPerProject$x, names=totalsPerProject$Group.1, las=1)

plot of chunk unnamed-chunk-49

Plots

barplot(sort(table(all$language), decreasing = TRUE), las=2)

plot of chunk unnamed-chunk-51

Plots

with(all[all$project=="linux 4.3.3",], barplot(total.lines, names=language, las=2, log ="y"))

plot of chunk unnamed-chunk-53

Plots

with(all, boxplot(total.lines ~ project, log="y",col=rainbow(4)))

plot of chunk unnamed-chunk-54

Plots

grouped <- aggregate(all[,c(3,4,5)], by=list(all$project), sum)
names(grouped)[1] <- "project"

Plots

bars <- data.frame(t(grouped[,2:4]))
names(bars) <- grouped$project
totals <- colSums(bars)
relative <- t(t(bars)/totals)

Plots

barplot(as.matrix(relative), col=rainbow(3), legend.text = row.names(relative))

plot of chunk unnamed-chunk-57

Plots

barplot(sort(relative[2,], decreasing =TRUE))

plot of chunk unnamed-chunk-58

Plots

set.seed(1234)
x <- seq(-10,10, 0.1)
y <- -2 + 3*rnorm(201)

Plots

plot(x,y)
abline(lm(y~x))

plot of chunk unnamed-chunk-60

Plots

hist(y)
rug(y) 

plot of chunk unnamed-chunk-61

Plots

plot(density(y))

plot of chunk unnamed-chunk-62

Plots

par(fig=c(0,0.8,0,0.8), new=TRUE)
plot(x,y)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(y, axes=FALSE)

plot of chunk unnamed-chunk-63

Reproducible Research

Reproducible Research

  • make research reproducible:
    • independently repeated
    • validated
  • 'open-source' research:
    • provide all code
    • provide all data

Reproducible Research Resources

Reproducible Research Tools

The R Ecosystem

The R Ecosystem

The R Ecosystem

Resources

Learning R: Free Texts

Learning R: Learning by doing

Learning R: Online resources

Learning R: Books

Podcasts

Questions?

Thank you