Introduction to R

Notes from the John Hopkins Coursera Data Science Specialisation

R Basics

Assignment

x <- 1

Printing to screen

print(x)
# or just
x

Creation of an integer sequence

Vector of values 1 to 20

x <- 1:20

Objects / Data Types

R has five basic atomic objects

  • Character
  • Number
    • NaN - Not a number
    • Inf - Infinity
  • Integer
    • Integers need to be specifically created as R will by default create a number. This can be done by using the suffix L. For example x <- 1L
  • Complex (1 + 4i)
  • Logical (TRUE, FALSE)

The basic object is a vector. A vector can only contain one object type. Lists can contain objects of different types.

Vectors and Lists

# Example of using c() to perform concatenations on different types
x <- c(0.1, 0.3)    # Numeric
x <- c(TRUE, FALSE) # Logical

# Create a empty vector of length 10 - this will initialise the vector with default values
x <- vector("numeric", length = 10)

# Vectors cannot have mixed types but it will not error when types are mixed.
# By default R will coerce the data types to be the same.
# The below example would become a character vector
x <- c(1.7, "a")

# Explicit Coercion / Casting
x <- 0:6        # create integer sequence
as.numeric(x)   # convert to a numeric sequence
as.character(x) # convert to a character sequence

# Lists example - note that a list can contain mixed types
x <- list(1, "a", TRUE, 1 + 4i)

Matrices

Vector with a dimension attribute (dimension is an integer vector with a length of two)

# create a new matrix with two rows and three columns
x <- matrix(nrow = 2, ncol = 3)

# This will return the dimension attributes passed to the matrix method
dim(x)    # [1] 2 3

# create a matrix populated with a sequence one to six
# Note: matrices are filled by columns first (over population by row).
x <- matrix(1:6, nrow = 2, ncol = 3)

# A vector can be transformed into a matrix by adding a dimension to it's attributes
x <- 1:10
dim(x) <- c(2,5)  # two rows and five columns

# They can also be created by performing column-binding or row-binding
x <- 1:3
y <- 10:12
cbind(x, y) # take these two vectors and bind them as two separate columns
rbind(x, y) # take these two vectors and bind them as two separate rows

Factors

A factor is a vector representing categorical data

  • This data can be sorted or unsorted
  • Can be thought of as an integer vector where each integer has a label
  • Are self describing so generally better than using integers. Male and Female make more sense than 1 and 2 for gender data.
# creation of a new factor
x <- factor(c("yes", "yes", "no"))

# frequency of occurrence
table(x)

# Return the vector as the integer version of itself
unclass(x)

# Something to note with factors is R will set the baseline to what comes alphabetically first.
# In the case of the example below this would be no. To force R to use yes as the baseline
# you can specify it through the levels attribute (important in linear modelling)
x <- factor(
  c("yes", "yes", "no"),
  levels = c("yes", "no")
)

Missing Values

  • NA - Not set / missing. NA values are not just numbers.
  • NaN - Not a number. NaN is also NA but NA is not NaN
# Logical tests 
is.na()
is.nan()

# Example of is.na()
x <- c(1, 2, NA)
is.na(x) # [1] FALSE FALSE TRUE

Data Frames

  • Data frames are used to store tabular data.
  • They are stored as a special type of list with each column being the same length.
  • Each column can have different data types.
  • Every row of a data frame has a name.
# Create a data frame with two columns, foo and bar
x <- data.frame(foo = 1:4, bar = c(T,T,F,F))

# list the number of rows
nrow(x)
# list the number of columns
ncol(x)

Names Attribute

x <- 1:3
# By default the integer sequence will not have any names associated with the values
names(x)  # NULL

# Elements can be named though
names(x) <- c("foo", "bar", "something")
names(x) # [1] "foo" "bar" "something"

# list can also have names
x <- list(a = 1, b = 2)

# so can matrices - we use a new method called dimnames(vector or row names, vector of column names)
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))

Reading Data

Common reading methods in R

  • read.table and read.csv - used for reading in tabular data
  • readLines - for reading in lines from a text file
  • source - reading in R code (inverse of dump)
  • dget - reading in R code files (inverse of dput)
  • unserialize - reading in R objects in a binary form
read.table(
  file              = "",   # name of the file
  header            = TRUE, # does the first line have the column names
  sep               = ",",  # what is the table separator, example csv would be commas
  colClasses        = c(),  # The list of classes that make up each of the columns in the table
  nrows             = 5,    # number of rows in the dataset
  comment.char      = "",   # is there a comments character
  skip              = 0,    # skip lines at the start of the file
  stringsAsFactors  = TRUE  # treat strings in columns as factors
)

# "Generally" you can call read.table with only the file param
read.table("test.txt")

# read.csv will set the separator to comma
read.csv("test.csv")

Large Datasets

  • Set comment.char = “”
  • Set nrows if possible - can help R with memory management (you can over estimate)
  • Set colClasses to the expected data types. This means R does not have to infer the type. You can also sample the data and then set the class types before performing a large read
initial <- read.table("data.txt", nrows = 100)
classes <- sapply(initial, class)
all <- read.table("data.txt", colClasses = classes)

Textual Formats

  • Data formats that contain contextual information like data type.
  • Two examples are dumping / source and dput / dget

Generally

  • Not very space efficient
  • Work nicely with version control
# dput
y <- data.frame(a = 1, b = "a")
dput(y) # this will print to the console

# duming / source
x <- "foo"
y <- data.frame(a = 1, b = "a")
dump(c("x", "y"), file = "data.R")
rm(x, y) # remove the variables that were created
source("data.R") # load in the dumped data

Connecting to external data

  • file - open a file
  • gzfile, bzfile - opens a compressed gzip / bzip2 file
  • url - opens a webpage
# read in a csv file
con <- file("data.txt", "r")
data <- read.csv(con)
close(con)

# read some lines
con <- gzfile("words.gz")
x <- readLines(con, 10)

# read a webpage
con <- url("http://www.google.com.au", "r")
x <- readLines(con)
head(x)

Subsetting

# Vectors
# With a single set of brackets the return type will be the same as the original
# For example the below vectors return another vector when accessed with the single set of brackets
x <- c("a", "b", "c", "d", "e")
x[1]          # [1] "a"
x[1:2]        # [1] "a" "b"
x[x > "d"]    # [1] "e"
u <- x > "d"  # [1] FALSE FALSE FALSE FALSE TRUE
x[u]          # [1] "e"


# Lists
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
x[1]          # $foo [1] 1 2 3 4
x[[1]]        # [1] 1 2 3 4

x$bar         # [1] 0.6
x[["bar"]]    # [1] 0.6
x["bar"]      # $bar [1] 0.6

x[c(1, 3)]    # $foo [1] 1 2 3 4
              # $baz [1] "hello"

# The double bracket has to be used over the $ when the name is calculated
name <- "foo"
x[[name]]     # [1] 1 2 3 4
x$name        # NULL


# Matrices
x <- matrix(1:6, 2, 3)
# By default when selecting single elements of a matrix a vector is returned
x[1, 2]       # [1] 3
x[1, ]        # [1] 1 3 5
x[ ,2]        # [1] 3 4

# This can be turned off by telling R explicitly
x[1, 2, drop = FALSE]


# Partial Matching
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
# find foo with a partial match
x$f           # [1] 1 2 3 4

# Note that the double bracket operator by default looks for exact matches
x[["f"]]      # NULL
x[["f", exact = FALSE]] # [1] 1 2 3 4

Removing NA values

x <- c(1, 2, NA)
bad <- is.na(x)
x[!bad]       # [1] 1 2

for

x <- c("a", "b", "c", "d")

for(i in 1:4) {
  print(x[i])
}

for(i in seq_along(x)) {
  print(x[i])

  # Example of next - skip the current loop
  if(x[i] == "b") {
    next
  }
}

for(letter in x) {
  print(letter)
}

for(i in 1:4) print(x[i])

while

count <- 0

while(count < 10) {
  print(count)
  count <- count + 1
}

Repeat

count <- 0
repeat {
  if(count < 10) {
    count <- count + 1
  } else {
    break
  }
}

Functions

# Basic function
add2 <- function(x, y) {
  x + y
}

# Example of returning a vector with a default value in the method argument
above <- function(x, n = 10) {
  use <- x > n
  x[use]
}

# Calculate the mean of a matrix column (will return a vector of each columns mean)
columnmean <- function(y, removeNA = TRUE) {
  nc <- ncol(y)
  means <- numeric(nc)

  for(i in 1:nc) {
    means[i] <- mean(y[, i], na.remove = removeNA)
  }

  means
}

Handy methods

# Create a sequence from an integer. Similar to 1:5, can be paired with nrow
x <- seq_len(5)
x # [1] 1 2 3 4 5

Loop Functions

  • lapply - loop over a list and evaluate a function on each of the elements
  • sapply - same as lapply but try to simplify the result
  • apply - apply a function over the margins of an array
  • tapply - (table apply) apply a function over subsets of a vector
  • mapply - multivariate version of lapply

lapply

lapply takes a list (or will attempt to coerce to a list) and will return the list. The below example will take the list with elements a and b and then return the mean of each of those elements.

x <- list (a = 1:5, b = rnorm(10))

lapply(x, mean)

# $a
# [1] 3
#
# $b
# [1] 0.0296824

x <- 1:4
# runif will return a value, with first variable that is passed to it is how many to return
# in this case, lapply will pass 1 through 4 to the method which will result in the first element
# having a vector of 1, the second a vector of 2 and so on.
# Note the values passed after the named function (min and max) are passed directly to the runif method
lapply(x, runif, min = 0, max = 10)

sapply

sapply will try and simplify the result of lapply

apply

apply is used to evaluate a function over the margins of an array

Debugging

traceback #will print out the call stack
debug #flags a function for debug mode
browser # suspends the execution of a function where it is called from

Generating random numbers

  • rnorm - generating random normal with a given mean and standard deviation
  • dnorm - evaluate the normal probability density at a point
  • pnorm
  • rpois

Probablity distributions normally have the following four function, d (for density), r (for random number generation), p (for cumulative distribution), q (for quantile function)

set.seed(1)
rnorm(5) # returns 5 random vars

rnorm(5) # will 5 different vars

set.seed(1)
rnorm(5) # will return the same 5 that were generated the first time

# draw a random sample 
sample(1:10, 4) # pick 4 entries from the vector 1-10
sample(letters, 4) # pick 4 random letters from the alphabet
sample(1:10, replace = TRUE) # allow the sample function to return the same thing, so might get 2 ones

Basic functions

List what is in the current working directory

dir()
# R's current working directory
getwd()

# set the working directory
setwd("path/to/the/wd")

Load in an R file

source("mycode.R")