Notes From Advanced R Part Two

Liang / 2018-12-06

1. OO (Object-oriented) field guide

R has four object oriented systems:

base type: are mostly manipulated using C code
S3 type: S3 is a very casual system. It has no formal definition of classes.
S4 type: S4 has formal class definitions and multiple dispatch
RC (Reference Classes) type

A class defines the behaviour of objects by describing their attributes and their relationship to other classes.

pryr provides tools to pry back the surface of R and dig into the details.

Base types

new base types are added very rarely. determine the object’s base type with typeof().

# The type of a function is "closure"
f <- function() {}
typeof(f)

## [1] "closure"

is.function(f)

## [1] TRUE

# The type of a primitive function is "builtin"
typeof(sum)

## [1] "builtin"

is.primitive(sum)

## [1] TRUE

Functions for different base types are almost always written in C. S3 objects can be built on top of any base type, S4 objects use a special base type, and RC objects are a combination of S4 and environments.

To see if an object is a pure base type, check that is.object(x) returns FALSE.

if the object x has the R internal OBJECT, the is.object() return True

is.object(f)

## [1] FALSE

is.object(1) # FALSE

## [1] FALSE

is.object(as.factor(1:3)) # TRUE

## [1] TRUE

S3 type

S3 is the only OO system used in the base and stats packages.

pryr::otype(): can be used to test if an object is an S3 object in base R.

library(pryr)

## 
## Attaching package: 'pryr'

## The following object is masked _by_ '.GlobalEnv':
## 
##     f

## The following objects are masked from 'package:purrr':
## 
##     compose, partial

df <- data.frame(x = 1:10, y = letters[1:10])
otype(df)    # A data frame is an S3 class

## [1] "S3"

otype(df$x)  # A numeric vector isn't

## [1] "base"

otype(df$y)  # A factor is

## [1] "S3"

In S3, methods belong to functions, called generic functions. S3 methods do not belong to objects or classes.

ftype() can describes the object system and to determinate if a function is an S3 method or generic:

ftype(mean)

## [1] "s3"      "generic"

ftype(t.data.frame) # data frame method for t()

## [1] "s3"     "method"

ftype(t.test)       # generic function for t tests

## [1] "s3"      "generic"

Defining classes and creating objects

# Create and assign class in one step
foo <- structure(list(), class = "foo")
# ftype(foo)
# Create, then set class
foo <- list()
class(foo) <- "foo"
foo

## list()
## attr(,"class")
## [1] "foo"

determine the class of any object using class(x), and see if an object inherits from a specific class using inherits(x, "classname").

class(foo)

## [1] "foo"

inherits(foo, "foo")

## [1] TRUE

we can change the class of existing objects:

Creating new methods and generics

To add a new generic, just need to create a function that calls UseMethod().

f <- function(x) UseMethod("f")
f.a <- function(x) "Class a"
f.b <- function(x) "Class b"
a <- structure(list(), class = "a")
class(a)

## [1] "a"

f(a)

## [1] "Class a"

f.a(a)

## [1] "Class a"

f.b(a)

## [1] "Class b"

b <- structure(list(), class = "b")
class(b)

## [1] "b"

f(b)

## [1] "Class b"

f.a(b)

## [1] "Class a"

f.b(b)

## [1] "Class b"

f <- function(x) UseMethod("f")
f.a <- function(x) "Class a"
f.default <- function(x) "Unknown class"

# Method for a class
f(structure(list(), class = "a"))

## [1] "Class a"

# No method for b class, so uses method for a class
f(structure(list(), class = c("b", "a")))

## [1] "Class b"

# No method for c class, so falls back to default
f(structure(list(), class = "c"))

## [1] "Unknown class"

# Force R to call the wrong method, f.a should only use for a class
f.a(structure(list(), class = "c"))

## [1] "Class a"

S4 type

Classes have formal definitions which describe their fields and inheritance structures (parent classes).
Method dispatch can be based on multiple arguments to a generic function
There is a special operator, @, for extracting slots from an S4 object.

Recognising objects, generic functions, and methods

No S4 classes in the base packages

library(stats4)

# From example(mle)
y <- c(26, 17, 13, 12, 20, 5, 9, 8, 5, 4, 8)
nLL <- function(lambda) - sum(dpois(y, lambda, log = TRUE))
fit <- mle(nLL, start = list(lambda = 5), nobs = length(y))

# An S4 object
isS4(fit)

## [1] TRUE

otype(fit)

## [1] "S4"

ftype(nobs)

## [1] "s4"      "generic"

# nobs is a function to extract the number of observations from a Fit.

Defining classes and creating objects

An S4 class has three key properties:

A name: an alpha-numeric class identifier.
A named list of slots (fields), which defines slot names and permitted classes.
A string giving the class it inherits from, or, in S4 terminology, that it contains.

setClass("Person",
  slots = list(name = "character", age = "numeric"))
setClass("Employee",
  slots = list(boss = "Person"),
  contains = "Person")

alice <- new("Person", name = "Alice", age = 40)
john <- new("Employee", name = "John", age = 20, boss = alice)

alice

## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40

alice@name

## [1] "Alice"

john@boss; slot(john, "boss")

## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40

## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40

(@ is equivalent to $, and slot() to [[.) but the $ and slot() doesn't work for S4

# john$boss

Creating new methods and generics

we could take union(), which usually just works on vectors, and make it work with data frames:

alist <- c("a", "b", "a")
blist <- c("a", "c", "d")
union(alist, blist)

## [1] "a" "b" "c" "d"

x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
y <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Alice"))
union(x, y)

## Warning: Column `Name` joining factors with different levels, coercing to
## character vector

##   SN Age  Name
## 1  2  15 Alice
## 2  2  15  Dora
## 3  1  21  John

setGeneric("union")

## [1] "union"

setMethod("union",
  c(x = "data.frame", y = "data.frame"),
  function(x, y) {
    unique(rbind(x, y))
  }
)

union(x, y)

##   SN Age  Name
## 1  1  21  John
## 2  2  15  Dora
## 4  2  15 Alice

2. Environments

Environment basics

an environment serve as a bag of names: Each name points to an object stored in memory:

e <- new.env()
e$a <- FALSE
e$b <- "a"
e$c <- 2.3
e$d <- 1:3
ls(e) # show all object in the environment

## [1] "a" "b" "c" "d"

multiple names can point to the same object, Every environment has a parent, another environment.

an environment is similar to a list, with four important exceptions:

Every name in an environment is unique.
The names in an environment are not ordered (i.e., it doesn’t make sense to ask what the first element of an environment is).
An environment has a parent.
Environments have reference semantics.

four special environments:

globalenv(), or global environment, is the interactive workspace. This is the environment in which you normally work.
baseenv(), or base environment, is the environment of the base package. Its parent is the empty environment.
emptyenv(), or empty environment, is the ultimate ancestor of all environments
environment() is the current environment.

search() lists all parents of the global environment.

search()

##  [1] ".GlobalEnv"        "package:stats4"    "package:pryr"     
##  [4] "package:stats"     "package:graphics"  "package:grDevices"
##  [7] "package:utils"     "package:datasets"  "package:magrittr" 
## [10] "package:forcats"   "package:stringr"   "package:dplyr"    
## [13] "package:purrr"     "package:readr"     "package:tidyr"    
## [16] "package:tibble"    "package:ggplot2"   "package:tidyverse"
## [19] "package:methods"   "Autoloads"         "package:base"

Autoloads is used to save memory by only loading package objects

use new.env() to create an environment, use parent.env() to see its parent

e <- new.env()
# the default parent provided by new.env() is environment
parent.env(e)

## <environment: R_GlobalEnv>

e$a <- 1
e$b <- 2
ls(e)

## [1] "a" "b"

e$a

## [1] 1

useful way to view an environment is ls.str()

ls.str(e)

## a :  num 1
## b :  num 2

extract the value from a name

e$a; e[["a"]]; get("a", envir = e)

## [1] 1

## [1] 1

## [1] 1

use rm() to remove the binding.

rm("a", envir = e)
ls.str(e)

## b :  num 2

use exists() to determine if a variable in an environment

x <- 10
exists("x", envir = e)

## [1] TRUE

identical(globalenv(), environment())

## [1] TRUE

Recursing over environments

where() finds the environment where that name is defined, using R’s regular scoping rules:

library(pryr)
x <- 5
where("x")

## <environment: R_GlobalEnv>

where("mean")

## <environment: base>

definition of where()

where

## function (name, env = parent.frame()) 
## {
##     stopifnot(is.character(name), length(name) == 1)
##     env <- to_env(env)
##     if (identical(env, emptyenv())) {
##         stop("Can't find ", name, call. = FALSE)
##     }
##     if (exists(name, env, inherits = FALSE)) {
##         env
##     }
##     else {
##         where(name, parent.env(env))
##     }
## }
## <bytecode: 0x7ff93ebb6e40>
## <environment: namespace:pryr>

Not Interesting

Explicit environments

the original list is not changed because modifying a list actually creates and modifies a copy.

modify <- function(x) {
  x$a <- 2
  invisible()
}

x_l <- list()
x_l$a <- 1
modify(x_l)
x_l$a

## [1] 1

apply it to an environment, the original environment is modified:

x_e <- new.env()
x_e$a <- 1
modify(x_e)
x_e$a

## [1] 2

3. Functional programming

R is a functional programming (FP) language.

three building blocks of functional programming

anonymous functions
closures (functions written by functions)
lists of functions.

# Generate a sample dataset
set.seed(1014)
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE)))
names(df) <- letters[1:6]
df

##    a  b c   d   e f
## 1  1  6 1   5 -99 1
## 2 10  4 4 -99   9 3
## 3  7  9 5   4   1 4
## 4  2  9 3   8   6 8
## 5  1 10 5   9   8 6
## 6  6  2 1   3   8 5

fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}
df[] <- lapply(df, fix_missing)
df

##    a  b c  d  e f
## 1  1  6 1  5 NA 1
## 2 10  4 4 NA  9 3
## 3  7  9 5  4  1 4
## 4  2  9 3  8  6 8
## 5  1 10 5  9  8 6
## 6  6  2 1  3  8 5

code have more advantages

Anonymous functions

# the anonymous function is same
(function(x) x + 3)(10)

## [1] 13

f <- function(x) x + 3
f(10)

## [1] 13

Closures

closure have two

# two variable inside
power <- function(exponent) {
  function(x) {
    x ^ exponent
  }
}

square <- power(2) # exponent is 2 
square(3) # x is 3

## [1] 9

print a closure, can see anything useful:

square

## function(x) {
##     x ^ exponent
##   }
## <environment: 0x7ff93f8d1b70>

the function itself doesn’t change, the difference is the enclosing environment

environment(square) %>% as.list()

## $exponent
## [1] 2

another way to see the structure

library(pryr)
unenclose(square)

## function (x) 
## {
##     x^2
## }

<<- the double arrow operator will keep looking up the chain of parent environments until it finds a matching name

new_counter <- function() {
  i <- 0
  function() {
    i <<- i + 1
    i
  }
}

counter_one <- new_counter()
unenclose(counter_one)

## function () 
## {
##     0 <<- 0 + 1
##     0
## }

counter_one()

## [1] 1

unenclose(counter_one)

## function () 
## {
##     1 <<- 1 + 1
##     1
## }

counter_one()

## [1] 2

Lists of functions

functions can be stored in lists.

compute_mean <- list(
  base = function(x) mean(x),
  sum = function(x) sum(x) / length(x),
  manual = function(x) {
    total <- 0
    n <- length(x)
    for (i in seq_along(x)) {
      total <- total + x[i] / n
    }
    total
  }
)

to call each function use lapply() and need an anonymous function or a new named function,

x <- runif(1e5) # return a uniform distribution data
lapply(compute_mean, function(f) f(x))

## $base
## [1] 0.4994771
## 
## $sum
## [1] 0.4994771
## 
## $manual
## [1] 0.4994771

lapply(compute_mean, function(f) system.time(f(x)))

## $base
##    user  system elapsed 
##   0.001   0.000   0.001 
## 
## $sum
##    user  system elapsed 
##   0.002   0.000   0.002 
## 
## $manual
##    user  system elapsed 
##   0.006   0.000   0.006

x <- 1:10
funs <- list(
  sum = sum,
  mean = mean,
  median = median
)
lapply(funs, function(f) f(x))

## $sum
## [1] 55
## 
## $mean
## [1] 5.5
## 
## $median
## [1] 5.5

# put a function in the aggrement of a function
midpoint <- function(f, a, b) {
  (b - a) * f((a + b) / 2)
}

trapezoid <- function(f, a, b) {
  (b - a) / 2 * (f(a) + f(b))
}

midpoint(sin, 0, pi)

## [1] 3.141593

trapezoid(sin, 0, pi)

## [1] 1.923671e-16

4. Functionals

A higher-order function is a function that takes a function as an input or returns a function as output. The complement to a closure is a functional

randomise <- function(f) f(runif(1e3)) # f is a function as input
randomise(mean)

## [1] 0.5078925

lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list.

lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

# Create some random data
l <- replicate(20, runif(sample(1:10, 1)), simplify = FALSE)

# With lapply
unlist(lapply(l, length))

##  [1]  8  3  7  4  2  2  8  4  7  6  1  6  7  8  2 10  8  6  8  3

# What class is each column?
unlist(lapply(mtcars, class))

##       mpg       cyl      disp        hp      drat        wt      qsec 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        vs        am      gear      carb 
## "numeric" "numeric" "numeric" "numeric"

# rcauchy return a cauchy distribution
trims <- c(0, 0.1, 0.2, 0.5)
x <- rcauchy(1000)
unlist(lapply(trims, function(trim) mean(x, trim = trim)))

## [1] 0.31527829 0.07795881 0.04608098 0.04652580

sapply and vapply

vapply() takes an additional argument specifying the output type.

sapply(mtcars, is.numeric)

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

args(vapply)

## function (X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE) 
## NULL

vapply(mtcars, is.numeric, FUN.VALUE = logical(1))

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

df <- data.frame(x = 1:10, y = letters[1:10])
sapply(df, class)

##         x         y 
## "integer"  "factor"

vapply(df, class, character(1))

##         x         y 
## "integer"  "factor"

df2 <- data.frame(x = 1:10, y = Sys.time() + 1:10)
sapply(df2, class)

## $x
## [1] "integer"
## 
## $y
## [1] "POSIXct" "POSIXt"

Multiple inputs: Map (and mapply)

# Generate some sample data
xs <- replicate(5, runif(10), simplify = FALSE)
ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE)

compute the unweighted means:

sapply(xs, mean)

## [1] 0.6038653 0.5604356 0.5082535 0.3454798 0.4212558

calculate weighted.mean()

unlist(lapply(seq_along(xs), function(i) {
  weighted.mean(xs[[i]], ws[[i]])
}))

## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215

# or 
unlist(lapply(1:5, function(i) {
  weighted.mean(xs[[i]], ws[[i]])
}))

## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215

Map, a variant of lapply()

Map(weighted.mean, xs, ws)  %>% unlist() # Map(f, ...)

## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215

a <- matrix(1:20, nrow = 5)
apply(a, 1, mean)

## [1]  8.5  9.5 10.5 11.5 12.5

Function operators

chatty <- function(f) {
  function(x, ...) {
    res <- f(x, ...)
    cat("Processing ", x, "\n", sep = "")
    res
  }
}
f <- function(x) x ^ 2
s <- c(3, 2, 1)
chatty(f)(1)

## Processing 1

## [1] 1

vapply(s, chatty(f), numeric(1))

## Processing 3
## Processing 2
## Processing 1

## [1] 9 4 1

Last modified on 2018-12-06