Notes From Advanced R Part Two

Liang / 2018-12-06


1. OO (Object-oriented) field guide

R has four object oriented systems:

  • base type: are mostly manipulated using C code
  • S3 type: S3 is a very casual system. It has no formal definition of classes.
  • S4 type: S4 has formal class definitions and multiple dispatch
  • RC (Reference Classes) type

A class defines the behaviour of objects by describing their attributes and their relationship to other classes.

pryr provides tools to pry back the surface of R and dig into the details.

Base types

new base types are added very rarely. determine the object’s base type with typeof().

# The type of a function is "closure"
f <- function() {}
typeof(f)
## [1] "closure"
is.function(f)
## [1] TRUE
# The type of a primitive function is "builtin"
typeof(sum)
## [1] "builtin"
is.primitive(sum)
## [1] TRUE

Functions for different base types are almost always written in C. S3 objects can be built on top of any base type, S4 objects use a special base type, and RC objects are a combination of S4 and environments.

To see if an object is a pure base type, check that is.object(x) returns FALSE.

if the object x has the R internal OBJECT, the is.object() return True

is.object(f)
## [1] FALSE
is.object(1) # FALSE
## [1] FALSE
is.object(as.factor(1:3)) # TRUE
## [1] TRUE

S3 type

S3 is the only OO system used in the base and stats packages.

pryr::otype(): can be used to test if an object is an S3 object in base R.

library(pryr)
## 
## Attaching package: 'pryr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     f
## The following objects are masked from 'package:purrr':
## 
##     compose, partial
df <- data.frame(x = 1:10, y = letters[1:10])
otype(df)    # A data frame is an S3 class
## [1] "S3"
otype(df$x)  # A numeric vector isn't
## [1] "base"
otype(df$y)  # A factor is
## [1] "S3"

In S3, methods belong to functions, called generic functions. S3 methods do not belong to objects or classes.

ftype() can describes the object system and to determinate if a function is an S3 method or generic:

ftype(mean)
## [1] "s3"      "generic"
ftype(t.data.frame) # data frame method for t()
## [1] "s3"     "method"
ftype(t.test)       # generic function for t tests
## [1] "s3"      "generic"

Defining classes and creating objects

# Create and assign class in one step
foo <- structure(list(), class = "foo")
# ftype(foo)
# Create, then set class
foo <- list()
class(foo) <- "foo"
foo
## list()
## attr(,"class")
## [1] "foo"

determine the class of any object using class(x), and see if an object inherits from a specific class using inherits(x, "classname").

class(foo)
## [1] "foo"
inherits(foo, "foo")
## [1] TRUE

we can change the class of existing objects:

Creating new methods and generics

To add a new generic, just need to create a function that calls UseMethod().

f <- function(x) UseMethod("f")
f.a <- function(x) "Class a"
f.b <- function(x) "Class b"
a <- structure(list(), class = "a")
class(a)
## [1] "a"
f(a)
## [1] "Class a"
f.a(a)
## [1] "Class a"
f.b(a)
## [1] "Class b"
b <- structure(list(), class = "b")
class(b)
## [1] "b"
f(b)
## [1] "Class b"
f.a(b)
## [1] "Class a"
f.b(b)
## [1] "Class b"
f <- function(x) UseMethod("f")
f.a <- function(x) "Class a"
f.default <- function(x) "Unknown class"

# Method for a class
f(structure(list(), class = "a"))
## [1] "Class a"
# No method for b class, so uses method for a class
f(structure(list(), class = c("b", "a")))
## [1] "Class b"
# No method for c class, so falls back to default
f(structure(list(), class = "c"))
## [1] "Unknown class"
# Force R to call the wrong method, f.a should only use for a class
f.a(structure(list(), class = "c"))
## [1] "Class a"

S4 type

  • Classes have formal definitions which describe their fields and inheritance structures (parent classes).

  • Method dispatch can be based on multiple arguments to a generic function

  • There is a special operator, @, for extracting slots from an S4 object.

Recognising objects, generic functions, and methods

No S4 classes in the base packages

library(stats4)

# From example(mle)
y <- c(26, 17, 13, 12, 20, 5, 9, 8, 5, 4, 8)
nLL <- function(lambda) - sum(dpois(y, lambda, log = TRUE))
fit <- mle(nLL, start = list(lambda = 5), nobs = length(y))

# An S4 object
isS4(fit)
## [1] TRUE
otype(fit)
## [1] "S4"
ftype(nobs)
## [1] "s4"      "generic"
# nobs is a function to extract the number of observations from a Fit.

Defining classes and creating objects

An S4 class has three key properties:

  • A name: an alpha-numeric class identifier.
  • A named list of slots (fields), which defines slot names and permitted classes.
  • A string giving the class it inherits from, or, in S4 terminology, that it contains.
setClass("Person",
  slots = list(name = "character", age = "numeric"))
setClass("Employee",
  slots = list(boss = "Person"),
  contains = "Person")

alice <- new("Person", name = "Alice", age = 40)
john <- new("Employee", name = "John", age = 20, boss = alice)
alice
## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40
alice@name
## [1] "Alice"
john@boss; slot(john, "boss")
## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40
## An object of class "Person"
## Slot "name":
## [1] "Alice"
## 
## Slot "age":
## [1] 40

(@ is equivalent to $, and slot() to [[.) but the $ and slot() doesn't work for S4

# john$boss

Creating new methods and generics

we could take union(), which usually just works on vectors, and make it work with data frames:

alist <- c("a", "b", "a")
blist <- c("a", "c", "d")
union(alist, blist)
## [1] "a" "b" "c" "d"
x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
y <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Alice"))
union(x, y)
## Warning: Column `Name` joining factors with different levels, coercing to
## character vector
##   SN Age  Name
## 1  2  15 Alice
## 2  2  15  Dora
## 3  1  21  John
setGeneric("union")
## [1] "union"
setMethod("union",
  c(x = "data.frame", y = "data.frame"),
  function(x, y) {
    unique(rbind(x, y))
  }
)
union(x, y)
##   SN Age  Name
## 1  1  21  John
## 2  2  15  Dora
## 4  2  15 Alice

2. Environments

Environment basics

an environment serve as a bag of names: Each name points to an object stored in memory:

e <- new.env()
e$a <- FALSE
e$b <- "a"
e$c <- 2.3
e$d <- 1:3
ls(e) # show all object in the environment
## [1] "a" "b" "c" "d"

multiple names can point to the same object, Every environment has a parent, another environment.

an environment is similar to a list, with four important exceptions:

  • Every name in an environment is unique.

  • The names in an environment are not ordered (i.e., it doesn’t make sense to ask what the first element of an environment is).

  • An environment has a parent.

  • Environments have reference semantics.

four special environments:

  • globalenv(), or global environment, is the interactive workspace. This is the environment in which you normally work.

  • baseenv(), or base environment, is the environment of the base package. Its parent is the empty environment.

  • emptyenv(), or empty environment, is the ultimate ancestor of all environments

  • environment() is the current environment.

search() lists all parents of the global environment.

search()
##  [1] ".GlobalEnv"        "package:stats4"    "package:pryr"     
##  [4] "package:stats"     "package:graphics"  "package:grDevices"
##  [7] "package:utils"     "package:datasets"  "package:magrittr" 
## [10] "package:forcats"   "package:stringr"   "package:dplyr"    
## [13] "package:purrr"     "package:readr"     "package:tidyr"    
## [16] "package:tibble"    "package:ggplot2"   "package:tidyverse"
## [19] "package:methods"   "Autoloads"         "package:base"

Autoloads is used to save memory by only loading package objects

use new.env() to create an environment, use parent.env() to see its parent

e <- new.env()
# the default parent provided by new.env() is environment
parent.env(e)
## <environment: R_GlobalEnv>
e$a <- 1
e$b <- 2
ls(e)
## [1] "a" "b"
e$a
## [1] 1

useful way to view an environment is ls.str()

ls.str(e)
## a :  num 1
## b :  num 2

extract the value from a name

e$a; e[["a"]]; get("a", envir = e)
## [1] 1
## [1] 1
## [1] 1

use rm() to remove the binding.

rm("a", envir = e)
ls.str(e)
## b :  num 2

use exists() to determine if a variable in an environment

x <- 10
exists("x", envir = e)
## [1] TRUE
identical(globalenv(), environment())
## [1] TRUE

Recursing over environments

where() finds the environment where that name is defined, using R’s regular scoping rules:

library(pryr)
x <- 5
where("x")
## <environment: R_GlobalEnv>
where("mean")
## <environment: base>

definition of where()

where
## function (name, env = parent.frame()) 
## {
##     stopifnot(is.character(name), length(name) == 1)
##     env <- to_env(env)
##     if (identical(env, emptyenv())) {
##         stop("Can't find ", name, call. = FALSE)
##     }
##     if (exists(name, env, inherits = FALSE)) {
##         env
##     }
##     else {
##         where(name, parent.env(env))
##     }
## }
## <bytecode: 0x7ff93ebb6e40>
## <environment: namespace:pryr>
Not Interesting

Explicit environments

the original list is not changed because modifying a list actually creates and modifies a copy.

modify <- function(x) {
  x$a <- 2
  invisible()
}
x_l <- list()
x_l$a <- 1
modify(x_l)
x_l$a
## [1] 1

apply it to an environment, the original environment is modified:

x_e <- new.env()
x_e$a <- 1
modify(x_e)
x_e$a
## [1] 2

3. Functional programming

R is a functional programming (FP) language.

three building blocks of functional programming

  • anonymous functions
  • closures (functions written by functions)
  • lists of functions.
# Generate a sample dataset
set.seed(1014)
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep = TRUE)))
names(df) <- letters[1:6]
df
##    a  b c   d   e f
## 1  1  6 1   5 -99 1
## 2 10  4 4 -99   9 3
## 3  7  9 5   4   1 4
## 4  2  9 3   8   6 8
## 5  1 10 5   9   8 6
## 6  6  2 1   3   8 5
fix_missing <- function(x) {
  x[x == -99] <- NA
  x
}
df[] <- lapply(df, fix_missing)
df
##    a  b c  d  e f
## 1  1  6 1  5 NA 1
## 2 10  4 4 NA  9 3
## 3  7  9 5  4  1 4
## 4  2  9 3  8  6 8
## 5  1 10 5  9  8 6
## 6  6  2 1  3  8 5

code have more advantages

Anonymous functions

# the anonymous function is same
(function(x) x + 3)(10)
## [1] 13
f <- function(x) x + 3
f(10)
## [1] 13

Closures

closure have two

# two variable inside
power <- function(exponent) {
  function(x) {
    x ^ exponent
  }
}

square <- power(2) # exponent is 2 
square(3) # x is 3
## [1] 9

print a closure, can see anything useful:

square
## function(x) {
##     x ^ exponent
##   }
## <environment: 0x7ff93f8d1b70>

the function itself doesn’t change, the difference is the enclosing environment

environment(square) %>% as.list()
## $exponent
## [1] 2

another way to see the structure

library(pryr)
unenclose(square)
## function (x) 
## {
##     x^2
## }

<<- the double arrow operator will keep looking up the chain of parent environments until it finds a matching name

new_counter <- function() {
  i <- 0
  function() {
    i <<- i + 1
    i
  }
}

counter_one <- new_counter()
unenclose(counter_one)
## function () 
## {
##     0 <<- 0 + 1
##     0
## }
counter_one()
## [1] 1
unenclose(counter_one)
## function () 
## {
##     1 <<- 1 + 1
##     1
## }
counter_one()
## [1] 2

Lists of functions

functions can be stored in lists.

compute_mean <- list(
  base = function(x) mean(x),
  sum = function(x) sum(x) / length(x),
  manual = function(x) {
    total <- 0
    n <- length(x)
    for (i in seq_along(x)) {
      total <- total + x[i] / n
    }
    total
  }
)

to call each function use lapply() and need an anonymous function or a new named function,

x <- runif(1e5) # return a uniform distribution data
lapply(compute_mean, function(f) f(x))
## $base
## [1] 0.4994771
## 
## $sum
## [1] 0.4994771
## 
## $manual
## [1] 0.4994771
lapply(compute_mean, function(f) system.time(f(x)))
## $base
##    user  system elapsed 
##   0.001   0.000   0.001 
## 
## $sum
##    user  system elapsed 
##   0.002   0.000   0.002 
## 
## $manual
##    user  system elapsed 
##   0.006   0.000   0.006
x <- 1:10
funs <- list(
  sum = sum,
  mean = mean,
  median = median
)
lapply(funs, function(f) f(x))
## $sum
## [1] 55
## 
## $mean
## [1] 5.5
## 
## $median
## [1] 5.5
# put a function in the aggrement of a function
midpoint <- function(f, a, b) {
  (b - a) * f((a + b) / 2)
}

trapezoid <- function(f, a, b) {
  (b - a) / 2 * (f(a) + f(b))
}

midpoint(sin, 0, pi)
## [1] 3.141593
trapezoid(sin, 0, pi)
## [1] 1.923671e-16

4. Functionals

A higher-order function is a function that takes a function as an input or returns a function as output. The complement to a closure is a functional

randomise <- function(f) f(runif(1e3)) # f is a function as input
randomise(mean)
## [1] 0.5078925

lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list.

lapply2 <- function(x, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], ...)
  }
  out
}

# Create some random data
l <- replicate(20, runif(sample(1:10, 1)), simplify = FALSE)

# With lapply
unlist(lapply(l, length))
##  [1]  8  3  7  4  2  2  8  4  7  6  1  6  7  8  2 10  8  6  8  3
# What class is each column?
unlist(lapply(mtcars, class))
##       mpg       cyl      disp        hp      drat        wt      qsec 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        vs        am      gear      carb 
## "numeric" "numeric" "numeric" "numeric"
# rcauchy return a cauchy distribution
trims <- c(0, 0.1, 0.2, 0.5)
x <- rcauchy(1000)
unlist(lapply(trims, function(trim) mean(x, trim = trim)))
## [1] 0.31527829 0.07795881 0.04608098 0.04652580

sapply and vapply

vapply() takes an additional argument specifying the output type.

sapply(mtcars, is.numeric)
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
args(vapply)
## function (X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE) 
## NULL
vapply(mtcars, is.numeric, FUN.VALUE = logical(1))
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
df <- data.frame(x = 1:10, y = letters[1:10])
sapply(df, class)
##         x         y 
## "integer"  "factor"
vapply(df, class, character(1))
##         x         y 
## "integer"  "factor"
df2 <- data.frame(x = 1:10, y = Sys.time() + 1:10)
sapply(df2, class)
## $x
## [1] "integer"
## 
## $y
## [1] "POSIXct" "POSIXt"

Multiple inputs: Map (and mapply)

# Generate some sample data
xs <- replicate(5, runif(10), simplify = FALSE)
ws <- replicate(5, rpois(10, 5) + 1, simplify = FALSE)

compute the unweighted means:

sapply(xs, mean)
## [1] 0.6038653 0.5604356 0.5082535 0.3454798 0.4212558

calculate weighted.mean()

unlist(lapply(seq_along(xs), function(i) {
  weighted.mean(xs[[i]], ws[[i]])
}))
## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215
# or 
unlist(lapply(1:5, function(i) {
  weighted.mean(xs[[i]], ws[[i]])
}))
## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215

Map, a variant of lapply()

Map(weighted.mean, xs, ws)  %>% unlist() # Map(f, ...)
## [1] 0.5771461 0.5824162 0.4680431 0.3401923 0.4564215
a <- matrix(1:20, nrow = 5)
apply(a, 1, mean)
## [1]  8.5  9.5 10.5 11.5 12.5

Function operators

chatty <- function(f) {
  function(x, ...) {
    res <- f(x, ...)
    cat("Processing ", x, "\n", sep = "")
    res
  }
}
f <- function(x) x ^ 2
s <- c(3, 2, 1)
chatty(f)(1)
## Processing 1
## [1] 1
vapply(s, chatty(f), numeric(1))
## Processing 3
## Processing 2
## Processing 1
## [1] 9 4 1

Last modified on 2018-12-06