R整洁代码

KJY / 2022-12-05


R整洁代码

数据屏蔽:使得可以不用带数据框(环境变量)名字,就能使用数据框内的变量(数据变量),便 于在数据集内计算值

  • 环境变量(env-variables) ,一般你在Rstuido右上角的Environment中发现它。比如n <- 10这里的n

  • 数据变量(data-variables),一般指数据框的某个变量。比如data <- data.frame(x = 1, n = 2)中的data$n

但作为函数参数时的间接使用,正常是环境变量,要想作为数据变量使用,则需要用两个大括号括起来 {{var}}:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2
## Warning: package 'readr' was built under R version 4.0.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
var_summary = function(data, var) { 
  data %>% summarise(n = n(), mean = mean({{var}}))
}


mtcars %>%
  group_by(cyl) %>%
  var_summary(mpg)
## # A tibble: 3 × 3
##     cyl     n  mean
##   <dbl> <int> <dbl>
## 1     4    11  26.7
## 2     6     7  19.7
## 3     8    14  15.1

若是字符向量形式,想作为数据变量,则需要在函数体中使用 .data[[var]],这里 .data 是代替 数据集的代词:

var_summary = function(data, var) { data %>%
    summarise(n = n(), mean = mean(.data[[var]]))
}
mtcars %>%
  group_by(cyl) %>%
  var_summary("mpg")
## # A tibble: 3 × 3
##     cyl     n  mean
##   <dbl> <int> <dbl>
## 1     4    11  26.7
## 2     6     7  19.7
## 3     8    14  15.1

同样地,整洁选择作为函数参数时的间接使用,也需要用两个大括号括起来 {{vars}}:

summarise_mean = function(data, vars) { data %>%
    summarise(n = n(), across({{vars}}, mean))
}
mtcars %>%
  group_by(cyl) %>%
  summarise_mean(where(is.numeric))
## # A tibble: 3 × 12
##     cyl     n   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     4    11  26.7  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
## 2     6     7  19.7  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
## 3     8    14  15.1  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

若是字符向量形式,则需要借助函数 all_of() 或 any_of(),取决于你的选择:

vars = c("mpg", "vs")
mtcars %>% select(all_of(vars))
##                      mpg vs
## Mazda RX4           21.0  0
## Mazda RX4 Wag       21.0  0
## Datsun 710          22.8  1
## Hornet 4 Drive      21.4  1
## Hornet Sportabout   18.7  0
## Valiant             18.1  1
## Duster 360          14.3  0
## Merc 240D           24.4  1
## Merc 230            22.8  1
## Merc 280            19.2  1
## Merc 280C           17.8  1
## Merc 450SE          16.4  0
## Merc 450SL          17.3  0
## Merc 450SLC         15.2  0
## Cadillac Fleetwood  10.4  0
## Lincoln Continental 10.4  0
## Chrysler Imperial   14.7  0
## Fiat 128            32.4  1
## Honda Civic         30.4  1
## Toyota Corolla      33.9  1
## Toyota Corona       21.5  1
## Dodge Challenger    15.5  0
## AMC Javelin         15.2  0
## Camaro Z28          13.3  0
## Pontiac Firebird    19.2  0
## Fiat X1-9           27.3  1
## Porsche 914-2       26.0  0
## Lotus Europa        30.4  1
## Ford Pantera L      15.8  0
## Ferrari Dino        19.7  0
## Maserati Bora       15.0  0
## Volvo 142E          21.4  1
mtcars %>% select(!all_of(vars))
##                     cyl  disp  hp drat    wt  qsec am gear carb
## Mazda RX4             6 160.0 110 3.90 2.620 16.46  1    4    4
## Mazda RX4 Wag         6 160.0 110 3.90 2.875 17.02  1    4    4
## Datsun 710            4 108.0  93 3.85 2.320 18.61  1    4    1
## Hornet 4 Drive        6 258.0 110 3.08 3.215 19.44  0    3    1
## Hornet Sportabout     8 360.0 175 3.15 3.440 17.02  0    3    2
## Valiant               6 225.0 105 2.76 3.460 20.22  0    3    1
## Duster 360            8 360.0 245 3.21 3.570 15.84  0    3    4
## Merc 240D             4 146.7  62 3.69 3.190 20.00  0    4    2
## Merc 230              4 140.8  95 3.92 3.150 22.90  0    4    2
## Merc 280              6 167.6 123 3.92 3.440 18.30  0    4    4
## Merc 280C             6 167.6 123 3.92 3.440 18.90  0    4    4
## Merc 450SE            8 275.8 180 3.07 4.070 17.40  0    3    3
## Merc 450SL            8 275.8 180 3.07 3.730 17.60  0    3    3
## Merc 450SLC           8 275.8 180 3.07 3.780 18.00  0    3    3
## Cadillac Fleetwood    8 472.0 205 2.93 5.250 17.98  0    3    4
## Lincoln Continental   8 460.0 215 3.00 5.424 17.82  0    3    4
## Chrysler Imperial     8 440.0 230 3.23 5.345 17.42  0    3    4
## Fiat 128              4  78.7  66 4.08 2.200 19.47  1    4    1
## Honda Civic           4  75.7  52 4.93 1.615 18.52  1    4    2
## Toyota Corolla        4  71.1  65 4.22 1.835 19.90  1    4    1
## Toyota Corona         4 120.1  97 3.70 2.465 20.01  0    3    1
## Dodge Challenger      8 318.0 150 2.76 3.520 16.87  0    3    2
## AMC Javelin           8 304.0 150 3.15 3.435 17.30  0    3    2
## Camaro Z28            8 350.0 245 3.73 3.840 15.41  0    3    4
## Pontiac Firebird      8 400.0 175 3.08 3.845 17.05  0    3    2
## Fiat X1-9             4  79.0  66 4.08 1.935 18.90  1    4    1
## Porsche 914-2         4 120.3  91 4.43 2.140 16.70  1    5    2
## Lotus Europa          4  95.1 113 3.77 1.513 16.90  1    5    2
## Ford Pantera L        8 351.0 264 4.22 3.170 14.50  1    5    4
## Ferrari Dino          6 145.0 175 3.62 2.770 15.50  1    5    6
## Maserati Bora         8 301.0 335 3.54 3.570 14.60  1    5    8
## Volvo 142E            4 121.0 109 4.11 2.780 18.60  1    4    2

以下则不合适

vars = c("mpg", "vs")
mtcars %>% select(vars)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(vars)` instead of `vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
##                      mpg vs
## Mazda RX4           21.0  0
## Mazda RX4 Wag       21.0  0
## Datsun 710          22.8  1
## Hornet 4 Drive      21.4  1
## Hornet Sportabout   18.7  0
## Valiant             18.1  1
## Duster 360          14.3  0
## Merc 240D           24.4  1
## Merc 230            22.8  1
## Merc 280            19.2  1
## Merc 280C           17.8  1
## Merc 450SE          16.4  0
## Merc 450SL          17.3  0
## Merc 450SLC         15.2  0
## Cadillac Fleetwood  10.4  0
## Lincoln Continental 10.4  0
## Chrysler Imperial   14.7  0
## Fiat 128            32.4  1
## Honda Civic         30.4  1
## Toyota Corolla      33.9  1
## Toyota Corona       21.5  1
## Dodge Challenger    15.5  0
## AMC Javelin         15.2  0
## Camaro Z28          13.3  0
## Pontiac Firebird    19.2  0
## Fiat X1-9           27.3  1
## Porsche 914-2       26.0  0
## Lotus Europa        30.4  1
## Ford Pantera L      15.8  0
## Ferrari Dino        19.7  0
## Maserati Bora       15.0  0
## Volvo 142E          21.4  1

使用数据屏蔽或整洁选择同时修改列名的用法:

my_summarise = function(data, mean_var, sd_var) { data %>%
summarise("mean_{{mean_var}}" := mean({{mean_var}}), "sd_{{sd_var}}" := sd({{sd_var}}))
}
mtcars %>%
  group_by(cyl) %>%
  my_summarise(mpg, disp)
## # A tibble: 3 × 3
##     cyl mean_mpg sd_disp
##   <dbl>    <dbl>   <dbl>
## 1     4     26.7    26.9
## 2     6     19.7    41.6
## 3     8     15.1    67.8
my_summarise = function(data, group_var, summarise_var) { data %>%
    group_by(across({{group_var}})) %>% summarise(across({{summarise_var}}, mean, .names = "mean_{.col}")) #整洁选择作为函数参数时的间接使用,也需要用两个大括号括起来 {{vars}}:
#环境变量要想作为数据变量使用,则需要用两个大括号括起来
}
mtcars %>%
  my_summarise(c(am, cyl), where(is.numeric))
## `summarise()` has grouped output by 'am'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 11
## # Groups:   am [2]
##      am   cyl mean_mpg mean_disp mean_hp mean_…¹ mean_wt mean_…² mean_vs mean_…³
##   <dbl> <dbl>    <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     0     4     22.9     136.     84.7    3.77    2.94    21.0   1        3.67
## 2     0     6     19.1     205.    115.     3.42    3.39    19.2   1        3.5 
## 3     0     8     15.0     358.    194.     3.12    4.10    17.1   0        3   
## 4     1     4     28.1      93.6    81.9    4.18    2.04    18.4   0.875    4.25
## 5     1     6     20.6     155     132.     3.81    2.76    16.3   0        4.33
## 6     1     8     15.4     326     300.     3.88    3.37    14.6   0        5   
## # … with 1 more variable: mean_carb <dbl>, and abbreviated variable names
## #   ¹​mean_drat, ²​mean_qsec, ³​mean_gear

环境变量在函数中当作mtcars中的数据变量,即当做mtcars的一个列的名字来使用,就需要引用(quote)和解引用(unquote)两个工序:

第一步,用 enquo()把用户传递过来的参数引用起来(引用可以理解为冷冻起来)

第二步,用 !! 解开这个引用(解引用可以理解为解冷),然后使用参数的内容

这个quote-unquote的过程让环境变量名变成了数据变量,也可以理解为在函数评估过程中,数据变量(data-variable)遮盖了环境变量(env-variable),即数据遮盖(data masking),看到cyl,正常情况下,本来应该是到环境变量里去找这个cyl对应的值,然而,数据遮盖机制,插队了,让代码去数据变量中去找cyl以及对应的值。

group_var <-  quote(cyl)
summary_var <-  quote(mpg)
    
rlang::qq_show( 
    data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
)
## data %>% group_by(cyl) %>% summarise(mean = mean(mpg))
group_var <-  quote(cyl)
summary_var <-  quote(mpg)
mtcars %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
## # A tibble: 3 × 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1
grouped_mean = function(data, summary_var, group_var) { 
  summary_var = enquo(summary_var)
  group_var = enquo(group_var)
data %>%
    group_by(!!group_var) %>%
    summarise(mean = mean(!!summary_var))
}
grouped_mean(mtcars, mpg, cyl)
## # A tibble: 3 × 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1

是一样的,但是使用{{}}的代码更加简单些

grouped_mean = function(data, summary_var, group_var) { 
data %>%
    group_by({{group_var}}) %>%
    summarise(mean = mean({{summary_var}}))
}
grouped_mean(mtcars, mpg, cyl)
## # A tibble: 3 × 2
##     cyl  mean
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1

要想修改结果列名,可借助 as_label() 函数从引用中提取名字:

grouped_mean = function(data, summary_var, group_var) { 
  summary_var = enquo(summary_var)
  group_var = enquo(group_var)
  summary_nm = str_c("mean_", as_label(summary_var))
  group_nm = str_c("group_", as_label(group_var))
  
   data %>%
    group_by(!!group_nm := !!group_var) %>%
    summarise(!!summary_nm := mean(!!summary_var))
}
grouped_mean(mtcars, mpg, cyl)
## # A tibble: 3 × 2
##   group_cyl mean_mpg
##       <dbl>    <dbl>
## 1         4     26.7
## 2         6     19.7
## 3         8     15.1

我们希望输出的统计结果中,统计参数名加一个前缀 “avg_”, 可以分三步完成

获取引用参数的默认名 修改参数的默认名,比如加前缀或者后缀 !! 解引用并放在 := 左边

grouped_mean2 <- function(.data, .summary_var, ...) {
  summary_var <- enquo(.summary_var)
  group_vars <- enquos(...)

  # Get and modify the default name
  summary_nm <- as_label(summary_var)
  summary_nm <- paste0("avg_", summary_nm)

  .data %>%
    group_by(!!!group_vars) %>%
    summarise(!!summary_nm := mean(!!summary_var))  # Unquote the name
}

grouped_mean2(mtcars, disp, cyl, am)
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 3
## # Groups:   cyl [3]
##     cyl    am avg_disp
##   <dbl> <dbl>    <dbl>
## 1     4     0    136. 
## 2     4     1     93.6
## 3     6     0    205. 
## 4     6     1    155  
## 5     8     0    358. 
## 6     8     1    326

或者更简洁的办法

my_summarise <- function(data, group_var, summarise_var) {
  data %>%
    group_by(across({{ group_var }})) %>%
    summarise(across({{ summarise_var }}, mean, .names = "mean_{col}"))
}

my_summarise(starwars, species, height)
## # A tibble: 38 × 2
##    species   mean_height
##    <chr>           <dbl>
##  1 Aleena            79 
##  2 Besalisk         198 
##  3 Cerean           198 
##  4 Chagrian         196 
##  5 Clawdite         168 
##  6 Droid             NA 
##  7 Dug              112 
##  8 Ewok              88 
##  9 Geonosian        183 
## 10 Gungan           209.
## # … with 28 more rows

最后一次修改于 2022-12-05