class: title-slide, inverse, bottom background-image: url(img/gradient-background.png) background-size: cover # R programming basics ### CFE R Training - Module 3 <br/> María Paula Caldas and Jolien Noels --- class: middle # Useful links [Slides](https://oecd-cfe-eds.github.io/cfe-r-training/03_programming.html), if you want to navigate on your own [RStudio Project](https://rstudio.cloud/project/2940340), to try out the exercises [Teams Space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c), for discussions [Github repository](https://github.com/mpaulacaldas/cfe-r-training), for later reference --- class: middle # Housekeeping matters 🙋♀️ During the session, ask questions in the chat or raise your hand 📷 Sessions are recorded. Remember to turn off your camera if its your preference 💬 After the session, post follow-up questions, comments or reactions in the [Teams space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c) 📝 If you are going through these slides on your own, type `p` to see the presenter notes ??? The presenter notes are where you might also find OECD or CFE specific information. --- class: middle # Learning objectives for today 1. Understand the difference between vectors, lists and data frames and how to subset them 1. Know how to write simple loops and functions 1. Understand how to use functions and lists as an alternative to loops --- class:inverse, bottom, left # Vectors --- # Vectors .pull-left[ __Vectors__ are the most common and basic data structure in R. There are two types: - Atomic vectors - Lists There is a related object, `NULL` which represents the _absence_ of a vector. ] .pull-right[ <img src="https://github.com/hadley/r4ds/blob/master/diagrams/data-structures-overview.png?raw=true" width="100%" /> ] .footnote[ Ref: [Chapter 20, R4DS](https://r4ds.had.co.nz/vectors.html#important-types-of-atomic-vector) ] --- # Atomic vectors .pull-left[ Atomic vectors are __homogeneous__, i.e. all their elements need to be of the same _type_. ```r lgl <- c(TRUE, FALSE, TRUE, TRUE) int <- c(1L, 2L, 3L, 4L) dbl <- c(5.5, 33.6, 8, 12.5) chr <- c("coffee", "café") ``` We can create vectors using `c()` and assign them to an object using `<-` ] .pull-right[ <img src="https://github.com/hadley/r4ds/blob/master/diagrams/data-structures-overview.png?raw=true" width="100%" /> ] .footnote[ Ref: [Chapter 20, R4DS](https://r4ds.had.co.nz/vectors.html#important-types-of-atomic-vector) ] --- # Vector types It can be difficult to tell the _type_ of a vector solely by the way it is printed in the console. .pull-left[ When we print them in R: ```r lgl #> [1] TRUE FALSE TRUE TRUE int #> [1] 1 2 3 4 dbl #> [1] 5.5 33.6 8.0 12.5 chr #> [1] "coffee" "café" ``` ] .pull-right[ When we explore their type: ```r typeof(lgl) #> [1] "logical" typeof(int) #> [1] "integer" typeof(dbl) #> [1] "double" typeof(chr) #> [1] "character" ``` ] --- # Logical operations .pull-left[ _From the first session:_ | Condition | Reads | |----------|-------| | `x > y` | `x` is greater than `y` | | `x >= y` | `x` is greater or equal to `y` | | `x == 3` | `x` is equal to 3 | | `x %in% c(2, 8)` | `x` is either 2 or 8 | | `x > y & x < z` | `x` is greater than `y` AND smaller than `z` | | <code>x > y | x < z</code> | `x` is greater than `y` OR smaller than `z` | ] .pull-right[ Logical operations return logical vectors ```r telework_days <- c(2, 4:5) office_days <- c("monday", "wednesday") 4 %in% telework_days #> [1] TRUE 3:5 %in% telework_days #> [1] FALSE TRUE TRUE is.character(office_days) #> [1] TRUE ``` ] --- # Coercion There are two ways to convert, or __coerce__, vectors from one type to another: .pull-left[ ### Explicit ```r lgl #> [1] TRUE FALSE TRUE TRUE as.numeric(lgl) #> [1] 1 0 1 1 int #> [1] 1 2 3 4 as.character(int) #> [1] "1" "2" "3" "4" ``` ] .pull-right[ ### Implicit ```r lgl #> [1] TRUE FALSE TRUE TRUE lgl * 5 #> [1] 5 0 5 5 dbl #> [1] 5.5 33.6 8.0 12.5 chr #> [1] "coffee" "café" c(chr, dbl) #> [1] "coffee" "café" "5.5" "33.6" "8" "12.5" ``` ] --- # Coercion This concept is important to understand _warnings_: ```r ages_chr <- c("29", "88", "46", ">100") ages_num <- as.numeric(ages_chr) #> Warning: NAs introduced by coercion ages_num #> [1] 29 88 46 NA ``` ⚠️ __Avoid ignoring warnings__. Warnings, as opposed to errors, don't stop code execution, so mistakes can propagate. --- # Missing values Each type of atomic vector has it's own type of missing value .pull-left[ ```r NA #> [1] NA NA_integer_ #> [1] NA NA_real_ #> [1] NA NA_character_ #> [1] NA ``` ] .pull-right[ ```r typeof(NA) #> [1] "logical" typeof(NA_integer_) #> [1] "integer" typeof(NA_real_) #> [1] "double" typeof(NA_character_) #> [1] "character" ``` ] <br/> Because of coercion this often doesn't matter. .pull-left[ ```r c("tea", NA, "té") #> [1] "tea" NA "té" ``` ] .pull-right[ ```r typeof(c("tea", NA, "té")) #> [1] "character" ``` ] --- # Missing values are _contagious_ .pull-left[ Some functions have an option to remove them, but you need to be explicit about it. ```r mean(c(10, 20, NA)) #> [1] NA mean(c(10, 20, NA), na.rm = TRUE) #> [1] 15 ``` ] .pull-right[ Functions that remove them by default tend to give a warning: ```r library(ggplot2) ggplot(airquality, aes(Ozone, Temp)) + geom_point() #> Warning: Removed 37 rows containing missing values (geom_point). ``` <img src="03_programming_files/figure-html/unnamed-chunk-14-1.png" width="50%" style="display: block; margin: auto;" /> ] --- # Missing values are _contagious_ ⚠️ You can't use `==` to identify the missing values in a vector. Use `is.na()` .pull-left[ ```r x <- c(10, 20) x == 10 #> [1] TRUE FALSE x == NA #> [1] NA NA is.na(x) #> [1] FALSE FALSE ``` ] .pull-right[ ```r y <- c(10, NA) y == 10 #> [1] TRUE NA y == NA #> [1] NA NA is.na(y) #> [1] FALSE TRUE ``` ] --- # Recycling Vectors of shorter length are __recycled__ to match the length of the longer vector This makes fairly common operations quicker to type ```r rates <- c(0.93, 0.85, 0.43) rates * 100 #> [1] 93 85 43 rates * c(100, 100, 100) #> [1] 93 85 43 ``` --- # Recycling Recycling can happen for vectors of any length, not just those of length 1 ```r c(10, 100, 1000, 10000) * c(1, 3) #> [1] 10 300 1000 30000 c(10, 100, 1000, 10000) * c(1, 3, 1, 3) #> [1] 10 300 1000 30000 ``` And it also explains a fairly common warning: ```r 1:5 + 1:3 #> Warning in 1:5 + 1:3: longer object length is not a multiple of shorter object length #> [1] 2 4 6 5 7 ``` --- # Subsetting ### By position ```r office <- c("alexandre", "nikos", "maria paula", "tahsin") office[2] #> [1] "nikos" office[c(1, 4)] #> [1] "alexandre" "tahsin" office[-1] #> [1] "nikos" "maria paula" "tahsin" ``` --- # Subsetting ### With a logical vector ```r office #> [1] "alexandre" "nikos" "maria paula" "tahsin" present_on_monday <- c(TRUE, FALSE, TRUE, TRUE) office[present_on_monday] #> [1] "alexandre" "maria paula" "tahsin" ``` ⚠️ This is somewhere where we need to be aware of R's recycling rules ```r office[c(TRUE, FALSE)] #> [1] "alexandre" "maria paula" ``` --- # Subsetting ### By name Vectors can be _named_ and we can use those names to subset them ```r names(office) <- c("BANQUET", "PATIAS", "CALDAS", "MEDHI") office #> BANQUET PATIAS CALDAS MEDHI #> "alexandre" "nikos" "maria paula" "tahsin" office[c("CALDAS", "PATIAS")] #> CALDAS PATIAS #> "maria paula" "nikos" ``` --- class: exercise # 📝 Vectors Head to the [RStudio Cloud Project](https://rstudio.cloud/project/2940340) and follow the instructions in the `vectors.R` script.
05
:
00
--- class:inverse, bottom, left # Conditions --- # If conditions .pull-left[ ### Syntax ``` if (<CONDITION>) { <CODE_TO_EXECUTE_IF_CONDITION_IS_TRUE> } ``` ] .pull-right[ ### Examples ```r if (TRUE) { "I will print!" } #> [1] "I will print!" if (FALSE) { "Nothing will happen!" } ``` ] --- # Conditions: warnings ⚠️ `<CONDITION>` should evaluate to a logical vector of length 1. If it's a longer vector, only the first element will be used. ```r if (c(TRUE, FALSE)) { "I will print!" } #> Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the first #> element will be used #> [1] "I will print!" if (c(FALSE, TRUE)) { "Nothing will happen!" } #> Warning in if (c(FALSE, TRUE)) {: the condition has length > 1 and only the first #> element will be used ``` --- # Conditions: errors ⚠️ `<CONDITION>` Conditions will fail with missing values. ```r if (NA) { "I will fail!" } #> Error in if (NA) {: missing value where TRUE/FALSE needed if (c(NA, TRUE)) { "I will fail too!" } #> Warning in if (c(NA, TRUE)) {: the condition has length > 1 and only the first element #> will be used #> Error in if (c(NA, TRUE)) {: missing value where TRUE/FALSE needed ``` --- # If-else conditions .pull-left[ ### Syntax ``` if (<CONDITION>) { <CODE_TO_EXECUTE_IF_CONDITION_IS_TRUE> } else { <CODE_TO_EXECUTE_OTHERWISE> } ``` ] .pull-right[ ### Examples ```r language <- "spanish" if (language == "spanish") { "¡Hola!" } else { "Hi!" } #> [1] "¡Hola!" ``` ] --- # Multiple conditions .pull-left[ ### Syntax ``` if (<CONDITION1>) { <CODE_TO_EXECUTE_IF_CONDITION1_IS_TRUE> } else if (<CONDITION2>) { <CODE_TO_EXECUTE_IF_CONDITION2_IS_TRUE> } else { <CODE_TO_EXECUTE_OTHERWISE> } ``` ] .pull-right[ ### Examples ```r language <- "french" if (language == "spanish") { "¡Hola!" } else if (language == "french") { "Salut!" } else { "Hi!" } #> [1] "Salut!" ``` ] --- # Vectorised alternatives The vectorised alternatives are more useful when we work with data frames. ```r byear <- c(1970, 2005, 1992, 1962) ``` .pull-left[ ### `ifelse()` ```r ifelse(2021 - byear < 18, "young", "old") #> [1] "old" "young" "old" "old" ``` ] .pull-right[ ### `dplyr::case_when()` ```r dplyr::case_when( byear <= 1964 ~ "boomer", byear %in% 1965:1980 ~ "gen x", byear %in% 1981:1996 ~ "millenial", TRUE ~ "gen z" ) #> [1] "gen x" "gen z" "millenial" "boomer" ``` ] --- class: exercise # 📝 Conditions Head to the [RStudio Cloud Project](https://rstudio.cloud/project/2940340) and follow the instructions in the `conditions.R` script.
05
:
00
--- class:inverse, bottom, left # Loops --- # For loops .pull-left[ ### Steps 1. Create an empty vector to store results 2. Specify the sequence to iterate over 3. Define what you want to do and where you want to store the result 4. (Optional) print the output ] .pull-right[ ### Structure ```r output <- vector("double", 7) for (m in seq_along(output)) { output[m] <- format( Sys.Date() + m, "%e %b, %Y" ) } output #> [1] " 1 Oct, 2021" " 2 Oct, 2021" " 3 Oct, 2021" " 4 Oct, 2021" " 5 Oct, 2021" #> [6] " 6 Oct, 2021" " 7 Oct, 2021" ``` ] --- # For loops ### Less-than-ideal patterns - _Growing the output vector with each iteration_ is computationally inefficient. If you know the size that the output vector should be, use `vector()`. - _Using colon notation and `length()` to define the sequence_ can lead to unexpected behaviour. What if `days` was a zero-length vector? What would `1:length(days)` return? What would `seq_along(days)` return? ```r days <- c("tomorrow", "day after tomorrow") output <- NULL for (m in 1:length(days)) { output <- c(output, format(Sys.Date() + m, "%e %b, %Y")) } output #> [1] " 1 Oct, 2021" " 2 Oct, 2021" ``` --- class: exercise # 📝 For loops Head to the [RStudio Cloud Project](https://rstudio.cloud/project/2940340) and follow the instructions in the `loops.R` script.
05
:
00
--- class: inverse, bottom, left background-image: url(https://images.unsplash.com/photo-1421986872218-300a0fea5895?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=634&q=80) background-size: cover # Break
10
:
00
--- class:inverse, bottom, left # Lists --- # Lists .pull-left[ Unlike atomic vectors, lists can be __heterogeneous__. They can be made up of vectors of many different types, including other lists. ```r a <- list( a = 1:3, b = "a string", c = pi, d = list(-1, -5) ) ``` ] .pull-right[ <img src="https://github.com/hadley/r4ds/blob/master/diagrams/data-structures-overview.png?raw=true" width="100%" /> ] .footnote[ Ref: [Chapter 20, R4DS](https://r4ds.had.co.nz/vectors.html#important-types-of-atomic-vector) ] ??? Since they can contain other lists, lists are sometimes known as recursive vectors. --- # Inspecting lists Lists can be very big in size, so it's not always a good idea to print them to the console. To examine their structure, one useful way is to use `str()` ```r str(a) #> List of 4 #> $ a: int [1:3] 1 2 3 #> $ b: chr "a string" #> $ c: num 3.14 #> $ d:List of 2 #> ..$ : num -1 #> ..$ : num -5 ``` In the RStudio IDE, you can also use `View(a)` --- # Subsetting lists .pull-left[ Extract the component, returning a list: ```r a["b"] #> $b #> [1] "a string" typeof(a["b"]) #> [1] "list" ``` ] .pull-right[ Extract the component, removing a level of hierarchy: ```r a[["b"]] #> [1] "a string" typeof(a[["b"]]) #> [1] "character" a$b #> [1] "a string" typeof(a$b) #> [1] "character" ``` ] --- class: exercise # 📝 Subsetting lists .pull-left[ 1. Go to the [RStudio Project](https://rstudio.cloud/project/2940340). 1. Type out the different subsets presented in the figure in the right. What what are the vector types of the outputs you get? 1. Take the bottom row of the figure. How would you re-write those subsetting operations using the `$`? ] .pull-right[ <img src="https://github.com/hadley/r4ds/blob/master/diagrams/lists-subsetting.png?raw=true" width="100%" /> ] .footnote[ Ref: [Chapter 20, R4DS](https://r4ds.had.co.nz/vectors.html#important-types-of-atomic-vector) ]
05
:
00
--- class:inverse, bottom, left # Functions --- # Functions .pull-left[ ### Elements of a function 1. __Name__: Name of the function. It's a good idea to make it a verb. 1. __Arguments__: These can be empty or have default values. 1. __Body__: With the code that you want to execute according to the values taken by the arguments. At the end, the function should _return_ a value. ] .pull-right[ ### Example ```r greet <- function(person, language = "ENG") { greeting <- "Hi" if (language == "ESP") { greeting <- "Hola" } paste0(greeting, ", ", person, "!") } greet(person = "Jolien") #> [1] "Hi, Jolien!" greet("Jolien") #> [1] "Hi, Jolien!" greet("Jolien", "ESP") #> [1] "Hola, Jolien!" ``` ] --- class: exercise # 📝 Functions Head to the [RStudio Cloud Project](https://rstudio.cloud/project/2940340) and follow the instructions in the `functions.R` script.
05
:
00
--- class:inverse, bottom, left background-image: url(https://raw.githubusercontent.com/tidyverse/purrr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Functional programming --- # Iteration over vectors with purrr The purrr package has a family of __map__ functions that allow us to iterate over vectors. .pull-left[ <br/> ```r map( .x, # for every element of .x .f # do .f ) ``` ] .pull-right[ <img src="https://github.com/hadley/adv-r/blob/master/diagrams/functionals/map.png?raw=true" width="80%" /> ] .footnote[ Ref: [Advanced R, Chapter 9](https://adv-r.hadley.nz/functionals.html) ] --- # Iteration over vectors with purrr The purrr package has a family of __map__ functions that allow us to iterate over vectors. .pull-left[ ```r library(purrr) triple <- function(x) x * 3 map(1:3, triple) #> [[1]] #> [1] 3 #> #> [[2]] #> [1] 6 #> #> [[3]] #> [1] 9 ``` ] .pull-right[ <img src="https://github.com/hadley/adv-r/blob/master/diagrams/functionals/map.png?raw=true" width="80%" /> ] --- # Control the type of output vector .pull-left[ By default, `map()` always returns a list. ```r map(1:3, triple) #> [[1]] #> [1] 3 #> #> [[2]] #> [1] 6 #> #> [[3]] #> [1] 9 ``` ] .pull-right[ We can change the type of the output vector with the `map_*()` variants. ```r map_dbl(1:3, triple) #> [1] 3 6 9 map_chr(1:3, triple) #> [1] "3.000000" "6.000000" "9.000000" ``` ] --- # Pass arguments to `.f` We can pass arguments to `.f` via `...` .pull-left[ <br/> ```r map( .x, # for every element of .x .f, # do .f ... # extra arguments to .f ) ``` ] .pull-right[ <img src="https://github.com/hadley/adv-r/blob/master/diagrams/functionals/map-arg.png?raw=true" width="80%" /> ] .footnote[ Ref: [Advanced R, Chapter 9](https://adv-r.hadley.nz/functionals.html) ] --- # Pass arguments to `.f` We can pass arguments to `.f` via `...` .pull-left[ ```r seniority <- list( eds = c(2, 10, 5, 3), rdg = c(3, 16, NA) ) map_dbl(seniority, mean) #> eds rdg #> 5 NA map_dbl(seniority, mean, na.rm = TRUE) #> eds rdg #> 5.0 9.5 ``` ] .pull-right[ <img src="https://github.com/hadley/adv-r/blob/master/diagrams/functionals/map-arg.png?raw=true" width="80%" /> ] .footnote[ Ref: [Advanced R, Chapter 9](https://adv-r.hadley.nz/functionals.html) ] --- # Other ways to define `.f` We can call `map()` with existing or user-defined functions: ```r triple <- function(x) x * 3 map_dbl(1:3, triple) #> [1] 3 6 9 ``` .pull-left[ ### Anonymous functions ```r map_dbl(1:3, function(x) x * 3) #> [1] 3 6 9 ``` ] .pull-right[ ### Tilde notation ```r map_dbl(1:3, ~ .x * 3) #> [1] 3 6 9 ``` ] --- class: exercise # 📝 Iteration with purrr Head to the [RStudio Cloud Project](https://rstudio.cloud/project/2940340) and follow the instructions in the `purrr.R` script.
05
:
00
--- class: exercise # 👩💻 Demo: Automated plots The code used in this demonstration is in the [RStudio Cloud project](https://rstudio.cloud/project/2940340). --- class: inverse, bottom, left background-image: url(img/gradient-background.png) background-size: cover # Annex --- # To learn more .pull-left[ [**R for Data Science**](https://r4ds.had.co.nz/), [Section III - Program](https://r4ds.had.co.nz/program-intro.html) [**Advanced R**](https://adv-r.hadley.nz/index.html), [Chapter 9 - Functionals](https://adv-r.hadley.nz/functionals.html) ] .pull-right[ <img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="60%" /> ] --- # To learn more .pull-left[
] .pull-right[ Danielle Navarro's [Youtube playlist](https://www.youtube.com/watch?v=aozL5TKQgfY&list=PLRPB0ZzEYegMStVRojPITLUZ6A8YGUHi-) of her __aRt programming__ class. Jenny Bryan's [purrr workshop](https://speakerdeck.com/jennybc/purrr-workshop). ] --- class:inverse, bottom, left # Data frames and tibbles --- # Data frames and tibbles ### Printing _tibbles_ have a nicer printing method. .pull-left[ ```r df <- airquality[1:3, ] df #> Ozone Solar.R Wind Temp Month Day #> 1 41 190 7.4 67 5 1 #> 2 36 118 8.0 72 5 2 #> 3 12 149 12.6 74 5 3 ``` ] .pull-right[ ```r library(tibble) tb <- tibble(df) tb #> # A tibble: 3 x 6 #> Ozone Solar.R Wind Temp Month Day #> <int> <int> <dbl> <int> <int> <int> #> 1 41 190 7.4 67 5 1 #> 2 36 118 8 72 5 2 #> 3 12 149 12.6 74 5 3 ``` ] ??? In R, tables with data are usually represented using __data frames__. The tidyverse introduces a similar structure, called __tibbles__, with slightly different behaviours. --- # Data frames and tibbles ### Column subsetting `$` and `[[` have the same behaviour .pull-left[ ```r df$Ozone #> [1] 41 36 12 df[["Ozone"]] #> [1] 41 36 12 ``` ] .pull-right[ ```r tb$Ozone #> [1] 41 36 12 tb[["Ozone"]] #> [1] 41 36 12 ``` ] --- # Data frames and tibbles ### Column subsetting `[` does not consistently return the same object type for data frames .pull-left[ ```r df[, c("Ozone", "Wind")] #> Ozone Wind #> 1 41 7.4 #> 2 36 8.0 #> 3 12 12.6 df[, "Ozone"] #> [1] 41 36 12 ``` ] .pull-right[ ```r tb[, c("Ozone", "Wind")] #> # A tibble: 3 x 2 #> Ozone Wind #> <int> <dbl> #> 1 41 7.4 #> 2 36 8 #> 3 12 12.6 tb[, "Ozone"] #> # A tibble: 3 x 1 #> Ozone #> <int> #> 1 41 #> 2 36 #> 3 12 ``` ]