class: title-slide, inverse, bottom background-image: url(img/gradient-background.png) background-size: cover # Introduction to R, RStudio and the tidyverse ### CFE R Training - Module 1 <br/> María Paula Caldas and Jolien Noels --- class: middle # Useful links [Slides](https://oecd-cfe-eds.github.io/cfe-r-training/01_intro.html), if you want to navigate on your own [RStudio Project](https://rstudio.cloud/project/2864232), to try out the exercises [Teams Space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c), for discussions [Github repository](https://github.com/mpaulacaldas/cfe-r-training), for later reference --- class: middle # Housekeeping matters 🙋♀️ During the session, ask questions in the chat or raise your hand 📷 Sessions are recorded. Remember to turn off your camera if its your preference 💬 After the session, post follow-up questions, comments or reactions in the [Teams space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c) 📝 If you are going through these slides on your own, type `p` to see the presenter notes ??? The presenter notes are where you might also find OECD or CFE specific information. --- class: middle # Learning objectives for today 1. Understand how to use the RStudio IDE to execute R code 1. Know how to import data from common file formats into R with `readr`, `readxl` and `haven`. 1. Learn the basics of data manipulation with `dplyr` --- class:inverse, bottom, left # R, RStudio and the tidyverse --- # R and RStudio .pull-left[ __R__ is an open source programming language for statistical computing and graphics. <br/> .center[ <img src="https://www.r-project.org/logo/Rlogo.svg" width="50%" /> ] ] .pull-right[ __RStudio__ is an integrated development environment (IDE) for R programming. ![](https://upload.wikimedia.org/wikipedia/commons/e/ed/RStudio_IDE_screenshot.png)<!-- --> ] ??? Think of it like a car: R is the engine, and RStudio is the car. In this training we will use RStudio Cloud, a lightweight, cloud-based solution that mimics the RStudio IDE. This is to make sure that everyone is using a standardised computing environment (same R and package versions). R and RStudio are available to OECD staff upon request to the Service Desk. --- # Parts of the RStudio IDE .pull-left[ These are known as _panes_. - __Source:__ For scripts. - __Console:__ Where commands are executed. - __Environment:__ Objects you have created and are available in memory. - __Files/Plots/Help:__ To navigate the file system, see plots, get help. ] .pull-right[ ![](https://datacarpentry.org/genomics-r-intro/fig/rstudio_session_4pane_layout.png)<!-- --> ] .footnote[https://datacarpentry.org/genomics-r-intro/01-introduction/index.html] --- class: exercise # 📝 Discover the RStudio IDE <br/> 1. Go to the [RStudio Cloud project](https://rstudio.cloud/project/2864232). On the top-right corner you will see a red, flashing sign that reads "TEMPORARY COPY". Click on the button in the right that reads __"Save a Permanent Copy"__. <img src="img/save-permanent-copy.png" width="60%" style="display: block; margin: auto;" /> 1. Try do some simple math operations in the __Console__. Press `Enter` to execute the code. 1. Open the `discover-rstudio.R` script from the __Files__ pane, and follow the instructions that appear there.
05
:
00
--- # R packages .pull-left[ __Packages__ are collections of R functions, data, and compiled code in a well-defined format. The directory where they are stored is called the __library__. ```r # to install a package install.packages("dplyr") # to use a package library(dplyr) ``` ] .pull-right[ The __tidyverse__ is an opinionated collection of R packages designed for data science. Tidyverse packages share an underlying philosophy and common APIs. ```r library(tidyverse) #> -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.1 -- #> v ggplot2 3.3.5 v purrr 0.3.4 #> v tibble 3.1.2 v dplyr 1.0.7 #> v tidyr 1.1.3 v stringr 1.4.0 #> v readr 1.4.0 v forcats 0.5.1 #> -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() -- #> x dplyr::filter() masks stats::filter() #> x dplyr::lag() masks stats::lag() ``` ] ??? Think of R as a new phone, and R packages as the apps that you install there. (Credit: WB DIME Analytics Trainings) Most of the functionality in open source languages like R and Python comes from packages or libraries. Anyone can write and publish an R package. This means there can be many (up to hundreds of) packages to do the same thing. Given the large variety of packages, it's important to choose packages that are of good quality (e.g. that are well-tested or actively maintained), that have a large user base, or that have a cohesive design. As a team, it's also a good idea to have some cohesiveness on the packages used, particularly for the most common data tasks. The tidyverse is a good convention to adopt: it is a collection of R packages which offer a unified grammar and design philosophy that makes it easier to learn one package after the next. --- class:inverse, bottom, left # Importing data into R --- background-image: url(https://raw.githubusercontent.com/tidyverse/readr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Tabular data with `readr` `readr` is useful to read (and write) rectangular data from (to) flat text files. .panelset.sideways[ .panel[.panel-name[Tabular data] `defunciones.csv` in a text editor: ``` Comuna,Defunciones,Fecha Contulmo,1,2019-06-05 Calbuco,1,2020-01-16 Chillán Viejo,1,2021-08-16 Gorbea,2,2011-05-30 San Miguel,1,2012-02-11 Lumaco,1,2021-09-08 ``` ] .panel[.panel-name[Syntax] In <svg viewBox="0 0 581 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> : ```r library(readr) read_csv("defunciones.csv") #> # A tibble: 6 x 3 #> Comuna Defunciones Fecha #> <chr> <dbl> <date> #> 1 Contulmo 1 2019-06-05 #> 2 Calbuco 1 2020-01-16 #> 3 Chillán Viejo 1 2021-08-16 #> 4 Gorbea 2 2011-05-30 #> 5 San Miguel 1 2012-02-11 #> 6 Lumaco 1 2021-09-08 ``` ] .panel[.panel-name[Functions] .pull-left[ <br/> Alternatively, you can use the `read.*` functions that are already loaded with the basic installation of R. The most general of these functions is `read_delim()`, where you can specify the specific delimiter of the file. Every `read_*` function has a sister `write_*` function to save data in different formats. ] .pull-right[ <br/> | Base R function | `readr` function | Delimiter | |-----------------|------------------|:----:| | `read.csv()` | `read_csv()` | `,` | | `read.csv2()` | `read_csv2()` | `;` | | `read.tsv()` | `read_tsv()` | `\t` | | `read.delim()` | `read_delim()` | Any | ] ] .panel[.panel-name[Arguments] You can have more control over the way the data is read by using the _arguments_ of the `readr` functions. The most useful ones are: - `delim` - `col_names` - `col_types` - `skip` - `n_max` To read about them, type `?read_delim` in the R Console. ] .panel[.panel-name[Troubleshooting] When you import data with `readr`, the `problems()` function is useful to let you know if something went wrong and where. ```r oops <- read_csv("path/to/problematic-file.csv") #> Warning: 1 parsing failure. #> row col expected actual file #> 6 Fecha valid date 2021-13-08 literal data problems(oops) #> # A tibble: 1 x 5 #> row col expected actual file #> <int> <chr> <chr> <chr> <chr> #> 1 6 Fecha valid date 2021-13-08 literal data ``` ] ] ??? Rectangular data means "tables", with a series of cells organised into rows and columns. Not all types of data are necessarily stored in this format. The first tab of this slide shows what tabular data looks like when you open a file with a text editor like the windows notepad. On the right, we see how the same data looks in R when we import it using readr. --- background-image: url(https://raw.githubusercontent.com/tidyverse/readxl/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Spreadsheets `readxl` and `writexl` are two light-weight packages designed to read data into R. .pull-left[ ### [`readxl`](https://readxl.tidyverse.org/) - `read_excel()` to read spreadsheets of `xlsx` or `xls` files. - `excel_sheets()` to list the spreadsheets in a file. ] .pull-right[ ### [`writexl`](https://docs.ropensci.org/writexl/) - `write_xlsx()` to write tables to `xlsx` files. - `xl_formula()` or `xl_hyperlink()`, to write special column types to a spreadsheet. ] --- background-image: url(https://raw.githubusercontent.com/tidyverse/haven/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # SPSS, Stata and SAS with `haven` .pull-left[ ### [`haven`](https://haven.tidyverse.org/) Enables R to read and write various data formats used by other statistical packages. | Software | Read functions | Write functions | |-|-|-| | SAS | `read_sas()`, `read_xpt()` | None | | SPSS | `read_sav()`, `read_por()` | `write_sav()` | | Stata | `read_dta()` | `write_dta()` | ] .pull-right[ ### [`foreign`](https://cran.r-project.org/web/packages/foreign/foreign.pdf) A commonly used alternative to `haven` is the `foreign` package. - Notably, it reads `dbf` files. - Unfortunately, it doesn't read Stata files past version 12. ] --- # R native formats .panelset[ .panel[.panel-name[⚠️ RData] .pull-left[ You can save multiple R objects to a single RData file. When you load a new RData file into R, any R objects inside it will keep the names you originally assigned to them. `load()` will silently overwrite objects with conflicting names. __This makes it dangerous__. ] .pull-right[ ```r # load the uk_irl.RData, which can have # more than one object load("gbr_irl.RData") eu <- read_csv("eu.csv") us <- read_dta("us.dta") # saves 4 objects from the environment to # all_countries.RData save( irl, gbr, eu, us, file = "all_countries.RData" ) ``` ] ] .panel[.panel-name[✅ RDS] .pull-left[ You can only save one object at a time to an RDS file. You can assign any new name to the object when you read it into R. ] .pull-right[ **01_clean-postcodes.R** ```r saveRDS(gbr, "postcodes/gbr.rds") saveRDS(irl, "postcodes/irl.rds") ``` **02_analysis.R** ```r indicators <- readRDS("indicators/irl.rds") postcodes <- readRDS("postcodes/irl.rds") ``` ] ] ] .footnote[ 📖 https://yihui.org/en/2017/12/save-vs-saverds/ ] ??? If you are going to be working mainly in R, you may want to save intermediary objects in a format that can be easily read (i.e. fast) and interpreted by R. Unless you have a very good reason to do so, you should always favour RDS over RData files. The latter is dangerous because similarly named objects can be silently overwritten. RData files are also harder to iterate over, exactly because you don't know the names of the objects that are being loaded into the environment. That is not to say that RData files are not useful. They can be extremely useful, for example, in Shiny apps, where it's often more efficient to have smaller objects. The limitation of "one object per RDS" is less of an issue because of a special type of R structure known as a "list". We will speak more about these in a later session. --- class: exercise # 📝 Let's read some data <br/> Go to the [RStudio Cloud project](https://rstudio.cloud/project/2864232), open and follow the instructions in the `import.R` script.
05
:
00
Prepare yourself for some show-and-tell after the 5 minutes are over. Volunteers are welcome! --- class: inverse, bottom, left background-image: url(https://images.unsplash.com/photo-1421986872218-300a0fea5895?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=634&q=80) background-size: cover # Break
10
:
00
--- class: inverse, bottom, left # Data manipulation --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Main `dplyr` verbs The main functions, or _verbs_ from `dplyr` are: - `rename()`, to rename columns. - `select()`, to select columns. - `filter()`, to select rows. - `mutate()`, to create or modify columns. - `summarise()`, to summarise information from columns. The last three of these can be used with `group_by()` to do operations by groups of rows. ??? dplyr is a package focused on data manipulation. In R circles, more often people use the term "data wrangling". dplyr functions are often referred to as "dplyr verbs". They are verbs because they perform an action on a table. All dplyr verbs take a table as input and return a table as output. In the next slides, we explore the syntax of each verb one by one. --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px ## Air pollution in PM2.5 Average level in µg/m³ experienced by the population ```r (air_pol <- tibble(readsdmx::read_sdmx( "http://dotstat.oecd.org/restsdmx/sdmx.ashx/GetData/REGION_SOCIAL/1+2..AIR_POL.ALL./all?" ))) #> # A tibble: 9,480 x 8 #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 3.301 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 3.594 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 4.744 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 3.277 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 3.443 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 4.315 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 4.142 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ??? This is the data that we will use in our next examples. It comes from DOTSTAT. --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `rename()` .panelset.sideways[ .panel[.panel-name[Syntax] .center[`rename(<DATA>, <NEWNAME> = <0LDNAME>, ...)`] ```r rename(air_pol, reg_id = REG_ID, year = TIME, pm2_5 = ObsValue) #> # A tibble: 9,480 x 8 #> OBS_STATUS pm2_5 POS POWERCODE reg_id year TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 3.301 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 3.594 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 4.744 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 3.277 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 3.443 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 4.315 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 4.142 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ] .panel[.panel-name[Extension] `rename_with()` is useful to rename many columns at once using a function. ```r rename_with(air_pol, tolower) #> # A tibble: 9,480 x 8 #> obs_status obsvalue pos powercode reg_id time tl var #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 3.301 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 3.594 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 4.744 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 3.277 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 3.443 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 4.315 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 4.142 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ] ] ??? rename() is used to change the column names in a table. Take a look at the syntax. Although it may seem a bit counter-intuitive, the new name is goes on the left-hand side of the equal (=) sign. This is for consistency with the mutate() function, that we will see in a little while. rename_with() is an extension to rename() that allows to rename many columns at once by applying a function to the names of a table. In this example we use it to rename all columns, but know that it is a very flexible function that can implement more complex rules. --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `select()` .panelset.sideways[ .panel[.panel-name[Syntax] .center[`select(<DATA>, <VAR1>, <VAR2>, ...)`] ```r select(air_pol, REG_ID, TIME, ObsValue) #> # A tibble: 9,480 x 3 #> REG_ID TIME ObsValue #> <chr> <chr> <chr> #> 1 AUS 2003 3.88 #> 2 AUS 2006 3.301 #> 3 AUS 2009 3.594 #> 4 AUS 2012 7 #> 5 AU1 2003 4.744 #> 6 AU1 2006 3.277 #> 7 AU1 2009 3.443 #> 8 AU1 2012 7.08 #> 9 AU2 2003 4.315 #> 10 AU2 2006 4.142 #> # ... with 9,470 more rows ``` ] .panel[.panel-name[Selection helpers] `select()` is most useful when used with __selection helpers__, as documented in `?tidyselect::language`. .pull-left[ ```r select(air_pol, REG_ID:TL) #> # A tibble: 9,480 x 3 #> REG_ID TIME TL #> <chr> <chr> <chr> #> 1 AUS 2003 1 #> 2 AUS 2006 1 #> 3 AUS 2009 1 #> 4 AUS 2012 1 #> 5 AU1 2003 2 #> 6 AU1 2006 2 #> 7 AU1 2009 2 #> 8 AU1 2012 2 #> 9 AU2 2003 2 #> 10 AU2 2006 2 #> # ... with 9,470 more rows ``` ] .pull-right[ ```r select(air_pol, REG_ID, starts_with("T")) #> # A tibble: 9,480 x 3 #> REG_ID TIME TL #> <chr> <chr> <chr> #> 1 AUS 2003 1 #> 2 AUS 2006 1 #> 3 AUS 2009 1 #> 4 AUS 2012 1 #> 5 AU1 2003 2 #> 6 AU1 2006 2 #> 7 AU1 2009 2 #> 8 AU1 2012 2 #> 9 AU2 2003 2 #> 10 AU2 2006 2 #> # ... with 9,470 more rows ``` ] ] .panel[.panel-name[Extension] Columns can be renamed also during selection: ```r select(air_pol, reg_id = REG_ID, year = TIME, pm2_5 = ObsValue) #> # A tibble: 9,480 x 3 #> reg_id year pm2_5 #> <chr> <chr> <chr> #> 1 AUS 2003 3.88 #> 2 AUS 2006 3.301 #> 3 AUS 2009 3.594 #> 4 AUS 2012 7 #> 5 AU1 2003 4.744 #> 6 AU1 2006 3.277 #> 7 AU1 2009 3.443 #> 8 AU1 2012 7.08 #> 9 AU2 2003 4.315 #> 10 AU2 2006 4.142 #> # ... with 9,470 more rows ``` ] ] ??? select() is used to subset a table to keep only certain columns. It is most often used with a special set of functions known as "selection helpers". These are useful when you want to select many columns based on their names, position, or some conditions. Because it is such a common pattern, select() can be used to select and rename columns on the fly. Stata users: "select" is the equivalent of "keep" --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `filter()` __rows__ .panelset.sideways[ .panel[.panel-name[Syntax] .center[ `filter(<DATA>, <COND1>, <COND2>, ...)` ] <br/> - `filter()` returns a table with the rows where the condition returns `TRUE` - Rows where the condition returns missing values (`NA`) are removed. - The comma is equivalent to an AND (`&`) condition. ] .panel[.panel-name[Logical comparisons] | Condition | Reads | |----------|-------| | `x > y` | `x` is greater than `y` | | `x >= y` | `x` is greater or equal to `y` | | `x == 3` | `x` is equal to 3 | | `x %in% c(2, 8)` | `x` is either 2 or 8 | | `x > y & x < z` | `x` is greater than `y` AND smaller than `z` | | <code>x > y | x < z</code> | `x` is greater than `y` OR smaller than `z` | <!-- https://stackoverflow.com/a/17320389/14115463 --> ] .panel[.panel-name[Examples] ```r # Keep only rows for France filter(air_pol, REG_ID == "FRA") # Get the region with the lowest level of air pollution filter(air_pol, ObsValue == min(ObsValue)) # What was the level of PM2.5 in Colombia in 2000? filter(air_pol, TIME == "2000" & REG_ID == "COL") filter(air_pol, TIME == "2000", REG_ID == "COL") # mind the comma! ``` ] ] ??? Like select(), filter() also allows us to subset a table. The key difference is that filter() acts on the rows, whereas select acts on the columns. Whenever you see a "comma" inside filter() you should interpret is as an "AND" condition, not an "OR". If you want an "OR" you need to use a vertical bar. Stata users: "filter" is the equivalent of "keep if" --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `mutate()` .panelset.sideways[ .panel[.panel-name[Syntax] .center[ `mutate(<DATA>, <NEW_OR_UPDATED_COLUMN> = <OPERATIONS ON OTHER COLUMNS>, ...)` ] <br/> - To update existing columns - Create new ones ] .panel[.panel-name[Example] ```r mutate( air_pol, ObsValue = as.numeric(ObsValue), # original columns where character pm_rounded = round(ObsValue, 2) # new column, at the end ) #> # A tibble: 9,480 x 9 #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR pm_rounded #> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL 3.88 #> 2 <NA> 3.30 ALL 0 AUS 2006 1 AIR_POL 3.3 #> 3 <NA> 3.59 ALL 0 AUS 2009 1 AIR_POL 3.59 #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL 7 #> 5 <NA> 4.74 ALL 0 AU1 2003 2 AIR_POL 4.74 #> 6 <NA> 3.28 ALL 0 AU1 2006 2 AIR_POL 3.28 #> 7 <NA> 3.44 ALL 0 AU1 2009 2 AIR_POL 3.44 #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL 7.08 #> 9 <NA> 4.32 ALL 0 AU2 2003 2 AIR_POL 4.32 #> 10 <NA> 4.14 ALL 0 AU2 2006 2 AIR_POL 4.14 #> # ... with 9,470 more rows ``` ] ] ??? Mutate is used to create or update existing columns. All original columns are kept in the final table. --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `summarise()` .panelset.sideways[ .panel[.panel-name[Syntax] .center[ `summarise(<DATA>, <SUMMARISED_COLUMN> = <SUMMARY OPERATIONS ON OTHER COLUMNS>, ...)` ] <br/> - Summarise information from a column according to a function. ] .panel[.panel-name[Example] ```r summarise( air_pol, n_obs = n(), n_regions = n_distinct(REG_ID), min_value = min(ObsValue), max_value = max(ObsValue), min_year = min(TIME), max_year = max(TIME) ) #> # A tibble: 1 x 6 #> n_obs n_regions min_value max_value min_year max_year #> <int> <int> <chr> <chr> <chr> <chr> #> 1 9480 710 1.01 95.751 1990 2019 ``` ] ] ??? summarise() is useful to calculate summary statistics on a table. It will return one row summarising all observations in the original table, unless it is paired with group_by() Stata users: "summarise" is the equivalent of "collapse" --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # `group_by()` .panelset.sideways[ .panel[.panel-name[Syntax] .center[ `group_by(<DATA>, <COLUMN1>, <COLUMN2>, ...)` ] <br/> `group_by()` is used to perform operations by groups. It alters the default behaviour of: - `filter()` - `mutate()` - `summarise()` ] .panel[.panel-name[Example] ```r air_pol_grouped <- group_by(air_pol, REG_ID, TL) air_pol_grouped #> # A tibble: 9,480 x 8 #> # Groups: REG_ID, TL [710] #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 3.301 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 3.594 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 4.744 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 3.277 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 3.443 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 4.315 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 4.142 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ] .panel[.panel-name[with `filter()`] ```r filter(air_pol_grouped, ObsValue == max(ObsValue)) #> # A tibble: 719 x 8 #> # Groups: REG_ID, TL [710] #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 22.492 ALL 0 AT11 2003 2 AIR_POL #> 2 <NA> 20.003 ALL 0 AT21 2006 2 AIR_POL #> 3 <NA> 9.91 ALL 0 CAN 2003 1 AIR_POL #> 4 <NA> 7.772 ALL 0 ME18 2006 2 AIR_POL #> 5 <NA> 9.061 ALL 0 ME20 2003 2 AIR_POL #> 6 <NA> 5.591 ALL 0 ME23 2003 2 AIR_POL #> 7 <NA> 7.386 ALL 0 ME25 2006 2 AIR_POL #> 8 <NA> 6.968 ALL 0 ME26 2009 2 AIR_POL #> 9 <NA> 9.459 ALL 0 ME27 2009 2 AIR_POL #> 10 <NA> 9.695 ALL 0 ME28 2006 2 AIR_POL #> # ... with 709 more rows ``` ] .panel[.panel-name[with `mutate()`] ```r mutate(air_pol_grouped, ObsValue = max(ObsValue)) #> # A tibble: 9,480 x 8 #> # Groups: REG_ID, TL [710] #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 7.62 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 7.62 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 7.62 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7.62 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 7.5 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 7.5 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 7.5 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.5 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 8.52 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 8.52 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ] .panel[.panel-name[with `summarise()`] ```r summarise(air_pol_grouped, ObsValue = max(ObsValue)) #> `summarise()` has grouped output by 'REG_ID'. You can override using the `.groups` argument. #> # A tibble: 710 x 3 #> # Groups: REG_ID [710] #> REG_ID TL ObsValue #> <chr> <chr> <chr> #> 1 AT11 2 22.492 #> 2 AT12 2 21.71 #> 3 AT13 2 25.36 #> 4 AT21 2 20.003 #> 5 AT22 2 22.55 #> 6 AT31 2 21.36 #> 7 AT32 2 19.31 #> 8 AT33 2 9.67 #> 9 AT34 2 9.84 #> 10 AU1 2 7.5 #> # ... with 700 more rows ``` ] .panel[.panel-name[`ungroup()`] It's good practice to ungroup the data once you no longer need the grouping. ```r ungroup(air_pol_grouped) #> # A tibble: 9,480 x 8 #> OBS_STATUS ObsValue POS POWERCODE REG_ID TIME TL VAR #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 <NA> 3.88 ALL 0 AUS 2003 1 AIR_POL #> 2 <NA> 3.301 ALL 0 AUS 2006 1 AIR_POL #> 3 <NA> 3.594 ALL 0 AUS 2009 1 AIR_POL #> 4 <NA> 7 ALL 0 AUS 2012 1 AIR_POL #> 5 <NA> 4.744 ALL 0 AU1 2003 2 AIR_POL #> 6 <NA> 3.277 ALL 0 AU1 2006 2 AIR_POL #> 7 <NA> 3.443 ALL 0 AU1 2009 2 AIR_POL #> 8 <NA> 7.08 ALL 0 AU1 2012 2 AIR_POL #> 9 <NA> 4.315 ALL 0 AU2 2003 2 AIR_POL #> 10 <NA> 4.142 ALL 0 AU2 2006 2 AIR_POL #> # ... with 9,470 more rows ``` ] ] ??? When working with grouped data, it's important to keep an eye on those first two lines that are printed in the console. For historical reasons, summarise() drops by default the last variable of the grouping. This behaviour can be changed with the .groups argument of summarise(). --- background-image: url(https://raw.githubusercontent.com/tidyverse/magrittr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # _Piping_ operations with `%>%` _Piping_ helps chain together operations and improves code readability. .panelset[ .panel[.panel-name[In theory] .pull-left[ #### Not great ```r g(f(x, y), z) ``` #### Better ```r x_y <- f(x, y) g(x_y, z) ``` ] .pull-right[ #### Best ```r x %>% f(y) %>% g(z) ``` ] ] .panel[.panel-name[In practice] 🙋♀️ What question do you think we are trying to answer with this code? .pull-left[ ```r air_pol %>% filter(TL == "1") %>% group_by(TIME) %>% filter(ObsValue == max(as.numeric(ObsValue))) %>% ungroup() %>% arrange(desc(TIME)) %>% select(REG_ID, TIME) %>% filter(row_number() < 10) ``` ] .pull-right[ ``` #> # A tibble: 9 x 2 #> REG_ID TIME #> <chr> <chr> #> 1 KOR 2019 #> 2 KOR 2018 #> 3 TUR 2017 #> 4 TUR 2016 #> 5 IND 2015 #> 6 TUR 2014 #> 7 CHN 2013 #> 8 CHN 2012 #> 9 TUR 2011 ``` ] ] ] --- class: exercise ## 👩💻 Demo: Preparing data for an OECD report .center[ <img src="https://www.oecd-ilibrary.org/sites/959d5ba0-en/images/images/02-chapter2/emf/g2-11.png" width="80%" /> ] The code used in this demonstration is in the [RStudio Cloud project](https://rstudio.cloud/project/2864232). --- class: exercise # 📝 Let's play with some data <br/> Go to the [RStudio Cloud project](https://rstudio.cloud/project/2864232), open and follow the instructions in **Part 1** of the `wrangle.R` script.
05
:
00
Prepare yourself for some show-and-tell after the 5 minutes are over. Volunteers are welcome! --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Joining tables .panelset.sideways[ .panel[.panel-name[Functions] __Mutating joins__, that add new variables to a table depending on the contents of another one. - `left_join()` - `right_join()` - `full_join()` - `inner_join()` __Filtering joins__, that filter rows in a table depending on the contents of another one. - `semi_join()` - `anti_join()` ] .panel[.panel-name[Syntax] The syntax for all the join functions goes like: ``` <TYPE>_join( <LEFT_HAND_TABLE>, <RIGHT_HAND_TABLE>, by = c("<COMMON_VAR1>", "<COMMON_VAR2>", ...) ) ``` For the next examples, let's assume we have two tables: .pull-left[ ```r x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) ``` ] .pull-right[ ```r y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` ] ] .panel[.panel-name[`left_join()`] .pull-left[ ```r x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) ``` ] .pull-right[ ```r y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` ] .left[ ```r left_join(x, y, by = "key") #> # A tibble: 3 x 3 #> key val_x val_y #> <dbl> <chr> <chr> #> 1 1 x1 y1 #> 2 2 x2 y2 #> 3 3 x3 <NA> ``` ] ] .panel[.panel-name[`inner_join()`] .pull-left[ ```r x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) ``` ] .pull-right[ ```r y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` ] .left[ ```r inner_join(x, y, by = "key") #> # A tibble: 2 x 3 #> key val_x val_y #> <dbl> <chr> <chr> #> 1 1 x1 y1 #> 2 2 x2 y2 ``` ] ] .panel[.panel-name[`anti_join()`] .pull-left[ ```r x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" ) ``` ] .pull-right[ ```r y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" ) ``` ] .left[ ```r anti_join(x, y, by = "key") #> # A tibble: 1 x 2 #> key val_x #> <dbl> <chr> #> 1 3 x3 ``` ] ] ] ??? Joins are how we merge or combine relational tables in R. i.e. tables that share common IDs, or keys. Left join is the most common mutating join. It preserves all the rows from the left-hand table, and adds new columns with information from the right-hand table. Inner join is also a very useful join. It returns an intersection: only the rows from the left-hand table that have a match in the right-hand table. Anti join is the most useful filtering join. It preserves all the rows from the left-hand table that do not have a match in the "right-hand" table. --- background-image: url(https://raw.githubusercontent.com/tidyverse/dplyr/master/man/figures/logo.png) background-position: 1050px 20px background-size: 100px # Binding tables .panelset[ .panel[.panel-name[✅ `bind_rows()`] .pull-left[ ```r u <- tribble( ~num, ~val, 1, "u1", 2, "u2", 3, "u3" ) w <- tribble( ~num, ~val, 1, "w1", 2, "w2", 4, "w3" ) ``` ] .pull-right[ ```r bind_rows(u, w) #> # A tibble: 6 x 2 #> num val #> <dbl> <chr> #> 1 1 u1 #> 2 2 u2 #> 3 3 u3 #> 4 1 w1 #> 5 2 w2 #> 6 4 w3 ``` ] ] .panel[.panel-name[⚠️ `bind_cols()`] ```r bind_cols(u, w) #> New names: #> * num -> num...1 #> * val -> val...2 #> * num -> num...3 #> * val -> val...4 #> # A tibble: 3 x 4 #> num...1 val...2 num...3 val...4 #> <dbl> <chr> <dbl> <chr> #> 1 1 u1 1 w1 #> 2 2 u2 2 w2 #> 3 3 u3 4 w3 ``` ] ] ??? Binding rows is equivalent to what Stata users call "append". We take two tables and stack them on top of each other. You should be careful with bind_cols(). There is barely ever a need for it. In general, what you would want in this case is an append. --- class: exercise # 📝 Last round! <br/> Go to the [RStudio Cloud project](https://rstudio.cloud/project/2864232), open and follow the instructions in **Part 2** of the `wrangle.R` script.
05
:
00
--- class: inverse, bottom, left background-image: url(img/gradient-background.png) background-size: cover # Annex --- # To learn more .pull-left[ <https://r4ds.had.co.nz/> [Chapter 5](https://r4ds.had.co.nz/transform.html) and [Part II - Wrangle](https://r4ds.had.co.nz/wrangle-intro.html) ] .pull-right[ <img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="60%" /> ] --- # More on reading data Information on how to connect to our internal tools is in the Algobank, provided by Claire Hoffmann (CFE/EDS): - [OECD Database](https://algobank.oecd.org:4430/territorial-stats-unit/technical-docs/-/blob/main/accessing-dotstat-data.md) - [Spatial data from the PostGIS Database](https://algobank.oecd.org:4430/gis-community-of-practise/oecd-gis-technical-docs/-/blob/master/r-and-postgis.md) Other import packages that we did not cover: - [`readsdmx`](https://github.com/mdequeljoe/readsdmx): Created by Matthew de Queljoe (ECO). - [`arrow`](https://arrow.apache.org/docs/r/): To save large data in `parquet` format. Highly compressed format optimised for out-of-memory queries. - [`sf`](https://r-spatial.github.io/sf/) and [`stars`](https://r-spatial.github.io/stars/) for vector and raster data. --- # More on data manipulation There are many useful `dplyr` functions that we did not cover today: - `count()` - `distinct()` - `across()` For some examples, see in their documentation or the vignettes. - Online: <https://dplyr.tidyverse.org/> - In R: - For the documentation of a function, see `?function_name` or `help("function_name")` - Use `vignette(package = "dplyr")` to see the available vignettes, and `vignette("name-of-the-vignette")` to navigate to a specific one.