class: title-slide, inverse, bottom background-image: url(img/gradient-background.png) background-size: cover # Good practice and workflow with R ### CFE R Training - Module 5 <br/> MarΓa Paula Caldas and Jolien Noels --- class: middle # Useful links [Slides](https://oecd-cfe-eds.github.io/cfe-r-training/05_goodpractice.html), if you want to navigate on your own [RStudio Project](https://rstudio.cloud/project/3022018), to try out the exercises [Teams Space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c), for discussions [Github repository](https://github.com/mpaulacaldas/cfe-r-training), for later reference --- class: middle # Housekeeping matters πββοΈ During the session, ask questions in the chat or raise your hand π· Sessions are recorded. Remember to turn off your camera if its your preference π¬ After the session, post follow-up questions, comments or reactions in the [Teams space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c) π If you are going through these slides on your own, type `p` to see the presenter notes ??? The presenter notes are where you might also find OECD or CFE specific information. --- class: middle # Learning objectives for today 1. Understand the some of the main principles of good practice in R 1. Learn about R packages and tools that can improve your workflow 1. Get acquainted with software engineering principles to improve the quality of your code ??? Notably, today we won't cover good practices in other programming languages or software. Missing also is one key aspect of modern data science: version control. At OECD, we use Git and Gitlab (a.k.a. Algobank). Drafting of an unified manual is currently in the works. --- class:inverse, bottom, left # Introduction --- # What is good practice? The aim of good practice is to make analyses __easier to understand__ and __easier to reproduce__ by others. - By _others_, also think about _future you_. It's not the same person as present you. Implementing good practices can come with a __cost__ - More time spent writing documentation - Re-writing code - Learning new tools. The key is learning to adopt practices that are __good enough__ for the task at hand --- # What is a workflow? A workflow refers to the way in which we work on projects, either by ourselves or with others. - The way we organise folders, scripts and documents. - The way we share information with others. - The way we handle repetitive tasks. --- # Why bother with workflows? The aim of a workflow is to provide __consistency__, reducing the __cognitive load__ of repetitive tasks. For example: - Where is the raw data? - Where is the code for Figure 7? - Was this table based in the latest data? - What is the link to the latest version of the report again? --- class:inverse, bottom, left # RStudio configurations --- # Beware of `.RData` .pull-left[ ### What are `.RData` files? By default, R saves all elements from the workspace to an `.RData` file that will be loaded by default on your next session. Although it sounds convenient, __you should not rely on this behaviour!__ ] .pull-right[ ### Instead... - Make sure all of your steps are documented in R scripts. - Load and save intermediary objects using those scripts. - Configure RStudio to not rely on `.RData`. ] --- background-image: url(img/rdata.png) background-position: 600px 70px background-size: 550px # Use a blank slate .pull-left[ <br/> In the RStudio IDE, you can change the configurations via: `Tools -> Global Options -> General` <br/> Alternatively, in the Console: ```r usethis::use_blank_slate() ``` ] --- background-image: url(https://raw.githubusercontent.com/mpaulacaldas/good-enough-practices/master/figures/global-opts-encoding.PNG) background-position: 600px 70px background-size: 550px ## Use UTF-8 encoding .pull-left[ By default, scripts saved with the RStudio IDE will be saved with the computer's default [__encoding__](https://en.wikipedia.org/wiki/Character_encoding). To avoid issues when running your scripts on different operating systems, default text files to `UTF-8` encoding. <br/> In the RStudio IDE: `Tools -> Global Options -> ` `Code -> Saving` ] ??? If you're using Windows, this will likely create issues when you push your code to a computer with a Linux OS. Some of the tools we use that rely on Linux machines include shinyapps.io, to deploy Shiny apps, and the Algobank (our Gitlab instance). --- background-image: url(https://raw.githubusercontent.com/mpaulacaldas/good-enough-practices/master/figures/global-opts-margins.PNG) background-position: 600px 70px background-size: 550px # Nice-to-have's .pull-left[ Code gets harder to read the wider it becomes. Try to respect the [__80 character limit__](https://softwareengineering.stackexchange.com/questions/148677/why-is-80-characters-the-standard-limit-for-code-width) <br/> In the RStudio IDE: `Tools -> Global Options -> Code -> Display` ] --- class:inverse, bottom, left # Projects and folder structure --- # Project-based workflows .pull-left[ The convention in many programming languages is to use __project-based workflows.__ These workflows assume that _all_ files associated with a project are stored in the same folder, with a nested architecture. ] .pull-right[ ```r #> mortality #> +-- mortality.Rproj #> +-- data #> | +-- aus.rds #> | +-- chl.rds #> | \-- usa.rds #> +-- data-raw #> | +-- can #> | | +-- 2020-07-20 #> | | | +-- 13100768.csv #> | | | \-- 13100768_MetaData.csv #> | | +-- 2020-11-10 #> | | +-- 13100768.csv #> | | \-- 13100768_MetaData.csv #> | \------ README.md #> +-- figures #> | \-- tl3-comparisons.R #> +-- 01_download.md #> +-- 02_treat.md #> \-- README.md ``` ] --- # Working directory and paths .pull-left[ ### Working directory One way of adopting a project-based workflow is by __explicitly setting the working directory__ within scripts. ```r # only for the illustrative purposes # this is bad practice! see next slides setwd("V:/mortality/data/") ``` The __working directory__ is the folder that R will use to read and write files. ] .pull-right[ ### Paths You can read/write files in R using __absolute paths__ that give the location of the file relative to the root directory. ``` "V:/mortality/data/aus.rds" ``` Or using __relative paths__, that give the location of the file relative to the working directory. ``` "data/aus.rds" ``` ] --- # β οΈ Avoid using `setwd()` Although presented earlier, know that `setwd()` is considered __bad practice__. - If you're going to use a __fully self-contained project__, better tools exist. Like [RStudio projects](https://rstats.wtf/project-oriented-workflow.html#rstudio-projects), or the [here](https://here.r-lib.org/) package. - If your project needs to eventually read from or write to files __outside of the project__, then it's better to use an absolute path in the read/write functions directly, without changing the working directory. .footnote[[What They Forgot to Teach You About R, Chapter 2](https://rstats.wtf/project-oriented-workflow.html)] --- # RStudio projects To facilitate project-based workflows, RStudio implemented [projects](https://r4ds.had.co.nz/workflow-projects.html#rstudio-projects). .pull-left[ Projects are text files with a `.Rproj` extension. Double clicking one of them will launch the RStudio IDE and will set the R __working directory__ to the folder containing the `.Rproj` file. Other features of the IDE will also be affected, like the _Files_ pane, and tab autocompletion. The main advantage of using projects is that they are _portable_. ] .pull-right[ <img src="https://oecd-cfe-eds.github.io/good-enough-practices/figures/project0.PNG" width="80%" /> ] --- class: exercise # π©βπ» Create and use an RStudio project We will demonstrate the following: - How to create an RStudio project using the RStudio IDE. - How to switch between projects. - The the working directory changes. - How paths work. --- # OECD project folders Code, data and outputs should be stored in __dedicated project folders__. You can open a project via [Service Now](https://oecd.service-now.com/nav_to.do?uri=%2Fcom.glideapp.servicecatalog_cat_item_view.do%3Fv%3D1%26sysparm_id%3D1e3839fc4ff37e005f7668d18110c79c%26sysparm_link_parent%3D6a41e6d5dbf7730099b4a5ca0b961988%26sysparm_catalog%3D35d955d26f66310098ddc5cbbb3ee4dc%26sysparm_catalog_view%3Dcatalog_default%26sysparm_view%3Dcatalog_default). You will get an e-mail confirming the project has been created, with the required folder architecture: | Description | Path | |--|--| | For Sources | `\\main.oecd.org\em_sources\NAME_OF_PROJECT` | | For Work | `\\main.oecd.org\em_data\NAME_OF_PROJECT` | | For Results | `\\main.oecd.org\ASgenCFE\NAME_OF_PROJECT` (or `V:\NAME_OF_PROJECT`) | | For Archives | `\\main.oecd.org\em_arc\NAME_OF_PROJECT` | For Code Sharing | `https://algobank.oecd.org:4430/NAME_OF_PROJECT` | β οΈ When working with other Directorates, avoid using the `V:/` alias. The alias points to a different location for each Directorate. --- class: exercise # π©βπ» Leveraging the `config` R package The [__config__](https://cran.r-project.org/web/packages/config/vignettes/introduction.html) package makes it easy to manage environment specific configuration values. One good use-case is to write paths that you want to be available to everyone working on the same project. --- class: inverse, bottom, left # Files and file names --- # πΎ Good names The [three principles for file names](https://speakerdeck.com/jennybc/how-to-name-files) state that file names should: 1. Be machine readable 2. Be human readable 3. Play well will default ordering .footnote[Based on the presentation by [@JennyBryan](https://twitter.com/JennyBryan).] .pull-left[ .center[β __Bad__] ``` code.do Import des donnΓ©es avec accents.R histogram with spaces.png histogramWithNOINFO.PNG File with (Special characters).pdf ``` ] .pull-right[ .center[βοΈ __Good__] ``` analyse-prix.do import-donnees-enseignes.R fig01_histogram-price.png fig02_scatter-price-coverage.png 2017-03-04_speed-measurements.txt ``` ] --- # πΎ Good names ### 1. Machine readable .pull-left[ - Work well with [**regular expressions**](https://en.wikipedia.org/wiki/Regular_expression). - Should not contain accents, special characters or spaces. - Should preferably be in the same case ] .pull-right[ _For example:_ ``` CLIENT_FR_FVO_2015-09-29_17-41-42.txt CLIENT_FR_FVO_2015-09-29_17-41-54.txt CLIENT_FR_FVO_2015-09-30_14-05-57.txt CLIENT_FR_FVO_2015-11-05_10-28-55.txt CLIENT_FR_FVO_2015-11-06_10-28-55.txt CLIENT_FR_FVO_2015-11-09_16-10-04.txt ``` ] ??? The logic for making files work with regular expressions is that they make tasks like search and mass import easier. The only exceptions for special characters are for scores and underscores. Regarding the case, keep in mind that Windows is case insensitive, while Linux isn't. This means that code that works in your Windows computer may fail when you deploy it to a platform hosted in a Linux environment (e.g. Shinyapps.io, Algobank). --- # πΎ Good names ### 2. Human readable .pull-left[ .middle[- File names should be __informative__.] ] .pull-right[ _For example:_ ``` 01_geocodage-pdv-cible.R 02_preparation-couples.R 03_requests-pour-arcgis.R 04_impute-missing-dist.pdf 04_impute-missing-dist.Rmd 05_clean-distancier-traj.R 99_data-for-app.R ``` ] --- # πΎ Good names ### 3. Play well with default ordering .pull-left[ - Scripts should be numbered if they need to be run in a given order. - Think about left-padding numbers with `0`. - Dates should be in the format `YYYY-MM-DD` (international standard). ] .pull-right[ _For example:_ ``` 01_build-data17ca16.R 02_import-wrh17_0.R 03_build-data17_0.R checks01_x-et-y.html checks01_x-et-y.Rmd checks02_x-y-w.html checks02_x-y-w.Rmd ``` ] --- # README files .pull-left[ ### What they are... A [`README`](https://en.wikipedia.org/wiki/README) file contains information about other files in a directory or archive of computer software. Some examples: - [The Economist's Graphic Detail data](https://github.com/TheEconomist/graphic-detail-data) - [Replication repository of a WB paper](https://github.com/worldbank/brazil-pip-education/blob/master/README.md) ] .pull-right[ ### Tips for writing READMEs - Assume that you will not around to explain it! - Use a light-weight file format. - The convention is to use `.md` files, since these are rendered nicely in Git hosting platforms. ] --- class: exercise # π©βπ» Building a project's infrastructure We can automate some repetitive tasks using [fs](https://fs.r-lib.org/), an R package for file system operations. It's functions are divided into four main categories: - `path_` for manipulating and constructing paths - `file_` for files - `dir_` for directories - `link_` for links ??? Directory is a fancy name for a folder. --- class: inverse, bottom, left background-image: url(https://images.unsplash.com/photo-1421986872218-300a0fea5895?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=634&q=80) background-size: cover # Break
10
:
00
--- class: inverse, bottom, left # How to work interactively --- # Re-starting R and sourcing scripts The recommended workflow goes something like this: 1. Launch RStudio from your the `.Rproj` file 1. Open your script, make your edits, run your code Every once in a while: 1. Stop what you are doing 1. Restart R with `Ctrl + Shift + F10` 1. Source the script with `Ctrl + Shift + Enter` If you switch scripts, restart R again. --- class: exercise # π©βπ» Working interactively Let's demonstrate these steps. --- class: inverse, bottom, left # Code style --- class: middle ### Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread [Hadley Wickham](https://twitter.com/hadleywickham) --- # The tidyverse style guide This is the __R style guide__ that the tidyverse development team has decided to adopt. It is quickly becoming the default R style guide adopted by most R developers, data scientists and data analysts. .center[`http://style.tidyverse.org/index.html`] <br/> In addition, there's also the [`styler`](https://github.com/r-lib/styler) package, to interactively restyle text, files or projects. --- ## Packages and parameters at the top .pull-left[ ```r # Bad library(sf) deu_tl2 <- read_sf("path/to/deu-tl2.shp") # many, many lines library(tidyverse) deu_tl2_names <- read_csv("path/elsewhere/codes-tl2.csv") # even more lines library(writexl) write_xlsx(deu_tl2_final, "deu-tl2.xlsx") ``` ] .pull-right[ ```r # Good library(sf) library(tidyverse) library(writexl) path_shp <- "path/to/deu-tl2.shp" path_nms <- "path/elsewhere/codes-tl2.csv" path_out <- "deu-tl2.xlsx" deu_tl2 <- read_sf(path_shp) # many, many lines deu_tl2_names <- read_csv(path_nms) # even more lines write_xlsx(deu_tl2_final, path_out) ``` ] --- ## Use snake case for variable and function names ```r # Good day_one day_1 # Bad DayOne dayone ``` --- ## Respect spacing .pull-left[ ```r # Good mean(x, na.rm = TRUE) # Bad mean (x, na.rm = TRUE) mean( x, na.rm = TRUE ) ``` ```r # Good height <- (feet * 12) + inches mean(x, na.rm = TRUE) # Bad height<-feet*12+inches mean(x, na.rm=TRUE) ``` ] .pull-right[ ```r # Good if (debug) { show(x) } # Bad if(debug){ show(x) } ``` ] --- ## Avoid long lines ```r # Good do_something_very_complicated( something = "that", requires = many, arguments = "some of which may be long" ) # Bad do_something_very_complicated("that", requires, many, arguments, "some of which may be long" ) ``` --- ## Comments should explain the "why" not the "what" ```r # Good # Take out "not regionalised" observations data_checked <- data_iso2 %>% filter(check != "ZZ") # Bad # Filter observations data_checked <- data_iso2 %>% filter(check != "ZZ") ``` --- ## Follow the `%>%` with a new line ```r # Good iris %>% group_by(Species) %>% summarize_if(is.numeric, mean) %>% ungroup() %>% gather(measure, value, -Species) %>% arrange(value) # Bad iris %>% group_by(Species) %>% summarize_all(mean) %>% ungroup %>% gather(measure, value, -Species) %>% arrange(value) ``` --- class: exercise # π Restyle code Head over to the [RStudio Cloud Project](https://rstudio.cloud/project/3022018) and open the `air pollution.R` script. What would you change given what you have learned today?
05
:
00
--- class: inverse, bottom, left # Adopting practices from software engineering --- # Software engineering practices Doing data analysis and data science with R involves a lot of programming, so it makes sense to adopt engineering principles that help improve the __readability__ and __maintainability__ of code. - __Code reviews__: These can be for quality control, to help onboard someone to a project, or for pedagogical purposes. - __DRY principle__: Stands for _Do not repeat yourself_. Concretely, this might mean creating a function to replace several instances of copy-pasted code. - __Refactoring__: The process of going over your code again to improve its quality and remove redundancies. - __Technical debt__: What you accrue with quick fixes. Useful term to frame discussions. - __Defensive programming__: Adding tests and checks for eventual edge cases or changes. --- class: exercise # π©βπ» Reviewing a script Let's do a live code review of the code that was styled during the previous exercise. --- class: inverse, bottom, left background-image: url(img/gradient-background.png) background-size: cover # Annex --- # Back-slashes in paths Back-slashes (`\`) have a special meaning in R, so you must use forward-slashes (`/`): ```r fs::dir_exists("C:\Users\Caldas_M") fs::dir_exists("C:/Users/Caldas_M") #> Error: '\U' used without hex digits in character string starting ""C:\U" ``` Alternatively, you can _escape_ a back-slash by doubling it: ```r fs::dir_exists("C:\\Users\\Caldas_M") #> C:\\Users\\Caldas_M #> TRUE ``` Or use the special `r"(...)"` notation introduced in `R 4.0.0`: ```r fs::dir_exists(r"(C:\Users\Caldas_M)") #> C:\\Users\\Caldas_M #> TRUE ``` --- # UNC paths and mapped drives There are two ways to call a file in our local network. Either using the UNC path, making sure to properly escape the back-slashes: ```r dir(r"(\\main.oecd.org\ASgenCFE\EXCESS_MORTALITY)") #> [1] "COVID19 Regional Determinants" "data-fetching" ``` Or using the path with the associated mapped drive letter: ```r dir("V:/EXCESS_MORTALITY") #> [1] "COVID19 Regional Determinants" "data-fetching" ``` β οΈ Use the UNC path for `V:/` when you're working with colleagues from another Directorate! o --- # To learn more .pull-left[ ### Books - [What They Forgot to Teach You About R](https://rstats.wtf/index.html), Part I. - [R for Data Science](https://r4ds.had.co.nz/), "Workflow" chapters in Part I. ### What others are doing - [Using R at Grattan Institute](https://grattan.github.io/R_at_Grattan/index.html) (Australia) - [utilitR](https://www.book.utilitr.org/index.html) project, by the Insee (France) ] .pull-right[ ### Previous presentations - [Good Practice in Data Analysis with R](https://algobank.oecd.org:4430/MariaPaula.CALDAS/rladies-good-practice/-/tree/master), OECD R-Ladies - [Code Reviews](https://algobank.oecd.org:4430/r-ladies/index/-/blob/master/meeting_2021-01-29.md), OECD R-Ladies - [Workflow Tips and Tricks](https://oecd-cfe-eds.github.io/r-ladies-workflow/#1), R-Ladies Dammam ]