Good practice and workflow with R

class: title-slide, inverse, bottom
background-image: url(img/gradient-background.png)
background-size: cover

# Good practice and workflow with R
### CFE R Training - Module 5

<br/>

María Paula Caldas and Jolien Noels

---
class: middle

# Useful links

[Slides](https://oecd-cfe-eds.github.io/cfe-r-training/05_goodpractice.html), if you want to navigate on your own

[RStudio Project](https://rstudio.cloud/project/3022018), to try out the exercises

[Teams Space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c), for discussions

[Github repository](https://github.com/mpaulacaldas/cfe-r-training), for later reference

---
class: middle

# Housekeeping matters

🙋‍♀️&nbsp; During the session, ask questions in the chat 
or raise your hand

📷&nbsp; Sessions are recorded. Remember to turn off your camera if 
its your preference

💬&nbsp; After the session, post follow-up questions, 
comments or reactions in the [Teams space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c)

📝&nbsp; If you are going through these slides on your own, type `p` 
to see the presenter notes

???

The presenter notes are where you might also find OECD or CFE specific 
information.

---
class: middle

# Learning objectives for today

1. Understand the some of the main principles of good practice in R

1. Learn about R packages and tools that can improve your workflow

1. Get acquainted with software engineering principles to improve the quality of your code

???

Notably, today we won't cover good practices in other programming languages or software.

Missing also is one key aspect of modern data science: version control. At OECD, we use Git and Gitlab (a.k.a. Algobank). Drafting of an unified manual is currently in the works.

---
class:inverse, bottom, left

# Introduction

---
# What is good practice?

The aim of good practice is to make analyses __easier to understand__ and __easier to reproduce__ by others.

- By _others_, also think about _future you_. It's not the same person as present you.

Implementing good practices can come with a __cost__

- More time spent writing documentation
- Re-writing code
- Learning new tools.

The key is learning to adopt practices that are __good enough__ for the task at hand

---
# What is a workflow?

A workflow refers to the way in which we work on projects, either by ourselves or with others.

- The way we organise folders, scripts and documents.

- The way we share information with others.

- The way we handle repetitive tasks.

---
# Why bother with workflows?

The aim of a workflow is to provide __consistency__, reducing the __cognitive load__ of repetitive tasks.

For example:

- Where is the raw data?
- Where is the code for Figure 7?
- Was this table based in the latest data?
- What is the link to the latest version of the report again?

---
class:inverse, bottom, left

# RStudio configurations

---
# Beware of `.RData`

.pull-left[
### What are `.RData` files?
By default, R saves all elements from the workspace to an `.RData` file that will 
be loaded by default on your next session.

Although it sounds convenient, __you should not rely on this behaviour!__
]
.pull-right[
### Instead...

- Make sure all of your steps are documented in R scripts.

- Load and save intermediary objects using those scripts.

- Configure RStudio to not rely on `.RData`.

]

---
background-image: url(img/rdata.png)
background-position: 600px 70px
background-size: 550px

# Use a blank slate

.pull-left[

<br/>

In the RStudio IDE, you can change the configurations via:

`Tools -> Global Options -> General`

<br/>

Alternatively, in the Console:

```r
usethis::use_blank_slate()
```

]

---
background-image: url(https://raw.githubusercontent.com/mpaulacaldas/good-enough-practices/master/figures/global-opts-encoding.PNG)
background-position: 600px 70px
background-size: 550px

## Use UTF-8 encoding

.pull-left[
By default, scripts saved with the RStudio IDE will be saved with the computer's default [__encoding__](https://en.wikipedia.org/wiki/Character_encoding).

To avoid issues when running your scripts on different operating systems, default text files to `UTF-8` encoding.

<br/>

In the RStudio IDE:

`Tools -> Global Options -> `
`Code -> Saving`

]

???

If you're using Windows, this will likely create issues when you push your code to a computer with a Linux OS. Some of the tools we use that rely on Linux machines include shinyapps.io, to deploy Shiny apps, and the Algobank (our Gitlab instance).

---
background-image: url(https://raw.githubusercontent.com/mpaulacaldas/good-enough-practices/master/figures/global-opts-margins.PNG)
background-position: 600px 70px
background-size: 550px

# Nice-to-have's

.pull-left[

Code gets harder to read the wider it becomes.

Try to respect the [__80 character limit__](https://softwareengineering.stackexchange.com/questions/148677/why-is-80-characters-the-standard-limit-for-code-width)

<br/>

In the RStudio IDE:

`Tools -> Global Options -> Code -> Display`

]

---
class:inverse, bottom, left

# Projects and folder structure

---
# Project-based workflows

.pull-left[
The convention in many programming languages is to use __project-based workflows.__

These workflows assume that _all_ files associated with a project are stored in the same folder, with a nested architecture.

---
# Working directory and paths

.pull-left[
### Working directory
One way of adopting a project-based workflow is by __explicitly setting the working directory__ within scripts.

```r
# only for the illustrative purposes
# this is bad practice! see next slides
setwd("V:/mortality/data/")
```

The __working directory__ is the folder that R will use to read and write files.

]
.pull-right[
### Paths

You can read/write files in R using __absolute paths__ that give the location of the file relative to the root directory.

```
"V:/mortality/data/aus.rds"
```

Or using __relative paths__, that give the location of the file relative to the working directory.

```
"data/aus.rds"
```

]

---
# ⚠️&nbsp; Avoid using `setwd()`

Although presented earlier, know that `setwd()` is considered __bad practice__.

- If you're going to use a __fully self-contained project__, better tools exist. Like [RStudio projects](https://rstats.wtf/project-oriented-workflow.html#rstudio-projects), or the [here](https://here.r-lib.org/) package.

- If your project needs to eventually read from or write to files __outside of the project__, then it's better to use an absolute path in the read/write functions directly, without changing the working directory.

.footnote[[What They Forgot to Teach You About R, Chapter 2](https://rstats.wtf/project-oriented-workflow.html)]

---
# RStudio projects

To facilitate project-based workflows, RStudio implemented [projects](https://r4ds.had.co.nz/workflow-projects.html#rstudio-projects).

.pull-left[
Projects are text files with a `.Rproj` extension.

Double clicking one of them will launch the RStudio IDE and will set the R __working directory__ to the folder containing the `.Rproj` file.

Other features of the IDE will also be affected, like the _Files_ pane, and tab autocompletion.

The main advantage of using projects is that they are _portable_.

]
.pull-right[

<img src="https://oecd-cfe-eds.github.io/good-enough-practices/figures/project0.PNG" width="80%" />
]

---
class: exercise

# 👩‍💻&nbsp; Create and use an RStudio project

We will demonstrate the following:

- How to create an RStudio project using the RStudio IDE.
- How to switch between projects.
- The the working directory changes.
- How paths work.

---
# OECD project folders

Code, data and outputs should be stored in __dedicated project folders__. You can open a project via [Service Now](https://oecd.service-now.com/nav_to.do?uri=%2Fcom.glideapp.servicecatalog_cat_item_view.do%3Fv%3D1%26sysparm_id%3D1e3839fc4ff37e005f7668d18110c79c%26sysparm_link_parent%3D6a41e6d5dbf7730099b4a5ca0b961988%26sysparm_catalog%3D35d955d26f66310098ddc5cbbb3ee4dc%26sysparm_catalog_view%3Dcatalog_default%26sysparm_view%3Dcatalog_default).

You will get an e-mail confirming the project has been created, with the required folder architecture:

⚠️&nbsp; When working with other Directorates, avoid using the `V:/` alias. The alias points to a different location for each Directorate.

---
class: exercise

# 👩‍💻&nbsp; Leveraging the `config` R package

The [__config__](https://cran.r-project.org/web/packages/config/vignettes/introduction.html) package makes it easy to manage environment specific configuration values.

One good use-case is to write paths that you want to be available to everyone working on the same project.

---
class: inverse, bottom, left

# Files and file names

---
# 💾&nbsp; Good names

The [three principles for file names](https://speakerdeck.com/jennybc/how-to-name-files) state that file names should:

1. Be machine readable
2. Be human readable
3. Play well will default ordering

.footnote[Based on the presentation by [@JennyBryan](https://twitter.com/JennyBryan).]

.pull-left[

.center[❌&nbsp; __Bad__]

```
code.do
Import des données avec accents.R
histogram with spaces.png
histogramWithNOINFO.PNG
File with (Special characters).pdf
```

]

.pull-right[

.center[✔️&nbsp; __Good__]

```
analyse-prix.do
import-donnees-enseignes.R
fig01_histogram-price.png
fig02_scatter-price-coverage.png
2017-03-04_speed-measurements.txt
```

]

---

# 💾&nbsp; Good names
### 1. Machine readable

.pull-left[

- Work well with [**regular expressions**](https://en.wikipedia.org/wiki/Regular_expression).

- Should not contain accents, special characters or spaces.

- Should preferably be in the same case

]
.pull-right[

_For example:_

```
CLIENT_FR_FVO_2015-09-29_17-41-42.txt
CLIENT_FR_FVO_2015-09-29_17-41-54.txt
CLIENT_FR_FVO_2015-09-30_14-05-57.txt
CLIENT_FR_FVO_2015-11-05_10-28-55.txt
CLIENT_FR_FVO_2015-11-06_10-28-55.txt
CLIENT_FR_FVO_2015-11-09_16-10-04.txt
```
]

???

The logic for making files work with regular expressions is that they make tasks like search and mass import easier.

The only exceptions for special characters are for scores and underscores.

Regarding the case, keep in mind that Windows is case insensitive, while Linux isn't. This means that code that works in your Windows computer may fail when you deploy it to a platform hosted in a Linux environment (e.g. Shinyapps.io, Algobank).

---

# 💾&nbsp; Good names
### 2. Human readable

.pull-left[

.middle[- File names should be __informative__.]

]
.pull-right[

_For example:_

```
01_geocodage-pdv-cible.R
02_preparation-couples.R
03_requests-pour-arcgis.R 
04_impute-missing-dist.pdf
04_impute-missing-dist.Rmd
05_clean-distancier-traj.R
99_data-for-app.R
```
]

---

# 💾&nbsp; Good names
### 3. Play well with default ordering

.pull-left[

- Scripts should be numbered if they need to be run in a given order.

- Think about left-padding numbers with `0`.

- Dates should be in the format `YYYY-MM-DD` (international standard).

]
.pull-right[

_For example:_

```
01_build-data17ca16.R
02_import-wrh17_0.R
03_build-data17_0.R
checks01_x-et-y.html
checks01_x-et-y.Rmd
checks02_x-y-w.html
checks02_x-y-w.Rmd
```
]

---
#  README files

.pull-left[
### What they are...

A [`README`](https://en.wikipedia.org/wiki/README) file contains information about other files in a directory or archive of computer software.

Some examples:

- [The Economist's Graphic Detail data](https://github.com/TheEconomist/graphic-detail-data)
- [Replication repository of a WB paper](https://github.com/worldbank/brazil-pip-education/blob/master/README.md)

]

.pull-right[
### Tips for writing READMEs

- Assume that you will not around to explain it!

- Use a light-weight file format.

- The convention is to use `.md` files, since these are rendered nicely in Git hosting platforms.

]

---
class: exercise

# 👩‍💻&nbsp; Building a project's infrastructure

We can automate some repetitive tasks using [fs](https://fs.r-lib.org/), an R package for file system operations.

It's functions are divided into four main categories:

- `path_` for manipulating and constructing paths
- `file_` for files
- `dir_` for directories
- `link_` for links

???
Directory is a fancy name for a folder.

---
class: inverse, bottom, left
background-image: url(https://images.unsplash.com/photo-1421986872218-300a0fea5895?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=634&q=80)
background-size: cover

# Break

---
class: inverse, bottom, left

# How to work interactively

---

# Re-starting R and sourcing scripts

The recommended workflow goes something like this:

1. Launch RStudio from your the `.Rproj` file
1. Open your script, make your edits, run your code

Every once in a while:

1. Stop what you are doing
1. Restart R with `Ctrl + Shift + F10`
1. Source the script with `Ctrl + Shift + Enter`

If you switch scripts, restart R again.

---
class: exercise

# 👩‍💻&nbsp; Working interactively

Let's demonstrate these steps.

---
class: inverse, bottom, left

# Code style

---
class: middle

### Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread

[Hadley Wickham](https://twitter.com/hadleywickham)

---
# The tidyverse style guide

This is the __R style guide__ that the tidyverse development team has decided to adopt.

It is quickly becoming the default R style guide adopted by most R developers, data scientists and data analysts.

.center[`http://style.tidyverse.org/index.html`]

<br/>

In addition, there's also the [`styler`](https://github.com/r-lib/styler) package, to interactively restyle text, files or projects.

---
## Packages and parameters at the top

.pull-left[

```r
# Bad

library(sf)

deu_tl2 <- read_sf("path/to/deu-tl2.shp")

# many, many lines

library(tidyverse)
deu_tl2_names <- read_csv("path/elsewhere/codes-tl2.csv")

# even more lines

library(writexl)
write_xlsx(deu_tl2_final, "deu-tl2.xlsx")
```

]
.pull-right[

```r
# Good

library(sf)
library(tidyverse)
library(writexl)

path_shp <- "path/to/deu-tl2.shp"
path_nms <- "path/elsewhere/codes-tl2.csv"
path_out <- "deu-tl2.xlsx"

deu_tl2 <- read_sf(path_shp)

# many, many lines

deu_tl2_names <- read_csv(path_nms)

# even more lines

write_xlsx(deu_tl2_final, path_out)
```

]

---
## Use snake case for variable and function names

```r
# Good
day_one
day_1

# Bad
DayOne
dayone
```

---
## Respect spacing

.pull-left[

```r
# Good
mean(x, na.rm = TRUE)

# Bad
mean (x, na.rm = TRUE)
mean( x, na.rm = TRUE )
```

```r
# Good
height <- (feet * 12) + inches
mean(x, na.rm = TRUE)

# Bad
height<-feet*12+inches
mean(x, na.rm=TRUE)
```

]

.pull-right[

```r
# Good
if (debug) {
  show(x)
}

# Bad
if(debug){
  show(x)
}
```

]

---
## Avoid long lines

```r
# Good
do_something_very_complicated(
  something = "that",
  requires = many,
  arguments = "some of which may be long"
)

# Bad
do_something_very_complicated("that", requires, many, arguments,
                              "some of which may be long"
                              )
```

---
## Comments should explain the "why" not the "what"

```r
# Good

# Take out "not regionalised" observations
data_checked <- data_iso2 %>% 
  filter(check != "ZZ")

# Bad

# Filter observations
data_checked <- data_iso2 %>% 
  filter(check != "ZZ")
```

---
## Follow the `%>%` with a new line

```r
# Good
iris %>%
  group_by(Species) %>%
  summarize_if(is.numeric, mean) %>%
  ungroup() %>%
  gather(measure, value, -Species) %>%
  arrange(value)

# Bad
iris %>% group_by(Species) %>% summarize_all(mean) %>%
ungroup %>% gather(measure, value, -Species) %>%
arrange(value)
```

---
class: exercise

# 📝&nbsp; Restyle code

Head over to the [RStudio Cloud Project](https://rstudio.cloud/project/3022018) and open the `air pollution.R` script. What would you change given what you have learned today?

---
class: inverse, bottom, left

# Adopting practices from software engineering

---
# Software engineering practices

Doing data analysis and data science with R involves a lot of programming, so it makes sense to adopt engineering principles that help improve the __readability__ and __maintainability__ of code.

- __Code reviews__: These can be for quality control, to help onboard someone to a project, or for pedagogical purposes.

- __DRY principle__: Stands for _Do not repeat yourself_. Concretely, this might mean creating a function to replace several instances of copy-pasted code.

- __Refactoring__: The process of going over your code again to improve its
quality and remove redundancies.

- __Technical debt__: What you accrue with quick fixes. Useful term to
frame discussions.

- __Defensive programming__: Adding tests and checks for eventual edge cases or changes.

---
class: exercise

# 👩‍💻&nbsp; Reviewing a script

Let's do a live code review of the code that was styled during the previous exercise.

---
class: inverse, bottom, left
background-image: url(img/gradient-background.png)
background-size: cover

# Annex

---
# Back-slashes in paths

Back-slashes (`\`) have a special meaning in R, so you must use forward-slashes (`/`):

```r
fs::dir_exists("C:\Users\Caldas_M")
fs::dir_exists("C:/Users/Caldas_M")
#> Error: '\U' used without hex digits in character string starting ""C:\U"
```

Alternatively, you can _escape_ a back-slash by doubling it:

```r
fs::dir_exists("C:\\Users\\Caldas_M")
#> C:\\Users\\Caldas_M 
#>                TRUE
```

Or use the special `r"(...)"` notation introduced in `R 4.0.0`:

```r
fs::dir_exists(r"(C:\Users\Caldas_M)")
#> C:\\Users\\Caldas_M 
#>                TRUE
```

---
# UNC paths and mapped drives

There are two ways to call a file in our local network.

Either using the UNC path, making sure to properly escape the back-slashes:

```r
dir(r"(\\main.oecd.org\ASgenCFE\EXCESS_MORTALITY)")
#> [1] "COVID19 Regional Determinants" "data-fetching"
```

Or using the path with the associated mapped drive letter:

```r
dir("V:/EXCESS_MORTALITY")
#> [1] "COVID19 Regional Determinants" "data-fetching"
```

⚠️&nbsp; Use the UNC path for `V:/` when you're working with colleagues from another Directorate! o

---
# To learn more

.pull-left[
### Books

- [What They Forgot to Teach You About R](https://rstats.wtf/index.html), Part I.
- [R for Data Science](https://r4ds.had.co.nz/), "Workflow" chapters in Part I.

### What others are doing

- [Using R at Grattan Institute](https://grattan.github.io/R_at_Grattan/index.html) (Australia)
- [utilitR](https://www.book.utilitr.org/index.html) project, by the Insee (France)
]
.pull-right[
### Previous presentations

- [Good Practice in Data Analysis with R](https://algobank.oecd.org:4430/MariaPaula.CALDAS/rladies-good-practice/-/tree/master), OECD R-Ladies
- [Code Reviews](https://algobank.oecd.org:4430/r-ladies/index/-/blob/master/meeting_2021-01-29.md), OECD R-Ladies
- [Workflow Tips and Tricks](https://oecd-cfe-eds.github.io/r-ladies-workflow/#1), R-Ladies Dammam

]