class: title-slide, inverse, bottom background-image: url(img/gradient-background.png) background-size: cover # Data visualisation with ggplot2 ### CFE R Training - Module 2 <br/> MarΓa Paula Caldas and Jolien Noels --- class: middle # Useful links [Slides](https://oecd-cfe-eds.github.io/cfe-r-training/02_dataviz.html), if you want to navigate on your own [RStudio Project](https://rstudio.cloud/project/2907237), to try out the exercises [Teams Space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c), for discussions [Github repository](https://github.com/mpaulacaldas/cfe-r-training), for later reference --- class: middle # Housekeeping matters πββοΈ During the session, ask questions in the chat or raise your hand π· Sessions are recorded. Remember to turn off your camera if its your preference π¬ After the session, post follow-up questions, comments or reactions in the [Teams space](https://teams.microsoft.com/l/team/19%3aewi8FvNssJHrCsxFDSJbA7IL4q4kGH0E8IRMfMp8zPA1%40thread.tacv2/conversations?groupId=c957fd70-0f85-4bcc-b3a4-e453919316de&tenantId=ac41c7d4-1f61-460d-b0f4-fc925a2b471c) π If you are going through these slides on your own, type `p` to see the presenter notes ??? The presenter notes are where you might also find OECD or CFE specific information. --- class: middle # Learning objectives for today 1. Understand the main ideas behind the "layered grammar of graphics" and "tidy data" 1. Learn how to create basic charts with ggplot2 1. Learn to reshape data between wide and long format using the pivot functions from tidyr --- class:inverse, bottom, left # Introduction to data visualisation in R --- # Data visualisation in R <img src="02_dataviz_files/figure-html/unnamed-chunk-1-1.png" width="80%" style="display: block; margin: auto;" /> ??? There are many possible ways to visualise data with R. In general, libraries are divided into two camps: static visualisations and dynamic visualisations. Today we will concentrate on ggplot2, a package for highly customisable static visualisations and that facilitates quick iteration and data exploration. --- # Data visualisation in R
??? Although not covered today, know that there are many other R packages that are designed to display interactive data visualisations. This example shows highcharter. Another popular one is plotly. For maps, leaflet. --- background-position: 1050px 20px background-size: 100px background-image: url(https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png) # Why ggplot2? Today, we will focus on __static data visualisation__ using the [ggplot2](https://ggplot2.tidyverse.org/) package, from the tidyverse. There are some advantages to using `ggplot2`: 1. It is based on the [Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.pdf), a theory for building statistical graphics. 1. It provides a high degree of customisation. 1. It allows for fast exploration and iteration. 1. It is the most popular R graphics library, with many extensions and that is actively maintained. --- background-position: 1050px 20px background-size: 100px background-image: url(https://raw.githubusercontent.com/tidyverse/ggplot2/master/man/figures/logo.png) # The layered grammar of graphics Below we have an example of the grammar that we would use in ggplot2 to create a graph. Our focus today will be on the highlighted elements. .pull-left[ ```r *ggplot(data = <DATA>) + <GEOM_FUNCTION>( * mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION> ) + <COORDINATE_FUNCTION> + * <FACET_FUNCTION> + * <SCALE_FUNCTION> + <THEME_FUNCTION> ``` ] .pull-right[ - `DATA`: Data we want to visualise. - `GEOM`: Geometric shapes representing the data. - `MAPPINGS`: Aesthetic mappings of the geometric and statistical objects. - `FACET`: Arrangement of data into a grid of plots. - `SCALE`: Transformations to the `MAPPINGS`. ] .footnote[ Inspired by the [World Bank's DIME Analytics](https://raw.githack.com/worldbank/dime-r-training/master/Presentations/03-data-visualization.html#12) training. ] ??? A chart is made up of different layers. To denote the aggregation of layers, we use a plus sign. --- class:inverse, bottom, left # Introduction to ggplot2 --- # Data and packages In addition to the tidyverse, we will use the [gapminder](https://github.com/jennybc/gapminder/) package for its data. .footnote[This data is only for __teaching purposes__. You may remember it from some [TED Talks](https://www.ted.com/playlists/474/the_best_hans_rosling_talks_yo)] .pull-left[ ```r library(gapminder) library(tidyverse) #> -- Attaching packages ---------------------------------------------- tidyverse 1.3.1 -- #> v tibble 3.1.2 v dplyr 1.0.7 #> v tidyr 1.1.3 v stringr 1.4.0 #> v readr 1.4.0 v forcats 0.5.1 #> v purrr 0.3.4 #> -- Conflicts ------------------------------------------------- tidyverse_conflicts() -- #> x dplyr::filter() masks stats::filter() #> x dplyr::lag() masks stats::lag() ``` ] .pull-right[ <img src="https://talkstar-assets.s3.amazonaws.com/production/playlists/playlist_474/2d21d632-4d8c-4555-b38b-41caa4419e8e/best_hans_talks_1200x627.jpg" width="100%" style="display: block; margin: auto;" /> ] --- # The gapminder data In full format, it has 1704 rows and 6 columns. The first two columns are of type _factor_. This is useful for plotting. ```r gapminder #> # A tibble: 1,704 x 6 #> country continent year lifeExp pop gdpPercap #> <fct> <fct> <int> <dbl> <int> <dbl> #> 1 Afghanistan Asia 1952 28.8 8425333 779. #> 2 Afghanistan Asia 1957 30.3 9240934 821. #> 3 Afghanistan Asia 1962 32.0 10267083 853. #> 4 Afghanistan Asia 1967 34.0 11537966 836. #> 5 Afghanistan Asia 1972 36.1 13079460 740. #> 6 Afghanistan Asia 1977 38.4 14880372 786. #> 7 Afghanistan Asia 1982 39.9 12881816 978. #> 8 Afghanistan Asia 1987 40.8 13867957 852. #> 9 Afghanistan Asia 1992 41.7 16317921 649. #> 10 Afghanistan Asia 1997 41.8 22227415 635. #> # ... with 1,694 more rows ``` ??? Factor variables are mainly useful in plotting. They look like strings (i.e. character vectors) but in reality they are numbers (i.e. numeric vectors) with an associated label. These two characteristics makes them very useful for plotting because we can use them to order things. For now, you don't need to retain any more information than that about factors. We will explore them a bit more in detail in the exercises. --- # Subsetting gapminder ```r gapminder_example <- gapminder %>% filter(year %in% c(1957, 2007)) summary(gapminder_example) #> country continent year lifeExp pop #> Afghanistan: 2 Africa :104 Min. :1957 Min. :30.33 Min. :6.132e+04 #> Albania : 2 Americas: 50 1st Qu.:1957 1st Qu.:45.64 1st Qu.:2.876e+06 #> Algeria : 2 Asia : 66 Median :1982 Median :60.47 Median :7.297e+06 #> Angola : 2 Europe : 60 Mean :1982 Mean :59.26 Mean :3.139e+07 #> Argentina : 2 Oceania : 4 3rd Qu.:2007 3rd Qu.:72.39 3rd Qu.:2.185e+07 #> Australia : 2 Max. :2007 Max. :82.60 Max. :1.319e+09 #> (Other) :272 #> gdpPercap #> Min. : 277.6 #> 1st Qu.: 1237.8 #> Median : 3499.0 #> Mean : 7989.7 #> 3rd Qu.: 9662.5 #> Max. :113523.1 #> ``` --- # Setting the stage .pull-left[ ```r ggplot(data = gapminder_example) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-10-1.png" width="504" /> ] --- # Aesthetics: x and y .pull-left[ There are two arguments we can pass to `ggplot()`: `data` and `mapping`. The second must always use the function `aes()`. ```r ggplot( data = gapminder_example, * mapping = aes(x = gdpPercap, y = lifeExp) ) ``` Or alternatively... ```r gapminder_example %>% ggplot(aes(x = gdpPercap, y = lifeExp)) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-13-1.png" width="504" /> ] --- # Geometries .pull-left[ ```r gapminder_example %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + * geom_point() ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-15-1.png" width="504" /> ] --- class: exercise # π Code along #1 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `code-along.R` script and run the code from the slides above.
05
:
00
--- # More aesthetics: colours .pull-left[ ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, * colour = continent ) ) + geom_point() ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-18-1.png" width="504" /> ] --- # More aesthetics: shape .pull-left[ ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, * shape = continent ) ) + geom_point() ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-20-1.png" width="504" /> ] --- class: exercise # π Code along #2 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `code-along.R` script and run the code from the slides above.
05
:
00
--- # Scale transformations .pull-left[ As a reminder, this is the default view without any scale transformations. ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, colour = continent ) ) + geom_point() ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-23-1.png" width="504" /> ] --- # Scale transformations .pull-left[ With a transformation to the `x` aesthetic: ```r gapminder_example %>% ggplot( aes( * x = gdpPercap, y = lifeExp, colour = continent ) ) + geom_point() + * scale_x_log10() ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-25-1.png" width="504" /> ] --- # Scale transformations .pull-left[ With a transformation to the `colour` aesthetic: ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, * colour = continent ) ) + geom_point() + scale_x_log10() + * scale_colour_brewer(palette = "Dark2") ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-27-1.png" width="504" /> ] --- class: exercise # π Code along #3 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `code-along.R` script and run the code from the slides above.
05
:
00
--- # Faceting .pull-left[ ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, colour = continent ) ) + geom_point(show.legend = FALSE) + * facet_wrap(~ continent) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-30-1.png" width="504" /> ] --- # Faceting .pull-left[ ```r gapminder_example %>% ggplot( aes( x = gdpPercap, y = lifeExp, colour = continent ) ) + geom_point(show.legend = FALSE) + * facet_grid(continent ~ year) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-32-1.png" width="504" /> ] --- class: exercise # π Code along #4 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `code-along.R` script and run the code from the slides above.
05
:
00
--- class: inverse, bottom, left background-image: url(https://images.unsplash.com/photo-1421986872218-300a0fea5895?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=634&q=80) background-size: cover # Break
10
:
00
--- class:inverse, bottom, left # Common charts --- # Histograms .pull-left[ ```r gapminder_example %>% ggplot(aes(x = gdpPercap)) + geom_histogram() + facet_grid(. ~ year) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-36-1.png" width="504" /> ] --- # Columns .pull-left[ ```r gapminder_summary <- gapminder_example %>% group_by(continent, year) %>% summarise( gdp_wmean = weighted.mean( gdpPercap, pop ), .groups = "drop" ) ggplot(gapminder_summary) + geom_col( aes( x = continent, y = gdp_wmean ) ) + facet_grid(. ~ year) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-37-1.png" width="504" /> ] --- # Lines .pull-left[ ```r gapminder %>% ggplot( aes( x = year, y = lifeExp, group = country ) ) + geom_line(show.legend = FALSE) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-39-1.png" width="504" /> ] --- class: exercise # π Recreate .pull-left[ Take another look at the code from the previous slide. How would you change the code to create the figure in the right? ```r gapminder %>% ggplot( aes( x = year, y = lifeExp, group = country ) ) + geom_line(show.legend = FALSE) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/exo-lines-1.png" width="504" style="display: block; margin: auto;" /> ]
05
:
00
--- class: exercise # β Answer ```r gapminder %>% ggplot(aes(year, lifeExp, group = country, colour = continent)) + geom_line(show.legend = FALSE) + facet_wrap(~ continent) ``` --- class:inverse, bottom, left # Tidy data --- # Definition There are three interrelated rules which make a dataset tidy: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. ![](https://github.com/hadley/r4ds/raw/master/images/tidy-1.png) .right[Source: [R for Data Science](https://r4ds.had.co.nz/tidy-data.html)] --- # Pivoting from long to wide ```r gapminder_long #> # A tibble: 5,112 x 5 #> country continent year indicator value #> <fct> <fct> <int> <chr> <dbl> #> 1 Afghanistan Asia 1952 lifeExp 28.8 #> 2 Afghanistan Asia 1952 pop 8425333 #> 3 Afghanistan Asia 1952 gdpPercap 779. #> 4 Afghanistan Asia 1957 lifeExp 30.3 #> 5 Afghanistan Asia 1957 pop 9240934 #> 6 Afghanistan Asia 1957 gdpPercap 821. #> 7 Afghanistan Asia 1962 lifeExp 32.0 #> 8 Afghanistan Asia 1962 pop 10267083 #> 9 Afghanistan Asia 1962 gdpPercap 853. #> 10 Afghanistan Asia 1967 lifeExp 34.0 #> # ... with 5,102 more rows ``` --- background-position: 1050px 20px background-size: 100px background-image: url(https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png) # Pivoting from long to wide `pivot_wider()` has two main arguments: - `names_from`: The name of the column from the long table that will be used to create new columns to form a wider table. - `values_from`: The name of the column from the long table that will be used to fill the contents of the new columns in the wider table. ```r gapminder_long %>% pivot_wider(names_from = indicator, values_from = value) #> # A tibble: 1,704 x 6 #> country continent year lifeExp pop gdpPercap #> <fct> <fct> <int> <dbl> <dbl> <dbl> #> 1 Afghanistan Asia 1952 28.8 8425333 779. #> 2 Afghanistan Asia 1957 30.3 9240934 821. #> 3 Afghanistan Asia 1962 32.0 10267083 853. #> 4 Afghanistan Asia 1967 34.0 11537966 836. #> 5 Afghanistan Asia 1972 36.1 13079460 740. #> 6 Afghanistan Asia 1977 38.4 14880372 786. #> 7 Afghanistan Asia 1982 39.9 12881816 978. #> 8 Afghanistan Asia 1987 40.8 13867957 852. #> 9 Afghanistan Asia 1992 41.7 16317921 649. #> 10 Afghanistan Asia 1997 41.8 22227415 635. #> # ... with 1,694 more rows ``` --- class: exercise # π From long to wide 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `pivot.R` script, run the code from the slides above and answer the different questions.
02
:
00
--- # Pivoting from wide to long ```r gapminder_wide #> # A tibble: 142 x 14 #> country continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` #> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Afghanistan Asia 28.8 30.3 32.0 34.0 36.1 38.4 39.9 40.8 41.7 #> 2 Albania Europe 55.2 59.3 64.8 66.2 67.7 68.9 70.4 72 71.6 #> 3 Algeria Africa 43.1 45.7 48.3 51.4 54.5 58.0 61.4 65.8 67.7 #> 4 Angola Africa 30.0 32.0 34 36.0 37.9 39.5 39.9 39.9 40.6 #> 5 Argentina Americas 62.5 64.4 65.1 65.6 67.1 68.5 69.9 70.8 71.9 #> 6 Australia Oceania 69.1 70.3 70.9 71.1 71.9 73.5 74.7 76.3 77.6 #> 7 Austria Europe 66.8 67.5 69.5 70.1 70.6 72.2 73.2 74.9 76.0 #> 8 Bahrain Asia 50.9 53.8 56.9 59.9 63.3 65.6 69.1 70.8 72.6 #> 9 Bangladesh Asia 37.5 39.3 41.2 43.5 45.3 46.9 50.0 52.8 56.0 #> 10 Belgium Europe 68 69.2 70.2 70.9 71.4 72.8 73.9 75.4 76.5 #> # ... with 132 more rows, and 3 more variables: 1997 <dbl>, 2002 <dbl>, 2007 <dbl> ``` --- background-position: 1050px 20px background-size: 100px background-image: url(https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png) # Pivoting from wide to long .pull-left[ `pivot_longer()` has three main arguments: - `cols`: Columns to pivot into the longer format. Here you can select elements in the same way as you would with `dplyr::select()`. - `names_to`: The name of the column to create from the column names of the data in wide format. - `values_to`: The name of the column to create from the data stored in cell values. ] .pull-right[ ```r gapminder_wide %>% pivot_longer( cols = `1952`:`2007`, names_to = "year", values_to = "lifeExp" ) #> # A tibble: 1,704 x 4 #> country continent year lifeExp #> <fct> <fct> <chr> <dbl> #> 1 Afghanistan Asia 1952 28.8 #> 2 Afghanistan Asia 1957 30.3 #> 3 Afghanistan Asia 1962 32.0 #> 4 Afghanistan Asia 1967 34.0 #> 5 Afghanistan Asia 1972 36.1 #> 6 Afghanistan Asia 1977 38.4 #> 7 Afghanistan Asia 1982 39.9 #> 8 Afghanistan Asia 1987 40.8 #> 9 Afghanistan Asia 1992 41.7 #> 10 Afghanistan Asia 1997 41.8 #> # ... with 1,694 more rows ``` ] --- class: exercise # π From wide to long 1. Open today's [RStudio Cloud project](https://rstudio.cloud/project/2907237) 1. Open the `pivot.R` script, run the code from the slides above and answer the different questions.
05
:
00
--- class: exercise ## π©βπ» Demo: Recreating the OECD style with ggplot2 The code used in this demonstration is in the [RStudio Cloud project](https://rstudio.cloud/project/2907237). .center[ <img src="https://www.oecd-ilibrary.org/sites/959d5ba0-en/images/images/02-chapter2/emf/g2-11.png" width="80%" /> ] --- class: inverse, bottom, left background-image: url(img/gradient-background.png) background-size: cover # Annex --- # To learn more .pull-left[ [**R for Data Science**](https://r4ds.had.co.nz/), [Chapter 3](https://r4ds.had.co.nz/data-visualisation.html) and [Chapter 28](https://r4ds.had.co.nz/graphics-for-communication.html) [**ggplot2: Elegant graphics for data analysis**](https://ggplot2-book.org/), for a comprehensive explanation of the package and its implementation of the Grammar of Graphics. [**R Graphics Cookbook**](https://r-graphics.org/), for "recipes" to create specific types of charts. ] .pull-right[ <img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" width="60%" /> ] --- # Columns: a better example .pull-left[ ```r gapminder_summary <- gapminder_example %>% group_by(continent, year) %>% summarise( gdp_wmean = weighted.mean( gdpPercap, pop ), .groups = "drop" ) ggplot(gapminder_summary) + geom_col( aes( x = continent, y = gdp_wmean, fill = factor(year) ), position = "dodge" ) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-53-1.png" width="504" /> ] --- # Bars vs. columns .pull-left[ ```r gapminder_example %>% ggplot() + # we don't specify a 'y' aesthetic geom_bar(aes(x = continent)) + facet_grid(. ~ year) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-55-1.png" width="504" /> ] --- # Bars vs. columns .pull-left[ ```r gapminder_example %>% count(continent, year) %>% ggplot() + # we need a 'y' aesthetic geom_col(aes(x = continent, y = n)) + facet_grid(. ~ year) ``` ] .pull-right[ <img src="02_dataviz_files/figure-html/unnamed-chunk-57-1.png" width="504" /> ] ??? geom_col() will generally be more useful to us than geom_bar() because often we want to be very specific about the way we summarise the data. Imagine, for example, if these bars showed a weighted mean.