5  Data wrangling

Until now, every dataset in this book has arrived in perfect rows and columns. There were no columns to rename, no rows to drop, and nothing to reshape or merge. They were ready for whatever analyses or plotting we needed to do. Real-world datasets, by contrast, are rarely so accommodating. You might encounter superfluous columns, or need to create new variables. Perhaps you’ll only need a subset of the rows, or have to reshape and restructure the data, or even merge multiple datasets into one. Usually, before any analysis can begin, however simple, we have to reformat, clean, or transform our data.

This activity of cleaning or transforming data is often called data wrangling. The verb to wrangle originally meant “to struggle or argue at length.” Ranchers later used it for herding or taming uncooperative livestock. Modern analysts have borrowed the word because taming data can feel exactly the same: time-consuming, difficult, and occasionally infuriating. Many surveys (for example, Anaconda, 2022) suggest that analysts spend 40–60% of their time on data wrangling; more than on statistical analysis itself. Despite this, the topic is often neglected in statistics textbooks and courses.

In this chapter, you’ll learn about a set of interrelated R commands used to wrangle data. They go by names like select, filter, and mutate, and they’re almost all part of the broad tidyverse family of R tools. Each of these commands is designed to do one thing well while working seamlessly with the others. As such, they can be used to create powerful sequences of data transformations, known as pipelines, that take data from an unruly raw state and transform it into something clean, tidy, and ready for analysis.

We’ll go through many of the major wrangling tools, explaining their purpose and how they work. Then, we’ll provide examples of how they can be combined into pipelines to transform messy raw data into clean, tidy data. Everything we need, whether commands or data, is available in either the tidyverse meta-package or this textbook’s sgsur package.

library(tidyverse)
library(sgsur)

5.1 This chapter covers…

  • What data wrangling means.
    Why transforming raw data into tidy, analysis-ready form is an essential step before any analysis.

  • Core tidyverse verbs.
    Using select(), rename(), relocate(), slice(), filter(), mutate(), and arrange() to reshape and transform data.

  • Reshaping data.
    Moving between wide and long formats with pivot_longer() and pivot_wider().

  • Joining datasets.
    Combining related information with inner_join() and left_join().

  • Building pipelines.
    Linking steps with the native pipe |> to create clean, reproducible workflows.

  • Developing tidy data skills.
    Turning messy inputs into reliable datasets ready for modelling and visualisation.

5.2 Selecting columns with select

The first wrangling tool we will look at is select. This is part of the dplyr package, which is part of the tidyverse meta-package, so is loaded already. To explore this command, we will use the wdi_panel dataset that comes with the sgsur package.

Let’s look at this data:

wdi_panel
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

As you can see, this is a data frame with 6582 rows and 8 columns. It provides some variables that describe the economy of many different countries over recent decades; see help("wdi_panel") or ?wdi_panel for full details about this dataset.

With select, we can choose, or select, a subset of the columns of any data frame. As we will see, it provides many different ways to do this, but the simplest is just to name the individual columns that we want to choose. For example, if we just want the columns that give the country (country), year (year), its gross domestic product (GDP, gdp), and country code (iso2c), we can do the following:

select(wdi_panel, country, year, gdp)
# A tibble: 6,582 × 3
   country      year     gdp
   <chr>       <int>   <dbl>
 1 Afghanistan  2016 102672.
 2 Afghanistan  2023  82595.
 3 Afghanistan  2015 100402.
 4 Afghanistan  2006  49061.
 5 Afghanistan  2017 105389.
 6 Afghanistan  2014  98966.
 7 Afghanistan  2019 110814.
 8 Afghanistan  2021  85768.
 9 Afghanistan  2013  96341.
10 Afghanistan  2018 106643.
# ℹ 6,572 more rows

As you can see, the way select works in this example is very simple: first we give the name of the data frame (wdi_panel), and then give the names of the columns of that data frame that we want to select (country, year, gdp).

We can name as many columns as we wish. For example, if we wanted the country code (iso2c) as well, we could do the following:

select(wdi_panel, country, iso2c, year, gdp)
# A tibble: 6,582 × 4
   country     iso2c  year     gdp
   <chr>       <chr> <int>   <dbl>
 1 Afghanistan AF     2016 102672.
 2 Afghanistan AF     2023  82595.
 3 Afghanistan AF     2015 100402.
 4 Afghanistan AF     2006  49061.
 5 Afghanistan AF     2017 105389.
 6 Afghanistan AF     2014  98966.
 7 Afghanistan AF     2019 110814.
 8 Afghanistan AF     2021  85768.
 9 Afghanistan AF     2013  96341.
10 Afghanistan AF     2018 106643.
# ℹ 6,572 more rows

The selected columns appear in the order we name them. For example, here are the same four selected columns but in a different order:

select(wdi_panel, year, iso2c, country, gdp)
# A tibble: 6,582 × 4
    year iso2c country         gdp
   <int> <chr> <chr>         <dbl>
 1  2016 AF    Afghanistan 102672.
 2  2023 AF    Afghanistan  82595.
 3  2015 AF    Afghanistan 100402.
 4  2006 AF    Afghanistan  49061.
 5  2017 AF    Afghanistan 105389.
 6  2014 AF    Afghanistan  98966.
 7  2019 AF    Afghanistan 110814.
 8  2021 AF    Afghanistan  85768.
 9  2013 AF    Afghanistan  96341.
10  2018 AF    Afghanistan 106643.
# ℹ 6,572 more rows

An important point to make about the select function is that it does not change the original data frame. This is a general property of tidyverse commands, which favour immutable data frames: every command returns a new object instead of altering the original. In other words, while we put wdi_panel into the select function, a new data frame comes out; wdi_panel itself is unchanged. In order to see the output in the examples used above, we did not use the <- assignment operator to save the new data frame that the function returned, but we can always do this. For example, the following creates a new data frame named wdi_panel_a that just has the country, year, and GDP:

wdi_panel_a <- select(wdi_panel, country, year, gdp)

The select function is far more flexible than simply selecting columns by listing their names. It has several helper tricks and functions — ranges, numeric positions, pattern-matching functions, negation, and others — that let you specify whole groups of variables in one go.

Selecting a consecutive set of columns
If the columns that you want to select occur consecutively, you can select them by just stating the first and last column in the range you want. For example, suppose we need the first five columns (country to gdp_growth). We can select them all at once using the colon operator as follows:
select(wdi_panel, country:gdp_growth)
# A tibble: 6,582 × 5
   country     iso2c  year     gdp gdp_growth
   <chr>       <chr> <int>   <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26
 2 Afghanistan AF     2023  82595.       2.71
 3 Afghanistan AF     2015 100402.       1.45
 4 Afghanistan AF     2006  49061.       5.36
 5 Afghanistan AF     2017 105389.       2.65
 6 Afghanistan AF     2014  98966.       2.72
 7 Afghanistan AF     2019 110814.       3.91
 8 Afghanistan AF     2021  85768.     -20.7 
 9 Afghanistan AF     2013  96341.       5.60
10 Afghanistan AF     2018 106643.       1.19
# ℹ 6,572 more rows

The colon tells select() to keep all columns from the one named on the left up to the one named on the right, in their original order.

Selecting by column index
We can also keep columns purely by their numerical index or position. Sometimes this is helpful as you might not yet know the names or the names vary across files: For example, here we keep columns 1, 3, and 4:
select(wdi_panel, 1, 3, 4)
# A tibble: 6,582 × 3
   country      year     gdp
   <chr>       <int>   <dbl>
 1 Afghanistan  2016 102672.
 2 Afghanistan  2023  82595.
 3 Afghanistan  2015 100402.
 4 Afghanistan  2006  49061.
 5 Afghanistan  2017 105389.
 6 Afghanistan  2014  98966.
 7 Afghanistan  2019 110814.
 8 Afghanistan  2021  85768.
 9 Afghanistan  2013  96341.
10 Afghanistan  2018 106643.
# ℹ 6,572 more rows

Just like with column names, we can select a sequence of consecutive columns with the colon operator. For example, here we select the first to the fourth column with 1:4:

select(wdi_panel, 1:4)
# A tibble: 6,582 × 4
   country     iso2c  year     gdp
   <chr>       <chr> <int>   <dbl>
 1 Afghanistan AF     2016 102672.
 2 Afghanistan AF     2023  82595.
 3 Afghanistan AF     2015 100402.
 4 Afghanistan AF     2006  49061.
 5 Afghanistan AF     2017 105389.
 6 Afghanistan AF     2014  98966.
 7 Afghanistan AF     2019 110814.
 8 Afghanistan AF     2021  85768.
 9 Afghanistan AF     2013  96341.
10 Afghanistan AF     2018 106643.
# ℹ 6,572 more rows
Selecting by names and indices at the same time
We can select by column names and column indices, or sequences of them using the colon, at the same time. For example, this selects country, year, and the seventh and eighth columns:
select(wdi_panel, country, year, 7:8)
# A tibble: 6,582 × 4
   country      year lfp_male lfp_female
   <chr>       <int>    <dbl>      <dbl>
 1 Afghanistan  2016     74.4      20.2 
 2 Afghanistan  2023     70.2       5.16
 3 Afghanistan  2015     75.3      19.1 
 4 Afghanistan  2006     79.3      14.5 
 5 Afghanistan  2017     73.4      21.2 
 6 Afghanistan  2014     76.2      18.1 
 7 Afghanistan  2019     69.4      18.3 
 8 Afghanistan  2021     67.3      14.7 
 9 Afghanistan  2013     77.1      17.1 
10 Afghanistan  2018     71.4      19.8 
# ℹ 6,572 more rows
Selecting columns that start with, end with, or contain a pattern
The select function provides helper functions like starts_with, ends_with, contains, and matches that allow us to select columns on the basis of specific patterns or sequences of characters in their names. For example, in the wdi_panel data, the GDP-related variables begin with "gdp". To grab them in one go, we can use the starts_with helper function as follows:
select(wdi_panel, starts_with("gdp"))
# A tibble: 6,582 × 2
       gdp gdp_growth
     <dbl>      <dbl>
 1 102672.       2.26
 2  82595.       2.71
 3 100402.       1.45
 4  49061.       5.36
 5 105389.       2.65
 6  98966.       2.72
 7 110814.       3.91
 8  85768.     -20.7 
 9  96341.       5.60
10 106643.       1.19
# ℹ 6,572 more rows

We can use ends_with in a similar way:

select(wdi_panel, ends_with("male"))
# A tibble: 6,582 × 2
   lfp_male lfp_female
      <dbl>      <dbl>
 1     74.4      20.2 
 2     70.2       5.16
 3     75.3      19.1 
 4     79.3      14.5 
 5     73.4      21.2 
 6     76.2      18.1 
 7     69.4      18.3 
 8     67.3      14.7 
 9     77.1      17.1 
10     71.4      19.8 
# ℹ 6,572 more rows

We can use contains to select any column whose name includes a pattern or sequence of characters anywhere, not necessarily at the beginning or end. For example, to fetch any column whose name includes "inf" anywhere, we can do:

select(wdi_panel, contains("inf"))
# A tibble: 6,582 × 1
   inflation
       <dbl>
 1     4.38 
 2    -4.64 
 3    -0.662
 4     6.78 
 5     4.98 
 6     4.67 
 7     2.30 
 8     5.13 
 9     7.39 
10     0.626
# ℹ 6,572 more rows

All of these helper functions, starts_with, ends_with, contains, are not sensitive to case, upper or lower case, by default. For example, contains("INF") is the same as contains('inf') by default. If we want it to be case sensitive, we can indicate this by using ignore.case = FALSE inside the helper. For example, here is how we tell it to match the exact case:

# returns only columns containing "INF", of which there are none
select(wdi_panel, contains("INF", ignore.case = FALSE)) 
# A tibble: 6,582 × 0
Deselecting columns by negation
By putting a minus sign in front of any selector, we can drop or deselect columns instead of keeping them. For example, in the following example, we keep everything except iso2c:
select(wdi_panel, -iso2c)
# A tibble: 6,582 × 7
   country      year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan  2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan  2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan  2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan  2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan  2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan  2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan  2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan  2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan  2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan  2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

You can read the minus sign as meaning everything except. We can put it in front of every helper function too. For example, -starts_with("lfp_") drops all labour-force participation variables, which all start with lfp_.

Renaming while selecting
When we use select, we can optionally rename columns by using new_name = old_name inside the call. For example, here we select four columns but rename two as we do so:
select(
  wdi_panel,
  country,
  country_code = iso2c,
  year,
  gdp,
  growth = gdp
)
# A tibble: 6,582 × 4
   country     country_code  year  growth
   <chr>       <chr>        <int>   <dbl>
 1 Afghanistan AF            2016 102672.
 2 Afghanistan AF            2023  82595.
 3 Afghanistan AF            2015 100402.
 4 Afghanistan AF            2006  49061.
 5 Afghanistan AF            2017 105389.
 6 Afghanistan AF            2014  98966.
 7 Afghanistan AF            2019 110814.
 8 Afghanistan AF            2021  85768.
 9 Afghanistan AF            2013  96341.
10 Afghanistan AF            2018 106643.
# ℹ 6,572 more rows

As before however, it is the data frame that comes out that has the new column names; the original wdi_panel remains unchanged.

5.3 Renaming columns with rename

Our next wrangling command is rename(). We just saw that select() can both pick columns and rename them. As useful as that is, if we just want to rename some variables but not select variables at the same time, we should use rename instead. This command does only one thing: it changes column names while leaving every column in place. We often use it when we want to keep the current set of variables but the labels are cryptic, too long, or inconsistent and so need to be changed. Here we rename, as we just did using select, iso2c and gdp_growth to country_code and growth, respectively:

rename(wdi_panel,
  country_code = iso2c,
  growth = gdp_growth
)
# A tibble: 6,582 × 8
   country     country_code  year     gdp growth inflation lfp_male lfp_female
   <chr>       <chr>        <int>   <dbl>  <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF            2016 102672.   2.26     4.38      74.4      20.2 
 2 Afghanistan AF            2023  82595.   2.71    -4.64      70.2       5.16
 3 Afghanistan AF            2015 100402.   1.45    -0.662     75.3      19.1 
 4 Afghanistan AF            2006  49061.   5.36     6.78      79.3      14.5 
 5 Afghanistan AF            2017 105389.   2.65     4.98      73.4      21.2 
 6 Afghanistan AF            2014  98966.   2.72     4.67      76.2      18.1 
 7 Afghanistan AF            2019 110814.   3.91     2.30      69.4      18.3 
 8 Afghanistan AF            2021  85768. -20.7      5.13      67.3      14.7 
 9 Afghanistan AF            2013  96341.   5.60     7.39      77.1      17.1 
10 Afghanistan AF            2018 106643.   1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

Note that the syntax is new = old, just as we used when renaming while selecting with select.

Remember, as with the other commands we are looking at, the rename command returns a new data frame in which the specified variables have fresh names, while every other column is unchanged and unreordered. The original data frame is untouched. So if we want to permanently work with the renamed variables, we need to assign the new data frame returned by rename to a new name, as in this example:

wdi_panel_renamed <- rename(
  wdi_panel,
  country_code = iso2c,
  growth = gdp_growth
)

Now, the data frame wdi_panel_renamed is the data frame with the new names, while wdi_panel is the original data frame, and both of these can exist at the same time.

If we need to rename many or even all variables, typing dozens of pairs can be tedious. The rename_with function lets you pass a function to do renaming, which will be applied to every selected column name.

For example, the function toupper turns any string given as input into uppercase, and so if we do

rename_with(wdi_panel, toupper)
# A tibble: 6,582 × 8
   COUNTRY     ISO2C  YEAR     GDP GDP_GROWTH INFLATION LFP_MALE LFP_FEMALE
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

we will make all the column names uppercase.

You can combine rename_with with a selection helper, such as those used above, so the function applies to only certain columns. Below we convert every GDP-related variable (those that start with "gdp") to upper-case, leaving others alone:

rename_with(
  wdi_panel,
  toupper,
  .cols = starts_with("gdp")
)
# A tibble: 6,582 × 8
   country     iso2c  year     GDP GDP_GROWTH inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

5.4 Re-ordering columns with relocate

After we have selected the variables we need and renamed them sensibly, a common housekeeping step is to arrange those columns in an order that makes the data frame easier to read. That is what relocate does: it moves one or more columns to a new position while leaving every other column untouched. As with the other commands, relocate returns a new data frame; the original remains unchanged unless you assign it to a new name.

To see relocate in action, suppose we want iso2c to be the very first column. We do so as follows:

relocate(wdi_panel, iso2c)
# A tibble: 6,582 × 8
   iso2c country      year     gdp gdp_growth inflation lfp_male lfp_female
   <chr> <chr>       <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 AF    Afghanistan  2016 102672.       2.26     4.38      74.4      20.2 
 2 AF    Afghanistan  2023  82595.       2.71    -4.64      70.2       5.16
 3 AF    Afghanistan  2015 100402.       1.45    -0.662     75.3      19.1 
 4 AF    Afghanistan  2006  49061.       5.36     6.78      79.3      14.5 
 5 AF    Afghanistan  2017 105389.       2.65     4.98      73.4      21.2 
 6 AF    Afghanistan  2014  98966.       2.72     4.67      76.2      18.1 
 7 AF    Afghanistan  2019 110814.       3.91     2.30      69.4      18.3 
 8 AF    Afghanistan  2021  85768.     -20.7      5.13      67.3      14.7 
 9 AF    Afghanistan  2013  96341.       5.60     7.39      77.1      17.1 
10 AF    Afghanistan  2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

By default, relocate() puts the chosen column(s) before all others in the data frame.

We can move several columns as a block. Here we place year and iso2c ahead of everything else:

relocate(wdi_panel, year, iso2c)
# A tibble: 6,582 × 8
    year iso2c country         gdp gdp_growth inflation lfp_male lfp_female
   <int> <chr> <chr>         <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1  2016 AF    Afghanistan 102672.       2.26     4.38      74.4      20.2 
 2  2023 AF    Afghanistan  82595.       2.71    -4.64      70.2       5.16
 3  2015 AF    Afghanistan 100402.       1.45    -0.662     75.3      19.1 
 4  2006 AF    Afghanistan  49061.       5.36     6.78      79.3      14.5 
 5  2017 AF    Afghanistan 105389.       2.65     4.98      73.4      21.2 
 6  2014 AF    Afghanistan  98966.       2.72     4.67      76.2      18.1 
 7  2019 AF    Afghanistan 110814.       3.91     2.30      69.4      18.3 
 8  2021 AF    Afghanistan  85768.     -20.7      5.13      67.3      14.7 
 9  2013 AF    Afghanistan  96341.       5.60     7.39      77.1      17.1 
10  2018 AF    Afghanistan 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

Note how the two columns stay in the order in which we list them.

The relocate function will do more than just pull columns to the very front. To position columns somewhere other than the very front, we supply a target column with .before or .after arguments. For example, to put inflation right before gdp, we can do the following:

relocate(wdi_panel, inflation, .before = gdp)
# A tibble: 6,582 × 8
   country     iso2c  year inflation     gdp gdp_growth lfp_male lfp_female
   <chr>       <chr> <int>     <dbl>   <dbl>      <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016     4.38  102672.       2.26     74.4      20.2 
 2 Afghanistan AF     2023    -4.64   82595.       2.71     70.2       5.16
 3 Afghanistan AF     2015    -0.662 100402.       1.45     75.3      19.1 
 4 Afghanistan AF     2006     6.78   49061.       5.36     79.3      14.5 
 5 Afghanistan AF     2017     4.98  105389.       2.65     73.4      21.2 
 6 Afghanistan AF     2014     4.67   98966.       2.72     76.2      18.1 
 7 Afghanistan AF     2019     2.30  110814.       3.91     69.4      18.3 
 8 Afghanistan AF     2021     5.13   85768.     -20.7      67.3      14.7 
 9 Afghanistan AF     2013     7.39   96341.       5.60     77.1      17.1 
10 Afghanistan AF     2018     0.626 106643.       1.19     71.4      19.8 
# ℹ 6,572 more rows

As another example, if we want to send all labour force participation columns, which start with lfp_, to occur just after year, we can do:

relocate(wdi_panel, starts_with('lfp_'), .after = year)
# A tibble: 6,582 × 8
   country     iso2c  year lfp_male lfp_female     gdp gdp_growth inflation
   <chr>       <chr> <int>    <dbl>      <dbl>   <dbl>      <dbl>     <dbl>
 1 Afghanistan AF     2016     74.4      20.2  102672.       2.26     4.38 
 2 Afghanistan AF     2023     70.2       5.16  82595.       2.71    -4.64 
 3 Afghanistan AF     2015     75.3      19.1  100402.       1.45    -0.662
 4 Afghanistan AF     2006     79.3      14.5   49061.       5.36     6.78 
 5 Afghanistan AF     2017     73.4      21.2  105389.       2.65     4.98 
 6 Afghanistan AF     2014     76.2      18.1   98966.       2.72     4.67 
 7 Afghanistan AF     2019     69.4      18.3  110814.       3.91     2.30 
 8 Afghanistan AF     2021     67.3      14.7   85768.     -20.7      5.13 
 9 Afghanistan AF     2013     77.1      17.1   96341.       5.60     7.39 
10 Afghanistan AF     2018     71.4      19.8  106643.       1.19     0.626
# ℹ 6,572 more rows

5.5 Selecting rows with slice

The wrangling commands covered thus far were all applied to columns. There are other commands that apply to rows instead. One of the simplest such functions is slice whose main purpose is to select rows according to their position. In this sense, slice is roughly equivalent to select when it uses column numerical indices.

If we wanted to select the first five rows, we could use slice as follows:

slice(wdi_panel, 1:5)
# A tibble: 5 × 8
  country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
  <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 

Similarly, if we wanted rows 10 to 20 inclusive, we could do:

slice(wdi_panel, 10:20)
# A tibble: 11 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2018 106643.      1.19      0.626     71.4      19.8 
 2 Afghanistan AF     2005  46567.     11.2      12.7       79.5      14.3 
 3 Afghanistan AF     2004  41865.      1.41     NA         79.6      14.2 
 4 Afghanistan AF     2002  37931.     28.6      NA         79.8      14.2 
 5 Afghanistan AF     2003  41282.      8.83     NA         79.7      14.2 
 6 Afghanistan AF     2022  80416.     -6.24     13.7       70.2       5.16
 7 Afghanistan AF     2007  55845.     13.8       8.68      79.2      14.7 
 8 Afghanistan AF     2008  58037.      3.92     26.4       78.9      15.0 
 9 Afghanistan AF     2020 108209.     -2.35      5.60      66.8      16.5 
10 Afghanistan AF     2012  91231.     12.8       6.44      78.0      16.2 
11 Afghanistan AF     2011  80913.      0.426    11.8       78.2      15.9 

Note how we use the colon operator here just like we did with select when we selected consecutive columns by their position (e.g., select(wdi_panel, 1:4)).

Also, as we saw when using select, we can use the minus sign to deselect rows. For example, the following selects all rows except the first row:

slice(wdi_panel, -1)
# A tibble: 6,581 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 2 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 3 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 4 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 5 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 6 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 7 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 8 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
 9 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
10 Afghanistan AF     2005  46567.      11.2     12.7       79.5      14.3 
# ℹ 6,571 more rows

Likewise, we could negate the set of numbers selected with the colon. For example, the following selects all rows except the first five rows:

slice(wdi_panel, -(1:5))
# A tibble: 6,577 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 2 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 3 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 4 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
 5 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
 6 Afghanistan AF     2005  46567.      11.2     12.7       79.5      14.3 
 7 Afghanistan AF     2004  41865.       1.41    NA         79.6      14.2 
 8 Afghanistan AF     2002  37931.      28.6     NA         79.8      14.2 
 9 Afghanistan AF     2003  41282.       8.83    NA         79.7      14.2 
10 Afghanistan AF     2022  80416.      -6.24    13.7       70.2       5.16
# ℹ 6,567 more rows

There are some helper functions to make some slicing operations easier. For example, we can select the first n rows as follows:

# same as slice(wdi_panel, 1:5)
slice_head(wdi_panel, n = 5)
# A tibble: 5 × 8
  country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
  <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 

There is a counterpart to slice_head that can be used to obtain rows from the end of the data frame. For example, if we wanted the last 5 rows, we can do the following:

slice_tail(wdi_panel, n = 5)
# A tibble: 5 × 8
  country  iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
  <chr>    <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 Zimbabwe ZW     2019 50316.      -6.33     255.      71.9       60.7
2 Zimbabwe ZW     2020 46383.      -7.82     557.      71.1       59.5
3 Zimbabwe ZW     2021 50311.       8.47      98.5     71.7       60.1
4 Zimbabwe ZW     2022 53399.       6.14     105.      71.7       59.8
5 Zimbabwe ZW     2023 56249.       5.34      NA       71.3       59.7

Another slice helper function, slice_sample, allows us to randomly sample rows. We could use this, for example, if we need to subsample a very large data frame for the purposes of plotting. The following samples 100 rows at random:

slice_sample(wdi_panel, n = 100)
# A tibble: 100 × 8
   country      iso2c  year      gdp gdp_growth inflation lfp_male lfp_female
   <chr>        <chr> <int>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Burkina Faso BF     2016   41816.       5.96     0.441     69.0       53.6
 2 Guinea       GN     1992   14358.       3.27    NA         64.1       63.2
 3 Maldives     MV     2018    9935.       8.67    -0.133     74.5       40.9
 4 China        CN     2006 9060142.      12.7      1.65      79.0       65.4
 5 Ecuador      EC     2023  260213.       2.36     2.22      76.7       52.5
 6 Gabon        GA     2013   37948.       5.64     0.505     58.4       40.8
 7 Honduras     HN     1997   27305.       4.60    20.2       86.9       45.1
 8 Mauritius    MU     2005   18676.       1.78     4.94      77.1       41.2
 9 Senegal      SN     2009   41595.       2.75    -2.25      64.2       33.6
10 Maldives     MV     1993    2287.       5.56    20.1       74.4       24.4
# ℹ 90 more rows

Another way of using slice_sample is with the by option. For example, instead of just sampling, say, n=3 rows from anywhere in the data frame, we could sample n=3 rows from each country as follows:

slice_sample(wdi_panel, n = 3, by = country)
# A tibble: 597 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2008  58037.       3.92     26.4      78.9       15.0
 2 Afghanistan AF     2021  85768.     -20.7       5.13     67.3       14.7
 3 Afghanistan AF     2014  98966.       2.72      4.67     76.2       18.1
 4 Albania     AL     2018  42170.       4.02      2.03     67.5       51.4
 5 Albania     AL     2017  40540.       3.80      2.06     66.8       49.7
 6 Albania     AL     2021  45346.       8.97      2.04     66.2       51.9
 7 Algeria     DZ     1996 313815.       4.10     18.7      77.2       11.7
 8 Algeria     DZ     2004 431567.       4.50      3.96     72.7       12.7
 9 Algeria     DZ     2012 553306.       2.40      8.89     70.0       14.8
10 Andorra     AD     2016   4649.       3.71     NA        NA         NA  
# ℹ 587 more rows

5.6 Keeping (or dropping) rows with filter

The slice function and variants are useful for selecting rows but they don’t provide all the functionality we usually need for selecting rows. In particular, we very often need to select rows that match some condition or set of conditions. For example, with data like wdi_panel, we may wish to select only those rows that meet some or all of the conditions: “years since 2010”, “countries in the EU”, “non-missing GDP”. Doing this with slice would be difficult or impossible. Instead, for tasks like this we use filter, which is one of the most powerful data wrangling work-horses.

As a simple example of filter in action, here we keep only those rows from the year 2020:

filter(wdi_panel, year == 2020)
# A tibble: 199 × 8
   country           iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
   <chr>             <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan       AF     2020 1.08e5     -2.35      5.60      66.8       16.5
 2 Albania           AL     2020 4.16e4     -3.31      1.62      65.3       50.6
 3 Algeria           DZ     2020 6.25e5     -5.00      2.42      64.3       13.7
 4 Andorra           AD     2020 4.29e3    -11.2      NA         NA         NA  
 5 Angola            AO     2020 2.53e5     -5.64     22.3       76.8       72.7
 6 Antigua and Barb… AG     2020 2.20e3    -18.9       0.626     NA         NA  
 7 Argentina         AR     2020 1.08e6     -9.90     NA         65.9       46.5
 8 Armenia           AM     2020 4.46e4     -7.20      1.21      68.1       57.9
 9 Aruba             AW     2020 3.10e3    -26.2      NA         NA         NA  
10 Australia         AU     2020 1.46e6     -0.120     0.847     70.2       60.2
# ℹ 189 more rows

A few simple but important things to note about this. First, think “filter in” and not “filter out”. In other words, filter keeps, rather than drops, the rows that match the condition. Second, in this case we use == and not =. The == is how we test equality. We effectively ask if the thing on the left of == is identical to the thing on the right. You can read this example aloud as “keep rows where year equals 2020.” We use the single = often in R, and you’ve already seen many examples, but == means something completely different. It’s common to mistake the two when using filter, so if you use = instead of ==, filter will inform you that you probably intended ==, as in the following:

filter(wdi_panel, year = 2020)
Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `year == 2020`?

In general with filter, we use conditions: is something true or false. Equality conditions like year == 2020 are common numerical conditions, but other numerical conditions like != (not equal to), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to) are also very commonly used. For example, the following selects those rows from before, but not including, 2020:

filter(wdi_panel, year < 2020)
# A tibble: 5,794 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4       20.2
 2 Afghanistan AF     2015 100402.       1.45    -0.662     75.3       19.1
 3 Afghanistan AF     2006  49061.       5.36     6.78      79.3       14.5
 4 Afghanistan AF     2017 105389.       2.65     4.98      73.4       21.2
 5 Afghanistan AF     2014  98966.       2.72     4.67      76.2       18.1
 6 Afghanistan AF     2019 110814.       3.91     2.30      69.4       18.3
 7 Afghanistan AF     2013  96341.       5.60     7.39      77.1       17.1
 8 Afghanistan AF     2018 106643.       1.19     0.626     71.4       19.8
 9 Afghanistan AF     2005  46567.      11.2     12.7       79.5       14.3
10 Afghanistan AF     2004  41865.       1.41    NA         79.6       14.2
# ℹ 5,784 more rows

We can combine multiple conditions in filter. For example, to select the rows from France in or after the year 2020, we can do the following:

filter(wdi_panel, country == 'France', year >= 2020)
# A tibble: 4 × 8
  country iso2c  year      gdp gdp_growth inflation lfp_male lfp_female
  <chr>   <chr> <int>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 France  FR     2022 3654521.      2.57      5.22      60.2       51.9
2 France  FR     2021 3562924.      6.88      1.64      59.8       51.6
3 France  FR     2023 3688745.      0.936     4.88      59.8       51.8
4 France  FR     2020 3333501.     -7.44      0.476     58.9       50.0

Here, the comma effectively means “and”, so we are keeping those rows where the country is France and the year is 2020 or later. You can use as many conditions as you wish by just adding more conditions with commas between them. For example, here we select the rows from France in or after the year 2020 where the GDP growth was less than 2 (2% growth):

filter(wdi_panel, country == 'France', year >= 2020, gdp_growth < 2.0)
# A tibble: 2 × 8
  country iso2c  year      gdp gdp_growth inflation lfp_male lfp_female
  <chr>   <chr> <int>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 France  FR     2023 3688745.      0.936     4.88      59.8       51.8
2 France  FR     2020 3333501.     -7.44      0.476     58.9       50.0

Conditions that are “and” combinations like this are known in logic as conjunctions: a conjunction is true if and only if all of its constituent conditions are true. By contrast, disjunctions are “or” combinations. A disjunction is true if any of its conditions are true. We commonly use disjunctions when using filter. For example, to select rows where GDP growth was negative or inflation was over 20 (20% inflation), we do:

filter(wdi_panel, gdp_growth < 0 | inflation > 20)
# A tibble: 1,494 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2021  85768.     -20.7       5.13     67.3      14.7 
 2 Afghanistan AF     2022  80416.      -6.24     13.7      70.2       5.16
 3 Afghanistan AF     2008  58037.       3.92     26.4      78.9      15.0 
 4 Afghanistan AF     2020 108209.      -2.35      5.60     66.8      16.5 
 5 Afghanistan AF     2001  29496.      -9.43     NA        79.9      14.2 
 6 Albania     AL     1994  13896.       8.30     22.6      74.4      52.5 
 7 Albania     AL     2020  41613.      -3.31      1.62     65.3      50.6 
 8 Albania     AL     1993  12831.       9.56     85.0      75.1      53.2 
 9 Albania     AL     1991  12618.     -28.0      NA        75.2      53.4 
10 Albania     AL     1990  17526.      -9.58     NA        72.7      50.8 
# ℹ 1,484 more rows

Here, we don’t use a comma to separate conditions, but use |, which is the symbol in R meaning logical or. We can read this as “keep rows where gdp_growth is less than zero or inflation is greater than 20, or both”. Note that in some of the selected rows, the gdp_growth is positive and in others the inflation is not greater than 20, but in each case, one of the two conditions was met.

What if we wanted to select rows where values were within some range of values? For example, what if we wanted to select rows where the year was between 2000 and 2009? One possibility is to use a conjunction as follows:

filter(wdi_panel, year >= 2000, year <= 2009)
# A tibble: 1,934 × 8
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.       5.36      6.78     79.3       14.5
 2 Afghanistan AF     2005 46567.      11.2      12.7      79.5       14.3
 3 Afghanistan AF     2004 41865.       1.41     NA        79.6       14.2
 4 Afghanistan AF     2002 37931.      28.6      NA        79.8       14.2
 5 Afghanistan AF     2003 41282.       8.83     NA        79.7       14.2
 6 Afghanistan AF     2007 55845.      13.8       8.68     79.2       14.7
 7 Afghanistan AF     2008 58037.       3.92     26.4      78.9       15.0
 8 Afghanistan AF     2001 29496.      -9.43     NA        79.9       14.2
 9 Afghanistan AF     2000 32567.      NA        NA        79.9       14.3
10 Afghanistan AF     2009 70451.      21.4      -6.81     78.7       15.3
# ℹ 1,924 more rows

Here we are asking for those rows where year is greater than or equal to 2000 and also less than or equal to 2009. The only years that satisfy this conjunction are those between 2000 and 2009, inclusive. Another option, and probably easier too, is to use the between helper function as follows:

filter(wdi_panel, between(year, 2000, 2009))
# A tibble: 1,934 × 8
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.       5.36      6.78     79.3       14.5
 2 Afghanistan AF     2005 46567.      11.2      12.7      79.5       14.3
 3 Afghanistan AF     2004 41865.       1.41     NA        79.6       14.2
 4 Afghanistan AF     2002 37931.      28.6      NA        79.8       14.2
 5 Afghanistan AF     2003 41282.       8.83     NA        79.7       14.2
 6 Afghanistan AF     2007 55845.      13.8       8.68     79.2       14.7
 7 Afghanistan AF     2008 58037.       3.92     26.4      78.9       15.0
 8 Afghanistan AF     2001 29496.      -9.43     NA        79.9       14.2
 9 Afghanistan AF     2000 32567.      NA        NA        79.9       14.3
10 Afghanistan AF     2009 70451.      21.4      -6.81     78.7       15.3
# ℹ 1,924 more rows

The helper near can be used to find values that are within a certain distance from a specified value. For example, to find country/years where the GDP growth was \(5\% \pm 1\%\), we can use near as follows:

filter(wdi_panel, near(gdp_growth, 5, 1)) # within ±1.0 of 5 
# A tibble: 1,323 × 8
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.       5.36     6.78      79.3       14.5
 2 Afghanistan AF     2013 96341.       5.60     7.39      77.1       17.1
 3 Albania     AL     2018 42170.       4.02     2.03      67.5       51.4
 4 Albania     AL     2022 47535.       4.83     6.73      68.0       53.2
 5 Albania     AL     2007 30022.       5.98     2.93      62.9       46.1
 6 Albania     AL     2005 26748.       5.53     2.37      66.2       47.2
 7 Albania     AL     2003 24023.       5.53     0.484     69.4       48.5
 8 Albania     AL     2004 25348.       5.51     2.28      67.8       47.8
 9 Albania     AL     2006 28327.       5.90     2.37      64.6       46.7
10 Albania     AL     2002 22764.       4.54     7.77      70.9       49.1
# ℹ 1,313 more rows

Related situations arise with non-numeric data. What if, for example, we wanted to select rows from France, Germany, and Italy only? One option is as follows:

filter(wdi_panel, country == 'France' | country == 'Germany' | country == 'Italy')
# A tibble: 102 × 8
   country iso2c  year      gdp gdp_growth inflation lfp_male lfp_female
   <chr>   <chr> <int>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 France  FR     1994 2361212.      2.38     1.66       63.3       47.5
 2 France  FR     2010 3194765.      2.00     1.53       61.8       51.1
 3 France  FR     2009 3132113.     -2.82     0.0876     62.0       51.1
 4 France  FR     1993 2306239.     -0.359    2.10       63.7       47.3
 5 France  FR     2012 3278656.      0.184    1.95       61.7       51.2
 6 France  FR     2011 3272640.      2.44     2.11       61.5       51.0
 7 France  FR     2004 2992434.      2.87     2.14       62.7       49.6
 8 France  FR     2017 3472749.      2.08     1.03       60.3       50.6
 9 France  FR     2016 3401867.      0.860    0.183      60.4       50.8
10 France  FR     1995 2415633.      2.30     1.80       63.0       47.8
# ℹ 92 more rows

While that works, it is a bit awkward and an easier alternative is to use the %in% operator:

filter(wdi_panel, country %in% c('France', 'Germany', 'Italy'))
# A tibble: 102 × 8
   country iso2c  year      gdp gdp_growth inflation lfp_male lfp_female
   <chr>   <chr> <int>    <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 France  FR     1994 2361212.      2.38     1.66       63.3       47.5
 2 France  FR     2010 3194765.      2.00     1.53       61.8       51.1
 3 France  FR     2009 3132113.     -2.82     0.0876     62.0       51.1
 4 France  FR     1993 2306239.     -0.359    2.10       63.7       47.3
 5 France  FR     2012 3278656.      0.184    1.95       61.7       51.2
 6 France  FR     2011 3272640.      2.44     2.11       61.5       51.0
 7 France  FR     2004 2992434.      2.87     2.14       62.7       49.6
 8 France  FR     2017 3472749.      2.08     1.03       60.3       50.6
 9 France  FR     2016 3401867.      0.860    0.183      60.4       50.8
10 France  FR     1995 2415633.      2.30     1.80       63.0       47.8
# ℹ 92 more rows

The %in% operator tests if the value on its left is an element of the set to its right. In this case, we use the c() function to create a vector of three countries.

As mentioned, filter keeps rows rather than dropping them. However, we can easily get it to drop rows in various ways. One common option is to negate a condition with ! (logical not). For example, to drop all years between 2000 and 2009, we can do the following:

filter(wdi_panel, !between(year, 2000, 2009))
# A tibble: 4,648 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 5 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 6 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 7 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 8 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
 9 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
10 Afghanistan AF     2022  80416.      -6.24    13.7       70.2       5.16
# ℹ 4,638 more rows

The way to read !between(year, 2000, 2009) is any year not between 2000 and 2009, so we end up selecting all but the years between 2000 and 2009 inclusive.

As another example, to drop rows from France, Germany, or Italy, we can negate the selection made with %in% above:

filter(wdi_panel, !(country %in% c('France', 'Germany', 'Italy')))
# A tibble: 6,480 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,470 more rows

It is very common to need to remove rows where some or all values are missing, indicated by NA. We can use negation in filter to do this. In particular, we can drop rows whenever there is a missing value in a specified column by using the is.na function with negation. The is.na function tests if a value is NA (missing), so negating it tests if it is not NA (not missing). For example, to drop all rows where the gdp value is missing, we can do the following:

filter(wdi_panel, !is.na(gdp))
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

However, there is a special command, drop_na, that can be used instead of using filter. This is an important command to be aware of in its own right, and can usually be a more straightforward alternative to using filter to remove missing values.

On its own, drop_na will remove all rows where there is a missing value anywhere.

# drop rows containing missing values
# aka rowwise deletion of missing values
drop_na(wdi_panel)
# A tibble: 5,410 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 5,400 more rows

If we name specified columns, it will only drop rows where values of these columns are missing. For example, to drop rows wherever gdp is missing, we would do the following:

drop_na(wdi_panel, gdp)
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

To drop rows wherever gdp or gdp_growth is missing, we would do the following:

drop_na(wdi_panel, gdp, gdp_growth)
# A tibble: 6,557 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,547 more rows

We can use selection helpers like starts_with inside drop_na too. The following will drop any rows if any column starting with gdp has a missing value:

drop_na(wdi_panel, any_of(starts_with('gdp')))
# A tibble: 6,557 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,547 more rows

5.7 Creating and transforming variables with mutate

The mutate command is another powerful wrangling tool. It creates one or more new columns from the data already present, or it re-calculates or transforms an existing column in place.

As a simple example, let’s create a variable that calculates GDP, which is currently in millions of dollars, in billions of dollars instead:

mutate(wdi_panel, gdp_b = gdp / 1000)
# A tibble: 6,582 × 9
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female gdp_b
   <chr>       <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl> <dbl>
 1 Afghanistan AF     2016 1.03e5       2.26     4.38      74.4      20.2  103. 
 2 Afghanistan AF     2023 8.26e4       2.71    -4.64      70.2       5.16  82.6
 3 Afghanistan AF     2015 1.00e5       1.45    -0.662     75.3      19.1  100. 
 4 Afghanistan AF     2006 4.91e4       5.36     6.78      79.3      14.5   49.1
 5 Afghanistan AF     2017 1.05e5       2.65     4.98      73.4      21.2  105. 
 6 Afghanistan AF     2014 9.90e4       2.72     4.67      76.2      18.1   99.0
 7 Afghanistan AF     2019 1.11e5       3.91     2.30      69.4      18.3  111. 
 8 Afghanistan AF     2021 8.58e4     -20.7      5.13      67.3      14.7   85.8
 9 Afghanistan AF     2013 9.63e4       5.60     7.39      77.1      17.1   96.3
10 Afghanistan AF     2018 1.07e5       1.19     0.626     71.4      19.8  107. 
# ℹ 6,572 more rows

Note how the newly created gdp_b is inserted at the end of the data frame by default. We can optionally use .after or .before to tell mutate where to insert the new variable. For example, to insert gdp_b directly after gdp, we can do:

mutate(wdi_panel, gdp_b = gdp / 1000, .after = gdp)
# A tibble: 6,582 × 9
   country     iso2c  year    gdp gdp_b gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>  <dbl> <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 1.03e5 103.        2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023 8.26e4  82.6       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 1.00e5 100.        1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006 4.91e4  49.1       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 1.05e5 105.        2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014 9.90e4  99.0       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 1.11e5 111.        3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021 8.58e4  85.8     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013 9.63e4  96.3       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 1.07e5 107.        1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

We can use mutate to perform multiple transformations at once. You can add several columns in the same call, and later names may refer to earlier ones created in the same line. In the following example, we create a GDP-in-billions variable as before, calculate the annual GDP growth as a multiplier, and then use both of these newly created variables to calculate projected GDP:

mutate(wdi_panel,
  gdp_b = gdp / 1000,
  gdp_growth_mult = 1 + gdp_growth / 100,
  gdp_b_proj = gdp_b * gdp_growth_mult,
  .after = gdp
)
# A tibble: 6,582 × 11
   country     iso2c  year     gdp gdp_b gdp_growth_mult gdp_b_proj gdp_growth
   <chr>       <chr> <int>   <dbl> <dbl>           <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2016 102672. 103.            1.02       105.        2.26
 2 Afghanistan AF     2023  82595.  82.6           1.03        84.8       2.71
 3 Afghanistan AF     2015 100402. 100.            1.01       102.        1.45
 4 Afghanistan AF     2006  49061.  49.1           1.05        51.7       5.36
 5 Afghanistan AF     2017 105389. 105.            1.03       108.        2.65
 6 Afghanistan AF     2014  98966.  99.0           1.03       102.        2.72
 7 Afghanistan AF     2019 110814. 111.            1.04       115.        3.91
 8 Afghanistan AF     2021  85768.  85.8           0.793       68.0     -20.7 
 9 Afghanistan AF     2013  96341.  96.3           1.06       102.        5.60
10 Afghanistan AF     2018 106643. 107.            1.01       108.        1.19
# ℹ 6,572 more rows
# ℹ 3 more variables: inflation <dbl>, lfp_male <dbl>, lfp_female <dbl>

It is common to use mutate to create discrete or categorical variables. For example, we can create a new variable higher_gdp that indicates if any gdp value is above or below average as follows:

mutate(wdi_panel, higher_gdp = gdp >= mean(gdp), .after = gdp) 
# A tibble: 6,582 × 9
   country     iso2c  year     gdp higher_gdp gdp_growth inflation lfp_male
   <chr>       <chr> <int>   <dbl> <lgl>           <dbl>     <dbl>    <dbl>
 1 Afghanistan AF     2016 102672. FALSE            2.26     4.38      74.4
 2 Afghanistan AF     2023  82595. FALSE            2.71    -4.64      70.2
 3 Afghanistan AF     2015 100402. FALSE            1.45    -0.662     75.3
 4 Afghanistan AF     2006  49061. FALSE            5.36     6.78      79.3
 5 Afghanistan AF     2017 105389. FALSE            2.65     4.98      73.4
 6 Afghanistan AF     2014  98966. FALSE            2.72     4.67      76.2
 7 Afghanistan AF     2019 110814. FALSE            3.91     2.30      69.4
 8 Afghanistan AF     2021  85768. FALSE          -20.7      5.13      67.3
 9 Afghanistan AF     2013  96341. FALSE            5.60     7.39      77.1
10 Afghanistan AF     2018 106643. FALSE            1.19     0.626     71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>

The newly created higher_gdp is a logical vector, with values TRUE or FALSE, where TRUE means the corresponding gdp is equal to or above average GDP, and FALSE means it is below average. An alternative approach is to use if_else as follows:

mutate(wdi_panel,
  higher_gdp = if_else(gdp >= mean(gdp), "above_avg", "below_avg"),
  .after = gdp
)
# A tibble: 6,582 × 9
   country     iso2c  year     gdp higher_gdp gdp_growth inflation lfp_male
   <chr>       <chr> <int>   <dbl> <chr>           <dbl>     <dbl>    <dbl>
 1 Afghanistan AF     2016 102672. below_avg        2.26     4.38      74.4
 2 Afghanistan AF     2023  82595. below_avg        2.71    -4.64      70.2
 3 Afghanistan AF     2015 100402. below_avg        1.45    -0.662     75.3
 4 Afghanistan AF     2006  49061. below_avg        5.36     6.78      79.3
 5 Afghanistan AF     2017 105389. below_avg        2.65     4.98      73.4
 6 Afghanistan AF     2014  98966. below_avg        2.72     4.67      76.2
 7 Afghanistan AF     2019 110814. below_avg        3.91     2.30      69.4
 8 Afghanistan AF     2021  85768. below_avg      -20.7      5.13      67.3
 9 Afghanistan AF     2013  96341. below_avg        5.60     7.39      77.1
10 Afghanistan AF     2018 106643. below_avg        1.19     0.626     71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>

Now, the new higher_gdp variable is a character vector with two possible values: above_avg and below_avg. With the if_else function, there are always three arguments: the first is some condition to be tested, the second is the value returned if the condition is true, and the third is the value returned if the condition is false.

To create a discrete or categorical variable with more than two possible values, a very useful function is case_when. For example, we can create a new variable infl_rate to indicate if inflation is “high” (above 5%), “medium” (between 2% and 5%), or “low” (less than 2%) as follows:

mutate(wdi_panel,
       infl_rate = case_when(
         inflation > 5 ~ 'high',
         inflation > 2 ~ 'medium',
         TRUE ~ 'low'
       ),
       .after = inflation
)
# A tibble: 6,582 × 9
   country iso2c  year    gdp gdp_growth inflation infl_rate lfp_male lfp_female
   <chr>   <chr> <int>  <dbl>      <dbl>     <dbl> <chr>        <dbl>      <dbl>
 1 Afghan… AF     2016 1.03e5       2.26     4.38  medium        74.4      20.2 
 2 Afghan… AF     2023 8.26e4       2.71    -4.64  low           70.2       5.16
 3 Afghan… AF     2015 1.00e5       1.45    -0.662 low           75.3      19.1 
 4 Afghan… AF     2006 4.91e4       5.36     6.78  high          79.3      14.5 
 5 Afghan… AF     2017 1.05e5       2.65     4.98  medium        73.4      21.2 
 6 Afghan… AF     2014 9.90e4       2.72     4.67  medium        76.2      18.1 
 7 Afghan… AF     2019 1.11e5       3.91     2.30  medium        69.4      18.3 
 8 Afghan… AF     2021 8.58e4     -20.7      5.13  high          67.3      14.7 
 9 Afghan… AF     2013 9.63e4       5.60     7.39  high          77.1      17.1 
10 Afghan… AF     2018 1.07e5       1.19     0.626 low           71.4      19.8 
# ℹ 6,572 more rows

How case_when() works may not be obvious at first glance.
Think of it as checking each row against the rules in order:

  1. Test the first rule, inflation > 5.
    – If it is true, write "high" and stop for that row.
  2. Otherwise, test the second rule, inflation > 2.
    – If true, write "medium" (we already know the value is ≤ 5).
  3. If neither rule matched, the final catch-all line TRUE ~ "low" fires, so the row gets "low".

Because the last rule is always TRUE, every row receives exactly one of the three labels — high, medium, or low — and the checking stops as soon as the first matching rule is found.

Another handy helper inside mutate() is across(). Use it when the same change, for example, “divide by 100”, should be applied to several columns in one go.

mutate(
  wdi_panel,
  across(gdp_growth:lfp_female, ~ . / 100)
)
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.     0.0226   0.0438     0.744     0.202 
 2 Afghanistan AF     2023  82595.     0.0271  -0.0464     0.702     0.0516
 3 Afghanistan AF     2015 100402.     0.0145  -0.00662    0.753     0.191 
 4 Afghanistan AF     2006  49061.     0.0536   0.0678     0.793     0.145 
 5 Afghanistan AF     2017 105389.     0.0265   0.0498     0.734     0.212 
 6 Afghanistan AF     2014  98966.     0.0272   0.0467     0.762     0.181 
 7 Afghanistan AF     2019 110814.     0.0391   0.0230     0.694     0.183 
 8 Afghanistan AF     2021  85768.    -0.207    0.0513     0.673     0.147 
 9 Afghanistan AF     2013  96341.     0.0560   0.0739     0.771     0.171 
10 Afghanistan AF     2018 106643.     0.0119   0.00626    0.714     0.198 
# ℹ 6,572 more rows

What happens, step-by-step:

  1. gdp_growth:lfp_female selects the block of four percentage columns (gdp_growth, inflation, lfp_male, lfp_female).
  2. For every value in each of those columns, the little rule ~ . / 100 is carried out. Read it as “take the value (the dot) and divide it by 100.”
  3. The result, which is now a true proportion like 0.0226 instead of 2.26, is written back into the same columns. All other columns stay untouched.

So across() lets you say “do this to all these columns” once, instead of repeating the same line four times.

5.8 Sorting rows with arrange

Data often need to be displayed or examined in a particular order rather than the order they were read from disk. The arrange function reorders rows based on the values in one or more columns while leaving every column unchanged.

Sorting the wdi_panel data by gross domestic product allows us to see which country–years have the largest or smallest economies in the table.

arrange(wdi_panel, gdp)
# A tibble: 6,582 × 8
   country iso2c  year   gdp gdp_growth inflation lfp_male lfp_female
   <chr>   <chr> <int> <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Tuvalu  TV     1990  26.3      15.4         NA       NA         NA
 2 Tuvalu  TV     1991  27.2       3.61        NA       NA         NA
 3 Tuvalu  TV     1992  28.0       2.79        NA       NA         NA
 4 Tuvalu  TV     1996  28.7      -5.96        NA       NA         NA
 5 Tuvalu  TV     1993  29.1       4.11        NA       NA         NA
 6 Tuvalu  TV     1995  30.5      -5.00        NA       NA         NA
 7 Tuvalu  TV     1997  31.5      10.0         NA       NA         NA
 8 Tuvalu  TV     1994  32.1      10.3         NA       NA         NA
 9 Tuvalu  TV     2005  34.7      -4.65        NA       NA         NA
10 Tuvalu  TV     2006  35.5       2.44        NA       NA         NA
# ℹ 6,572 more rows

The rows now appear from the smallest GDP to the largest.

To reverse the order wrap the column name in desc, which stands for descending.

arrange(wdi_panel, desc(gdp))
# A tibble: 6,582 × 8
   country       iso2c  year       gdp gdp_growth inflation lfp_male lfp_female
   <chr>         <chr> <int>     <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 China         CN     2023 31229733.       5.25     0.235     71.6       59.9
 2 China         CN     2022 29672080.       2.95     1.97      71.6       59.9
 3 China         CN     2021 28821649.       8.45     0.981     72.9       61.2
 4 China         CN     2020 26576354.       2.24     2.42      71.9       59.9
 5 China         CN     2019 25994432.       5.95     2.90      73.4       61.2
 6 United States US     2023 24977118.       2.89     4.12      67.6       56.6
 7 China         CN     2018 24534506.       6.75     2.07      74.0       61.5
 8 United States US     2022 24276131.       2.51     8.00      67.6       56.0
 9 United States US     2021 23681171        6.06     4.70      67.4       55.6
10 China         CN     2017 22983193.       6.95     1.59      74.6       61.9
# ℹ 6,572 more rows

This version lists the country–year with the highest GDP first and moves down to the lowest. Multiple sort keys can be combined by listing additional columns after the first, allowing ties to be broken or secondary ordering to be imposed, but a single arrange call with or without desc is enough for most quick inspections.

5.9 Reshaping data from wide to long format and back

In data frames, data is often arranged in one of two fundamental ways for specific sets of variables: wide format and long format. For some analyses it is necessary to use a wide format and for other analyses a long format is required. Transforming your data from wide to long, or long to wide, is known as reshaping. Understanding these formats and how to switch between them is an essential data-wrangling skill. The tidyverse, specifically the tidyr package, provides two tools to reshape data: pivot_longer, which converts wide format to long, and pivot_wider, which converts long format to wide.

It is easier to understand long and wide formats using examples. Consider the following fictitious data.

neighbourhood crime_2019 crime_2020 crime_2021
Downtown 120 115 110
Riverside 85 90 88
Westfield 60 55 52
Lakeshore 40 38 35

Each row is a neighbourhood and each column after the first records the reported crime rate (crime per 10 000 residents) in a different year. Because the same measure is stored in separate year-specific columns this data frame is in wide form.

By contrast, the next data frame is the same data in long format.

neighbourhood year crime
Downtown 2019 120
Downtown 2020 115
Downtown 2021 110
Riverside 2019 85
Riverside 2020 90
Riverside 2021 88
Westfield 2019 60
Westfield 2020 55
Westfield 2021 52
Lakeshore 2019 40
Lakeshore 2020 38
Lakeshore 2021 35

The long version stacks the three year columns into two new columns: one for the year and one for the crime rate. A wide format is often easier for humans to read and can help with comparisons across columns. A long format is usually preferable or required for many statistical analyses and visualisations, and it is more flexible for adding new data, filtering, and performing calculations. As a rule, when analysing data in R, a long format is often the preferred default.

The data frame wdi_panel has two lfp_ columns in wide format, one for male and one for female labour-force participation. We can convert these columns to long format with pivot_longer.

wdi_panel |> 
  pivot_longer(
    cols       = starts_with("lfp_"),
    names_to   = "gender",
    values_to  = "lfp"
  )
# A tibble: 13,164 × 8
   country     iso2c  year     gdp gdp_growth inflation gender       lfp
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl> <chr>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38  lfp_male   74.4 
 2 Afghanistan AF     2016 102672.       2.26     4.38  lfp_female 20.2 
 3 Afghanistan AF     2023  82595.       2.71    -4.64  lfp_male   70.2 
 4 Afghanistan AF     2023  82595.       2.71    -4.64  lfp_female  5.16
 5 Afghanistan AF     2015 100402.       1.45    -0.662 lfp_male   75.3 
 6 Afghanistan AF     2015 100402.       1.45    -0.662 lfp_female 19.1 
 7 Afghanistan AF     2006  49061.       5.36     6.78  lfp_male   79.3 
 8 Afghanistan AF     2006  49061.       5.36     6.78  lfp_female 14.5 
 9 Afghanistan AF     2017 105389.       2.65     4.98  lfp_male   73.4 
10 Afghanistan AF     2017 105389.       2.65     4.98  lfp_female 21.2 
# ℹ 13,154 more rows

The cols argument identifies the columns to pivot, which are the two that start with lfp_. For each of those columns, pivot_longer creates a new row for every original row, writes the column name into gender, and writes the original cell value into lfp. All columns that were not selected remain unchanged.

The names that appear in gender are lfp_male and lfp_female. It is preferable to remove the lfp_ prefix from these names. We can drop that shared prefix during the pivot as follows.

wdi_panel |> 
  pivot_longer(
    cols          = starts_with("lfp_"),
    names_to      = "gender",
    names_prefix  = "lfp_",
    values_to     = "lfp"
  )
# A tibble: 13,164 × 8
   country     iso2c  year     gdp gdp_growth inflation gender   lfp
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl> <chr>  <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38  male   74.4 
 2 Afghanistan AF     2016 102672.       2.26     4.38  female 20.2 
 3 Afghanistan AF     2023  82595.       2.71    -4.64  male   70.2 
 4 Afghanistan AF     2023  82595.       2.71    -4.64  female  5.16
 5 Afghanistan AF     2015 100402.       1.45    -0.662 male   75.3 
 6 Afghanistan AF     2015 100402.       1.45    -0.662 female 19.1 
 7 Afghanistan AF     2006  49061.       5.36     6.78  male   79.3 
 8 Afghanistan AF     2006  49061.       5.36     6.78  female 14.5 
 9 Afghanistan AF     2017 105389.       2.65     4.98  male   73.4 
10 Afghanistan AF     2017 105389.       2.65     4.98  female 21.2 
# ℹ 13,154 more rows

Here names_prefix tells pivot_longer to remove lfp_ before writing the value into gender. The gender column now holds the labels male and female.

Because pivots are reversible, we can restore the original wide layout with pivot_wider, which transforms long format to wide. In the code below we first save the output of pivot_longer as wdi_longer and then use pivot_wider to reverse it.

wdi_longer <- pivot_longer(
    wdi_panel,
    cols          = starts_with("lfp_"),
    names_to      = "gender",
    names_prefix  = "lfp_",
    values_to     = "lfp"
  )

pivot_wider(
    wdi_longer,
    names_from   = "gender",
    values_from  = "lfp",
    names_prefix = "lfp_"
  )
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

pivot_wider spreads the values in gender back into separate columns, fills them with the numbers from lfp, and re-attaches the prefix so the resulting names are lfp_male and lfp_female again. There is a clear symmetry between pivot_longer and pivot_wider: names_to matches names_from, and values_to matches values_from.

5.10 Joining data frames

Analyses often require information that lives in more than one data frame. A join operation combines two data frames by matching rows that share the same identifying columns, sometimes called keys.

Imagine if we have two data frames:

  • data_frame_A: A first data frame could be a list of student names and their student ID numbers.
  • data_frame_B: A second data frame could be a list of classes and the ID numbers of the students taking them.

If you want to find out the names of all the students in all the classes, we need to link up these two data frames. The join operation does exactly that by looking for the common piece of information between the two; in this case, it is the student ID number. It effectively looks at each student ID value in data_frame_A and then finds all the rows in data_frame_B with matching student IDs. It then combines the information from both lists for each matching ID, creating a new and more complete data frame that tells you which student (by name) is taking which class.

The tidyverse, specifically the dplyr package, provides a set of joining functions: inner_join, left_join, and others. To see this in action, let’s say we want to get the population for every country and year in the wdi_panel. We can do this by joining wdi_panel with the population data frame, which is part of the tidyr package and so is automatically loaded when you load the tidyverse:

population
# A tibble: 4,060 × 3
   country      year population
   <chr>       <dbl>      <dbl>
 1 Afghanistan  1995   17586073
 2 Afghanistan  1996   18415307
 3 Afghanistan  1997   19021226
 4 Afghanistan  1998   19496836
 5 Afghanistan  1999   19987071
 6 Afghanistan  2000   20595360
 7 Afghanistan  2001   21347782
 8 Afghanistan  2002   22202806
 9 Afghanistan  2003   23116142
10 Afghanistan  2004   24018682
# ℹ 4,050 more rows

As you can see, population has country and year, as does wdi_panel, and so we can join these two by

inner_join(wdi_panel, population)
# A tibble: 3,095 × 9
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female population
   <chr>       <chr> <dbl>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.      5.36       6.78     79.3       14.5   25631282
 2 Afghanistan AF     2013 96341.      5.60       7.39     77.1       17.1   30551674
 3 Afghanistan AF     2005 46567.     11.2       12.7      79.5       14.3   24860855
 4 Afghanistan AF     2004 41865.      1.41      NA        79.6       14.2   24018682
 5 Afghanistan AF     2002 37931.     28.6       NA        79.8       14.2   22202806
 6 Afghanistan AF     2003 41282.      8.83      NA        79.7       14.2   23116142
 7 Afghanistan AF     2007 55845.     13.8        8.68     79.2       14.7   26349243
 8 Afghanistan AF     2008 58037.      3.92      26.4      78.9       15.0   27032197
 9 Afghanistan AF     2012 91231.     12.8        6.44     78.0       16.2   29824536
10 Afghanistan AF     2011 80913.      0.426     11.8      78.2       15.9   29105480
# ℹ 3,085 more rows

In this example, inner_join finds all columns shared between wdi_panel and population and joins by every combination of these two variables. Sometimes it is helpful or necessary to explicitly specify the shared columns to join on. We can do that by using by = as follows:

inner_join(wdi_panel, population, by = c("country", "year"))
# A tibble: 3,095 × 9
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female population
   <chr>       <chr> <dbl>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.      5.36       6.78     79.3       14.5   25631282
 2 Afghanistan AF     2013 96341.      5.60       7.39     77.1       17.1   30551674
 3 Afghanistan AF     2005 46567.     11.2       12.7      79.5       14.3   24860855
 4 Afghanistan AF     2004 41865.      1.41      NA        79.6       14.2   24018682
 5 Afghanistan AF     2002 37931.     28.6       NA        79.8       14.2   22202806
 6 Afghanistan AF     2003 41282.      8.83      NA        79.7       14.2   23116142
 7 Afghanistan AF     2007 55845.     13.8        8.68     79.2       14.7   26349243
 8 Afghanistan AF     2008 58037.      3.92      26.4      78.9       15.0   27032197
 9 Afghanistan AF     2012 91231.     12.8        6.44     78.0       16.2   29824536
10 Afghanistan AF     2011 80913.      0.426     11.8      78.2       15.9   29105480
# ℹ 3,085 more rows

The inner_join function keeps only those rows where the key values appear in both data frames. If a particular country and year pair is missing from either table, that row is dropped. In our case the result shrinks from 6582 to 3095 rows, keeping only the country-year combinations for which a population figure is available.

In contrast to inner_join, left_join keeps every row from the first (left) data frame and adds columns from the second wherever the keys match:

left_join(wdi_panel, population, by = c("country", "year"))
# A tibble: 6,582 × 9
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female population
   <chr>       <chr> <dbl>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2          NA
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16         NA
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1          NA
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5    25631282
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2          NA
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1          NA
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3          NA
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7          NA
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1    30551674
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8          NA
# ℹ 6,572 more rows

When using left_join, if the second data frame lacks a matching row, the new columns are filled with NA. In this example, the resulting output therefore has the same number of rows as wdi_panel (6582) and has missing values for population where no match was found.

Use inner_join when you need complete cases and are happy to drop observations that lack matches. Use left_join when you want to preserve all rows from the main table and simply attach whatever extra information is available.

Other join variants exist such as right_join, full_join, and anti_join, but inner_join and left_join cover most day-to-day tasks and are all you need at this stage.

5.11 Data-wrangling pipelines

Chapter 2 introduced the pipe operator |> and showed how it passes the result of one expression straight into the next function. Data wrangling is where that left-to-right flow shines. Each of the tidyverse commands we focused on here does one job, so real wrangling work usually means running several commands in sequence. A pipeline writes those steps as a single sentence that the reader can follow from the raw data to the finished table.

Below are three short pipelines that use almost every tool covered so far. They all start with wdi_panel, add information or reshape it, and end with a table that could feed directly into a plot or a model.

Pipeline example 1: GDP per person, ranked
This pipeline builds a ranked list of GDP-per-capita across different country-years.
inner_join(wdi_panel, population, by = c("country", "year")) |>
  mutate(gdp_pc = 1e6 * gdp / population) |>
  select(country, year, gdp_pc) |>
  arrange(desc(gdp_pc))
# A tibble: 3,095 × 3
   country     year  gdp_pc
   <chr>      <dbl>   <dbl>
 1 Luxembourg  2007 139017.
 2 Luxembourg  2008 135467.
 3 Luxembourg  2013 132423.
 4 Luxembourg  2006 131384.
 5 Luxembourg  2010 130498.
 6 Luxembourg  2012 129978.
 7 Luxembourg  2011 129700.
 8 Luxembourg  2009 128236.
 9 Luxembourg  2005 126191.
10 Luxembourg  2004 124852.
# ℹ 3,085 more rows

The steps are as follows:

  1. inner_join(wdi_panel, population, by = c("country", "year")) Keep only those country–year rows that appear in both tables, attaching the population column to the matching rows in wdi_panel.

  2. mutate(gdp_pc = 1e6 * gdp / population) Convert GDP from millions to dollars (gdp * 1e6) and divide by population to create a new column gdp_pc, the per-capita GDP.

  3. select(country, year, gdp_pc) Retain just the identifying columns country and year along with the newly calculated gdp_pc, discarding all other variables.

  4. arrange(desc(gdp_pc)) Sort the resulting table in descending order so that the highest GDP-per-capita observations appear first.

Pipeline example 2: Recent inflation by income group
This pipeline produces a five-row summary that compares average GDP and average inflation across GDP quintiles for recent years.
wdi_panel |> 
  filter(year >= 2015) |>
  mutate(gdp_quintile = ntile(gdp, 5)) |> 
  summarise(avg_gdp = mean(gdp), 
            avg_inflation = mean(inflation, na.rm = TRUE), 
            .by = gdp_quintile) |> 
  arrange(gdp_quintile)
# A tibble: 5 × 3
  gdp_quintile  avg_gdp avg_inflation
         <int>    <dbl>         <dbl>
1            1    3139.          3.18
2            2   28657.          5.51
3            3   84298.          8.75
4            4  318744.          7.56
5            5 3253163.          5.53

Here is what happens, step by step.

  1. filter(year >= 2015) Keep only rows from 2015 onward. Everything earlier is discarded.

  2. mutate(gdp_quintile = ntile(gdp, 5)) Rank the remaining rows by their GDP value and split them into five equal-sized groups. The poorest fifth gets gdp_quintile = 1, the richest fifth gets 5.

  3. summarise(..., .by = gdp_quintile) For each quintile calculate • avg_gdp: the mean GDP within that group. • avg_inflation: the mean inflation rate, ignoring any missing values. Because of .by = gdp_quintile, the summary is done separately for each quintile, yielding one row per group.

  4. arrange(gdp_quintile) Order the resulting five rows from the lowest GDP quintile to the highest so the output reads naturally from poorer to richer economies.

  • filter to select the last ten years.
  • summarise to group the rows by World Bank income category. Compute the average inflation rate within each group. List groups from highest to lowest inflation.
Pipeline example 3: Gender gaps in labour participation by region
This pipeline averages the labour force participation rate for males and females, averaged over all countries and years:
wdi_panel |> 
  pivot_longer(
    cols          = starts_with("lfp_"),
    names_to      = "gender",
    names_prefix  = "lfp_",
    values_to     = "lfp"
  ) |> summarise(lfp = mean(lfp, na.rm = TRUE),
                 .by = gender)
# A tibble: 2 × 2
  gender   lfp
  <chr>  <dbl>
1 male    72.2
2 female  50.0
  1. pivot_longer(cols = starts_with("lfp_"), names_to = "gender", names_prefix = "lfp_", values_to = "lfp") Select the columns whose names begin with lfp_. Stack those columns into two new columns where the former column name, stripped of the lfp_ prefix, is stored in gender and the numeric value is stored in lfp.

  2. summarise(lfp = mean(lfp, na.rm = TRUE), .by = gender) Group the long data by gender and compute the mean of lfp, ignoring missing values. The result is one row for male and one row for female, each containing the average labour-force participation rate.

These three examples show how a pipe strings together joins, filters, calculations, reshapes, summaries, and sorts. The same pattern scales to longer pipelines; you just add another step on a new line.

5.12 Further Reading

The wrangling techniques and tidyverse functions presented in this chapter provide a foundation for cleaning and transforming data. The tidyverse contains many additional functions and advanced techniques that offer more power and flexibility. Exploring these resources will deepen your understanding and expand your data-wrangling toolkit.

The Official tidyverse Documentation
This chapter focused on core dplyr and tidyr functions. You can learn more about these packages and others in the official documentation, which includes many examples.
R for Data Science (R4DS) by Hadley Wickham and Garrett Grolemund
This textbook (Wickham et al., 2023) is the standard reference for learning the tidyverse. It includes detailed chapters on data transformation and is freely available online at https://r4ds.hadley.nz/.
Specific Package Cheatsheets
Many tidyverse packages have quick-reference cheatsheets. These can be found on the Posit cheatsheets page at https://posit.co/resources/cheatsheets/ or via the individual package websites.

5.13 Chapter summary

What you learned
- Why data wrangling matters: real-world datasets are rarely tidy, and cleaning, reshaping, and transforming them is essential before any analysis can begin.
- How to use core tidyverse verbs for working systematically with data:
select() to choose columns, rename() and relocate() to tidy and reorder them,
slice() and filter() to keep or drop rows,
mutate() to create or transform variables,
and arrange() to sort the results.
- How to reshape data between wide and long formats using pivot_longer() and pivot_wider().
- How to merge datasets using joins such as inner_join() and left_join(), matching rows based on shared identifiers.
- How to build pipelines using the native pipe |> to link transformations together into clear, reproducible workflows.
- How tidy data principles — each variable in a column, each observation in a row, each value in a cell — underpin all wrangling operations and make analysis easier.

Common pitfalls to avoid
- Thinking of wrangling as separate from analysis rather than its foundation.
- Forgetting that tidyverse functions are immutable — they return new data frames instead of changing the original.
- Using manual or repetitive transformations instead of chaining commands in a pipeline.
- Losing track of the tidy data structure when reshaping or joining.

Takeaway
- Chapter 5 introduced the practical toolkit for transforming messy real-world data into clean, analysis-ready form.
- You learned to use the tidyverse verbs and the pipe |> to construct transparent wrangling pipelines that anyone can read and reproduce.
- These skills are the foundation for all later work in the rest of the book.