5  Data wrangling

Until now, every dataset in this book has arrived in perfect rows and columns. There were no columns to rename, no rows to drop, and nothing to reshape or merge. They were ready for whatever analyses or plotting we needed to do. Real-world datasets, by contrast, are rarely so accommodating. You might encounter superfluous columns, or need to create new variables. Perhaps you’ll only need a subset of the rows, or have to reshape and restructure the data, or even merge multiple datasets into one. Usually, before any analysis can begin, however simple, we have to reformat, clean, or transform our data.

This activity of cleaning or transforming data is often called data wrangling. The verb to wrangle originally meant “to struggle or argue at length.” Ranchers later used it for herding or taming uncooperative livestock. Modern analysts have borrowed the word because taming data can feel exactly the same: time-consuming, difficult, and occasionally infuriating. Many surveys (for example, Anaconda, 2022) suggest that analysts spend 40–60% of their time on data wrangling; more than on statistical analysis itself. Despite this, the topic is often neglected in statistics textbooks and courses.

In this chapter, you’ll learn about a set of interrelated R commands used to wrangle data. They go by names like select(), filter(), and mutate(), and they’re almost all part of the broad tidyverse family of R tools. Each of these commands is designed to do one thing well while working seamlessly with the others. As such, they can be used to create powerful sequences of data transformations, known as pipelines, that take data from an unruly raw state and transform it into something clean, tidy, and ready for analysis.

We’ll go through many of the major wrangling tools, explaining their purpose and how they work. Then, we’ll provide examples of how they can be combined into pipelines to transform messy raw data into clean, tidy data. Everything we need, whether commands or data, is available in either the tidyverse meta-package or this textbook’s sgsur package.

library(tidyverse)
library(sgsur)

The chapter covers

  • Data wrangling transforms raw, messy datasets into tidy, analysis-ready format, consuming substantial analyst time but enabling all subsequent work.
  • Core tidyverse verbs like select(), filter(), mutate(), and arrange() each perform one transformation task efficiently and combine seamlessly.
  • Reshaping operations with pivot_longer() and pivot_wider() convert between wide and long data formats as different analyses require.
  • Join operations like inner_join() and left_join() combine information from multiple related datasets based on shared key variables.
  • Pipelines created with the native pipe operator link wrangling steps into readable, reproducible sequences that transform data systematically.
  • Developing fluency with these tools reduces time spent fighting data and increases time available for meaningful analysis and interpretation.

5.1 Selecting columns with select()

The first wrangling tool we will look at is select(). This is part of the dplyr package, which is part of the tidyverse meta-package, so is loaded already. To explore this command, we will use the wdi_panel dataset that comes with the sgsur package.

Let’s look at this data:

wdi_panel
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

As you can see, this is a data frame with 6582 rows and 8 columns. It provides some variables that describe the economy of many different countries over recent decades; see help("wdi_panel") or ?wdi_panel for full details about this dataset.

With select(), we can choose, or select, a subset of the columns of any data frame. As we will see, it provides many different ways to do this, but the simplest is just to name the individual columns that we want to choose. For example, if we just want the columns that give the country (country), year (year), its gross domestic product (GDP, gdp), and country code (iso2c), we can do the following:

select(wdi_panel, country, year, gdp)
# A tibble: 6,582 × 3
   country      year     gdp
   <chr>       <int>   <dbl>
 1 Afghanistan  2016 102672.
 2 Afghanistan  2023  82595.
 3 Afghanistan  2015 100402.
 4 Afghanistan  2006  49061.
 5 Afghanistan  2017 105389.
 6 Afghanistan  2014  98966.
 7 Afghanistan  2019 110814.
 8 Afghanistan  2021  85768.
 9 Afghanistan  2013  96341.
10 Afghanistan  2018 106643.
# ℹ 6,572 more rows

As you can see, the way select works in this example is very simple: first we give the name of the data frame (wdi_panel), and then give the names of the columns of that data frame that we want to select (country, year, gdp).

We can name as many columns as we wish, and the selected columns appear in the order we name them. For example:

select(wdi_panel, country, iso2c, year, gdp)
select(wdi_panel, year, iso2c, country, gdp)

An important point to make about the select() function is that it does not change the original data frame. This is a general property of tidyverse commands, which favour immutable data frames: every command returns a new object instead of altering the original. In other words, while we put wdi_panel into the select() function, a new data frame comes out; wdi_panel itself is unchanged. In order to see the output in the examples used above, we did not use the <- assignment operator to save the new data frame that the function returned, but we can always do this. For example, the following creates a new data frame named wdi_panel_a that just has the country, year, and GDP:

wdi_panel_a <- select(wdi_panel, country, year, gdp)

The select() function is far more flexible than simply selecting columns by listing their names. It has several helper tricks and functions — ranges, numeric positions, pattern-matching functions, negation, and others — that let you specify whole groups of variables in one go.

Selecting a consecutive set of columns
If the columns that you want to select occur consecutively, you can select them by just stating the first and last column in the range you want. For example, suppose we need the first five columns (country to gdp_growth). We can select them all at once using the colon operator as follows:
select(wdi_panel, country:gdp_growth)
# A tibble: 6,582 × 5
   country     iso2c  year     gdp gdp_growth
   <chr>       <chr> <int>   <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26
 2 Afghanistan AF     2023  82595.       2.71
 3 Afghanistan AF     2015 100402.       1.45
 4 Afghanistan AF     2006  49061.       5.36
 5 Afghanistan AF     2017 105389.       2.65
 6 Afghanistan AF     2014  98966.       2.72
 7 Afghanistan AF     2019 110814.       3.91
 8 Afghanistan AF     2021  85768.     -20.7 
 9 Afghanistan AF     2013  96341.       5.60
10 Afghanistan AF     2018 106643.       1.19
# ℹ 6,572 more rows

The colon tells select() to keep all columns from the one named on the left up to the one named on the right, in their original order.

Selecting by column index
We can also keep columns purely by their numerical index or position, and we can use the colon operator for consecutive columns:
select(wdi_panel, 1, 3, 4)        # columns 1, 3, and 4
select(wdi_panel, 1:4)             # columns 1 through 4
Selecting by names and indices at the same time
We can select by column names and column indices, or sequences of them using the colon, at the same time. For example, this selects country, year, and the seventh and eighth columns:
select(wdi_panel, country, year, 7:8)
# A tibble: 6,582 × 4
   country      year lfp_male lfp_female
   <chr>       <int>    <dbl>      <dbl>
 1 Afghanistan  2016     74.4      20.2 
 2 Afghanistan  2023     70.2       5.16
 3 Afghanistan  2015     75.3      19.1 
 4 Afghanistan  2006     79.3      14.5 
 5 Afghanistan  2017     73.4      21.2 
 6 Afghanistan  2014     76.2      18.1 
 7 Afghanistan  2019     69.4      18.3 
 8 Afghanistan  2021     67.3      14.7 
 9 Afghanistan  2013     77.1      17.1 
10 Afghanistan  2018     71.4      19.8 
# ℹ 6,572 more rows
Selecting columns that start with, end with, or contain a pattern
The select() function provides helper functions like starts_with(), ends_with(), and contains() for pattern matching:
select(wdi_panel, starts_with("gdp"))
# A tibble: 6,582 × 2
       gdp gdp_growth
     <dbl>      <dbl>
 1 102672.       2.26
 2  82595.       2.71
 3 100402.       1.45
 4  49061.       5.36
 5 105389.       2.65
 6  98966.       2.72
 7 110814.       3.91
 8  85768.     -20.7 
 9  96341.       5.60
10 106643.       1.19
# ℹ 6,572 more rows
select(wdi_panel, ends_with("male"))
select(wdi_panel, contains("inf"))

All of these helper functions, starts_with(), ends_with(), contains(), are not sensitive to case, upper or lower case, by default. For example, contains("INF") is the same as contains('inf') by default. If we want it to be case sensitive, we can indicate this by using ignore.case = FALSE inside the helper. For example, here is how we tell it to match the exact case:

# returns only columns containing "INF", of which there are none
select(wdi_panel, contains("INF", ignore.case = FALSE))
# A tibble: 6,582 × 0
Deselecting columns by negation
By putting a minus sign in front of any selector, we can drop or deselect columns instead of keeping them. For example, in the following example, we keep everything except iso2c:
select(wdi_panel, -iso2c)
# A tibble: 6,582 × 7
   country      year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan  2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan  2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan  2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan  2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan  2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan  2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan  2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan  2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan  2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan  2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

You can read the minus sign as meaning everything except. We can put it in front of every helper function too. For example, -starts_with("lfp_") drops all labour-force participation variables, which all start with lfp_.

Renaming while selecting
When we use select, we can optionally rename columns by using new_name = old_name inside the call. For example, here we select four columns but rename two as we do so:
select(
  wdi_panel,
  country,
  country_code = iso2c,
  year,
  gdp,
  growth = gdp
)
# A tibble: 6,582 × 4
   country     country_code  year  growth
   <chr>       <chr>        <int>   <dbl>
 1 Afghanistan AF            2016 102672.
 2 Afghanistan AF            2023  82595.
 3 Afghanistan AF            2015 100402.
 4 Afghanistan AF            2006  49061.
 5 Afghanistan AF            2017 105389.
 6 Afghanistan AF            2014  98966.
 7 Afghanistan AF            2019 110814.
 8 Afghanistan AF            2021  85768.
 9 Afghanistan AF            2013  96341.
10 Afghanistan AF            2018 106643.
# ℹ 6,572 more rows

As before however, it is the data frame that comes out that has the new column names; the original wdi_panel remains unchanged.

5.2 Renaming columns with rename()

Our next wrangling command is rename(). We just saw that select() can both pick columns and rename them. As useful as that is, if we just want to rename some variables but not select variables at the same time, we should use rename() instead. This command does only one thing: it changes column names while leaving every column in place. We often use it when we want to keep the current set of variables but the labels are cryptic, too long, or inconsistent and so need to be changed. Here we rename, as we just did using select, iso2c and gdp_growth to country_code and growth, respectively:

rename(wdi_panel,
  country_code = iso2c,
  growth = gdp_growth
)
# A tibble: 6,582 × 8
   country     country_code  year     gdp growth inflation lfp_male lfp_female
   <chr>       <chr>        <int>   <dbl>  <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF            2016 102672.   2.26     4.38      74.4      20.2 
 2 Afghanistan AF            2023  82595.   2.71    -4.64      70.2       5.16
 3 Afghanistan AF            2015 100402.   1.45    -0.662     75.3      19.1 
 4 Afghanistan AF            2006  49061.   5.36     6.78      79.3      14.5 
 5 Afghanistan AF            2017 105389.   2.65     4.98      73.4      21.2 
 6 Afghanistan AF            2014  98966.   2.72     4.67      76.2      18.1 
 7 Afghanistan AF            2019 110814.   3.91     2.30      69.4      18.3 
 8 Afghanistan AF            2021  85768. -20.7      5.13      67.3      14.7 
 9 Afghanistan AF            2013  96341.   5.60     7.39      77.1      17.1 
10 Afghanistan AF            2018 106643.   1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

Note that the syntax is new = old, just as we used when renaming while selecting with select.

Remember, rename() returns a new data frame; the original is untouched. To permanently work with renamed variables, assign the result to a name:

wdi_panel_renamed <- rename(wdi_panel, country_code = iso2c, growth = gdp_growth)

If we need to rename many or even all variables, typing dozens of pairs can be tedious. The rename_with function lets you pass a function to do renaming, which will be applied to every selected column name.

For example, toupper converts strings to uppercase. You can combine rename_with with selection helpers to target specific columns:

rename_with(wdi_panel, toupper)                              # all columns
# A tibble: 6,582 × 8
   COUNTRY     ISO2C  YEAR     GDP GDP_GROWTH INFLATION LFP_MALE LFP_FEMALE
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014  98966.       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 110814.       3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021  85768.     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013  96341.       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows
rename_with(wdi_panel, toupper, .cols = starts_with("gdp"))  # only GDP variables

5.3 Re-ordering columns with relocate()

After we have selected the variables we need and renamed them sensibly, a common housekeeping step is to arrange those columns in an order that makes the data frame easier to read. That is what relocate() does: it moves one or more columns to a new position while leaving every other column untouched. As with the other commands, relocate() returns a new data frame; the original remains unchanged unless you assign it to a new name.

By default, relocate() moves columns to the front. You can move multiple columns as a block, or position them relative to other columns with .before or .after:

relocate(wdi_panel, iso2c)
# A tibble: 6,582 × 8
   iso2c country      year     gdp gdp_growth inflation lfp_male lfp_female
   <chr> <chr>       <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 AF    Afghanistan  2016 102672.       2.26     4.38      74.4      20.2 
 2 AF    Afghanistan  2023  82595.       2.71    -4.64      70.2       5.16
 3 AF    Afghanistan  2015 100402.       1.45    -0.662     75.3      19.1 
 4 AF    Afghanistan  2006  49061.       5.36     6.78      79.3      14.5 
 5 AF    Afghanistan  2017 105389.       2.65     4.98      73.4      21.2 
 6 AF    Afghanistan  2014  98966.       2.72     4.67      76.2      18.1 
 7 AF    Afghanistan  2019 110814.       3.91     2.30      69.4      18.3 
 8 AF    Afghanistan  2021  85768.     -20.7      5.13      67.3      14.7 
 9 AF    Afghanistan  2013  96341.       5.60     7.39      77.1      17.1 
10 AF    Afghanistan  2018 106643.       1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows
relocate(wdi_panel, year, iso2c)                             # multiple to front
relocate(wdi_panel, inflation, .before = gdp)                # before gdp
relocate(wdi_panel, starts_with('lfp_'), .after = year)      # after year

5.4 Selecting rows with slice()

The wrangling commands covered thus far were all applied to columns. There are other commands that apply to rows instead. One of the simplest such functions is slice() whose main purpose is to select rows according to their position. In this sense, slice() is roughly equivalent to select() when it uses column numerical indices.

You can select rows by position, use ranges with :, or negate with -:

slice(wdi_panel, 1:5)                 # rows 1 to 5
# A tibble: 5 × 8
  country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
  <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
1 Afghanistan AF     2016 102672.       2.26     4.38      74.4      20.2 
2 Afghanistan AF     2023  82595.       2.71    -4.64      70.2       5.16
3 Afghanistan AF     2015 100402.       1.45    -0.662     75.3      19.1 
4 Afghanistan AF     2006  49061.       5.36     6.78      79.3      14.5 
5 Afghanistan AF     2017 105389.       2.65     4.98      73.4      21.2 
slice(wdi_panel, 10:20)               # rows 10 to 20
slice(wdi_panel, -1)                  # all except row 1
slice(wdi_panel, -(1:5))              # all except rows 1 to 5

Helper functions slice_head() and slice_tail() select from the beginning or end:

slice_head(wdi_panel, n = 5)          # first 5 rows
slice_tail(wdi_panel, n = 5)          # last 5 rows

The slice_sample() function randomly samples rows, optionally grouped by a variable:

slice_sample(wdi_panel, n = 100)               # 100 random rows
slice_sample(wdi_panel, n = 3, by = country)   # 3 rows per country

5.5 Keeping (or dropping) rows with filter()

The slice() function and variants are useful for selecting rows but they don’t provide all the functionality we usually need for selecting rows. In particular, we very often need to select rows that match some condition or set of conditions. For example, with data like wdi_panel, we may wish to select only those rows that meet some or all of the conditions: “years since 2010”, “countries in the EU”, “non-missing GDP”. Doing this with slice() would be difficult or impossible. Instead, for tasks like this we use filter(), which is one of the most powerful data wrangling work-horses.

Think “filter in”: filter() keeps rows that match conditions. Use == for equality (not =), and other operators like !=, >, <, >=, <=:

filter(wdi_panel, year == 2020)
# A tibble: 199 × 8
   country           iso2c  year    gdp gdp_growth inflation lfp_male lfp_female
   <chr>             <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan       AF     2020 1.08e5     -2.35      5.60      66.8       16.5
 2 Albania           AL     2020 4.16e4     -3.31      1.62      65.3       50.6
 3 Algeria           DZ     2020 6.25e5     -5.00      2.42      64.3       13.7
 4 Andorra           AD     2020 4.29e3    -11.2      NA         NA         NA  
 5 Angola            AO     2020 2.53e5     -5.64     22.3       76.8       72.7
 6 Antigua and Barb… AG     2020 2.20e3    -18.9       0.626     NA         NA  
 7 Argentina         AR     2020 1.08e6     -9.90     NA         65.9       46.5
 8 Armenia           AM     2020 4.46e4     -7.20      1.21      68.1       57.9
 9 Aruba             AW     2020 3.10e3    -26.2      NA         NA         NA  
10 Australia         AU     2020 1.46e6     -0.120     0.847     70.2       60.2
# ℹ 189 more rows
filter(wdi_panel, year < 2020)

Combine conditions with commas (AND) or | (OR):

filter(wdi_panel, country == 'France', year >= 2020)                    # AND
filter(wdi_panel, country == 'France', year >= 2020, gdp_growth < 2.0)  # multiple AND
filter(wdi_panel, gdp_growth < 0 | inflation > 20)                      # OR

Use between() for ranges and near() for approximate matches:

filter(wdi_panel, year >= 2000, year <= 2009)       # manual range
filter(wdi_panel, between(year, 2000, 2009))        # between helper
filter(wdi_panel, near(gdp_growth, 5, 1))           # within ±1.0 of 5

Use %in% to match multiple values, and ! to negate conditions:

filter(wdi_panel, country %in% c('France', 'Germany', 'Italy'))    # match set
filter(wdi_panel, !between(year, 2000, 2009))                      # NOT between
filter(wdi_panel, !(country %in% c('France', 'Germany', 'Italy'))) # NOT in set

To remove missing values, use !is.na() in filter or the dedicated drop_na() function:

filter(wdi_panel, !is.na(gdp))                          # filter approach
drop_na(wdi_panel)                                      # all rows with any NA
drop_na(wdi_panel, gdp)                                 # rows with gdp NA
drop_na(wdi_panel, gdp, gdp_growth)                     # rows with either NA
drop_na(wdi_panel, any_of(starts_with('gdp')))          # with selection helpers

5.6 Creating and transforming variables with mutate()

The mutate() command is another powerful wrangling tool. It creates one or more new columns from the data already present, or it re-calculates or transforms an existing column in place.

As a simple example, let’s create a variable that calculates GDP, which is currently in millions of dollars, in billions of dollars instead:

mutate(wdi_panel, gdp_b = gdp / 1000)
# A tibble: 6,582 × 9
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female gdp_b
   <chr>       <chr> <int>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl> <dbl>
 1 Afghanistan AF     2016 1.03e5       2.26     4.38      74.4      20.2  103. 
 2 Afghanistan AF     2023 8.26e4       2.71    -4.64      70.2       5.16  82.6
 3 Afghanistan AF     2015 1.00e5       1.45    -0.662     75.3      19.1  100. 
 4 Afghanistan AF     2006 4.91e4       5.36     6.78      79.3      14.5   49.1
 5 Afghanistan AF     2017 1.05e5       2.65     4.98      73.4      21.2  105. 
 6 Afghanistan AF     2014 9.90e4       2.72     4.67      76.2      18.1   99.0
 7 Afghanistan AF     2019 1.11e5       3.91     2.30      69.4      18.3  111. 
 8 Afghanistan AF     2021 8.58e4     -20.7      5.13      67.3      14.7   85.8
 9 Afghanistan AF     2013 9.63e4       5.60     7.39      77.1      17.1   96.3
10 Afghanistan AF     2018 1.07e5       1.19     0.626     71.4      19.8  107. 
# ℹ 6,572 more rows

Note how the newly created gdp_b is inserted at the end of the data frame by default. We can optionally use .after or .before to tell mutate() where to insert the new variable. For example, to insert gdp_b directly after gdp, we can do:

mutate(wdi_panel, gdp_b = gdp / 1000, .after = gdp)
# A tibble: 6,582 × 9
   country     iso2c  year    gdp gdp_b gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>  <dbl> <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 1.03e5 103.        2.26     4.38      74.4      20.2 
 2 Afghanistan AF     2023 8.26e4  82.6       2.71    -4.64      70.2       5.16
 3 Afghanistan AF     2015 1.00e5 100.        1.45    -0.662     75.3      19.1 
 4 Afghanistan AF     2006 4.91e4  49.1       5.36     6.78      79.3      14.5 
 5 Afghanistan AF     2017 1.05e5 105.        2.65     4.98      73.4      21.2 
 6 Afghanistan AF     2014 9.90e4  99.0       2.72     4.67      76.2      18.1 
 7 Afghanistan AF     2019 1.11e5 111.        3.91     2.30      69.4      18.3 
 8 Afghanistan AF     2021 8.58e4  85.8     -20.7      5.13      67.3      14.7 
 9 Afghanistan AF     2013 9.63e4  96.3       5.60     7.39      77.1      17.1 
10 Afghanistan AF     2018 1.07e5 107.        1.19     0.626     71.4      19.8 
# ℹ 6,572 more rows

We can use mutate() to perform multiple transformations at once. You can add several columns in the same call, and later names may refer to earlier ones created in the same line. In the following example, we create a GDP-in-billions variable as before, calculate the annual GDP growth as a multiplier, and then use both of these newly created variables to calculate projected GDP:

mutate(wdi_panel,
  gdp_b = gdp / 1000,
  gdp_growth_mult = 1 + gdp_growth / 100,
  gdp_b_proj = gdp_b * gdp_growth_mult,
  .after = gdp
)
# A tibble: 6,582 × 11
   country     iso2c  year     gdp gdp_b gdp_growth_mult gdp_b_proj gdp_growth
   <chr>       <chr> <int>   <dbl> <dbl>           <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2016 102672. 103.            1.02       105.        2.26
 2 Afghanistan AF     2023  82595.  82.6           1.03        84.8       2.71
 3 Afghanistan AF     2015 100402. 100.            1.01       102.        1.45
 4 Afghanistan AF     2006  49061.  49.1           1.05        51.7       5.36
 5 Afghanistan AF     2017 105389. 105.            1.03       108.        2.65
 6 Afghanistan AF     2014  98966.  99.0           1.03       102.        2.72
 7 Afghanistan AF     2019 110814. 111.            1.04       115.        3.91
 8 Afghanistan AF     2021  85768.  85.8           0.793       68.0     -20.7 
 9 Afghanistan AF     2013  96341.  96.3           1.06       102.        5.60
10 Afghanistan AF     2018 106643. 107.            1.01       108.        1.19
# ℹ 6,572 more rows
# ℹ 3 more variables: inflation <dbl>, lfp_male <dbl>, lfp_female <dbl>

It is common to use mutate() to create discrete or categorical variables. For example, we can create a new variable higher_gdp that indicates if any gdp value is above or below average as follows:

mutate(wdi_panel, higher_gdp = gdp >= mean(gdp), .after = gdp)
# A tibble: 6,582 × 9
   country     iso2c  year     gdp higher_gdp gdp_growth inflation lfp_male
   <chr>       <chr> <int>   <dbl> <lgl>           <dbl>     <dbl>    <dbl>
 1 Afghanistan AF     2016 102672. FALSE            2.26     4.38      74.4
 2 Afghanistan AF     2023  82595. FALSE            2.71    -4.64      70.2
 3 Afghanistan AF     2015 100402. FALSE            1.45    -0.662     75.3
 4 Afghanistan AF     2006  49061. FALSE            5.36     6.78      79.3
 5 Afghanistan AF     2017 105389. FALSE            2.65     4.98      73.4
 6 Afghanistan AF     2014  98966. FALSE            2.72     4.67      76.2
 7 Afghanistan AF     2019 110814. FALSE            3.91     2.30      69.4
 8 Afghanistan AF     2021  85768. FALSE          -20.7      5.13      67.3
 9 Afghanistan AF     2013  96341. FALSE            5.60     7.39      77.1
10 Afghanistan AF     2018 106643. FALSE            1.19     0.626     71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>

The newly created higher_gdp is a logical vector, with values TRUE or FALSE, where TRUE means the corresponding gdp is equal to or above average GDP, and FALSE means it is below average. An alternative approach is to use if_else() as follows:

mutate(wdi_panel,
  higher_gdp = if_else(gdp >= mean(gdp), "above_avg", "below_avg"),
  .after = gdp
)
# A tibble: 6,582 × 9
   country     iso2c  year     gdp higher_gdp gdp_growth inflation lfp_male
   <chr>       <chr> <int>   <dbl> <chr>           <dbl>     <dbl>    <dbl>
 1 Afghanistan AF     2016 102672. below_avg        2.26     4.38      74.4
 2 Afghanistan AF     2023  82595. below_avg        2.71    -4.64      70.2
 3 Afghanistan AF     2015 100402. below_avg        1.45    -0.662     75.3
 4 Afghanistan AF     2006  49061. below_avg        5.36     6.78      79.3
 5 Afghanistan AF     2017 105389. below_avg        2.65     4.98      73.4
 6 Afghanistan AF     2014  98966. below_avg        2.72     4.67      76.2
 7 Afghanistan AF     2019 110814. below_avg        3.91     2.30      69.4
 8 Afghanistan AF     2021  85768. below_avg      -20.7      5.13      67.3
 9 Afghanistan AF     2013  96341. below_avg        5.60     7.39      77.1
10 Afghanistan AF     2018 106643. below_avg        1.19     0.626     71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>

Now, the new higher_gdp variable is a character vector with two possible values: above_avg and below_avg. With the if_else() function, there are always three arguments: the first is some condition to be tested, the second is the value returned if the condition is true, and the third is the value returned if the condition is false.

To create a discrete or categorical variable with more than two possible values, a very useful function is case_when(). For example, we can create a new variable infl_rate to indicate if inflation is “high” (above 5%), “medium” (between 2% and 5%), or “low” (less than 2%) as follows:

mutate(wdi_panel,
       infl_rate = case_when(
         inflation > 5 ~ 'high',
         inflation > 2 ~ 'medium',
         TRUE ~ 'low'
       ),
       .after = inflation
)
# A tibble: 6,582 × 9
   country iso2c  year    gdp gdp_growth inflation infl_rate lfp_male lfp_female
   <chr>   <chr> <int>  <dbl>      <dbl>     <dbl> <chr>        <dbl>      <dbl>
 1 Afghan… AF     2016 1.03e5       2.26     4.38  medium        74.4      20.2 
 2 Afghan… AF     2023 8.26e4       2.71    -4.64  low           70.2       5.16
 3 Afghan… AF     2015 1.00e5       1.45    -0.662 low           75.3      19.1 
 4 Afghan… AF     2006 4.91e4       5.36     6.78  high          79.3      14.5 
 5 Afghan… AF     2017 1.05e5       2.65     4.98  medium        73.4      21.2 
 6 Afghan… AF     2014 9.90e4       2.72     4.67  medium        76.2      18.1 
 7 Afghan… AF     2019 1.11e5       3.91     2.30  medium        69.4      18.3 
 8 Afghan… AF     2021 8.58e4     -20.7      5.13  high          67.3      14.7 
 9 Afghan… AF     2013 9.63e4       5.60     7.39  high          77.1      17.1 
10 Afghan… AF     2018 1.07e5       1.19     0.626 low           71.4      19.8 
# ℹ 6,572 more rows

How case_when() works may not be obvious at first glance. Think of it as checking each row against the rules in order:

  1. Test the first rule, inflation > 5. – If it is true, write "high" and stop for that row.
  2. Otherwise, test the second rule, inflation > 2. – If true, write "medium" (we already know the value is ≤ 5).
  3. If neither rule matched, the final catch-all line TRUE ~ "low" fires, so the row gets "low".

Because the last rule is always TRUE, every row receives exactly one of the three labels — high, medium, or low — and the checking stops as soon as the first matching rule is found.

Another handy helper inside mutate() is across(). Use it when the same change, for example, “divide by 100”, should be applied to several columns in one go.

mutate(
  wdi_panel,
  across(gdp_growth:lfp_female, ~ . / 100)
)
# A tibble: 6,582 × 8
   country     iso2c  year     gdp gdp_growth inflation lfp_male lfp_female
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Afghanistan AF     2016 102672.     0.0226   0.0438     0.744     0.202 
 2 Afghanistan AF     2023  82595.     0.0271  -0.0464     0.702     0.0516
 3 Afghanistan AF     2015 100402.     0.0145  -0.00662    0.753     0.191 
 4 Afghanistan AF     2006  49061.     0.0536   0.0678     0.793     0.145 
 5 Afghanistan AF     2017 105389.     0.0265   0.0498     0.734     0.212 
 6 Afghanistan AF     2014  98966.     0.0272   0.0467     0.762     0.181 
 7 Afghanistan AF     2019 110814.     0.0391   0.0230     0.694     0.183 
 8 Afghanistan AF     2021  85768.    -0.207    0.0513     0.673     0.147 
 9 Afghanistan AF     2013  96341.     0.0560   0.0739     0.771     0.171 
10 Afghanistan AF     2018 106643.     0.0119   0.00626    0.714     0.198 
# ℹ 6,572 more rows

What happens, step-by-step:

  1. gdp_growth:lfp_female selects the block of four percentage columns (gdp_growth, inflation, lfp_male, lfp_female).
  2. For every value in each of those columns, the little rule ~ . / 100 is carried out. Read it as “take the value (the dot) and divide it by 100.”
  3. The result, which is now a true proportion like 0.0226 instead of 2.26, is written back into the same columns. All other columns stay untouched.

So across() lets you say “do this to all these columns” once, instead of repeating the same line four times.

5.7 Sorting rows with arrange()

Data often need to be displayed or examined in a particular order rather than the order they were read from disk. The arrange() function reorders rows based on the values in one or more columns while leaving every column unchanged.

Sorting the wdi_panel data by gross domestic product allows us to see which country–years have the largest or smallest economies in the table.

arrange(wdi_panel, gdp)
# A tibble: 6,582 × 8
   country iso2c  year   gdp gdp_growth inflation lfp_male lfp_female
   <chr>   <chr> <int> <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 Tuvalu  TV     1990  26.3      15.4         NA       NA         NA
 2 Tuvalu  TV     1991  27.2       3.61        NA       NA         NA
 3 Tuvalu  TV     1992  28.0       2.79        NA       NA         NA
 4 Tuvalu  TV     1996  28.7      -5.96        NA       NA         NA
 5 Tuvalu  TV     1993  29.1       4.11        NA       NA         NA
 6 Tuvalu  TV     1995  30.5      -5.00        NA       NA         NA
 7 Tuvalu  TV     1997  31.5      10.0         NA       NA         NA
 8 Tuvalu  TV     1994  32.1      10.3         NA       NA         NA
 9 Tuvalu  TV     2005  34.7      -4.65        NA       NA         NA
10 Tuvalu  TV     2006  35.5       2.44        NA       NA         NA
# ℹ 6,572 more rows

The rows now appear from the smallest GDP to the largest.

To reverse the order wrap the column name in desc, which stands for descending.

arrange(wdi_panel, desc(gdp))
# A tibble: 6,582 × 8
   country       iso2c  year       gdp gdp_growth inflation lfp_male lfp_female
   <chr>         <chr> <int>     <dbl>      <dbl>     <dbl>    <dbl>      <dbl>
 1 China         CN     2023 31229733.       5.25     0.235     71.6       59.9
 2 China         CN     2022 29672080.       2.95     1.97      71.6       59.9
 3 China         CN     2021 28821649.       8.45     0.981     72.9       61.2
 4 China         CN     2020 26576354.       2.24     2.42      71.9       59.9
 5 China         CN     2019 25994432.       5.95     2.90      73.4       61.2
 6 United States US     2023 24977118.       2.89     4.12      67.6       56.6
 7 China         CN     2018 24534506.       6.75     2.07      74.0       61.5
 8 United States US     2022 24276131.       2.51     8.00      67.6       56.0
 9 United States US     2021 23681171        6.06     4.70      67.4       55.6
10 China         CN     2017 22983193.       6.95     1.59      74.6       61.9
# ℹ 6,572 more rows

This version lists the country–year with the highest GDP first and moves down to the lowest. Multiple sort keys can be combined by listing additional columns after the first, allowing ties to be broken or secondary ordering to be imposed, but a single arrange() call with or without desc is enough for most quick inspections.

5.8 Reshaping data from wide to long format and back

In data frames, data is often arranged in one of two fundamental ways for specific sets of variables: wide format and long format. For some analyses it is necessary to use a wide format and for other analyses a long format is required. Transforming your data from wide to long, or long to wide, is known as reshaping. Understanding these formats and how to switch between them is an essential data-wrangling skill. The tidyverse, specifically the tidyr package, provides two tools to reshape data: pivot_longer(), which converts wide format to long, and pivot_wider(), which converts long format to wide.

It is easier to understand long and wide formats using examples. Consider the following fictitious data.

neighbourhood crime_2019 crime_2020 crime_2021
Downtown 120 115 110
Riverside 85 90 88
Westfield 60 55 52
Lakeshore 40 38 35

Each row is a neighbourhood and each column after the first records the reported crime rate (crime per 10 000 residents) in a different year. Because the same measure is stored in separate year-specific columns this data frame is in wide form.

By contrast, the next data frame is the same data in long format.

neighbourhood year crime
Downtown 2019 120
Downtown 2020 115
Downtown 2021 110
Riverside 2019 85
Riverside 2020 90
Riverside 2021 88
Westfield 2019 60
Westfield 2020 55
Westfield 2021 52
Lakeshore 2019 40
Lakeshore 2020 38
Lakeshore 2021 35

The long version stacks the three year columns into two new columns: one for the year and one for the crime rate. A wide format is often easier for humans to read and can help with comparisons across columns. A long format is usually preferable or required for many statistical analyses and visualisations, and it is more flexible for adding new data, filtering, and performing calculations. As a rule, when analysing data in R, a long format is often the preferred default.

The data frame wdi_panel has two lfp_ columns in wide format. We can convert them to long format with pivot_longer(), optionally removing a shared prefix:

wdi_panel |>
  pivot_longer(cols = starts_with("lfp_"), names_to = "gender", values_to = "lfp")
# A tibble: 13,164 × 8
   country     iso2c  year     gdp gdp_growth inflation gender       lfp
   <chr>       <chr> <int>   <dbl>      <dbl>     <dbl> <chr>      <dbl>
 1 Afghanistan AF     2016 102672.       2.26     4.38  lfp_male   74.4 
 2 Afghanistan AF     2016 102672.       2.26     4.38  lfp_female 20.2 
 3 Afghanistan AF     2023  82595.       2.71    -4.64  lfp_male   70.2 
 4 Afghanistan AF     2023  82595.       2.71    -4.64  lfp_female  5.16
 5 Afghanistan AF     2015 100402.       1.45    -0.662 lfp_male   75.3 
 6 Afghanistan AF     2015 100402.       1.45    -0.662 lfp_female 19.1 
 7 Afghanistan AF     2006  49061.       5.36     6.78  lfp_male   79.3 
 8 Afghanistan AF     2006  49061.       5.36     6.78  lfp_female 14.5 
 9 Afghanistan AF     2017 105389.       2.65     4.98  lfp_male   73.4 
10 Afghanistan AF     2017 105389.       2.65     4.98  lfp_female 21.2 
# ℹ 13,154 more rows
wdi_panel |>
  pivot_longer(
    cols         = starts_with("lfp_"),
    names_to     = "gender",
    names_prefix = "lfp_",            # strip prefix
    values_to    = "lfp"
  )

The reverse operation, pivot_wider(), transforms long format back to wide:

wdi_longer <- pivot_longer(wdi_panel, cols = starts_with("lfp_"),
                           names_to = "gender", names_prefix = "lfp_", values_to = "lfp")
pivot_wider(wdi_longer, names_from = "gender", values_from = "lfp", names_prefix = "lfp_")

The symmetry is clear: names_to matches names_from, and values_to matches values_from.

5.9 Joining data frames

Analyses often require information that lives in more than one data frame. A join operation combines two data frames by matching rows that share the same identifying columns, sometimes called keys.

Imagine if we have two data frames:

  • data_frame_A: A first data frame could be a list of student names and their student ID numbers.
  • data_frame_B: A second data frame could be a list of classes and the ID numbers of the students taking them.

If you want to find out the names of all the students in all the classes, we need to link up these two data frames. The join operation does exactly that by looking for the common piece of information between the two; in this case, it is the student ID number. It effectively looks at each student ID value in data_frame_A and then finds all the rows in data_frame_B with matching student IDs. It then combines the information from both lists for each matching ID, creating a new and more complete data frame that tells you which student (by name) is taking which class.

Joins combine data frames by matching keys. inner_join() keeps rows present in both tables; left_join() keeps all rows from the left table:

inner_join(wdi_panel, population)                          # auto-detect keys
# A tibble: 3,095 × 9
   country     iso2c  year    gdp gdp_growth inflation lfp_male lfp_female population
   <chr>       <chr> <dbl>  <dbl>      <dbl>     <dbl>    <dbl>      <dbl>      <dbl>
 1 Afghanistan AF     2006 49061.      5.36       6.78     79.3       14.5   25631282
 2 Afghanistan AF     2013 96341.      5.60       7.39     77.1       17.1   30551674
 3 Afghanistan AF     2005 46567.     11.2       12.7      79.5       14.3   24860855
 4 Afghanistan AF     2004 41865.      1.41      NA        79.6       14.2   24018682
 5 Afghanistan AF     2002 37931.     28.6       NA        79.8       14.2   22202806
 6 Afghanistan AF     2003 41282.      8.83      NA        79.7       14.2   23116142
 7 Afghanistan AF     2007 55845.     13.8        8.68     79.2       14.7   26349243
 8 Afghanistan AF     2008 58037.      3.92      26.4      78.9       15.0   27032197
 9 Afghanistan AF     2012 91231.     12.8        6.44     78.0       16.2   29824536
10 Afghanistan AF     2011 80913.      0.426     11.8      78.2       15.9   29105480
# ℹ 3,085 more rows
inner_join(wdi_panel, population, by = c("country", "year"))  # explicit keys
left_join(wdi_panel, population, by = c("country", "year"))   # keep all left rows

Use inner_join() when you need complete cases and are happy to drop observations that lack matches. Use left_join() when you want to preserve all rows from the main table and simply attach whatever extra information is available.

Other join variants exist such as right_join(), full_join(), and anti_join(), but inner_join() and left_join() cover most day-to-day tasks and are all you need at this stage.

5.10 Data-wrangling pipelines

Chapter 2 introduced the pipe operator |> and showed how it passes the result of one expression straight into the next function. Data wrangling is where that left-to-right flow shines. Each of the tidyverse commands we focused on here does one job, so real wrangling work usually means running several commands in sequence. A pipeline writes those steps as a single sentence that the reader can follow from the raw data to the finished table.

Below are three short pipelines that use almost every tool covered so far. They all start with wdi_panel, add information or reshape it, and end with a table that could feed directly into a plot or a model.

Pipeline example 1: GDP per person, ranked
This pipeline builds a ranked list of GDP-per-capita across different country-years.
inner_join(wdi_panel, population, by = c("country", "year")) |>
  mutate(gdp_pc = 1e6 * gdp / population) |>
  select(country, year, gdp_pc) |>
  arrange(desc(gdp_pc))
# A tibble: 3,095 × 3
   country     year  gdp_pc
   <chr>      <dbl>   <dbl>
 1 Luxembourg  2007 139017.
 2 Luxembourg  2008 135467.
 3 Luxembourg  2013 132423.
 4 Luxembourg  2006 131384.
 5 Luxembourg  2010 130498.
 6 Luxembourg  2012 129978.
 7 Luxembourg  2011 129700.
 8 Luxembourg  2009 128236.
 9 Luxembourg  2005 126191.
10 Luxembourg  2004 124852.
# ℹ 3,085 more rows

The steps are as follows:

  1. inner_join(wdi_panel, population, by = c("country", "year")) Keep only those country–year rows that appear in both tables, attaching the population column to the matching rows in wdi_panel.

  2. mutate(gdp_pc = 1e6 * gdp / population) Convert GDP from millions to dollars (gdp * 1e6) and divide by population to create a new column gdp_pc, the per-capita GDP.

  3. select(country, year, gdp_pc) Retain just the identifying columns country and year along with the newly calculated gdp_pc, discarding all other variables.

  4. arrange(desc(gdp_pc)) Sort the resulting table in descending order so that the highest GDP-per-capita observations appear first.

Pipeline example 2: Recent inflation by income group
This pipeline produces a five-row summary that compares average GDP and average inflation across GDP quintiles for recent years.
wdi_panel |>
  filter(year >= 2015) |>
  mutate(gdp_quintile = ntile(gdp, 5)) |>
  summarise(avg_gdp = mean(gdp),
            avg_inflation = mean(inflation, na.rm = TRUE),
            .by = gdp_quintile) |>
  arrange(gdp_quintile)
# A tibble: 5 × 3
  gdp_quintile  avg_gdp avg_inflation
         <int>    <dbl>         <dbl>
1            1    3139.          3.18
2            2   28657.          5.51
3            3   84298.          8.75
4            4  318744.          7.56
5            5 3253163.          5.53

Here is what happens, step by step.

  1. filter(year >= 2015) Keep only rows from 2015 onward. Everything earlier is discarded.

  2. mutate(gdp_quintile = ntile(gdp, 5)) Rank the remaining rows by their GDP value and split them into five equal-sized groups. The poorest fifth gets gdp_quintile = 1, the richest fifth gets 5.

  3. summarise(..., .by = gdp_quintile) For each quintile calculate • avg_gdp: the mean GDP within that group. • avg_inflation: the mean inflation rate, ignoring any missing values. Because of .by = gdp_quintile, the summary is done separately for each quintile, yielding one row per group.

  4. arrange(gdp_quintile) Order the resulting five rows from the lowest GDP quintile to the highest so the output reads naturally from poorer to richer economies.

  • filter to select the last ten years.
  • summarise() to group the rows by World Bank income category. Compute the average inflation rate within each group. List groups from highest to lowest inflation.
Pipeline example 3: Gender gaps in labour participation by region
This pipeline averages the labour force participation rate for males and females, averaged over all countries and years:
wdi_panel |>
  pivot_longer(
    cols          = starts_with("lfp_"),
    names_to      = "gender",
    names_prefix  = "lfp_",
    values_to     = "lfp"
  ) |> summarise(lfp = mean(lfp, na.rm = TRUE),
                 .by = gender)
# A tibble: 2 × 2
  gender   lfp
  <chr>  <dbl>
1 male    72.2
2 female  50.0
  1. pivot_longer(cols = starts_with("lfp_"), names_to = "gender", names_prefix = "lfp_", values_to = "lfp") Select the columns whose names begin with lfp_. Stack those columns into two new columns where the former column name, stripped of the lfp_ prefix, is stored in gender and the numeric value is stored in lfp.

  2. summarise(lfp = mean(lfp, na.rm = TRUE), .by = gender) Group the long data by gender and compute the mean of lfp, ignoring missing values. The result is one row for male and one row for female, each containing the average labour-force participation rate.

These three examples show how a pipe strings together joins, filters, calculations, reshapes, summaries, and sorts. The same pattern scales to longer pipelines; you just add another step on a new line.

5.11 Further reading

  • Hadley Wickham and Garrett Grolemund’s R for Data Science (Wickham & Grolemund, 2023) provides the standard reference for learning the tidyverse, with detailed chapters on data transformation principles and techniques.

  • The official tidyverse documentation at tidyverse.org offers comprehensive function references and examples across the ecosystem of packages including dplyr, tidyr, stringr, and others.

5.12 Chapter summary

  • Data wrangling transforms raw, messy datasets into tidy, analysis-ready format, typically consuming 40-60% of analyst time but making all subsequent statistical work possible.
  • The select() function chooses columns by name, position, or pattern, with helpers like starts_with(), ends_with(), and contains() enabling flexible column selection.
  • The filter() function keeps rows matching specified conditions using comparison operators, logical combinations, and helpers like between() and %in%.
  • The mutate() function creates new variables or transforms existing ones, supporting sequential calculations, conditional logic with if_else() and case_when(), and batch transformations with across().
  • Reshaping operations convert data between wide format (multiple columns per measure) and long format (stacked observations) using pivot_longer() and pivot_wider().
  • Join operations like inner_join() and left_join() combine information from multiple data frames by matching rows on shared key variables.
  • The pipe operator |> chains wrangling steps into readable, reproducible pipelines that flow left-to-right from raw data to finished tables.
  • tidyverse functions are immutable, returning new data frames without modifying originals, requiring assignment with <- to preserve transformations.