library(tidyverse)
library(sgsur)5 Data wrangling
Until now, every dataset in this book has arrived in perfect rows and columns. There were no columns to rename, no rows to drop, and nothing to reshape or merge. They were ready for whatever analyses or plotting we needed to do. Real-world datasets, by contrast, are rarely so accommodating. You might encounter superfluous columns, or need to create new variables. Perhaps you’ll only need a subset of the rows, or have to reshape and restructure the data, or even merge multiple datasets into one. Usually, before any analysis can begin, however simple, we have to reformat, clean, or transform our data.
This activity of cleaning or transforming data is often called data wrangling. The verb to wrangle originally meant “to struggle or argue at length.” Ranchers later used it for herding or taming uncooperative livestock. Modern analysts have borrowed the word because taming data can feel exactly the same: time-consuming, difficult, and occasionally infuriating. Many surveys (for example, Anaconda, 2022) suggest that analysts spend 40–60% of their time on data wrangling; more than on statistical analysis itself. Despite this, the topic is often neglected in statistics textbooks and courses.
In this chapter, you’ll learn about a set of interrelated R commands used to wrangle data. They go by names like select(), filter(), and mutate(), and they’re almost all part of the broad tidyverse family of R tools. Each of these commands is designed to do one thing well while working seamlessly with the others. As such, they can be used to create powerful sequences of data transformations, known as pipelines, that take data from an unruly raw state and transform it into something clean, tidy, and ready for analysis.
We’ll go through many of the major wrangling tools, explaining their purpose and how they work. Then, we’ll provide examples of how they can be combined into pipelines to transform messy raw data into clean, tidy data. Everything we need, whether commands or data, is available in either the tidyverse meta-package or this textbook’s sgsur package.
The chapter covers
- Data wrangling transforms raw, messy datasets into tidy, analysis-ready format, consuming substantial analyst time but enabling all subsequent work.
- Core tidyverse verbs like
select(),filter(),mutate(), andarrange()each perform one transformation task efficiently and combine seamlessly. - Reshaping operations with
pivot_longer()andpivot_wider()convert between wide and long data formats as different analyses require. - Join operations like
inner_join()andleft_join()combine information from multiple related datasets based on shared key variables. - Pipelines created with the native pipe operator link wrangling steps into readable, reproducible sequences that transform data systematically.
- Developing fluency with these tools reduces time spent fighting data and increases time available for meaningful analysis and interpretation.
5.1 Selecting columns with select()
The first wrangling tool we will look at is select(). This is part of the dplyr package, which is part of the tidyverse meta-package, so is loaded already. To explore this command, we will use the wdi_panel dataset that comes with the sgsur package.
Let’s look at this data:
wdi_panel# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
As you can see, this is a data frame with 6582 rows and 8 columns. It provides some variables that describe the economy of many different countries over recent decades; see help("wdi_panel") or ?wdi_panel for full details about this dataset.
With select(), we can choose, or select, a subset of the columns of any data frame. As we will see, it provides many different ways to do this, but the simplest is just to name the individual columns that we want to choose. For example, if we just want the columns that give the country (country), year (year), its gross domestic product (GDP, gdp), and country code (iso2c), we can do the following:
select(wdi_panel, country, year, gdp)# A tibble: 6,582 × 3
country year gdp
<chr> <int> <dbl>
1 Afghanistan 2016 102672.
2 Afghanistan 2023 82595.
3 Afghanistan 2015 100402.
4 Afghanistan 2006 49061.
5 Afghanistan 2017 105389.
6 Afghanistan 2014 98966.
7 Afghanistan 2019 110814.
8 Afghanistan 2021 85768.
9 Afghanistan 2013 96341.
10 Afghanistan 2018 106643.
# ℹ 6,572 more rows
As you can see, the way select works in this example is very simple: first we give the name of the data frame (wdi_panel), and then give the names of the columns of that data frame that we want to select (country, year, gdp).
We can name as many columns as we wish, and the selected columns appear in the order we name them. For example:
select(wdi_panel, country, iso2c, year, gdp)
select(wdi_panel, year, iso2c, country, gdp)An important point to make about the select() function is that it does not change the original data frame. This is a general property of tidyverse commands, which favour immutable data frames: every command returns a new object instead of altering the original. In other words, while we put wdi_panel into the select() function, a new data frame comes out; wdi_panel itself is unchanged. In order to see the output in the examples used above, we did not use the <- assignment operator to save the new data frame that the function returned, but we can always do this. For example, the following creates a new data frame named wdi_panel_a that just has the country, year, and GDP:
wdi_panel_a <- select(wdi_panel, country, year, gdp)The select() function is far more flexible than simply selecting columns by listing their names. It has several helper tricks and functions — ranges, numeric positions, pattern-matching functions, negation, and others — that let you specify whole groups of variables in one go.
- Selecting a consecutive set of columns
-
If the columns that you want to select occur consecutively, you can select them by just stating the first and last column in the range you want. For example, suppose we need the first five columns (
countrytogdp_growth). We can select them all at once using the colon operator as follows:
select(wdi_panel, country:gdp_growth)# A tibble: 6,582 × 5
country iso2c year gdp gdp_growth
<chr> <chr> <int> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26
2 Afghanistan AF 2023 82595. 2.71
3 Afghanistan AF 2015 100402. 1.45
4 Afghanistan AF 2006 49061. 5.36
5 Afghanistan AF 2017 105389. 2.65
6 Afghanistan AF 2014 98966. 2.72
7 Afghanistan AF 2019 110814. 3.91
8 Afghanistan AF 2021 85768. -20.7
9 Afghanistan AF 2013 96341. 5.60
10 Afghanistan AF 2018 106643. 1.19
# ℹ 6,572 more rows
The colon tells select() to keep all columns from the one named on the left up to the one named on the right, in their original order.
- Selecting by column index
- We can also keep columns purely by their numerical index or position, and we can use the colon operator for consecutive columns:
select(wdi_panel, 1, 3, 4) # columns 1, 3, and 4
select(wdi_panel, 1:4) # columns 1 through 4- Selecting by names and indices at the same time
-
We can select by column names and column indices, or sequences of them using the colon, at the same time. For example, this selects
country,year, and the seventh and eighth columns:
select(wdi_panel, country, year, 7:8)# A tibble: 6,582 × 4
country year lfp_male lfp_female
<chr> <int> <dbl> <dbl>
1 Afghanistan 2016 74.4 20.2
2 Afghanistan 2023 70.2 5.16
3 Afghanistan 2015 75.3 19.1
4 Afghanistan 2006 79.3 14.5
5 Afghanistan 2017 73.4 21.2
6 Afghanistan 2014 76.2 18.1
7 Afghanistan 2019 69.4 18.3
8 Afghanistan 2021 67.3 14.7
9 Afghanistan 2013 77.1 17.1
10 Afghanistan 2018 71.4 19.8
# ℹ 6,572 more rows
- Selecting columns that start with, end with, or contain a pattern
-
The
select()function provides helper functions likestarts_with(),ends_with(), andcontains()for pattern matching:
select(wdi_panel, starts_with("gdp"))# A tibble: 6,582 × 2
gdp gdp_growth
<dbl> <dbl>
1 102672. 2.26
2 82595. 2.71
3 100402. 1.45
4 49061. 5.36
5 105389. 2.65
6 98966. 2.72
7 110814. 3.91
8 85768. -20.7
9 96341. 5.60
10 106643. 1.19
# ℹ 6,572 more rows
select(wdi_panel, ends_with("male"))
select(wdi_panel, contains("inf"))All of these helper functions, starts_with(), ends_with(), contains(), are not sensitive to case, upper or lower case, by default. For example, contains("INF") is the same as contains('inf') by default. If we want it to be case sensitive, we can indicate this by using ignore.case = FALSE inside the helper. For example, here is how we tell it to match the exact case:
# returns only columns containing "INF", of which there are none
select(wdi_panel, contains("INF", ignore.case = FALSE))# A tibble: 6,582 × 0
- Deselecting columns by negation
-
By putting a minus sign in front of any selector, we can drop or deselect columns instead of keeping them. For example, in the following example, we keep everything except
iso2c:
select(wdi_panel, -iso2c)# A tibble: 6,582 × 7
country year gdp gdp_growth inflation lfp_male lfp_female
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
You can read the minus sign as meaning everything except. We can put it in front of every helper function too. For example, -starts_with("lfp_") drops all labour-force participation variables, which all start with lfp_.
- Renaming while selecting
-
When we use
select, we can optionally rename columns by usingnew_name = old_nameinside the call. For example, here we select four columns but rename two as we do so:
select(
wdi_panel,
country,
country_code = iso2c,
year,
gdp,
growth = gdp
)# A tibble: 6,582 × 4
country country_code year growth
<chr> <chr> <int> <dbl>
1 Afghanistan AF 2016 102672.
2 Afghanistan AF 2023 82595.
3 Afghanistan AF 2015 100402.
4 Afghanistan AF 2006 49061.
5 Afghanistan AF 2017 105389.
6 Afghanistan AF 2014 98966.
7 Afghanistan AF 2019 110814.
8 Afghanistan AF 2021 85768.
9 Afghanistan AF 2013 96341.
10 Afghanistan AF 2018 106643.
# ℹ 6,572 more rows
As before however, it is the data frame that comes out that has the new column names; the original wdi_panel remains unchanged.
5.2 Renaming columns with rename()
Our next wrangling command is rename(). We just saw that select() can both pick columns and rename them. As useful as that is, if we just want to rename some variables but not select variables at the same time, we should use rename() instead. This command does only one thing: it changes column names while leaving every column in place. We often use it when we want to keep the current set of variables but the labels are cryptic, too long, or inconsistent and so need to be changed. Here we rename, as we just did using select, iso2c and gdp_growth to country_code and growth, respectively:
rename(wdi_panel,
country_code = iso2c,
growth = gdp_growth
)# A tibble: 6,582 × 8
country country_code year gdp growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
Note that the syntax is new = old, just as we used when renaming while selecting with select.
Remember, rename() returns a new data frame; the original is untouched. To permanently work with renamed variables, assign the result to a name:
wdi_panel_renamed <- rename(wdi_panel, country_code = iso2c, growth = gdp_growth)If we need to rename many or even all variables, typing dozens of pairs can be tedious. The rename_with function lets you pass a function to do renaming, which will be applied to every selected column name.
For example, toupper converts strings to uppercase. You can combine rename_with with selection helpers to target specific columns:
rename_with(wdi_panel, toupper) # all columns# A tibble: 6,582 × 8
COUNTRY ISO2C YEAR GDP GDP_GROWTH INFLATION LFP_MALE LFP_FEMALE
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
rename_with(wdi_panel, toupper, .cols = starts_with("gdp")) # only GDP variables5.3 Re-ordering columns with relocate()
After we have selected the variables we need and renamed them sensibly, a common housekeeping step is to arrange those columns in an order that makes the data frame easier to read. That is what relocate() does: it moves one or more columns to a new position while leaving every other column untouched. As with the other commands, relocate() returns a new data frame; the original remains unchanged unless you assign it to a new name.
By default, relocate() moves columns to the front. You can move multiple columns as a block, or position them relative to other columns with .before or .after:
relocate(wdi_panel, iso2c)# A tibble: 6,582 × 8
iso2c country year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AF Afghanistan 2016 102672. 2.26 4.38 74.4 20.2
2 AF Afghanistan 2023 82595. 2.71 -4.64 70.2 5.16
3 AF Afghanistan 2015 100402. 1.45 -0.662 75.3 19.1
4 AF Afghanistan 2006 49061. 5.36 6.78 79.3 14.5
5 AF Afghanistan 2017 105389. 2.65 4.98 73.4 21.2
6 AF Afghanistan 2014 98966. 2.72 4.67 76.2 18.1
7 AF Afghanistan 2019 110814. 3.91 2.30 69.4 18.3
8 AF Afghanistan 2021 85768. -20.7 5.13 67.3 14.7
9 AF Afghanistan 2013 96341. 5.60 7.39 77.1 17.1
10 AF Afghanistan 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
relocate(wdi_panel, year, iso2c) # multiple to front
relocate(wdi_panel, inflation, .before = gdp) # before gdp
relocate(wdi_panel, starts_with('lfp_'), .after = year) # after year5.4 Selecting rows with slice()
The wrangling commands covered thus far were all applied to columns. There are other commands that apply to rows instead. One of the simplest such functions is slice() whose main purpose is to select rows according to their position. In this sense, slice() is roughly equivalent to select() when it uses column numerical indices.
You can select rows by position, use ranges with :, or negate with -:
slice(wdi_panel, 1:5) # rows 1 to 5# A tibble: 5 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
slice(wdi_panel, 10:20) # rows 10 to 20
slice(wdi_panel, -1) # all except row 1
slice(wdi_panel, -(1:5)) # all except rows 1 to 5Helper functions slice_head() and slice_tail() select from the beginning or end:
slice_head(wdi_panel, n = 5) # first 5 rows
slice_tail(wdi_panel, n = 5) # last 5 rowsThe slice_sample() function randomly samples rows, optionally grouped by a variable:
slice_sample(wdi_panel, n = 100) # 100 random rows
slice_sample(wdi_panel, n = 3, by = country) # 3 rows per country5.5 Keeping (or dropping) rows with filter()
The slice() function and variants are useful for selecting rows but they don’t provide all the functionality we usually need for selecting rows. In particular, we very often need to select rows that match some condition or set of conditions. For example, with data like wdi_panel, we may wish to select only those rows that meet some or all of the conditions: “years since 2010”, “countries in the EU”, “non-missing GDP”. Doing this with slice() would be difficult or impossible. Instead, for tasks like this we use filter(), which is one of the most powerful data wrangling work-horses.
Think “filter in”: filter() keeps rows that match conditions. Use == for equality (not =), and other operators like !=, >, <, >=, <=:
filter(wdi_panel, year == 2020)# A tibble: 199 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2020 1.08e5 -2.35 5.60 66.8 16.5
2 Albania AL 2020 4.16e4 -3.31 1.62 65.3 50.6
3 Algeria DZ 2020 6.25e5 -5.00 2.42 64.3 13.7
4 Andorra AD 2020 4.29e3 -11.2 NA NA NA
5 Angola AO 2020 2.53e5 -5.64 22.3 76.8 72.7
6 Antigua and Barb… AG 2020 2.20e3 -18.9 0.626 NA NA
7 Argentina AR 2020 1.08e6 -9.90 NA 65.9 46.5
8 Armenia AM 2020 4.46e4 -7.20 1.21 68.1 57.9
9 Aruba AW 2020 3.10e3 -26.2 NA NA NA
10 Australia AU 2020 1.46e6 -0.120 0.847 70.2 60.2
# ℹ 189 more rows
filter(wdi_panel, year < 2020)Combine conditions with commas (AND) or | (OR):
filter(wdi_panel, country == 'France', year >= 2020) # AND
filter(wdi_panel, country == 'France', year >= 2020, gdp_growth < 2.0) # multiple AND
filter(wdi_panel, gdp_growth < 0 | inflation > 20) # ORUse between() for ranges and near() for approximate matches:
filter(wdi_panel, year >= 2000, year <= 2009) # manual range
filter(wdi_panel, between(year, 2000, 2009)) # between helper
filter(wdi_panel, near(gdp_growth, 5, 1)) # within ±1.0 of 5Use %in% to match multiple values, and ! to negate conditions:
filter(wdi_panel, country %in% c('France', 'Germany', 'Italy')) # match set
filter(wdi_panel, !between(year, 2000, 2009)) # NOT between
filter(wdi_panel, !(country %in% c('France', 'Germany', 'Italy'))) # NOT in setTo remove missing values, use !is.na() in filter or the dedicated drop_na() function:
filter(wdi_panel, !is.na(gdp)) # filter approach
drop_na(wdi_panel) # all rows with any NA
drop_na(wdi_panel, gdp) # rows with gdp NA
drop_na(wdi_panel, gdp, gdp_growth) # rows with either NA
drop_na(wdi_panel, any_of(starts_with('gdp'))) # with selection helpers5.6 Creating and transforming variables with mutate()
The mutate() command is another powerful wrangling tool. It creates one or more new columns from the data already present, or it re-calculates or transforms an existing column in place.
As a simple example, let’s create a variable that calculates GDP, which is currently in millions of dollars, in billions of dollars instead:
mutate(wdi_panel, gdp_b = gdp / 1000)# A tibble: 6,582 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female gdp_b
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 1.03e5 2.26 4.38 74.4 20.2 103.
2 Afghanistan AF 2023 8.26e4 2.71 -4.64 70.2 5.16 82.6
3 Afghanistan AF 2015 1.00e5 1.45 -0.662 75.3 19.1 100.
4 Afghanistan AF 2006 4.91e4 5.36 6.78 79.3 14.5 49.1
5 Afghanistan AF 2017 1.05e5 2.65 4.98 73.4 21.2 105.
6 Afghanistan AF 2014 9.90e4 2.72 4.67 76.2 18.1 99.0
7 Afghanistan AF 2019 1.11e5 3.91 2.30 69.4 18.3 111.
8 Afghanistan AF 2021 8.58e4 -20.7 5.13 67.3 14.7 85.8
9 Afghanistan AF 2013 9.63e4 5.60 7.39 77.1 17.1 96.3
10 Afghanistan AF 2018 1.07e5 1.19 0.626 71.4 19.8 107.
# ℹ 6,572 more rows
Note how the newly created gdp_b is inserted at the end of the data frame by default. We can optionally use .after or .before to tell mutate() where to insert the new variable. For example, to insert gdp_b directly after gdp, we can do:
mutate(wdi_panel, gdp_b = gdp / 1000, .after = gdp)# A tibble: 6,582 × 9
country iso2c year gdp gdp_b gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 1.03e5 103. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 8.26e4 82.6 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 1.00e5 100. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 4.91e4 49.1 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 1.05e5 105. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 9.90e4 99.0 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 1.11e5 111. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 8.58e4 85.8 -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 9.63e4 96.3 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 1.07e5 107. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
We can use mutate() to perform multiple transformations at once. You can add several columns in the same call, and later names may refer to earlier ones created in the same line. In the following example, we create a GDP-in-billions variable as before, calculate the annual GDP growth as a multiplier, and then use both of these newly created variables to calculate projected GDP:
mutate(wdi_panel,
gdp_b = gdp / 1000,
gdp_growth_mult = 1 + gdp_growth / 100,
gdp_b_proj = gdp_b * gdp_growth_mult,
.after = gdp
)# A tibble: 6,582 × 11
country iso2c year gdp gdp_b gdp_growth_mult gdp_b_proj gdp_growth
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 103. 1.02 105. 2.26
2 Afghanistan AF 2023 82595. 82.6 1.03 84.8 2.71
3 Afghanistan AF 2015 100402. 100. 1.01 102. 1.45
4 Afghanistan AF 2006 49061. 49.1 1.05 51.7 5.36
5 Afghanistan AF 2017 105389. 105. 1.03 108. 2.65
6 Afghanistan AF 2014 98966. 99.0 1.03 102. 2.72
7 Afghanistan AF 2019 110814. 111. 1.04 115. 3.91
8 Afghanistan AF 2021 85768. 85.8 0.793 68.0 -20.7
9 Afghanistan AF 2013 96341. 96.3 1.06 102. 5.60
10 Afghanistan AF 2018 106643. 107. 1.01 108. 1.19
# ℹ 6,572 more rows
# ℹ 3 more variables: inflation <dbl>, lfp_male <dbl>, lfp_female <dbl>
It is common to use mutate() to create discrete or categorical variables. For example, we can create a new variable higher_gdp that indicates if any gdp value is above or below average as follows:
mutate(wdi_panel, higher_gdp = gdp >= mean(gdp), .after = gdp)# A tibble: 6,582 × 9
country iso2c year gdp higher_gdp gdp_growth inflation lfp_male
<chr> <chr> <int> <dbl> <lgl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. FALSE 2.26 4.38 74.4
2 Afghanistan AF 2023 82595. FALSE 2.71 -4.64 70.2
3 Afghanistan AF 2015 100402. FALSE 1.45 -0.662 75.3
4 Afghanistan AF 2006 49061. FALSE 5.36 6.78 79.3
5 Afghanistan AF 2017 105389. FALSE 2.65 4.98 73.4
6 Afghanistan AF 2014 98966. FALSE 2.72 4.67 76.2
7 Afghanistan AF 2019 110814. FALSE 3.91 2.30 69.4
8 Afghanistan AF 2021 85768. FALSE -20.7 5.13 67.3
9 Afghanistan AF 2013 96341. FALSE 5.60 7.39 77.1
10 Afghanistan AF 2018 106643. FALSE 1.19 0.626 71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>
The newly created higher_gdp is a logical vector, with values TRUE or FALSE, where TRUE means the corresponding gdp is equal to or above average GDP, and FALSE means it is below average. An alternative approach is to use if_else() as follows:
mutate(wdi_panel,
higher_gdp = if_else(gdp >= mean(gdp), "above_avg", "below_avg"),
.after = gdp
)# A tibble: 6,582 × 9
country iso2c year gdp higher_gdp gdp_growth inflation lfp_male
<chr> <chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. below_avg 2.26 4.38 74.4
2 Afghanistan AF 2023 82595. below_avg 2.71 -4.64 70.2
3 Afghanistan AF 2015 100402. below_avg 1.45 -0.662 75.3
4 Afghanistan AF 2006 49061. below_avg 5.36 6.78 79.3
5 Afghanistan AF 2017 105389. below_avg 2.65 4.98 73.4
6 Afghanistan AF 2014 98966. below_avg 2.72 4.67 76.2
7 Afghanistan AF 2019 110814. below_avg 3.91 2.30 69.4
8 Afghanistan AF 2021 85768. below_avg -20.7 5.13 67.3
9 Afghanistan AF 2013 96341. below_avg 5.60 7.39 77.1
10 Afghanistan AF 2018 106643. below_avg 1.19 0.626 71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>
Now, the new higher_gdp variable is a character vector with two possible values: above_avg and below_avg. With the if_else() function, there are always three arguments: the first is some condition to be tested, the second is the value returned if the condition is true, and the third is the value returned if the condition is false.
To create a discrete or categorical variable with more than two possible values, a very useful function is case_when(). For example, we can create a new variable infl_rate to indicate if inflation is “high” (above 5%), “medium” (between 2% and 5%), or “low” (less than 2%) as follows:
mutate(wdi_panel,
infl_rate = case_when(
inflation > 5 ~ 'high',
inflation > 2 ~ 'medium',
TRUE ~ 'low'
),
.after = inflation
)# A tibble: 6,582 × 9
country iso2c year gdp gdp_growth inflation infl_rate lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 Afghan… AF 2016 1.03e5 2.26 4.38 medium 74.4 20.2
2 Afghan… AF 2023 8.26e4 2.71 -4.64 low 70.2 5.16
3 Afghan… AF 2015 1.00e5 1.45 -0.662 low 75.3 19.1
4 Afghan… AF 2006 4.91e4 5.36 6.78 high 79.3 14.5
5 Afghan… AF 2017 1.05e5 2.65 4.98 medium 73.4 21.2
6 Afghan… AF 2014 9.90e4 2.72 4.67 medium 76.2 18.1
7 Afghan… AF 2019 1.11e5 3.91 2.30 medium 69.4 18.3
8 Afghan… AF 2021 8.58e4 -20.7 5.13 high 67.3 14.7
9 Afghan… AF 2013 9.63e4 5.60 7.39 high 77.1 17.1
10 Afghan… AF 2018 1.07e5 1.19 0.626 low 71.4 19.8
# ℹ 6,572 more rows
How case_when() works may not be obvious at first glance. Think of it as checking each row against the rules in order:
- Test the first rule,
inflation > 5. – If it is true, write"high"and stop for that row. - Otherwise, test the second rule,
inflation > 2. – If true, write"medium"(we already know the value is ≤ 5). - If neither rule matched, the final catch-all line
TRUE ~ "low"fires, so the row gets"low".
Because the last rule is always TRUE, every row receives exactly one of the three labels — high, medium, or low — and the checking stops as soon as the first matching rule is found.
Another handy helper inside mutate() is across(). Use it when the same change, for example, “divide by 100”, should be applied to several columns in one go.
mutate(
wdi_panel,
across(gdp_growth:lfp_female, ~ . / 100)
)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 0.0226 0.0438 0.744 0.202
2 Afghanistan AF 2023 82595. 0.0271 -0.0464 0.702 0.0516
3 Afghanistan AF 2015 100402. 0.0145 -0.00662 0.753 0.191
4 Afghanistan AF 2006 49061. 0.0536 0.0678 0.793 0.145
5 Afghanistan AF 2017 105389. 0.0265 0.0498 0.734 0.212
6 Afghanistan AF 2014 98966. 0.0272 0.0467 0.762 0.181
7 Afghanistan AF 2019 110814. 0.0391 0.0230 0.694 0.183
8 Afghanistan AF 2021 85768. -0.207 0.0513 0.673 0.147
9 Afghanistan AF 2013 96341. 0.0560 0.0739 0.771 0.171
10 Afghanistan AF 2018 106643. 0.0119 0.00626 0.714 0.198
# ℹ 6,572 more rows
What happens, step-by-step:
gdp_growth:lfp_femaleselects the block of four percentage columns (gdp_growth,inflation,lfp_male,lfp_female).- For every value in each of those columns, the little rule
~ . / 100is carried out. Read it as “take the value (the dot) and divide it by 100.” - The result, which is now a true proportion like 0.0226 instead of 2.26, is written back into the same columns. All other columns stay untouched.
So across() lets you say “do this to all these columns” once, instead of repeating the same line four times.
5.7 Sorting rows with arrange()
Data often need to be displayed or examined in a particular order rather than the order they were read from disk. The arrange() function reorders rows based on the values in one or more columns while leaving every column unchanged.
Sorting the wdi_panel data by gross domestic product allows us to see which country–years have the largest or smallest economies in the table.
arrange(wdi_panel, gdp)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tuvalu TV 1990 26.3 15.4 NA NA NA
2 Tuvalu TV 1991 27.2 3.61 NA NA NA
3 Tuvalu TV 1992 28.0 2.79 NA NA NA
4 Tuvalu TV 1996 28.7 -5.96 NA NA NA
5 Tuvalu TV 1993 29.1 4.11 NA NA NA
6 Tuvalu TV 1995 30.5 -5.00 NA NA NA
7 Tuvalu TV 1997 31.5 10.0 NA NA NA
8 Tuvalu TV 1994 32.1 10.3 NA NA NA
9 Tuvalu TV 2005 34.7 -4.65 NA NA NA
10 Tuvalu TV 2006 35.5 2.44 NA NA NA
# ℹ 6,572 more rows
The rows now appear from the smallest GDP to the largest.
To reverse the order wrap the column name in desc, which stands for descending.
arrange(wdi_panel, desc(gdp))# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 China CN 2023 31229733. 5.25 0.235 71.6 59.9
2 China CN 2022 29672080. 2.95 1.97 71.6 59.9
3 China CN 2021 28821649. 8.45 0.981 72.9 61.2
4 China CN 2020 26576354. 2.24 2.42 71.9 59.9
5 China CN 2019 25994432. 5.95 2.90 73.4 61.2
6 United States US 2023 24977118. 2.89 4.12 67.6 56.6
7 China CN 2018 24534506. 6.75 2.07 74.0 61.5
8 United States US 2022 24276131. 2.51 8.00 67.6 56.0
9 United States US 2021 23681171 6.06 4.70 67.4 55.6
10 China CN 2017 22983193. 6.95 1.59 74.6 61.9
# ℹ 6,572 more rows
This version lists the country–year with the highest GDP first and moves down to the lowest. Multiple sort keys can be combined by listing additional columns after the first, allowing ties to be broken or secondary ordering to be imposed, but a single arrange() call with or without desc is enough for most quick inspections.
5.8 Reshaping data from wide to long format and back
In data frames, data is often arranged in one of two fundamental ways for specific sets of variables: wide format and long format. For some analyses it is necessary to use a wide format and for other analyses a long format is required. Transforming your data from wide to long, or long to wide, is known as reshaping. Understanding these formats and how to switch between them is an essential data-wrangling skill. The tidyverse, specifically the tidyr package, provides two tools to reshape data: pivot_longer(), which converts wide format to long, and pivot_wider(), which converts long format to wide.
It is easier to understand long and wide formats using examples. Consider the following fictitious data.
| neighbourhood | crime_2019 | crime_2020 | crime_2021 |
|---|---|---|---|
| Downtown | 120 | 115 | 110 |
| Riverside | 85 | 90 | 88 |
| Westfield | 60 | 55 | 52 |
| Lakeshore | 40 | 38 | 35 |
Each row is a neighbourhood and each column after the first records the reported crime rate (crime per 10 000 residents) in a different year. Because the same measure is stored in separate year-specific columns this data frame is in wide form.
By contrast, the next data frame is the same data in long format.
| neighbourhood | year | crime |
|---|---|---|
| Downtown | 2019 | 120 |
| Downtown | 2020 | 115 |
| Downtown | 2021 | 110 |
| Riverside | 2019 | 85 |
| Riverside | 2020 | 90 |
| Riverside | 2021 | 88 |
| Westfield | 2019 | 60 |
| Westfield | 2020 | 55 |
| Westfield | 2021 | 52 |
| Lakeshore | 2019 | 40 |
| Lakeshore | 2020 | 38 |
| Lakeshore | 2021 | 35 |
The long version stacks the three year columns into two new columns: one for the year and one for the crime rate. A wide format is often easier for humans to read and can help with comparisons across columns. A long format is usually preferable or required for many statistical analyses and visualisations, and it is more flexible for adding new data, filtering, and performing calculations. As a rule, when analysing data in R, a long format is often the preferred default.
The data frame wdi_panel has two lfp_ columns in wide format. We can convert them to long format with pivot_longer(), optionally removing a shared prefix:
wdi_panel |>
pivot_longer(cols = starts_with("lfp_"), names_to = "gender", values_to = "lfp")# A tibble: 13,164 × 8
country iso2c year gdp gdp_growth inflation gender lfp
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 lfp_male 74.4
2 Afghanistan AF 2016 102672. 2.26 4.38 lfp_female 20.2
3 Afghanistan AF 2023 82595. 2.71 -4.64 lfp_male 70.2
4 Afghanistan AF 2023 82595. 2.71 -4.64 lfp_female 5.16
5 Afghanistan AF 2015 100402. 1.45 -0.662 lfp_male 75.3
6 Afghanistan AF 2015 100402. 1.45 -0.662 lfp_female 19.1
7 Afghanistan AF 2006 49061. 5.36 6.78 lfp_male 79.3
8 Afghanistan AF 2006 49061. 5.36 6.78 lfp_female 14.5
9 Afghanistan AF 2017 105389. 2.65 4.98 lfp_male 73.4
10 Afghanistan AF 2017 105389. 2.65 4.98 lfp_female 21.2
# ℹ 13,154 more rows
wdi_panel |>
pivot_longer(
cols = starts_with("lfp_"),
names_to = "gender",
names_prefix = "lfp_", # strip prefix
values_to = "lfp"
)The reverse operation, pivot_wider(), transforms long format back to wide:
wdi_longer <- pivot_longer(wdi_panel, cols = starts_with("lfp_"),
names_to = "gender", names_prefix = "lfp_", values_to = "lfp")
pivot_wider(wdi_longer, names_from = "gender", values_from = "lfp", names_prefix = "lfp_")The symmetry is clear: names_to matches names_from, and values_to matches values_from.
5.9 Joining data frames
Analyses often require information that lives in more than one data frame. A join operation combines two data frames by matching rows that share the same identifying columns, sometimes called keys.
Imagine if we have two data frames:
data_frame_A: A first data frame could be a list of student names and their student ID numbers.data_frame_B: A second data frame could be a list of classes and the ID numbers of the students taking them.
If you want to find out the names of all the students in all the classes, we need to link up these two data frames. The join operation does exactly that by looking for the common piece of information between the two; in this case, it is the student ID number. It effectively looks at each student ID value in data_frame_A and then finds all the rows in data_frame_B with matching student IDs. It then combines the information from both lists for each matching ID, creating a new and more complete data frame that tells you which student (by name) is taking which class.
Joins combine data frames by matching keys. inner_join() keeps rows present in both tables; left_join() keeps all rows from the left table:
inner_join(wdi_panel, population) # auto-detect keys# A tibble: 3,095 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female population
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5 25631282
2 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1 30551674
3 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3 24860855
4 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2 24018682
5 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2 22202806
6 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2 23116142
7 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7 26349243
8 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0 27032197
9 Afghanistan AF 2012 91231. 12.8 6.44 78.0 16.2 29824536
10 Afghanistan AF 2011 80913. 0.426 11.8 78.2 15.9 29105480
# ℹ 3,085 more rows
inner_join(wdi_panel, population, by = c("country", "year")) # explicit keys
left_join(wdi_panel, population, by = c("country", "year")) # keep all left rowsUse inner_join() when you need complete cases and are happy to drop observations that lack matches. Use left_join() when you want to preserve all rows from the main table and simply attach whatever extra information is available.
Other join variants exist such as right_join(), full_join(), and anti_join(), but inner_join() and left_join() cover most day-to-day tasks and are all you need at this stage.
5.10 Data-wrangling pipelines
Chapter 2 introduced the pipe operator |> and showed how it passes the result of one expression straight into the next function. Data wrangling is where that left-to-right flow shines. Each of the tidyverse commands we focused on here does one job, so real wrangling work usually means running several commands in sequence. A pipeline writes those steps as a single sentence that the reader can follow from the raw data to the finished table.
Below are three short pipelines that use almost every tool covered so far. They all start with wdi_panel, add information or reshape it, and end with a table that could feed directly into a plot or a model.
- Pipeline example 1: GDP per person, ranked
- This pipeline builds a ranked list of GDP-per-capita across different country-years.
inner_join(wdi_panel, population, by = c("country", "year")) |>
mutate(gdp_pc = 1e6 * gdp / population) |>
select(country, year, gdp_pc) |>
arrange(desc(gdp_pc))# A tibble: 3,095 × 3
country year gdp_pc
<chr> <dbl> <dbl>
1 Luxembourg 2007 139017.
2 Luxembourg 2008 135467.
3 Luxembourg 2013 132423.
4 Luxembourg 2006 131384.
5 Luxembourg 2010 130498.
6 Luxembourg 2012 129978.
7 Luxembourg 2011 129700.
8 Luxembourg 2009 128236.
9 Luxembourg 2005 126191.
10 Luxembourg 2004 124852.
# ℹ 3,085 more rows
The steps are as follows:
inner_join(wdi_panel, population, by = c("country", "year"))Keep only those country–year rows that appear in both tables, attaching thepopulationcolumn to the matching rows inwdi_panel.mutate(gdp_pc = 1e6 * gdp / population)Convert GDP from millions to dollars (gdp * 1e6) and divide bypopulationto create a new columngdp_pc, the per-capita GDP.select(country, year, gdp_pc)Retain just the identifying columnscountryandyearalong with the newly calculatedgdp_pc, discarding all other variables.arrange(desc(gdp_pc))Sort the resulting table in descending order so that the highest GDP-per-capita observations appear first.
- Pipeline example 2: Recent inflation by income group
- This pipeline produces a five-row summary that compares average GDP and average inflation across GDP quintiles for recent years.
wdi_panel |>
filter(year >= 2015) |>
mutate(gdp_quintile = ntile(gdp, 5)) |>
summarise(avg_gdp = mean(gdp),
avg_inflation = mean(inflation, na.rm = TRUE),
.by = gdp_quintile) |>
arrange(gdp_quintile)# A tibble: 5 × 3
gdp_quintile avg_gdp avg_inflation
<int> <dbl> <dbl>
1 1 3139. 3.18
2 2 28657. 5.51
3 3 84298. 8.75
4 4 318744. 7.56
5 5 3253163. 5.53
Here is what happens, step by step.
filter(year >= 2015)Keep only rows from 2015 onward. Everything earlier is discarded.mutate(gdp_quintile = ntile(gdp, 5))Rank the remaining rows by their GDP value and split them into five equal-sized groups. The poorest fifth getsgdp_quintile = 1, the richest fifth gets5.summarise(..., .by = gdp_quintile)For each quintile calculate •avg_gdp: the mean GDP within that group. •avg_inflation: the mean inflation rate, ignoring any missing values. Because of.by = gdp_quintile, the summary is done separately for each quintile, yielding one row per group.arrange(gdp_quintile)Order the resulting five rows from the lowest GDP quintile to the highest so the output reads naturally from poorer to richer economies.
filterto select the last ten years.summarise()to group the rows by World Bank income category. Compute the average inflation rate within each group. List groups from highest to lowest inflation.
- Pipeline example 3: Gender gaps in labour participation by region
- This pipeline averages the labour force participation rate for males and females, averaged over all countries and years:
wdi_panel |>
pivot_longer(
cols = starts_with("lfp_"),
names_to = "gender",
names_prefix = "lfp_",
values_to = "lfp"
) |> summarise(lfp = mean(lfp, na.rm = TRUE),
.by = gender)# A tibble: 2 × 2
gender lfp
<chr> <dbl>
1 male 72.2
2 female 50.0
pivot_longer(cols = starts_with("lfp_"), names_to = "gender", names_prefix = "lfp_", values_to = "lfp")Select the columns whose names begin withlfp_. Stack those columns into two new columns where the former column name, stripped of thelfp_prefix, is stored ingenderand the numeric value is stored inlfp.summarise(lfp = mean(lfp, na.rm = TRUE), .by = gender)Group the long data bygenderand compute the mean oflfp, ignoring missing values. The result is one row formaleand one row forfemale, each containing the average labour-force participation rate.
These three examples show how a pipe strings together joins, filters, calculations, reshapes, summaries, and sorts. The same pattern scales to longer pipelines; you just add another step on a new line.
5.11 Further reading
Hadley Wickham and Garrett Grolemund’s R for Data Science (Wickham & Grolemund, 2023) provides the standard reference for learning the
tidyverse, with detailed chapters on data transformation principles and techniques.The official
tidyversedocumentation at tidyverse.org offers comprehensive function references and examples across the ecosystem of packages includingdplyr,tidyr,stringr, and others.
5.12 Chapter summary
- Data wrangling transforms raw, messy datasets into tidy, analysis-ready format, typically consuming 40-60% of analyst time but making all subsequent statistical work possible.
- The
select()function chooses columns by name, position, or pattern, with helpers likestarts_with(),ends_with(), andcontains()enabling flexible column selection. - The
filter()function keeps rows matching specified conditions using comparison operators, logical combinations, and helpers likebetween()and%in%. - The
mutate()function creates new variables or transforms existing ones, supporting sequential calculations, conditional logic withif_else()andcase_when(), and batch transformations withacross(). - Reshaping operations convert data between wide format (multiple columns per measure) and long format (stacked observations) using
pivot_longer()andpivot_wider(). - Join operations like
inner_join()andleft_join()combine information from multiple data frames by matching rows on shared key variables. - The pipe operator
|>chains wrangling steps into readable, reproducible pipelines that flow left-to-right from raw data to finished tables. tidyversefunctions are immutable, returning new data frames without modifying originals, requiring assignment with<-to preserve transformations.