library(tidyverse)
library(sgsur)5 Data wrangling
Until now, every dataset in this book has arrived in perfect rows and columns. There were no columns to rename, no rows to drop, and nothing to reshape or merge. They were ready for whatever analyses or plotting we needed to do. Real-world datasets, by contrast, are rarely so accommodating. You might encounter superfluous columns, or need to create new variables. Perhaps you’ll only need a subset of the rows, or have to reshape and restructure the data, or even merge multiple datasets into one. Usually, before any analysis can begin, however simple, we have to reformat, clean, or transform our data.
This activity of cleaning or transforming data is often called data wrangling. The verb to wrangle originally meant “to struggle or argue at length.” Ranchers later used it for herding or taming uncooperative livestock. Modern analysts have borrowed the word because taming data can feel exactly the same: time-consuming, difficult, and occasionally infuriating. Many surveys (for example, Anaconda, 2022) suggest that analysts spend 40–60% of their time on data wrangling; more than on statistical analysis itself. Despite this, the topic is often neglected in statistics textbooks and courses.
In this chapter, you’ll learn about a set of interrelated R commands used to wrangle data. They go by names like select, filter, and mutate, and they’re almost all part of the broad tidyverse family of R tools. Each of these commands is designed to do one thing well while working seamlessly with the others. As such, they can be used to create powerful sequences of data transformations, known as pipelines, that take data from an unruly raw state and transform it into something clean, tidy, and ready for analysis.
We’ll go through many of the major wrangling tools, explaining their purpose and how they work. Then, we’ll provide examples of how they can be combined into pipelines to transform messy raw data into clean, tidy data. Everything we need, whether commands or data, is available in either the tidyverse meta-package or this textbook’s sgsur package.
5.1 This chapter covers…
What data wrangling means.
Why transforming raw data into tidy, analysis-ready form is an essential step before any analysis.Core tidyverse verbs.
Usingselect(),rename(),relocate(),slice(),filter(),mutate(), andarrange()to reshape and transform data.Reshaping data.
Moving between wide and long formats withpivot_longer()andpivot_wider().Joining datasets.
Combining related information withinner_join()andleft_join().Building pipelines.
Linking steps with the native pipe|>to create clean, reproducible workflows.Developing tidy data skills.
Turning messy inputs into reliable datasets ready for modelling and visualisation.
5.2 Selecting columns with select
The first wrangling tool we will look at is select. This is part of the dplyr package, which is part of the tidyverse meta-package, so is loaded already. To explore this command, we will use the wdi_panel dataset that comes with the sgsur package.
Let’s look at this data:
wdi_panel# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
As you can see, this is a data frame with 6582 rows and 8 columns. It provides some variables that describe the economy of many different countries over recent decades; see help("wdi_panel") or ?wdi_panel for full details about this dataset.
With select, we can choose, or select, a subset of the columns of any data frame. As we will see, it provides many different ways to do this, but the simplest is just to name the individual columns that we want to choose. For example, if we just want the columns that give the country (country), year (year), its gross domestic product (GDP, gdp), and country code (iso2c), we can do the following:
select(wdi_panel, country, year, gdp)# A tibble: 6,582 × 3
country year gdp
<chr> <int> <dbl>
1 Afghanistan 2016 102672.
2 Afghanistan 2023 82595.
3 Afghanistan 2015 100402.
4 Afghanistan 2006 49061.
5 Afghanistan 2017 105389.
6 Afghanistan 2014 98966.
7 Afghanistan 2019 110814.
8 Afghanistan 2021 85768.
9 Afghanistan 2013 96341.
10 Afghanistan 2018 106643.
# ℹ 6,572 more rows
As you can see, the way select works in this example is very simple: first we give the name of the data frame (wdi_panel), and then give the names of the columns of that data frame that we want to select (country, year, gdp).
We can name as many columns as we wish. For example, if we wanted the country code (iso2c) as well, we could do the following:
select(wdi_panel, country, iso2c, year, gdp)# A tibble: 6,582 × 4
country iso2c year gdp
<chr> <chr> <int> <dbl>
1 Afghanistan AF 2016 102672.
2 Afghanistan AF 2023 82595.
3 Afghanistan AF 2015 100402.
4 Afghanistan AF 2006 49061.
5 Afghanistan AF 2017 105389.
6 Afghanistan AF 2014 98966.
7 Afghanistan AF 2019 110814.
8 Afghanistan AF 2021 85768.
9 Afghanistan AF 2013 96341.
10 Afghanistan AF 2018 106643.
# ℹ 6,572 more rows
The selected columns appear in the order we name them. For example, here are the same four selected columns but in a different order:
select(wdi_panel, year, iso2c, country, gdp)# A tibble: 6,582 × 4
year iso2c country gdp
<int> <chr> <chr> <dbl>
1 2016 AF Afghanistan 102672.
2 2023 AF Afghanistan 82595.
3 2015 AF Afghanistan 100402.
4 2006 AF Afghanistan 49061.
5 2017 AF Afghanistan 105389.
6 2014 AF Afghanistan 98966.
7 2019 AF Afghanistan 110814.
8 2021 AF Afghanistan 85768.
9 2013 AF Afghanistan 96341.
10 2018 AF Afghanistan 106643.
# ℹ 6,572 more rows
An important point to make about the select function is that it does not change the original data frame. This is a general property of tidyverse commands, which favour immutable data frames: every command returns a new object instead of altering the original. In other words, while we put wdi_panel into the select function, a new data frame comes out; wdi_panel itself is unchanged. In order to see the output in the examples used above, we did not use the <- assignment operator to save the new data frame that the function returned, but we can always do this. For example, the following creates a new data frame named wdi_panel_a that just has the country, year, and GDP:
wdi_panel_a <- select(wdi_panel, country, year, gdp)The select function is far more flexible than simply selecting columns by listing their names. It has several helper tricks and functions — ranges, numeric positions, pattern-matching functions, negation, and others — that let you specify whole groups of variables in one go.
- Selecting a consecutive set of columns
-
If the columns that you want to select occur consecutively, you can select them by just stating the first and last column in the range you want. For example, suppose we need the first five columns (
countrytogdp_growth). We can select them all at once using the colon operator as follows:
select(wdi_panel, country:gdp_growth)# A tibble: 6,582 × 5
country iso2c year gdp gdp_growth
<chr> <chr> <int> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26
2 Afghanistan AF 2023 82595. 2.71
3 Afghanistan AF 2015 100402. 1.45
4 Afghanistan AF 2006 49061. 5.36
5 Afghanistan AF 2017 105389. 2.65
6 Afghanistan AF 2014 98966. 2.72
7 Afghanistan AF 2019 110814. 3.91
8 Afghanistan AF 2021 85768. -20.7
9 Afghanistan AF 2013 96341. 5.60
10 Afghanistan AF 2018 106643. 1.19
# ℹ 6,572 more rows
The colon tells select() to keep all columns from the one named on the left up to the one named on the right, in their original order.
- Selecting by column index
- We can also keep columns purely by their numerical index or position. Sometimes this is helpful as you might not yet know the names or the names vary across files: For example, here we keep columns 1, 3, and 4:
select(wdi_panel, 1, 3, 4)# A tibble: 6,582 × 3
country year gdp
<chr> <int> <dbl>
1 Afghanistan 2016 102672.
2 Afghanistan 2023 82595.
3 Afghanistan 2015 100402.
4 Afghanistan 2006 49061.
5 Afghanistan 2017 105389.
6 Afghanistan 2014 98966.
7 Afghanistan 2019 110814.
8 Afghanistan 2021 85768.
9 Afghanistan 2013 96341.
10 Afghanistan 2018 106643.
# ℹ 6,572 more rows
Just like with column names, we can select a sequence of consecutive columns with the colon operator. For example, here we select the first to the fourth column with 1:4:
select(wdi_panel, 1:4)# A tibble: 6,582 × 4
country iso2c year gdp
<chr> <chr> <int> <dbl>
1 Afghanistan AF 2016 102672.
2 Afghanistan AF 2023 82595.
3 Afghanistan AF 2015 100402.
4 Afghanistan AF 2006 49061.
5 Afghanistan AF 2017 105389.
6 Afghanistan AF 2014 98966.
7 Afghanistan AF 2019 110814.
8 Afghanistan AF 2021 85768.
9 Afghanistan AF 2013 96341.
10 Afghanistan AF 2018 106643.
# ℹ 6,572 more rows
- Selecting by names and indices at the same time
-
We can select by column names and column indices, or sequences of them using the colon, at the same time. For example, this selects
country,year, and the seventh and eighth columns:
select(wdi_panel, country, year, 7:8)# A tibble: 6,582 × 4
country year lfp_male lfp_female
<chr> <int> <dbl> <dbl>
1 Afghanistan 2016 74.4 20.2
2 Afghanistan 2023 70.2 5.16
3 Afghanistan 2015 75.3 19.1
4 Afghanistan 2006 79.3 14.5
5 Afghanistan 2017 73.4 21.2
6 Afghanistan 2014 76.2 18.1
7 Afghanistan 2019 69.4 18.3
8 Afghanistan 2021 67.3 14.7
9 Afghanistan 2013 77.1 17.1
10 Afghanistan 2018 71.4 19.8
# ℹ 6,572 more rows
- Selecting columns that start with, end with, or contain a pattern
-
The
selectfunction provides helper functions likestarts_with,ends_with,contains, andmatchesthat allow us to select columns on the basis of specific patterns or sequences of characters in their names. For example, in thewdi_paneldata, the GDP-related variables begin with"gdp". To grab them in one go, we can use thestarts_withhelper function as follows:
select(wdi_panel, starts_with("gdp"))# A tibble: 6,582 × 2
gdp gdp_growth
<dbl> <dbl>
1 102672. 2.26
2 82595. 2.71
3 100402. 1.45
4 49061. 5.36
5 105389. 2.65
6 98966. 2.72
7 110814. 3.91
8 85768. -20.7
9 96341. 5.60
10 106643. 1.19
# ℹ 6,572 more rows
We can use ends_with in a similar way:
select(wdi_panel, ends_with("male"))# A tibble: 6,582 × 2
lfp_male lfp_female
<dbl> <dbl>
1 74.4 20.2
2 70.2 5.16
3 75.3 19.1
4 79.3 14.5
5 73.4 21.2
6 76.2 18.1
7 69.4 18.3
8 67.3 14.7
9 77.1 17.1
10 71.4 19.8
# ℹ 6,572 more rows
We can use contains to select any column whose name includes a pattern or sequence of characters anywhere, not necessarily at the beginning or end. For example, to fetch any column whose name includes "inf" anywhere, we can do:
select(wdi_panel, contains("inf"))# A tibble: 6,582 × 1
inflation
<dbl>
1 4.38
2 -4.64
3 -0.662
4 6.78
5 4.98
6 4.67
7 2.30
8 5.13
9 7.39
10 0.626
# ℹ 6,572 more rows
All of these helper functions, starts_with, ends_with, contains, are not sensitive to case, upper or lower case, by default. For example, contains("INF") is the same as contains('inf') by default. If we want it to be case sensitive, we can indicate this by using ignore.case = FALSE inside the helper. For example, here is how we tell it to match the exact case:
# returns only columns containing "INF", of which there are none
select(wdi_panel, contains("INF", ignore.case = FALSE)) # A tibble: 6,582 × 0
- Deselecting columns by negation
-
By putting a minus sign in front of any selector, we can drop or deselect columns instead of keeping them. For example, in the following example, we keep everything except
iso2c:
select(wdi_panel, -iso2c)# A tibble: 6,582 × 7
country year gdp gdp_growth inflation lfp_male lfp_female
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
You can read the minus sign as meaning everything except. We can put it in front of every helper function too. For example, -starts_with("lfp_") drops all labour-force participation variables, which all start with lfp_.
- Renaming while selecting
-
When we use
select, we can optionally rename columns by usingnew_name = old_nameinside the call. For example, here we select four columns but rename two as we do so:
select(
wdi_panel,
country,
country_code = iso2c,
year,
gdp,
growth = gdp
)# A tibble: 6,582 × 4
country country_code year growth
<chr> <chr> <int> <dbl>
1 Afghanistan AF 2016 102672.
2 Afghanistan AF 2023 82595.
3 Afghanistan AF 2015 100402.
4 Afghanistan AF 2006 49061.
5 Afghanistan AF 2017 105389.
6 Afghanistan AF 2014 98966.
7 Afghanistan AF 2019 110814.
8 Afghanistan AF 2021 85768.
9 Afghanistan AF 2013 96341.
10 Afghanistan AF 2018 106643.
# ℹ 6,572 more rows
As before however, it is the data frame that comes out that has the new column names; the original wdi_panel remains unchanged.
5.3 Renaming columns with rename
Our next wrangling command is rename(). We just saw that select() can both pick columns and rename them. As useful as that is, if we just want to rename some variables but not select variables at the same time, we should use rename instead. This command does only one thing: it changes column names while leaving every column in place. We often use it when we want to keep the current set of variables but the labels are cryptic, too long, or inconsistent and so need to be changed. Here we rename, as we just did using select, iso2c and gdp_growth to country_code and growth, respectively:
rename(wdi_panel,
country_code = iso2c,
growth = gdp_growth
)# A tibble: 6,582 × 8
country country_code year gdp growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
Note that the syntax is new = old, just as we used when renaming while selecting with select.
Remember, as with the other commands we are looking at, the rename command returns a new data frame in which the specified variables have fresh names, while every other column is unchanged and unreordered. The original data frame is untouched. So if we want to permanently work with the renamed variables, we need to assign the new data frame returned by rename to a new name, as in this example:
wdi_panel_renamed <- rename(
wdi_panel,
country_code = iso2c,
growth = gdp_growth
)Now, the data frame wdi_panel_renamed is the data frame with the new names, while wdi_panel is the original data frame, and both of these can exist at the same time.
If we need to rename many or even all variables, typing dozens of pairs can be tedious. The rename_with function lets you pass a function to do renaming, which will be applied to every selected column name.
For example, the function toupper turns any string given as input into uppercase, and so if we do
rename_with(wdi_panel, toupper)# A tibble: 6,582 × 8
COUNTRY ISO2C YEAR GDP GDP_GROWTH INFLATION LFP_MALE LFP_FEMALE
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
we will make all the column names uppercase.
You can combine rename_with with a selection helper, such as those used above, so the function applies to only certain columns. Below we convert every GDP-related variable (those that start with "gdp") to upper-case, leaving others alone:
rename_with(
wdi_panel,
toupper,
.cols = starts_with("gdp")
)# A tibble: 6,582 × 8
country iso2c year GDP GDP_GROWTH inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
5.4 Re-ordering columns with relocate
After we have selected the variables we need and renamed them sensibly, a common housekeeping step is to arrange those columns in an order that makes the data frame easier to read. That is what relocate does: it moves one or more columns to a new position while leaving every other column untouched. As with the other commands, relocate returns a new data frame; the original remains unchanged unless you assign it to a new name.
To see relocate in action, suppose we want iso2c to be the very first column. We do so as follows:
relocate(wdi_panel, iso2c)# A tibble: 6,582 × 8
iso2c country year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AF Afghanistan 2016 102672. 2.26 4.38 74.4 20.2
2 AF Afghanistan 2023 82595. 2.71 -4.64 70.2 5.16
3 AF Afghanistan 2015 100402. 1.45 -0.662 75.3 19.1
4 AF Afghanistan 2006 49061. 5.36 6.78 79.3 14.5
5 AF Afghanistan 2017 105389. 2.65 4.98 73.4 21.2
6 AF Afghanistan 2014 98966. 2.72 4.67 76.2 18.1
7 AF Afghanistan 2019 110814. 3.91 2.30 69.4 18.3
8 AF Afghanistan 2021 85768. -20.7 5.13 67.3 14.7
9 AF Afghanistan 2013 96341. 5.60 7.39 77.1 17.1
10 AF Afghanistan 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
By default, relocate() puts the chosen column(s) before all others in the data frame.
We can move several columns as a block. Here we place year and iso2c ahead of everything else:
relocate(wdi_panel, year, iso2c)# A tibble: 6,582 × 8
year iso2c country gdp gdp_growth inflation lfp_male lfp_female
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2016 AF Afghanistan 102672. 2.26 4.38 74.4 20.2
2 2023 AF Afghanistan 82595. 2.71 -4.64 70.2 5.16
3 2015 AF Afghanistan 100402. 1.45 -0.662 75.3 19.1
4 2006 AF Afghanistan 49061. 5.36 6.78 79.3 14.5
5 2017 AF Afghanistan 105389. 2.65 4.98 73.4 21.2
6 2014 AF Afghanistan 98966. 2.72 4.67 76.2 18.1
7 2019 AF Afghanistan 110814. 3.91 2.30 69.4 18.3
8 2021 AF Afghanistan 85768. -20.7 5.13 67.3 14.7
9 2013 AF Afghanistan 96341. 5.60 7.39 77.1 17.1
10 2018 AF Afghanistan 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
Note how the two columns stay in the order in which we list them.
The relocate function will do more than just pull columns to the very front. To position columns somewhere other than the very front, we supply a target column with .before or .after arguments. For example, to put inflation right before gdp, we can do the following:
relocate(wdi_panel, inflation, .before = gdp)# A tibble: 6,582 × 8
country iso2c year inflation gdp gdp_growth lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 4.38 102672. 2.26 74.4 20.2
2 Afghanistan AF 2023 -4.64 82595. 2.71 70.2 5.16
3 Afghanistan AF 2015 -0.662 100402. 1.45 75.3 19.1
4 Afghanistan AF 2006 6.78 49061. 5.36 79.3 14.5
5 Afghanistan AF 2017 4.98 105389. 2.65 73.4 21.2
6 Afghanistan AF 2014 4.67 98966. 2.72 76.2 18.1
7 Afghanistan AF 2019 2.30 110814. 3.91 69.4 18.3
8 Afghanistan AF 2021 5.13 85768. -20.7 67.3 14.7
9 Afghanistan AF 2013 7.39 96341. 5.60 77.1 17.1
10 Afghanistan AF 2018 0.626 106643. 1.19 71.4 19.8
# ℹ 6,572 more rows
As another example, if we want to send all labour force participation columns, which start with lfp_, to occur just after year, we can do:
relocate(wdi_panel, starts_with('lfp_'), .after = year)# A tibble: 6,582 × 8
country iso2c year lfp_male lfp_female gdp gdp_growth inflation
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 74.4 20.2 102672. 2.26 4.38
2 Afghanistan AF 2023 70.2 5.16 82595. 2.71 -4.64
3 Afghanistan AF 2015 75.3 19.1 100402. 1.45 -0.662
4 Afghanistan AF 2006 79.3 14.5 49061. 5.36 6.78
5 Afghanistan AF 2017 73.4 21.2 105389. 2.65 4.98
6 Afghanistan AF 2014 76.2 18.1 98966. 2.72 4.67
7 Afghanistan AF 2019 69.4 18.3 110814. 3.91 2.30
8 Afghanistan AF 2021 67.3 14.7 85768. -20.7 5.13
9 Afghanistan AF 2013 77.1 17.1 96341. 5.60 7.39
10 Afghanistan AF 2018 71.4 19.8 106643. 1.19 0.626
# ℹ 6,572 more rows
5.5 Selecting rows with slice
The wrangling commands covered thus far were all applied to columns. There are other commands that apply to rows instead. One of the simplest such functions is slice whose main purpose is to select rows according to their position. In this sense, slice is roughly equivalent to select when it uses column numerical indices.
If we wanted to select the first five rows, we could use slice as follows:
slice(wdi_panel, 1:5)# A tibble: 5 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
Similarly, if we wanted rows 10 to 20 inclusive, we could do:
slice(wdi_panel, 10:20)# A tibble: 11 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
2 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
3 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2
4 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2
5 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2
6 Afghanistan AF 2022 80416. -6.24 13.7 70.2 5.16
7 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7
8 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0
9 Afghanistan AF 2020 108209. -2.35 5.60 66.8 16.5
10 Afghanistan AF 2012 91231. 12.8 6.44 78.0 16.2
11 Afghanistan AF 2011 80913. 0.426 11.8 78.2 15.9
Note how we use the colon operator here just like we did with select when we selected consecutive columns by their position (e.g., select(wdi_panel, 1:4)).
Also, as we saw when using select, we can use the minus sign to deselect rows. For example, the following selects all rows except the first row:
slice(wdi_panel, -1)# A tibble: 6,581 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
2 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
3 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
4 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
5 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
6 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
7 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
8 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
9 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
10 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
# ℹ 6,571 more rows
Likewise, we could negate the set of numbers selected with the colon. For example, the following selects all rows except the first five rows:
slice(wdi_panel, -(1:5))# A tibble: 6,577 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
2 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
3 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
4 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
5 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
6 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
7 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2
8 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2
9 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2
10 Afghanistan AF 2022 80416. -6.24 13.7 70.2 5.16
# ℹ 6,567 more rows
There are some helper functions to make some slicing operations easier. For example, we can select the first n rows as follows:
# same as slice(wdi_panel, 1:5)
slice_head(wdi_panel, n = 5)# A tibble: 5 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
There is a counterpart to slice_head that can be used to obtain rows from the end of the data frame. For example, if we wanted the last 5 rows, we can do the following:
slice_tail(wdi_panel, n = 5)# A tibble: 5 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Zimbabwe ZW 2019 50316. -6.33 255. 71.9 60.7
2 Zimbabwe ZW 2020 46383. -7.82 557. 71.1 59.5
3 Zimbabwe ZW 2021 50311. 8.47 98.5 71.7 60.1
4 Zimbabwe ZW 2022 53399. 6.14 105. 71.7 59.8
5 Zimbabwe ZW 2023 56249. 5.34 NA 71.3 59.7
Another slice helper function, slice_sample, allows us to randomly sample rows. We could use this, for example, if we need to subsample a very large data frame for the purposes of plotting. The following samples 100 rows at random:
slice_sample(wdi_panel, n = 100)# A tibble: 100 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Burkina Faso BF 2016 41816. 5.96 0.441 69.0 53.6
2 Guinea GN 1992 14358. 3.27 NA 64.1 63.2
3 Maldives MV 2018 9935. 8.67 -0.133 74.5 40.9
4 China CN 2006 9060142. 12.7 1.65 79.0 65.4
5 Ecuador EC 2023 260213. 2.36 2.22 76.7 52.5
6 Gabon GA 2013 37948. 5.64 0.505 58.4 40.8
7 Honduras HN 1997 27305. 4.60 20.2 86.9 45.1
8 Mauritius MU 2005 18676. 1.78 4.94 77.1 41.2
9 Senegal SN 2009 41595. 2.75 -2.25 64.2 33.6
10 Maldives MV 1993 2287. 5.56 20.1 74.4 24.4
# ℹ 90 more rows
Another way of using slice_sample is with the by option. For example, instead of just sampling, say, n=3 rows from anywhere in the data frame, we could sample n=3 rows from each country as follows:
slice_sample(wdi_panel, n = 3, by = country)# A tibble: 597 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0
2 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
3 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
4 Albania AL 2018 42170. 4.02 2.03 67.5 51.4
5 Albania AL 2017 40540. 3.80 2.06 66.8 49.7
6 Albania AL 2021 45346. 8.97 2.04 66.2 51.9
7 Algeria DZ 1996 313815. 4.10 18.7 77.2 11.7
8 Algeria DZ 2004 431567. 4.50 3.96 72.7 12.7
9 Algeria DZ 2012 553306. 2.40 8.89 70.0 14.8
10 Andorra AD 2016 4649. 3.71 NA NA NA
# ℹ 587 more rows
5.6 Keeping (or dropping) rows with filter
The slice function and variants are useful for selecting rows but they don’t provide all the functionality we usually need for selecting rows. In particular, we very often need to select rows that match some condition or set of conditions. For example, with data like wdi_panel, we may wish to select only those rows that meet some or all of the conditions: “years since 2010”, “countries in the EU”, “non-missing GDP”. Doing this with slice would be difficult or impossible. Instead, for tasks like this we use filter, which is one of the most powerful data wrangling work-horses.
As a simple example of filter in action, here we keep only those rows from the year 2020:
filter(wdi_panel, year == 2020)# A tibble: 199 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2020 1.08e5 -2.35 5.60 66.8 16.5
2 Albania AL 2020 4.16e4 -3.31 1.62 65.3 50.6
3 Algeria DZ 2020 6.25e5 -5.00 2.42 64.3 13.7
4 Andorra AD 2020 4.29e3 -11.2 NA NA NA
5 Angola AO 2020 2.53e5 -5.64 22.3 76.8 72.7
6 Antigua and Barb… AG 2020 2.20e3 -18.9 0.626 NA NA
7 Argentina AR 2020 1.08e6 -9.90 NA 65.9 46.5
8 Armenia AM 2020 4.46e4 -7.20 1.21 68.1 57.9
9 Aruba AW 2020 3.10e3 -26.2 NA NA NA
10 Australia AU 2020 1.46e6 -0.120 0.847 70.2 60.2
# ℹ 189 more rows
A few simple but important things to note about this. First, think “filter in” and not “filter out”. In other words, filter keeps, rather than drops, the rows that match the condition. Second, in this case we use == and not =. The == is how we test equality. We effectively ask if the thing on the left of == is identical to the thing on the right. You can read this example aloud as “keep rows where year equals 2020.” We use the single = often in R, and you’ve already seen many examples, but == means something completely different. It’s common to mistake the two when using filter, so if you use = instead of ==, filter will inform you that you probably intended ==, as in the following:
filter(wdi_panel, year = 2020)Error in `filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `year == 2020`?
In general with filter, we use conditions: is something true or false. Equality conditions like year == 2020 are common numerical conditions, but other numerical conditions like != (not equal to), > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to) are also very commonly used. For example, the following selects those rows from before, but not including, 2020:
filter(wdi_panel, year < 2020)# A tibble: 5,794 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
3 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
4 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
5 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
6 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
7 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
8 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
9 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
10 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2
# ℹ 5,784 more rows
We can combine multiple conditions in filter. For example, to select the rows from France in or after the year 2020, we can do the following:
filter(wdi_panel, country == 'France', year >= 2020)# A tibble: 4 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FR 2022 3654521. 2.57 5.22 60.2 51.9
2 France FR 2021 3562924. 6.88 1.64 59.8 51.6
3 France FR 2023 3688745. 0.936 4.88 59.8 51.8
4 France FR 2020 3333501. -7.44 0.476 58.9 50.0
Here, the comma effectively means “and”, so we are keeping those rows where the country is France and the year is 2020 or later. You can use as many conditions as you wish by just adding more conditions with commas between them. For example, here we select the rows from France in or after the year 2020 where the GDP growth was less than 2 (2% growth):
filter(wdi_panel, country == 'France', year >= 2020, gdp_growth < 2.0)# A tibble: 2 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FR 2023 3688745. 0.936 4.88 59.8 51.8
2 France FR 2020 3333501. -7.44 0.476 58.9 50.0
Conditions that are “and” combinations like this are known in logic as conjunctions: a conjunction is true if and only if all of its constituent conditions are true. By contrast, disjunctions are “or” combinations. A disjunction is true if any of its conditions are true. We commonly use disjunctions when using filter. For example, to select rows where GDP growth was negative or inflation was over 20 (20% inflation), we do:
filter(wdi_panel, gdp_growth < 0 | inflation > 20)# A tibble: 1,494 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
2 Afghanistan AF 2022 80416. -6.24 13.7 70.2 5.16
3 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0
4 Afghanistan AF 2020 108209. -2.35 5.60 66.8 16.5
5 Afghanistan AF 2001 29496. -9.43 NA 79.9 14.2
6 Albania AL 1994 13896. 8.30 22.6 74.4 52.5
7 Albania AL 2020 41613. -3.31 1.62 65.3 50.6
8 Albania AL 1993 12831. 9.56 85.0 75.1 53.2
9 Albania AL 1991 12618. -28.0 NA 75.2 53.4
10 Albania AL 1990 17526. -9.58 NA 72.7 50.8
# ℹ 1,484 more rows
Here, we don’t use a comma to separate conditions, but use |, which is the symbol in R meaning logical or. We can read this as “keep rows where gdp_growth is less than zero or inflation is greater than 20, or both”. Note that in some of the selected rows, the gdp_growth is positive and in others the inflation is not greater than 20, but in each case, one of the two conditions was met.
What if we wanted to select rows where values were within some range of values? For example, what if we wanted to select rows where the year was between 2000 and 2009? One possibility is to use a conjunction as follows:
filter(wdi_panel, year >= 2000, year <= 2009)# A tibble: 1,934 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
2 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
3 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2
4 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2
5 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2
6 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7
7 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0
8 Afghanistan AF 2001 29496. -9.43 NA 79.9 14.2
9 Afghanistan AF 2000 32567. NA NA 79.9 14.3
10 Afghanistan AF 2009 70451. 21.4 -6.81 78.7 15.3
# ℹ 1,924 more rows
Here we are asking for those rows where year is greater than or equal to 2000 and also less than or equal to 2009. The only years that satisfy this conjunction are those between 2000 and 2009, inclusive. Another option, and probably easier too, is to use the between helper function as follows:
filter(wdi_panel, between(year, 2000, 2009))# A tibble: 1,934 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
2 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3
3 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2
4 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2
5 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2
6 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7
7 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0
8 Afghanistan AF 2001 29496. -9.43 NA 79.9 14.2
9 Afghanistan AF 2000 32567. NA NA 79.9 14.3
10 Afghanistan AF 2009 70451. 21.4 -6.81 78.7 15.3
# ℹ 1,924 more rows
The helper near can be used to find values that are within a certain distance from a specified value. For example, to find country/years where the GDP growth was \(5\% \pm 1\%\), we can use near as follows:
filter(wdi_panel, near(gdp_growth, 5, 1)) # within ±1.0 of 5 # A tibble: 1,323 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
2 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
3 Albania AL 2018 42170. 4.02 2.03 67.5 51.4
4 Albania AL 2022 47535. 4.83 6.73 68.0 53.2
5 Albania AL 2007 30022. 5.98 2.93 62.9 46.1
6 Albania AL 2005 26748. 5.53 2.37 66.2 47.2
7 Albania AL 2003 24023. 5.53 0.484 69.4 48.5
8 Albania AL 2004 25348. 5.51 2.28 67.8 47.8
9 Albania AL 2006 28327. 5.90 2.37 64.6 46.7
10 Albania AL 2002 22764. 4.54 7.77 70.9 49.1
# ℹ 1,313 more rows
Related situations arise with non-numeric data. What if, for example, we wanted to select rows from France, Germany, and Italy only? One option is as follows:
filter(wdi_panel, country == 'France' | country == 'Germany' | country == 'Italy')# A tibble: 102 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FR 1994 2361212. 2.38 1.66 63.3 47.5
2 France FR 2010 3194765. 2.00 1.53 61.8 51.1
3 France FR 2009 3132113. -2.82 0.0876 62.0 51.1
4 France FR 1993 2306239. -0.359 2.10 63.7 47.3
5 France FR 2012 3278656. 0.184 1.95 61.7 51.2
6 France FR 2011 3272640. 2.44 2.11 61.5 51.0
7 France FR 2004 2992434. 2.87 2.14 62.7 49.6
8 France FR 2017 3472749. 2.08 1.03 60.3 50.6
9 France FR 2016 3401867. 0.860 0.183 60.4 50.8
10 France FR 1995 2415633. 2.30 1.80 63.0 47.8
# ℹ 92 more rows
While that works, it is a bit awkward and an easier alternative is to use the %in% operator:
filter(wdi_panel, country %in% c('France', 'Germany', 'Italy'))# A tibble: 102 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 France FR 1994 2361212. 2.38 1.66 63.3 47.5
2 France FR 2010 3194765. 2.00 1.53 61.8 51.1
3 France FR 2009 3132113. -2.82 0.0876 62.0 51.1
4 France FR 1993 2306239. -0.359 2.10 63.7 47.3
5 France FR 2012 3278656. 0.184 1.95 61.7 51.2
6 France FR 2011 3272640. 2.44 2.11 61.5 51.0
7 France FR 2004 2992434. 2.87 2.14 62.7 49.6
8 France FR 2017 3472749. 2.08 1.03 60.3 50.6
9 France FR 2016 3401867. 0.860 0.183 60.4 50.8
10 France FR 1995 2415633. 2.30 1.80 63.0 47.8
# ℹ 92 more rows
The %in% operator tests if the value on its left is an element of the set to its right. In this case, we use the c() function to create a vector of three countries.
As mentioned, filter keeps rows rather than dropping them. However, we can easily get it to drop rows in various ways. One common option is to negate a condition with ! (logical not). For example, to drop all years between 2000 and 2009, we can do the following:
filter(wdi_panel, !between(year, 2000, 2009))# A tibble: 4,648 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
5 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
6 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
7 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
8 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
9 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
10 Afghanistan AF 2022 80416. -6.24 13.7 70.2 5.16
# ℹ 4,638 more rows
The way to read !between(year, 2000, 2009) is any year not between 2000 and 2009, so we end up selecting all but the years between 2000 and 2009 inclusive.
As another example, to drop rows from France, Germany, or Italy, we can negate the selection made with %in% above:
filter(wdi_panel, !(country %in% c('France', 'Germany', 'Italy')))# A tibble: 6,480 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,470 more rows
It is very common to need to remove rows where some or all values are missing, indicated by NA. We can use negation in filter to do this. In particular, we can drop rows whenever there is a missing value in a specified column by using the is.na function with negation. The is.na function tests if a value is NA (missing), so negating it tests if it is not NA (not missing). For example, to drop all rows where the gdp value is missing, we can do the following:
filter(wdi_panel, !is.na(gdp))# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
However, there is a special command, drop_na, that can be used instead of using filter. This is an important command to be aware of in its own right, and can usually be a more straightforward alternative to using filter to remove missing values.
On its own, drop_na will remove all rows where there is a missing value anywhere.
# drop rows containing missing values
# aka rowwise deletion of missing values
drop_na(wdi_panel)# A tibble: 5,410 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 5,400 more rows
If we name specified columns, it will only drop rows where values of these columns are missing. For example, to drop rows wherever gdp is missing, we would do the following:
drop_na(wdi_panel, gdp)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
To drop rows wherever gdp or gdp_growth is missing, we would do the following:
drop_na(wdi_panel, gdp, gdp_growth)# A tibble: 6,557 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,547 more rows
We can use selection helpers like starts_with inside drop_na too. The following will drop any rows if any column starting with gdp has a missing value:
drop_na(wdi_panel, any_of(starts_with('gdp')))# A tibble: 6,557 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,547 more rows
5.7 Creating and transforming variables with mutate
The mutate command is another powerful wrangling tool. It creates one or more new columns from the data already present, or it re-calculates or transforms an existing column in place.
As a simple example, let’s create a variable that calculates GDP, which is currently in millions of dollars, in billions of dollars instead:
mutate(wdi_panel, gdp_b = gdp / 1000)# A tibble: 6,582 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female gdp_b
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 1.03e5 2.26 4.38 74.4 20.2 103.
2 Afghanistan AF 2023 8.26e4 2.71 -4.64 70.2 5.16 82.6
3 Afghanistan AF 2015 1.00e5 1.45 -0.662 75.3 19.1 100.
4 Afghanistan AF 2006 4.91e4 5.36 6.78 79.3 14.5 49.1
5 Afghanistan AF 2017 1.05e5 2.65 4.98 73.4 21.2 105.
6 Afghanistan AF 2014 9.90e4 2.72 4.67 76.2 18.1 99.0
7 Afghanistan AF 2019 1.11e5 3.91 2.30 69.4 18.3 111.
8 Afghanistan AF 2021 8.58e4 -20.7 5.13 67.3 14.7 85.8
9 Afghanistan AF 2013 9.63e4 5.60 7.39 77.1 17.1 96.3
10 Afghanistan AF 2018 1.07e5 1.19 0.626 71.4 19.8 107.
# ℹ 6,572 more rows
Note how the newly created gdp_b is inserted at the end of the data frame by default. We can optionally use .after or .before to tell mutate where to insert the new variable. For example, to insert gdp_b directly after gdp, we can do:
mutate(wdi_panel, gdp_b = gdp / 1000, .after = gdp)# A tibble: 6,582 × 9
country iso2c year gdp gdp_b gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 1.03e5 103. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 8.26e4 82.6 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 1.00e5 100. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 4.91e4 49.1 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 1.05e5 105. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 9.90e4 99.0 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 1.11e5 111. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 8.58e4 85.8 -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 9.63e4 96.3 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 1.07e5 107. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
We can use mutate to perform multiple transformations at once. You can add several columns in the same call, and later names may refer to earlier ones created in the same line. In the following example, we create a GDP-in-billions variable as before, calculate the annual GDP growth as a multiplier, and then use both of these newly created variables to calculate projected GDP:
mutate(wdi_panel,
gdp_b = gdp / 1000,
gdp_growth_mult = 1 + gdp_growth / 100,
gdp_b_proj = gdp_b * gdp_growth_mult,
.after = gdp
)# A tibble: 6,582 × 11
country iso2c year gdp gdp_b gdp_growth_mult gdp_b_proj gdp_growth
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 103. 1.02 105. 2.26
2 Afghanistan AF 2023 82595. 82.6 1.03 84.8 2.71
3 Afghanistan AF 2015 100402. 100. 1.01 102. 1.45
4 Afghanistan AF 2006 49061. 49.1 1.05 51.7 5.36
5 Afghanistan AF 2017 105389. 105. 1.03 108. 2.65
6 Afghanistan AF 2014 98966. 99.0 1.03 102. 2.72
7 Afghanistan AF 2019 110814. 111. 1.04 115. 3.91
8 Afghanistan AF 2021 85768. 85.8 0.793 68.0 -20.7
9 Afghanistan AF 2013 96341. 96.3 1.06 102. 5.60
10 Afghanistan AF 2018 106643. 107. 1.01 108. 1.19
# ℹ 6,572 more rows
# ℹ 3 more variables: inflation <dbl>, lfp_male <dbl>, lfp_female <dbl>
It is common to use mutate to create discrete or categorical variables. For example, we can create a new variable higher_gdp that indicates if any gdp value is above or below average as follows:
mutate(wdi_panel, higher_gdp = gdp >= mean(gdp), .after = gdp) # A tibble: 6,582 × 9
country iso2c year gdp higher_gdp gdp_growth inflation lfp_male
<chr> <chr> <int> <dbl> <lgl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. FALSE 2.26 4.38 74.4
2 Afghanistan AF 2023 82595. FALSE 2.71 -4.64 70.2
3 Afghanistan AF 2015 100402. FALSE 1.45 -0.662 75.3
4 Afghanistan AF 2006 49061. FALSE 5.36 6.78 79.3
5 Afghanistan AF 2017 105389. FALSE 2.65 4.98 73.4
6 Afghanistan AF 2014 98966. FALSE 2.72 4.67 76.2
7 Afghanistan AF 2019 110814. FALSE 3.91 2.30 69.4
8 Afghanistan AF 2021 85768. FALSE -20.7 5.13 67.3
9 Afghanistan AF 2013 96341. FALSE 5.60 7.39 77.1
10 Afghanistan AF 2018 106643. FALSE 1.19 0.626 71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>
The newly created higher_gdp is a logical vector, with values TRUE or FALSE, where TRUE means the corresponding gdp is equal to or above average GDP, and FALSE means it is below average. An alternative approach is to use if_else as follows:
mutate(wdi_panel,
higher_gdp = if_else(gdp >= mean(gdp), "above_avg", "below_avg"),
.after = gdp
)# A tibble: 6,582 × 9
country iso2c year gdp higher_gdp gdp_growth inflation lfp_male
<chr> <chr> <int> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. below_avg 2.26 4.38 74.4
2 Afghanistan AF 2023 82595. below_avg 2.71 -4.64 70.2
3 Afghanistan AF 2015 100402. below_avg 1.45 -0.662 75.3
4 Afghanistan AF 2006 49061. below_avg 5.36 6.78 79.3
5 Afghanistan AF 2017 105389. below_avg 2.65 4.98 73.4
6 Afghanistan AF 2014 98966. below_avg 2.72 4.67 76.2
7 Afghanistan AF 2019 110814. below_avg 3.91 2.30 69.4
8 Afghanistan AF 2021 85768. below_avg -20.7 5.13 67.3
9 Afghanistan AF 2013 96341. below_avg 5.60 7.39 77.1
10 Afghanistan AF 2018 106643. below_avg 1.19 0.626 71.4
# ℹ 6,572 more rows
# ℹ 1 more variable: lfp_female <dbl>
Now, the new higher_gdp variable is a character vector with two possible values: above_avg and below_avg. With the if_else function, there are always three arguments: the first is some condition to be tested, the second is the value returned if the condition is true, and the third is the value returned if the condition is false.
To create a discrete or categorical variable with more than two possible values, a very useful function is case_when. For example, we can create a new variable infl_rate to indicate if inflation is “high” (above 5%), “medium” (between 2% and 5%), or “low” (less than 2%) as follows:
mutate(wdi_panel,
infl_rate = case_when(
inflation > 5 ~ 'high',
inflation > 2 ~ 'medium',
TRUE ~ 'low'
),
.after = inflation
)# A tibble: 6,582 × 9
country iso2c year gdp gdp_growth inflation infl_rate lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 Afghan… AF 2016 1.03e5 2.26 4.38 medium 74.4 20.2
2 Afghan… AF 2023 8.26e4 2.71 -4.64 low 70.2 5.16
3 Afghan… AF 2015 1.00e5 1.45 -0.662 low 75.3 19.1
4 Afghan… AF 2006 4.91e4 5.36 6.78 high 79.3 14.5
5 Afghan… AF 2017 1.05e5 2.65 4.98 medium 73.4 21.2
6 Afghan… AF 2014 9.90e4 2.72 4.67 medium 76.2 18.1
7 Afghan… AF 2019 1.11e5 3.91 2.30 medium 69.4 18.3
8 Afghan… AF 2021 8.58e4 -20.7 5.13 high 67.3 14.7
9 Afghan… AF 2013 9.63e4 5.60 7.39 high 77.1 17.1
10 Afghan… AF 2018 1.07e5 1.19 0.626 low 71.4 19.8
# ℹ 6,572 more rows
How case_when() works may not be obvious at first glance.
Think of it as checking each row against the rules in order:
- Test the first rule,
inflation > 5.
– If it is true, write"high"and stop for that row. - Otherwise, test the second rule,
inflation > 2.
– If true, write"medium"(we already know the value is ≤ 5). - If neither rule matched, the final catch-all line
TRUE ~ "low"fires, so the row gets"low".
Because the last rule is always TRUE, every row receives exactly one of the three labels — high, medium, or low — and the checking stops as soon as the first matching rule is found.
Another handy helper inside mutate() is across(). Use it when the same change, for example, “divide by 100”, should be applied to several columns in one go.
mutate(
wdi_panel,
across(gdp_growth:lfp_female, ~ . / 100)
)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 0.0226 0.0438 0.744 0.202
2 Afghanistan AF 2023 82595. 0.0271 -0.0464 0.702 0.0516
3 Afghanistan AF 2015 100402. 0.0145 -0.00662 0.753 0.191
4 Afghanistan AF 2006 49061. 0.0536 0.0678 0.793 0.145
5 Afghanistan AF 2017 105389. 0.0265 0.0498 0.734 0.212
6 Afghanistan AF 2014 98966. 0.0272 0.0467 0.762 0.181
7 Afghanistan AF 2019 110814. 0.0391 0.0230 0.694 0.183
8 Afghanistan AF 2021 85768. -0.207 0.0513 0.673 0.147
9 Afghanistan AF 2013 96341. 0.0560 0.0739 0.771 0.171
10 Afghanistan AF 2018 106643. 0.0119 0.00626 0.714 0.198
# ℹ 6,572 more rows
What happens, step-by-step:
gdp_growth:lfp_femaleselects the block of four percentage columns (gdp_growth,inflation,lfp_male,lfp_female).- For every value in each of those columns, the little rule
~ . / 100is carried out. Read it as “take the value (the dot) and divide it by 100.” - The result, which is now a true proportion like 0.0226 instead of 2.26, is written back into the same columns. All other columns stay untouched.
So across() lets you say “do this to all these columns” once, instead of repeating the same line four times.
5.8 Sorting rows with arrange
Data often need to be displayed or examined in a particular order rather than the order they were read from disk. The arrange function reorders rows based on the values in one or more columns while leaving every column unchanged.
Sorting the wdi_panel data by gross domestic product allows us to see which country–years have the largest or smallest economies in the table.
arrange(wdi_panel, gdp)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tuvalu TV 1990 26.3 15.4 NA NA NA
2 Tuvalu TV 1991 27.2 3.61 NA NA NA
3 Tuvalu TV 1992 28.0 2.79 NA NA NA
4 Tuvalu TV 1996 28.7 -5.96 NA NA NA
5 Tuvalu TV 1993 29.1 4.11 NA NA NA
6 Tuvalu TV 1995 30.5 -5.00 NA NA NA
7 Tuvalu TV 1997 31.5 10.0 NA NA NA
8 Tuvalu TV 1994 32.1 10.3 NA NA NA
9 Tuvalu TV 2005 34.7 -4.65 NA NA NA
10 Tuvalu TV 2006 35.5 2.44 NA NA NA
# ℹ 6,572 more rows
The rows now appear from the smallest GDP to the largest.
To reverse the order wrap the column name in desc, which stands for descending.
arrange(wdi_panel, desc(gdp))# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 China CN 2023 31229733. 5.25 0.235 71.6 59.9
2 China CN 2022 29672080. 2.95 1.97 71.6 59.9
3 China CN 2021 28821649. 8.45 0.981 72.9 61.2
4 China CN 2020 26576354. 2.24 2.42 71.9 59.9
5 China CN 2019 25994432. 5.95 2.90 73.4 61.2
6 United States US 2023 24977118. 2.89 4.12 67.6 56.6
7 China CN 2018 24534506. 6.75 2.07 74.0 61.5
8 United States US 2022 24276131. 2.51 8.00 67.6 56.0
9 United States US 2021 23681171 6.06 4.70 67.4 55.6
10 China CN 2017 22983193. 6.95 1.59 74.6 61.9
# ℹ 6,572 more rows
This version lists the country–year with the highest GDP first and moves down to the lowest. Multiple sort keys can be combined by listing additional columns after the first, allowing ties to be broken or secondary ordering to be imposed, but a single arrange call with or without desc is enough for most quick inspections.
5.9 Reshaping data from wide to long format and back
In data frames, data is often arranged in one of two fundamental ways for specific sets of variables: wide format and long format. For some analyses it is necessary to use a wide format and for other analyses a long format is required. Transforming your data from wide to long, or long to wide, is known as reshaping. Understanding these formats and how to switch between them is an essential data-wrangling skill. The tidyverse, specifically the tidyr package, provides two tools to reshape data: pivot_longer, which converts wide format to long, and pivot_wider, which converts long format to wide.
It is easier to understand long and wide formats using examples. Consider the following fictitious data.
| neighbourhood | crime_2019 | crime_2020 | crime_2021 |
|---|---|---|---|
| Downtown | 120 | 115 | 110 |
| Riverside | 85 | 90 | 88 |
| Westfield | 60 | 55 | 52 |
| Lakeshore | 40 | 38 | 35 |
Each row is a neighbourhood and each column after the first records the reported crime rate (crime per 10 000 residents) in a different year. Because the same measure is stored in separate year-specific columns this data frame is in wide form.
By contrast, the next data frame is the same data in long format.
| neighbourhood | year | crime |
|---|---|---|
| Downtown | 2019 | 120 |
| Downtown | 2020 | 115 |
| Downtown | 2021 | 110 |
| Riverside | 2019 | 85 |
| Riverside | 2020 | 90 |
| Riverside | 2021 | 88 |
| Westfield | 2019 | 60 |
| Westfield | 2020 | 55 |
| Westfield | 2021 | 52 |
| Lakeshore | 2019 | 40 |
| Lakeshore | 2020 | 38 |
| Lakeshore | 2021 | 35 |
The long version stacks the three year columns into two new columns: one for the year and one for the crime rate. A wide format is often easier for humans to read and can help with comparisons across columns. A long format is usually preferable or required for many statistical analyses and visualisations, and it is more flexible for adding new data, filtering, and performing calculations. As a rule, when analysing data in R, a long format is often the preferred default.
The data frame wdi_panel has two lfp_ columns in wide format, one for male and one for female labour-force participation. We can convert these columns to long format with pivot_longer.
wdi_panel |>
pivot_longer(
cols = starts_with("lfp_"),
names_to = "gender",
values_to = "lfp"
)# A tibble: 13,164 × 8
country iso2c year gdp gdp_growth inflation gender lfp
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 lfp_male 74.4
2 Afghanistan AF 2016 102672. 2.26 4.38 lfp_female 20.2
3 Afghanistan AF 2023 82595. 2.71 -4.64 lfp_male 70.2
4 Afghanistan AF 2023 82595. 2.71 -4.64 lfp_female 5.16
5 Afghanistan AF 2015 100402. 1.45 -0.662 lfp_male 75.3
6 Afghanistan AF 2015 100402. 1.45 -0.662 lfp_female 19.1
7 Afghanistan AF 2006 49061. 5.36 6.78 lfp_male 79.3
8 Afghanistan AF 2006 49061. 5.36 6.78 lfp_female 14.5
9 Afghanistan AF 2017 105389. 2.65 4.98 lfp_male 73.4
10 Afghanistan AF 2017 105389. 2.65 4.98 lfp_female 21.2
# ℹ 13,154 more rows
The cols argument identifies the columns to pivot, which are the two that start with lfp_. For each of those columns, pivot_longer creates a new row for every original row, writes the column name into gender, and writes the original cell value into lfp. All columns that were not selected remain unchanged.
The names that appear in gender are lfp_male and lfp_female. It is preferable to remove the lfp_ prefix from these names. We can drop that shared prefix during the pivot as follows.
wdi_panel |>
pivot_longer(
cols = starts_with("lfp_"),
names_to = "gender",
names_prefix = "lfp_",
values_to = "lfp"
)# A tibble: 13,164 × 8
country iso2c year gdp gdp_growth inflation gender lfp
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 male 74.4
2 Afghanistan AF 2016 102672. 2.26 4.38 female 20.2
3 Afghanistan AF 2023 82595. 2.71 -4.64 male 70.2
4 Afghanistan AF 2023 82595. 2.71 -4.64 female 5.16
5 Afghanistan AF 2015 100402. 1.45 -0.662 male 75.3
6 Afghanistan AF 2015 100402. 1.45 -0.662 female 19.1
7 Afghanistan AF 2006 49061. 5.36 6.78 male 79.3
8 Afghanistan AF 2006 49061. 5.36 6.78 female 14.5
9 Afghanistan AF 2017 105389. 2.65 4.98 male 73.4
10 Afghanistan AF 2017 105389. 2.65 4.98 female 21.2
# ℹ 13,154 more rows
Here names_prefix tells pivot_longer to remove lfp_ before writing the value into gender. The gender column now holds the labels male and female.
Because pivots are reversible, we can restore the original wide layout with pivot_wider, which transforms long format to wide. In the code below we first save the output of pivot_longer as wdi_longer and then use pivot_wider to reverse it.
wdi_longer <- pivot_longer(
wdi_panel,
cols = starts_with("lfp_"),
names_to = "gender",
names_prefix = "lfp_",
values_to = "lfp"
)
pivot_wider(
wdi_longer,
names_from = "gender",
values_from = "lfp",
names_prefix = "lfp_"
)# A tibble: 6,582 × 8
country iso2c year gdp gdp_growth inflation lfp_male lfp_female
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8
# ℹ 6,572 more rows
pivot_wider spreads the values in gender back into separate columns, fills them with the numbers from lfp, and re-attaches the prefix so the resulting names are lfp_male and lfp_female again. There is a clear symmetry between pivot_longer and pivot_wider: names_to matches names_from, and values_to matches values_from.
5.10 Joining data frames
Analyses often require information that lives in more than one data frame. A join operation combines two data frames by matching rows that share the same identifying columns, sometimes called keys.
Imagine if we have two data frames:
data_frame_A: A first data frame could be a list of student names and their student ID numbers.data_frame_B: A second data frame could be a list of classes and the ID numbers of the students taking them.
If you want to find out the names of all the students in all the classes, we need to link up these two data frames. The join operation does exactly that by looking for the common piece of information between the two; in this case, it is the student ID number. It effectively looks at each student ID value in data_frame_A and then finds all the rows in data_frame_B with matching student IDs. It then combines the information from both lists for each matching ID, creating a new and more complete data frame that tells you which student (by name) is taking which class.
The tidyverse, specifically the dplyr package, provides a set of joining functions: inner_join, left_join, and others. To see this in action, let’s say we want to get the population for every country and year in the wdi_panel. We can do this by joining wdi_panel with the population data frame, which is part of the tidyr package and so is automatically loaded when you load the tidyverse:
population# A tibble: 4,060 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 1995 17586073
2 Afghanistan 1996 18415307
3 Afghanistan 1997 19021226
4 Afghanistan 1998 19496836
5 Afghanistan 1999 19987071
6 Afghanistan 2000 20595360
7 Afghanistan 2001 21347782
8 Afghanistan 2002 22202806
9 Afghanistan 2003 23116142
10 Afghanistan 2004 24018682
# ℹ 4,050 more rows
As you can see, population has country and year, as does wdi_panel, and so we can join these two by
inner_join(wdi_panel, population)# A tibble: 3,095 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female population
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5 25631282
2 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1 30551674
3 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3 24860855
4 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2 24018682
5 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2 22202806
6 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2 23116142
7 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7 26349243
8 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0 27032197
9 Afghanistan AF 2012 91231. 12.8 6.44 78.0 16.2 29824536
10 Afghanistan AF 2011 80913. 0.426 11.8 78.2 15.9 29105480
# ℹ 3,085 more rows
In this example, inner_join finds all columns shared between wdi_panel and population and joins by every combination of these two variables. Sometimes it is helpful or necessary to explicitly specify the shared columns to join on. We can do that by using by = as follows:
inner_join(wdi_panel, population, by = c("country", "year"))# A tibble: 3,095 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female population
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5 25631282
2 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1 30551674
3 Afghanistan AF 2005 46567. 11.2 12.7 79.5 14.3 24860855
4 Afghanistan AF 2004 41865. 1.41 NA 79.6 14.2 24018682
5 Afghanistan AF 2002 37931. 28.6 NA 79.8 14.2 22202806
6 Afghanistan AF 2003 41282. 8.83 NA 79.7 14.2 23116142
7 Afghanistan AF 2007 55845. 13.8 8.68 79.2 14.7 26349243
8 Afghanistan AF 2008 58037. 3.92 26.4 78.9 15.0 27032197
9 Afghanistan AF 2012 91231. 12.8 6.44 78.0 16.2 29824536
10 Afghanistan AF 2011 80913. 0.426 11.8 78.2 15.9 29105480
# ℹ 3,085 more rows
The inner_join function keeps only those rows where the key values appear in both data frames. If a particular country and year pair is missing from either table, that row is dropped. In our case the result shrinks from 6582 to 3095 rows, keeping only the country-year combinations for which a population figure is available.
In contrast to inner_join, left_join keeps every row from the first (left) data frame and adds columns from the second wherever the keys match:
left_join(wdi_panel, population, by = c("country", "year"))# A tibble: 6,582 × 9
country iso2c year gdp gdp_growth inflation lfp_male lfp_female population
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan AF 2016 102672. 2.26 4.38 74.4 20.2 NA
2 Afghanistan AF 2023 82595. 2.71 -4.64 70.2 5.16 NA
3 Afghanistan AF 2015 100402. 1.45 -0.662 75.3 19.1 NA
4 Afghanistan AF 2006 49061. 5.36 6.78 79.3 14.5 25631282
5 Afghanistan AF 2017 105389. 2.65 4.98 73.4 21.2 NA
6 Afghanistan AF 2014 98966. 2.72 4.67 76.2 18.1 NA
7 Afghanistan AF 2019 110814. 3.91 2.30 69.4 18.3 NA
8 Afghanistan AF 2021 85768. -20.7 5.13 67.3 14.7 NA
9 Afghanistan AF 2013 96341. 5.60 7.39 77.1 17.1 30551674
10 Afghanistan AF 2018 106643. 1.19 0.626 71.4 19.8 NA
# ℹ 6,572 more rows
When using left_join, if the second data frame lacks a matching row, the new columns are filled with NA. In this example, the resulting output therefore has the same number of rows as wdi_panel (6582) and has missing values for population where no match was found.
Use inner_join when you need complete cases and are happy to drop observations that lack matches. Use left_join when you want to preserve all rows from the main table and simply attach whatever extra information is available.
Other join variants exist such as right_join, full_join, and anti_join, but inner_join and left_join cover most day-to-day tasks and are all you need at this stage.
5.11 Data-wrangling pipelines
Chapter 2 introduced the pipe operator |> and showed how it passes the result of one expression straight into the next function. Data wrangling is where that left-to-right flow shines. Each of the tidyverse commands we focused on here does one job, so real wrangling work usually means running several commands in sequence. A pipeline writes those steps as a single sentence that the reader can follow from the raw data to the finished table.
Below are three short pipelines that use almost every tool covered so far. They all start with wdi_panel, add information or reshape it, and end with a table that could feed directly into a plot or a model.
- Pipeline example 1: GDP per person, ranked
- This pipeline builds a ranked list of GDP-per-capita across different country-years.
inner_join(wdi_panel, population, by = c("country", "year")) |>
mutate(gdp_pc = 1e6 * gdp / population) |>
select(country, year, gdp_pc) |>
arrange(desc(gdp_pc))# A tibble: 3,095 × 3
country year gdp_pc
<chr> <dbl> <dbl>
1 Luxembourg 2007 139017.
2 Luxembourg 2008 135467.
3 Luxembourg 2013 132423.
4 Luxembourg 2006 131384.
5 Luxembourg 2010 130498.
6 Luxembourg 2012 129978.
7 Luxembourg 2011 129700.
8 Luxembourg 2009 128236.
9 Luxembourg 2005 126191.
10 Luxembourg 2004 124852.
# ℹ 3,085 more rows
The steps are as follows:
inner_join(wdi_panel, population, by = c("country", "year"))Keep only those country–year rows that appear in both tables, attaching thepopulationcolumn to the matching rows inwdi_panel.mutate(gdp_pc = 1e6 * gdp / population)Convert GDP from millions to dollars (gdp * 1e6) and divide bypopulationto create a new columngdp_pc, the per-capita GDP.select(country, year, gdp_pc)Retain just the identifying columnscountryandyearalong with the newly calculatedgdp_pc, discarding all other variables.arrange(desc(gdp_pc))Sort the resulting table in descending order so that the highest GDP-per-capita observations appear first.
- Pipeline example 2: Recent inflation by income group
- This pipeline produces a five-row summary that compares average GDP and average inflation across GDP quintiles for recent years.
wdi_panel |>
filter(year >= 2015) |>
mutate(gdp_quintile = ntile(gdp, 5)) |>
summarise(avg_gdp = mean(gdp),
avg_inflation = mean(inflation, na.rm = TRUE),
.by = gdp_quintile) |>
arrange(gdp_quintile)# A tibble: 5 × 3
gdp_quintile avg_gdp avg_inflation
<int> <dbl> <dbl>
1 1 3139. 3.18
2 2 28657. 5.51
3 3 84298. 8.75
4 4 318744. 7.56
5 5 3253163. 5.53
Here is what happens, step by step.
filter(year >= 2015)Keep only rows from 2015 onward. Everything earlier is discarded.mutate(gdp_quintile = ntile(gdp, 5))Rank the remaining rows by their GDP value and split them into five equal-sized groups. The poorest fifth getsgdp_quintile = 1, the richest fifth gets5.summarise(..., .by = gdp_quintile)For each quintile calculate •avg_gdp: the mean GDP within that group. •avg_inflation: the mean inflation rate, ignoring any missing values. Because of.by = gdp_quintile, the summary is done separately for each quintile, yielding one row per group.arrange(gdp_quintile)Order the resulting five rows from the lowest GDP quintile to the highest so the output reads naturally from poorer to richer economies.
filterto select the last ten years.summariseto group the rows by World Bank income category. Compute the average inflation rate within each group. List groups from highest to lowest inflation.
- Pipeline example 3: Gender gaps in labour participation by region
- This pipeline averages the labour force participation rate for males and females, averaged over all countries and years:
wdi_panel |>
pivot_longer(
cols = starts_with("lfp_"),
names_to = "gender",
names_prefix = "lfp_",
values_to = "lfp"
) |> summarise(lfp = mean(lfp, na.rm = TRUE),
.by = gender)# A tibble: 2 × 2
gender lfp
<chr> <dbl>
1 male 72.2
2 female 50.0
pivot_longer(cols = starts_with("lfp_"), names_to = "gender", names_prefix = "lfp_", values_to = "lfp")Select the columns whose names begin withlfp_. Stack those columns into two new columns where the former column name, stripped of thelfp_prefix, is stored ingenderand the numeric value is stored inlfp.summarise(lfp = mean(lfp, na.rm = TRUE), .by = gender)Group the long data bygenderand compute the mean oflfp, ignoring missing values. The result is one row formaleand one row forfemale, each containing the average labour-force participation rate.
These three examples show how a pipe strings together joins, filters, calculations, reshapes, summaries, and sorts. The same pattern scales to longer pipelines; you just add another step on a new line.
5.12 Further Reading
The wrangling techniques and tidyverse functions presented in this chapter provide a foundation for cleaning and transforming data. The tidyverse contains many additional functions and advanced techniques that offer more power and flexibility. Exploring these resources will deepen your understanding and expand your data-wrangling toolkit.
- The Official
tidyverseDocumentation -
This chapter focused on core
dplyrandtidyrfunctions. You can learn more about these packages and others in the official documentation, which includes many examples.
dplyr: https://dplyr.tidyverse.orgtidyr: https://tidyr.tidyverse.orgstringrfor string manipulation: https://stringr.tidyverse.orglubridatefor dates and times: https://lubridate.tidyverse.orgforcatsfor factors: https://forcats.tidyverse.orgpurrrfor functional programming and iteration: https://purrr.tidyverse.org- Explore the entire
tidyverseecosystem: https://www.tidyverse.org
- R for Data Science (R4DS) by Hadley Wickham and Garrett Grolemund
-
This textbook (Wickham et al., 2023) is the standard reference for learning the
tidyverse. It includes detailed chapters on data transformation and is freely available online at https://r4ds.hadley.nz/. - Specific Package Cheatsheets
-
Many
tidyversepackages have quick-reference cheatsheets. These can be found on the Posit cheatsheets page at https://posit.co/resources/cheatsheets/ or via the individual package websites.
5.13 Chapter summary
What you learned
- Why data wrangling matters: real-world datasets are rarely tidy, and cleaning, reshaping, and transforming them is essential before any analysis can begin.
- How to use core tidyverse verbs for working systematically with data:
select() to choose columns, rename() and relocate() to tidy and reorder them,
slice() and filter() to keep or drop rows,
mutate() to create or transform variables,
and arrange() to sort the results.
- How to reshape data between wide and long formats using pivot_longer() and pivot_wider().
- How to merge datasets using joins such as inner_join() and left_join(), matching rows based on shared identifiers.
- How to build pipelines using the native pipe |> to link transformations together into clear, reproducible workflows.
- How tidy data principles — each variable in a column, each observation in a row, each value in a cell — underpin all wrangling operations and make analysis easier.
Common pitfalls to avoid
- Thinking of wrangling as separate from analysis rather than its foundation.
- Forgetting that tidyverse functions are immutable — they return new data frames instead of changing the original.
- Using manual or repetitive transformations instead of chaining commands in a pipeline.
- Losing track of the tidy data structure when reshaping or joining.
Takeaway
- Chapter 5 introduced the practical toolkit for transforming messy real-world data into clean, analysis-ready form.
- You learned to use the tidyverse verbs and the pipe |> to construct transparent wrangling pipelines that anyone can read and reproduce.
- These skills are the foundation for all later work in the rest of the book.