1 Introducing statistics
This book is about statistics in the social sciences. The social sciences — fields like psychology, sociology, economics, political science, and many more — study human behavior and interactions, examining how we form and are then shaped by social structures and systems. Social scientists address questions that affect most of our lives: How does income inequality affect crime and educational outcomes? Why do people vote the way they do? Does social media damage our mental health? To investigate these questions, they employ various methods, including qualitative approaches (such as interviews and ethnography) and quantitative approaches (such as surveys and experiments). This book focuses specifically on how we analyze quantitative data to help us answer these important questions with precision and confidence. Statistics is the science of how we turn quantitative social science data, which is often messy and complex, into at least provisional answers to important research questions.
This book aims to cover all the core statistical topics you’ll encounter in typical social science research: topics like t-tests, regression analysis, ANOVA, p-values, confidence intervals, and hypothesis testing. But rather than treating these as a collection of separate recipes or tricks, each with its own rules and procedures, we present them from the unified point of view of building, testing, and reasoning with statistical models. A statistical model is just a simplified mathematical description of systems or phenomena in the real world, whether that’s the relationship between income and voting behavior or the effect of a new therapy on anxiety levels. We believe that taking this statistics as model building approach makes learning easier because you’ll see how all statistical methods are really just variations of the same fundamental principles: some models are simpler, others more complex, some make different assumptions, but all follow the same logic of model building and evaluation. We also think this approach will make it easier for you to continue learning statistics beyond this course, allowing you to more easily transition to new and more advanced topics and confidently approach new statistical challenges in your future research.
We (your authors, Mark and Lucy) both love statistics and data analysis, but we recognize that not everyone shares our enthusiasm, at least not yet. We know that many students come to the social sciences because they’re fascinated by how humans and societies work, and discovering they need to learn statistics can feel like an unwelcome surprise that seems irrelevant to their interests. Others may understand why statistics matters for research but still feel uncomfortable when they encounter mathematical symbols, formulas, or code, which can be genuinely intimidating. We do understand this, and we’ve written this book with these concerns in mind. Throughout every chapter, we work hard to show you exactly why statistics is useful or even essential for answering the social science questions people care about. At the same time, we’ve tried to make everything as accessible as possible by keeping technical details to a minimum — not zero, because some symbols, formulas, and code are unavoidable if we want to explain things properly, but only as much as necessary and no more. When we do introduce technical elements, we explain and break down everything step by step, building up your understanding gradually so you will hopefully never feel lost or overwhelmed.
1.1 This chapter covers…
Why statistics matters.
Real research examples from psychology, economics, politics, sociology, criminology, and education show why statistical reasoning is essential for answering social-science questions.The core ideas of statistics.
Five recurring concepts — populations, samples, models, inference, and uncertainty — that provide a lens for understanding data throughout the book.Why data need statistics.
How non-determinism, sampling variability, measurement noise, confounding, and multiple causes make statistical modelling unavoidable.The role of computation.
How modern computing transforms classical methods: cleaning large datasets, fitting complex models, visualising results, and ensuring reproducibility.Getting started with R.
Why we use R, what makes it powerful, and how it supports transparent, reproducible workflows.Your guides and approach.
The authors’ guiding principles: practical focus, clear language, minimal technicalities, and reproducible analysis.
1.2 Why we need statistics
In every field of social science, we can find endless examples of research where statistical methods are not just useful tools but are essential for answering important research questions. Here is a short list of illustrative examples of social science research studies where statistics play a vital role:
- Social media and mental well-being
- In 2013, psychologist Ethan Kross and colleagues (Kross et al., 2013) tracked 82 young adults for two weeks to see whether using Facebook in everyday life lifts their spirits or drags them down. Before the study began, each student filled out standard questionnaires on life satisfaction, mood, self-esteem, and related traits. For the next 14 days the researchers sent each participant five text messages a day, randomly spaced between 10am and midnight. Every time the phone buzzed, the student tapped a tiny online survey with five sliders that asked them about how good or bad they felt right now, how worried they are, how lonely they feel, how much Facebook they have used since the last text, and how much face-to-face or phone contact with other people they have had since the last text. Over the course of the two weeks, this created thousands of moment-to-moment snapshots of their participants’ mood and behavior. When the answers were lined up in time order, some patterns emerged. Periods of heavier Facebook use were usually followed by slightly worse feelings at the next check-in. By contrast, spending time with people in person predicted a small boost, rather than a dip, in their mood. These links held regardless of personality, number of Facebook friends, or how supportive the students thought their online network was.
- Does working from home hurt or help business?
- In 2021-22, economist Nicholas Bloom and colleagues (Bloom et al., 2024) partnered with Trip.com, a 35,000-employee travel technology company in Shanghai, to settle a hot post-pandemic question: is letting staff work from home for a couple of days each week good or bad for employees and the firm? They invited 1,612 engineers, marketers, and finance staff to join a six-month experiment. Workers were divided randomly into two groups. Those with odd-numbered birthdays could work from home every Wednesday and Friday (the “hybrid” group); even-birthday colleagues kept the traditional five-day office week (the “office” group). Everyone clocked the same hours, used identical software, and faced the same quarterly performance reviews. Researchers tracked job-satisfaction surveys, HR records on quits, promotion, and pay, and, for coders, the amount of computer code written. They also asked managers before and after the trial how they thought hybrid work would affect productivity. After six months, hybrid workers were one-third less likely to quit than office-only peers and reported higher job satisfaction. The benefits were largest for staff with long commutes, women, and non-managers. Crucially, supervisors’ performance ratings, promotion rates, and objective metrics such as lines of code looked no different between the two groups—even when the researchers checked again two years later. Seeing the evidence, managers shifted from expecting hybrid work to lower productivity to believing it might raise it slightly.
- Does vote-by-mail tilt US elections?
- In the run-up to the 2020 US presidential election, commentators on both sides claimed that mailing every registered voter a ballot would hand an advantage to one party. Political scientist Daniel M. Thompson and colleagues (Thompson et al., 2020) asked if universal vote-by-mail actually shifts turnout or the partisan balance of votes. The authors assembled a mammoth archive of county returns from 1996-2018 for every election in California, Utah, and Washington states, which were the only three states that rolled out universal vote-by-mail county-by-county rather than all at once. Because neighboring counties adopted the policy in different years, each election creates a built-in comparison group. They paired the 1,240 county-election records with tens of millions of anonymized voter-file entries that identify party registration, allowing them to see not just how many people voted but who they were. Treating each reform as a natural before-and-after experiment, the team compared changes in turnout and Democratic/Republican vote share in counties as they switched to all-mail voting with changes in counties that had not yet switched. The main results show that the reform raised overall turnout by about two percentage points, which is roughly the same boost that moving a presidential election from a rainy to a sunny day can cause. Moreover, it left both parties’ fortunes unchanged. Neither the share of voters who were Democrats nor the Democratic vote share budged in a statistically meaningful way. The same null result held across midterms and presidential races, urban and rural counties, and high- and low-propensity voters. In short, universal vote-by-mail brings a modest participation bump without tilting the electoral playing field.
- Does cannabis raise the risk of psychosis?
- It has long been debated whether regular cannabis use, especially of high-potency “skunk” varieties, causes psychotic illness like schizophrenia. Psychiatrist Marta Di Forti and colleagues (Di Forti et al., 2019) conducted a large international study on this topic. They identified 901 adults presenting with a first episode of psychosis at mental-health services in 11 cities across Europe and one in Brazil between 2010 and 2015. These cases were matched to 1,237 individuals representative of the local population in the same locations and times. Trained interviewers used a standard questionnaire to record lifetime cannabis habits, such as whether they ever or never used cannabis, their frequency of use, and the potency of the drug in their areas (based on local seizure data). They also recorded sociodemographic variables, other drug use, and family psychiatric history so these factors could be taken into account. Across the sample, daily users of high-potency cannabis were about five times more likely to develop a psychotic disorder than people who had never tried the drug. Everyday use of any cannabis carried a threefold increase, while occasional low-potency use showed lesser risk. Cities where strong daily use was common, such as London and Amsterdam, also recorded the highest incidence of psychosis.
- Does pre-trial detention affect conviction or future crime?
- In the United States more than one-fifth of people behind bars are awaiting trial, being held because they could not post bail. Critics argue that this form of “wealth-based detention” pressures defendants to plead guilty and derails lives; defenders claim it keeps communities safer. Will Dobbie and colleagues (Dobbie et al., 2018) investigated the effects of being detained pre-trial on case outcomes, later crime, and employment. The authors exploited a quirk of big-city courtrooms: defendants are randomly assigned to bail judges who differ sharply in their tendency to release or detain. Because judge “toughness” is effectively random, comparing defendants seen by lenient versus strict judges mimics an experiment while using purely observational data. The main findings were that defendants assigned to a stricter judge, and so more likely to be held, were 15 percentage points more likely to be convicted, almost entirely via increases in guilty pleas. Pre-trial detention had little detectable effect on future crime, but it cut post-release employment and earnings by about 13 percent, consistent with a conviction tarnishing resumes.
- Do social networks lead to social mobility?
- Do cross-class friendships help poor children climb the income ladder? Raj Chetty and colleagues (Chetty et al., 2022) used the largest social-network dataset ever assembled to look at how social networks affect economic mobility. Working with privacy-protected Facebook records, the team mapped 21 billion friendship ties among 72 million U.S. adults aged 25-44. Each user’s socioeconomic status (SES) was predicted from home postal address, education, and smartphone model, then validated against Census and tax statistics. They built three indices: economic connectedness, or the share of high-SES friends in the networks of low-SES people; network cohesion, or how tightly friends’ friends cluster; civic engagement, or volunteering and group membership rates. The Facebook measures were then matched to previously published Internal Revenue Service estimates of upward income mobility for children born in the early 1980s, creating a nationwide panel of 70,000 communities. They found that areas where low-income residents have more high-income friends exhibit sharply higher adult earnings for children from poor families. Economic connectedness is a stronger mobility predictor than school quality, job growth, or racial segregation. Network cohesion and civic engagement, by contrast, show little or no link once connectedness is held constant.
- Do school closures affect children’s learning?
- When COVID-19 shut classrooms in spring 2020, the public and policymakers were worried about possible long-term consequences of these closures on children’s education. Demographers Per Engzell and colleagues (Engzell et al., 2021) exploited uniquely rich Dutch test records to measure exactly how much progress stalled and for whom when schools in the Netherlands were closed due to the COVID pandemic. The Netherlands requires all primary schools to give identical standardized tests each January and June. The authors collected anonymized scores for around 350,000 pupils in grades 3-6 across 2017-2020, then linked them to background variables such as parental education. Because the 2020 spring term featured an eight-week national lockdown with online instruction, that cohort’s January-to-June gains could be compared with the same interval in the three prior years. Despite world-leading broadband access and a short closure, pupils in 2020 made about one-fifth of a school year’s less progress compared to previous years, virtually identical to the time out of class. Losses were 60% larger for children whose parents lacked higher education, confirming fears that remote learning widens socioeconomic outcome gaps.
Although these seven studies span psychology, economics, political science, criminology, sociology, and education, and use everything from smartphone surveys to nationwide administrative files, they all, like countless other social science studies, heavily rely on statistical analyses to address the research questions they pose. Were you to look through all the original research articles cited above, you would find that all the results are presented in the seemingly arcane language of modern statistics: standard deviations, standard errors, p-values, confidence intervals, odds ratios, and so on. Despite the ubiquity of statistics wherever there is quantitative data in the social sciences, it is still worth asking why exactly do we use statistics in this research? After all, could we not just let the data speak for themselves? Can we not answer the questions we want to pose without all the machinery of statistics? This question is not necessarily naïve. After all, the eminent physicist Ernest Rutherford, said to be the father of nuclear physics, famously said: If your experiment needs statistics, you ought to have done a better experiment. For Rutherford then, statistics might be avoidable had the research been done more carefully.
In social science, unlike the tidy world of early-twentieth-century physics, where a single tabletop experiment could overturn everything we know about the structure of the atom, statistics can probably never be avoided because the subject matter is inherently messier and more complicated. For either ethical or practical reasons, we cannot just design and conduct perfect experiments that definitively answer our questions. We cannot lock millions of Facebook users in a laboratory for the rest of their lives, randomly assign people to smoke cannabis regularly or not, or rerun a pandemic with and without school closures. The real world, the real human social world, supplies only fragments of evidence, shot through with chance events, missing pieces, and competing explanations. With statistics, we can see through the clutter and complications.
In more detail, social-science research is complicated by different fundamental factors such as non-determinism, sampling variability, noise and uncertainty, intertwined causes, and so on. These require us to use statistical reasoning rather than just simple inspection of the numbers.
Some things in the world unfold like clockwork, others always surprise us, and most sit somewhere in between.
Here are four ideas to help describe these different kinds of processes:
Deterministic
Processes that are fixed and predictable.
The same inputs always give the same outputs.
Typing 2 + 2 into a calculator will always return 4.
Dropping an object on Earth will always make it fall.
Deterministic systems never surprise you.
Non-deterministic
Processes that follow fixed rules but produce different outcomes each time.
Shuffling a deck of cards uses the same procedure, yet the order is always different.
Rolling a die works the same way: the mechanics don’t change, but the result does.
Randomness
The name we give to this unpredictable variation in outcomes.
Each die roll or coin toss is random — you can’t know the next outcome in advance.
Randomness is the raw material of uncertainty.
Probabilistic
The way we describe and reason about randomness.
Probability gives us a structured way to say how likely different outcomes are.
A coin toss is random, but probabilistic reasoning tells us the chance of heads is 0.5, or 50%.
Statistics lives in this space where randomness is everywhere, and probability is the language we use to make sense of it. Deterministic rules still exist, but our main interest is in the non-deterministic, probabilistic processes that shape the social world.
Probability runs on a simple scale:
- 0 means the event cannot happen (impossible).
- 1 means the event will definitely happen (certain).
- Values between 0 and 1 reflect varying degrees of likelihood.
Examples
- Rolling a 7 on a standard six-sided die = 0 (impossible).
- The sun rising tomorrow = 1 (certain).
- Flipping a coin and getting heads = 0.5 (equally likely either way).
In this book, when we first talk about probability, we’ll usually mean it in the long-run sense: not what will happen on the very next trial, but the pattern you see when the process is repeated many times. Toss a coin once and you might get tails; toss it 1,000 times and you’ll get close to half heads, half tails.
Different forms
We can write probabilities in more than one way, and you’ll see all of these in research papers and reports:
Percentages. The most familiar.
A probability of 0.3 means a 30% chance — like saying there’s a 30% chance of rain tomorrow.Odds. Another way of framing likelihood.
A probability of 0.75 is the same as odds of 3 to 1 — three successes for every one failure.Odds ratios. A comparison of odds between groups.
For example: “Group A has 1.5 times the odds of success compared to Group B.”
We’ll revisit these topics later in the book, so don’t worry if it feels abstract for now.
- Relationships are non-deterministic
- Every finding in our seven examples above is a tendency, not a law. Many daily Facebook users probably stayed perfectly cheerful; some employees thrived even in the five-day office; plenty of daily high-potency cannabis smokers never developed psychosis. Human behaviour is shaped simultaneously by biology, context, choice, and chance, so the same cause can lead to different outcomes for different people or at different times. Statistics gives us a probabilistic language — average effects, risk ratios, differences in means — to describe such probabilistic regularities without pretending they are iron rules or laws of nature.
- Sampling variability
- Every dataset used in the studies just described is a mere sliver of the general reality that the researchers ultimately want to consider. Kross’s 82 students are obviously just a tiny sample of the millions of daily social-media users; Bloom’s 1,612 Trip.com employees are just a tiny sample of the world’s white-collar workforce. Were these studies to be re-run, they would use different samples and the results would not be identical. A central statistical task, therefore, is dealing with this sampling variability so that we can draw conclusions about the larger reality from any small sample. Statistical techniques such as confidence intervals and p-values quantify how far a sample estimate (e.g., a two-point voter turnout bump) is likely to vary were different samples used, thus allowing researchers to speak cautiously yet meaningfully beyond the data at hand.
- Data is noisy
- Even if the underlying tendency is strong, the measurements we record are noisy representations of reality. A coder’s number of lines of code does not perfectly measure their productivity; a Facebook user’s self-report does not perfectly measure their momentary happiness; a cannabis user’s self-report does not perfectly measure the frequency of their cannabis exposure. Noise is not just a matter of lack of precision of our measuring instruments. There is also natural variability: Facebook use does not occur like clockwork, voter turnout in any county bounces around from year to year, a pupil’s exam performance can be higher or lower from one exam to the next. All this noise and variability blurs the true signal. By using models that summarize the typical link between variables but using a margin of error to account for noise, researchers can say how much of what they see is likely to be real and how much could be random fuzz.
- Correlation without causation
- Even when there is a reliable relationship between two variables, what causes what can be unclear and misleading. Sometimes there is a hidden factor, which we call a confound, that affects both the presumed cause and its effect, creating a spurious link between them. Maybe people with higher anxiety both use Facebook more and report lower mood. Maybe childhood adversity leads people to both more heavy cannabis use and also to mental illness. Maybe more prosperous areas foster both cross-class friendships and increased earnings over time. Statistical methods provide tools to control these possible spoilers so that we don’t mistake coincidence for causation. For example, Di Forti’s international study measures parental education, unemployment, and other disadvantages, then adjusts for them in the model, showing the cannabis-psychosis link survives even after those common roots are removed. Or in Kross’s Facebook study, each participant is compared with themselves across many pings, so any constant background factors like personality drop out.
- Multiple causes
- Even if we deal with confounding, we often discover that several genuine causes combine to lead to the same outcome at the same time. Think of the school-closure study: any given child’s progress may have been affected by lessons moving online and by the parents’ level of education. In the cannabis study, maybe psychosis risk increases because of drug use frequency, or drug use potency, or both. Each of these potential factors is real, none is a mere statistical illusion, and they are not independent of one another. Social scientists’ statistical tools can help handle these many-drivers problems. For example, in their statistical models, the Dutch researchers include separate terms for weeks-out-of-school and parental education. The resulting effect for school closure effectively tells us how much scores would have dropped even if parental education is constant. In the cannabis study, daily use raises psychosis risk more sharply when the product is high-THC, and statistical models showed that potency modifies the basic cannabis effect. Modelling multiple real causes allows us to see how several causal levers operate together, sometimes reinforcing, sometimes dampening one another.
In short, while physics may sometimes get by without statistics, whenever people, policies, and free will are involved, careful statistical models and inference is the only way to move from the limited, noisy, complicated data that we can observe to the larger truths we hope to understand.
1.3 What is statistics, really?
When most people hear “statistics” they think of a compendium of facts, often about countries: the unemployment rate, average family size, a country’s GDP. That meaning traces back to the word’s eighteenth-century roots: the German word statistik was literally “description of the state.” Modern scientific research keeps the name but changes the focus. In this context, statistics is first and foremost a scientific method: a set of ideas for using limited, messy data to learn something general about reality that we can never observe in full or directly. We’ll require this whole textbook to provide a proper introduction to how statistics is a scientific method, but as you’ll see, it often revolves around a small set of interrelated core concepts or ideas: populations and samples, models, inference, and uncertainty. Let’s now preview each of these ideas in turn.
- Populations and samples
- The concept of a population may seem like a very familiar one. A population is just the set of people in some place, right? Well, not exactly. In statistics, this term has a more general and technical meaning. It refers to the complete set of all individuals, objects, or measurements that we’re interested in studying or making conclusions about. It’s the entire group that our research question tries to generalize to. Consider Kross’s Facebook study. The researchers’ data pertains to sporadic moments over two weeks in a sample of just 82 Michigan students, yet their research question pertained to a much larger phenomenon: every moment of social-media use and mood for all young adults, anywhere in the world, on any day. That vastly larger set is the population in their study. The same distinction holds for Bloom’s hybrid-work experiment. Trip.com supplied 1,612 employees in Shanghai over a six-month period in the early 2020s, but the target population is every white-collar job where a hybrid work schedule is feasible, from tech firms in Seattle to tax accountants in Manila, at any time. Likewise, Thompson’s vote-by-mail study analyzes 1,240 county-election records from three states, yet the population of interest is all U.S. elections, present and future, that might adopt universal mailed ballots. Even the large Dutch school-closure dataset, with its 350,000 pupils, captures only a sliver of the conceptual population: all children worldwide whose schooling was disrupted, or could have been disrupted, by a pandemic, real or hypothetical. In all these cases, the data is said to be a sample from a population. The population is what we are ultimately interested in understanding, but the data is the empirical evidence or observations that we have. The machinery of statistics is there to bridge the gap between the sample and the population; to allow us to use the limited, messy sample we can actually observe to make defensible, explicitly uncertain statements about the far larger population that we ultimately care about.
- Models
- Because social life is neither clockwork nor perfectly measured, we always describe phenomena using averages, trends, or tendencies plus some variation or wiggle room. We usually put these descriptions together in simple mathematical or formal terms, and this is what we call a statistical model. For example, Kross’s study tells us that mood tends to dip as recent Facebook usage rises; Bloom’s study says that quit rates tend to fall when staff get two days at home; Di Forti’s study says that frequent cannabis use leads to higher chances of psychosis. In all these cases, these relationships are described by models that describe the average pattern or tendency while always allowing that any single person at any given time can and will behave differently. Models allow us to address many research questions simultaneously. For example, they can describe the typical relationship or trend (“two mailed ballots boost turnout by about two points”). They can be used to compare scenarios (“how much bigger is the effect for daily high-THC users than for weekly low-THC users?”). They also acknowledge uncertainty. The wiggle term turns into margins of error that say, in essence, “Here is how far the real world might drift from our average rule.”
- Inference and Uncertainty
- Once a model is fitted to the data it returns two headline results. The first is what we usually call an estimate or an effect size. This summarizes the strength of the relationship. For example, daily strong-THC use multiplies psychosis risk fivefold; pre-trial detention leads to a 15% increase in conviction. The second result quantifies the uncertainty of these estimates or effects because all findings are subject to random variability due to the inherent randomness of the sample. The Dutch learning-loss estimate, the cannabis study risk ratio, the social network economic connectedness measure, and so on, are all subject to uncertainty and are expressed using carefully calculated confidence intervals. These data analysis steps, which are often put under the general heading of statistical inference, remind us that every conclusion drawn from a sample is provisional, always subject to the possibility that another random slice of data would have looked a bit different.
Put together, these ideas—populations, samples, models, inference, and uncertainty—form the backbone of statistical reasoning. Whether the raw data arrive from a randomized experiment, a natural policy change, a survey, or an administrative file, the social scientist’s task is the same: build a transparent model of the relationship of interest, use it to infer a population-level answer from a sample, and report how certain (or uncertain) that answer is. The rest of this book unpacks the tools that make this all possible.
1.4 Statistics with Computers
Modern statistical thinking in the social sciences can trace its roots to R. A. Fisher’s landmark Statistical Methods for Research Workers published almost exactly 100 years ago in 1925. In that slim volume, Fisher wove together ideas that had been scattered across the previous century—sampling distributions, standard errors, \(t\) and \(\chi^2\) tests, p-values, the logic of hypothesis testing—into a single toolkit that researchers in psychology, sociology, and economics would eventually put to work in the lab or in the field. Many of these core ideas still serve as the basis of introductory statistics courses or textbooks, including this one. For the next fifty years, however, applying Fisher’s toolkit was a slog. Before the 1970s, in every field of science, you would have found graduate students bent over adding machines, punching in tables of numbers one column at a time, then leafing through phone-book-sized tables to look up p-values. A single analysis of the kind we will cover in even early chapters in this book could take them days.
Fast-forward to today: the Dutch learning-loss project involves test results of hundreds of thousands of children; the Facebook-mobility study crunches 21 billion friendship ties. Statistical modeling that once required a small army of analysts now finishes in milliseconds because we hand off the calculations to computers. This partnership of statistical reasoning powered by modern computation is what makes today’s statistics possible. It lets social scientists pose questions that would have been unthinkable in Fisher’s era and, just as importantly, makes rigorous analysis accessible to every student with a laptop.
When we talk about computation in statistics, we mean everything that has to happen between a raw data file and a persuasive conclusion. This involves major steps like the following:
- Data wrangling. Chetty’s team had to merge huge social-media friendship graphs with tax files; Engzell had to work with four years of Dutch exam data. Software turns those scattered, disparate sources into a single, analysis-ready table without the previously required months of effort by teams of assistants.
- Statistical modeling. Whether you are fitting a simple “Facebook minutes predicts mood” model or a more complex multiple-predictor model, the calculations involve numerical matrices with thousands or even millions of rows. Computers can do the necessary calculations in seconds.
- Visualization. A quickly created plot could reveal complex patterns in your data that you would never otherwise notice. These plots can guide modeling choices and be used to explore the validity of our models. Graphics are not decoration; they are how we catch mistakes and spot new patterns.
- Communication and reproducibility. A single file with analysis code can load raw data, clean it up for analysis, run the models, create the tables and figures of results, and insert them directly in our reports. Now anyone—your co-author, your supervisor, or just your future self — can rerun the script and see exactly how each number was produced.
Seen this way, advances in computing are not separate from statistics; they enable the modern way we do statistics. Classical ideas such as sampling, modelling, and inference now run hand in hand with the power of computers, which let us clean gigabytes of survey data in seconds, fit models with millions of parameters, and turn results into interactive graphics.
Modern computers are statistics’ power source, not a replacement. Without them, none of the large-scale studies highlighted in this chapter could even have been attempted. With them, the same analytical capacity is available to any undergraduate armed with a laptop and a curiosity about how society works.
The chapters ahead will show you how to harness that power—how to clean data, build and check models, visualize findings, and report them—so you can ask bigger, sharper questions about the social world and answer them convincingly.
1.5 What is R and why should we use it?
Imagine a freely available toolkit that can, in one single command, read in a jumble of survey spreadsheets, clean and reshape them in seconds, perform any conceivable statistical analysis, produce publication-ready plots, and automatically insert all analysis results, including tables and figures, into a report or webpage. That, in a nutshell, is what R offers. It is the data analysis software we will use throughout this book, and here we will explain why it is worth learning even if you have never typed a line of computer code, and never wanted to either.
R is a coding language and a computing environment designed from the ground up for statistics, data analysis, and graphics. It started in the early 1990s as an academic project in New Zealand, borrowing ideas from an earlier statistics language called S. By the early 2000s it had spread widely across statistics-focused departments in universities. Today it is a mainstream tool in government agencies, NGOs, finance and tech companies and, increasingly, the social-science departments where many of you will work or study.
Put in everyday terms, R is software for statistical data analysis. But there are dozens if not hundreds of software packages or services that focus on data, so why should we use R? For us, the case for R boils down to three very compelling points: it is free and open-source software; it is extremely powerful; it is extremely widely used.
1.5.1 Free, open and everywhere
One of R’s main attractions is that it is open-source software that is freely available to everyone; it costs nothing to use and is available via a license that guarantees that it will always remain free and publicly available, no matter how much it changes in the future. In a sense, no one, or rather everyone, owns it. You can download it legally, keep it forever, and install it on as many machines as you like. It is widely used on all the major platforms like Windows, macOS, Linux, but it can in fact run on every device from a mobile phone to a supercomputer. Hundreds of volunteer developers worldwide maintain and improve the core, and thousands more contribute add-on packages. You are not locked into proprietary formats, yearly licenses, or vendor upgrades; the analyses you write today will still run in years or even decades from now.
The profound practical consequences of R’s cost and licensing may not be obvious initially. But almost every digital tool you already use comes with hidden strings, barriers, or limitations of some kind. The apps or web services we use often offer a “free” tier until the ads and data-tracking push us to a paid plan. A cloud spreadsheet lets you analyze a few thousand rows, but then walls off essential functions behind a monthly fee. A proprietary stats package may seem perfect when you are a student and have a student license or your institution has a site license, but then it prices itself out of reach when you leave campus or shift to another job where the heavy license fees are unsustainable. Worst of all, a company behind software can simply kill a product or service and all your carefully crafted files from years of work become obsolete overnight.
R is immune to all of these problems. Its core is released under the GNU General Public License, which imposes one powerful rule: any new version or spin-off must remain freely redistributable, with source code visible to everyone. No firm can buy the project and restrict its openness and charge fees to use it; there will never be free and paid tiers. If development stopped tomorrow (unlikely, given the thousands of contributors) the current source could still be compiled, shared, and improved by anyone. That permanence matters when you intend to keep research scripts for a decade or archive them for replication.
Because R is free and open software, we can take for granted many advantages that are impossible or not guaranteed with most computing products or services. There is no surveillance economy. It does not track you, harvest your data, or push ads. There is no platform or vendor lock-in. The same script runs on Windows in the campus lab, macOS on your laptop, a Raspberry Pi in a field station or a 10,000-core Linux cluster in the cloud. If you switch machines, everything keeps working. Its open source code also provides complete transparency about everything it does. For every analysis or result you obtain, you can always inspect, and even alter, every function or piece of code that produced it. That level of transparency and auditability is impossible with proprietary software that runs as opaque binary code or on a server that you do not control. There is also collective insurance. Should a critical bug appear, hundreds of volunteers are at hand to fix it, and new features appear almost immediately whenever the community needs them; we’re not waiting for managers to decide on the business case for adding updates or fixing bugs.
For students and researchers, whose budgets are usually tight, and who must share code openly or need their analyses to be usable for years into the future, these guarantees are more than ideological niceties. They are practical safeguards that let you focus on the science, not on budgets, or on the inherent limitations of a license or service, or on its uncertain future beyond your control.
1.5.2 A power tool for data analysis
R provides an almost unlimited set of tools for data analysis. There is virtually nothing in the realm of statistics or data science that cannot be done using R. Out of the box, you get the same core toolkit found in almost all statistics packages: linear and generalized linear models, ANOVA, non-parametric tests, time-series analysis, high-quality graphics, and hundreds of functions for data manipulation and statistical calculation. In other words, R ships with the “batteries included.” But R’s real power comes from its massive add-on package ecosystem. There are over 22,000 packages on the official repository (known as CRAN), plus countless more on other public code repositories like GitHub. These packages extend the core R functionality into virtually every corner of modern data science. And whenever new statistical methods are published in scientific journals, the chances are high that the authors have already implemented it in an R package or that it will soon become available.
You can think of these R packages like the apps you use on your smartphone. Whenever you need something new, you just install it directly into R by issuing a simple command. And just like we say “there’s an app for that” to highlight the vast array of apps readily available to us for almost everything, when it comes to any statistical method or problem, chances are that “there’s a package for that” too. Unlike mobile apps, however, every part of the R package ecosystem is free and open-source. As with the rest of R, for each of these packages, you pay no license fees, there are no free versus paid tiers, no ads, no data tracking, and you can inspect, modify, or redistribute the code.
While R and its package ecosystem give us a vast statistical toolbox, beginners may worry that R will be overwhelming and that they must master everything before they can do anything at all. Happily, that isn’t how learning R works. Think of R more like a huge and fully stocked hardware store: you don’t need to master every tool and gadget, or even know what each one does, in order to learn how to tighten a single bolt. Even veteran users tap into only a fraction of what R offers. It is built for incremental learning. Start by learning a small, consistent core. Many functions follow the same naming and argument conventions, so once you recognize one pattern you start to recognize them everywhere. Gradually, add new capabilities as and when you need them. Expanding your toolkit happens gradually and on demand. It does not happen all upfront. You can postpone everything beyond the basics until the moment it becomes useful. Each time you expand to new tools, you build on what you already know and the learning becomes ever faster. New methods usually resemble the ones you have used before, so each step feels like an intuitive incremental extension, not like being set back to square one.
1.5.3 Increasingly widely used
Although R began life as an academic project, the numbers show that by 2025 it has become one of the most widely used analytic platforms in the world, both within and beyond academia. There are many hard facts that attest to its pervasiveness and growth.
As mentioned, the CRAN repository alone now hosts more than 22,000 add-on packages and has been growing exponentially since the early 2000s. About 10 years ago, in April 2015, CRAN listed only about 6,200 packages. Going from 6K to 22K in ten years works out to roughly 13–14% compound annual growth. A broader longitudinal study that covers over R package repositories estimates the average growth of active R packages at around 29% per year, sustained for two decades (Bommarito & Bommarito, 2021).
The Journal of Statistical Software is a leading peer-reviewed academic journal that publishes research on statistical software development and computational methods. The journal publishes articles on statistical software along with the source code of the software itself and replication code for all empirical results. While the journal accepts articles describing methods implemented across various programming languages, R is the dominant language featured in its publications. This pattern reflects R’s central role in implementing cutting-edge statistical methods.
In terms of users, Posit, a company that has invested heavily in developing R packages, estimates that many of its open-source package families have each had billions of cumulative downloads (Posit PBC, 2024). They also estimate that the RStudio IDE for R is used by millions of people weekly, giving a conservative lower bound for the active R user base. In programming language popularity charts such as by Tiobe, Redmonk, PyPL, for years, R has consistently been at or near the top 10 most popular programming languages of any kind, well above other statistics-specialized languages like SAS and Stata.
1.5.4 But is coding hard?
To do any analysis in R, you write commands. In some other statistics packages, such as SPSS, which is traditionally very widely used in social sciences, analyses are selected from menus and options are selected with clicks in dialog boxes. For many people, especially students beginning their journey into data analysis in the social sciences, writing code may seem like it will be much more difficult and frustrating than using menus and clicking boxes. We are familiar with this concern, having felt the same way when we were students first learning data analysis, but ultimately we think it is a mistaken initial impression.
Calling R a programming language makes it sound very intimidating, but in day-to-day use for almost all users it behaves more like an advanced calculator: you type a short instruction (essentially saying “do this to that data”) and then press Enter and get your result. These single-line commands are no more complicated than the formulas many people regularly use in Excel. If you have ever seen or written something like =AVERAGE(A1:A10) in a spreadsheet, you already have the mental model you need for R commands. The concepts that people associate with “real programming”—iterations, conditionals, functions, and the like—are best seen as optional extras for more advanced and experienced users. Many competent R users never do programming like this in R, and in this book, they will play no role whatsoever.
Graphical menus and dialog boxes may be easy for selecting different tests, but the moment you need to do something like recode 200 survey items, merge multiple files, reshape data sets, or fix the same typo in 30 columns, this workflow turns into a tedious slog of endless manual operations. By contrast, a short sequence of R commands handles those chores in one pass. These commands can be saved to a file, tweaked, and rerun whenever new data arrive. What feels easy and natural on day one often proves slower and more error-prone as time goes by.
Think of writing in code like writing a lab notebook. Every command you run is a permanent, human-readable record of what happened to the data, from raw data import to final figures. If you need to update a plot or re-do an analysis after the dataset changes, just re-run your commands. By putting your code for any given analysis in one file, which we call a script, you can run the whole script with one command (or even a single mouse click). With this, every step of your entire complex analysis sequence can be re-run, often in seconds. With purely graphical interactive software, the analytic steps live only in your memory (or a screencast); recreating them means retracing every menu and checkbox by hand, hoping you don’t miss one.
Because an R script can be shared as plain text, collaborators and reviewers can inspect, rerun, or audit the work line by line. That transparency is fast becoming an expectation in both research and industry. Graphical statistics packages will often export output, but not the entire pipeline from data import to finished product, making verification cumbersome or impossible.
As mentioned earlier, you learn R incrementally, not all at once. The first dozen or so R commands you learn will cover around 80% of your everyday analyses. Each additional snippet then builds on familiar patterns, so the skill curve slopes upward smoothly rather than suddenly throwing you into a complex world of programming. The return on that small upfront investment is a workflow that scales from starting homework assignments to full-scale research projects.
Menus and buttons may look friendlier, but for anything beyond some simple one-shot analysis, we believe they lock you into repetitive manual labor and an opaque audit trail. R’s command-driven approach trades a small amount of initial typing for speed, accuracy, and reproducibility. These are advantages that grow exponentially as your projects, and your ambition, expand.
1.6 Who we are and why we wrote this book
We are Mark Andrews and Lucy Justice, both academics in the School of Social Sciences at Nottingham Trent University. Our academic journeys began just like most typical psychology or social science students, and we learned statistics initially from the same introductory stats modules that most social-science students meet in their first years as undergraduates. Along the way, to our surprise, we found ourselves gravitating towards specializing in statistics and data analysis. Today, all our university teaching is about statistics, which we teach at different levels, from first-year undergraduates to specialized MSc students to advanced analyses for PhD students and early career researchers.
Although we specialize in data analysis, we still see this as just a tool for doing science and not as an end in itself. We value practical insight over mathematical or computational pyrotechnics, and we remember vividly exactly what it felt like to stare at a data set not knowing where to even begin with the analysis. That perspective hopefully guides everything in this book: plain language, practical focus, no more technicalities than necessary, and plenty of real-world examples.
We wrote this book because we found that too many introductory textbooks were at odds with how data analysis is now done in practice. Modern data analysis, even for relatively simple or routine problems, usually involves some or all of the following:
- cleaning and reshaping messy data
- visualizing data before, during, and after modeling
- fitting and evaluating different statistical models
- mixing classical and Bayesian ideas
- keeping the whole analysis workflow transparent and reproducible
Modern computer hardware and software like R have made all of these steps straightforward and easy and so they’ve now become part of the normal workflow of most researchers and data analysts regardless of their focus or specialty. However, the introductory stats curriculum often seems like these developments never occurred.
While teaching statistics we began writing short handouts to guide students. Those pages grew into full lecture notes and, eventually, into the book you are reading. As the material evolved, a handful of guiding principles kept us on track:
Connect old staples with new workflows. Classical statistical concepts and ideas — p-values, confidence intervals, regression, ANOVA — still matter, but they live inside a wider habit of model-building, visual exploration, and iterative refinement.
Keep the code lean. Most examples should fit on a few lines, or just one, and should feel no more complicated than a spreadsheet formula. Heavier programming constructs (iterations, custom functions, building packages) can wait for later courses.
Make every step visible and reproducible. From raw data to final figure, the path is a single, runnable script you can inspect, rerun or tweak. Everything is transparent. No hidden steps. Everything we cover in the book can be followed and reproduced exactly, line by line, by those reading this book.
Minimize mathematical technicalities. Clear prose, plenty of examples, and just enough math to explain how the method works.
Treat data wrangling and visualization as first-class citizens. Real projects live or die on cleaning messy files and spotting patterns; we give those tasks the space and prominence they deserve.
Favor practice over jargon. Short explanations followed by hands-on examples let you move quickly from concept to application.
These principles shape every chapter. The result, we hope, is the book we wished for as students: practical, modern, transparent—and friendly to anyone taking their first steps in statistics with R.
1.7 What we cover and why
This book has three parts and 14 chapters overall.
1.7.1 Part I: Foundations
Part I is all about introducing the fundamentals of data analysis. It’s about getting you familiar and comfortable with R, with the core ideas and principles of descriptive and inferential statistics, and with cleaning and visualizing real-world data. By the time you reach the end of this part you’ll know how to import raw data sets, clean them for analysis, visualize and explore them, and perform your first inferential statistical tests. Everything else in the book builds on these foundations.
Chapter 2 provides a gentle, step-by-step introduction to R and its companion editor, RStudio. It takes you from complete beginner to fully up and running: installing both R and RStudio, creating your first project folder, installing the few main packages we’ll use throughout the book, and running your very first commands. You’ll see how an R command is simply a plain-text instruction, no more complex than a spreadsheet formula, telling R what to do to or with your data. You’ll also see how an R script is just a file containing a series of commands that you can save, edit, share, and re-run at any time. We’ll finish by reading several common data-file formats into R with quick, single-line commands. Our aim here is not to show you everything R can do, but to get you comfortable with the basics—comfort that will carry you a long way and make learning everything else easier.
With the software in place, Chapter 3 shows you how to explore and summarize real datasets—often called exploratory data analysis (EDA). EDA is a crucial step in any analysis, no matter how sophisticated, yet it involves little more than looking and thinking. We’ll introduce major plotting and visualization tools such as histograms, boxplots, and scatterplots; these reveal the shapes of distributions and the relationships between variables. We’ll reinforce the visual patterns with numerical summaries that describe where the data are centered, how spread out they are, how symmetric (or not) they look, and so on. The goal is to show how raw numbers can be transformed into patterns your eyes and intuition can easily grasp.
Chapter 4 introduces the logic of statistical inference, or how we can make reliable generalizations from our data. Inference is a broad topic and, frankly, a challenging one, because some of its concepts are subtle and not always intuitive. Yet at its heart the fundamental ideas are simple, and this chapter presents them as plainly as possible. By using deliberately simple examples, we keep your focus on the main principles rather than on distracting technicalities. Key concepts such as sampling distributions, standard errors, hypothesis tests, p-values, confidence intervals, and more will appear again and again in later chapters.
Finally, Chapter 5 tackles the messy realities that many statistics textbooks sidestep: real datasets are rarely born clean and tidy. Column names may be cryptic, many columns or rows may be superfluous, data can be spread across multiple files, or be stored in the wrong shape for analysis. Here you’ll meet a concise toolkit of data-wrangling verbs, each specialized for a crucial task, and you’ll learn how to chain them together into pipelines. Mastering these tools, often postponed in introductory courses (we think that’s a mistake), will save you immense amounts of time and frustration whenever you handle real data.
Together these chapters arm you with the essentials: a working R setup; the habit of exploring data visually and numerically; a firm grasp of why inference works; and a set of powerful tools for cleaning the data you’ll analyze in the rest of the book.
1.7.2 Part II: Linear Models and Friends
Part II brings together the statistical workhorses that power an enormous share of real-world data analysis. Many textbooks present these tools — t-tests, ANOVA, linear regression, and others — as items in a toolbox, each suited to a single, isolated purpose. We take a different approach: every method here is a variation on a single big idea, the linear model. Some chapters focus squarely on classical linear regression; others extend it by adding categorical predictors, allowing parameters to vary across groups, or swapping the normal distribution for a different one. Seeing the family resemblance from the start pays dividends: the same principles of modeling, estimation, and testing appear from Chapter 6 through Chapter 14, so each new topic feels like a step forward rather than a fresh mountain to climb.
We begin in Chapter 6 with the classic problem of comparing normal means, usually called the t-test. Here you’ll meet the normal distribution and the t-statistic, compute standard errors, confidence intervals, and p-values, report effect sizes, and see how power analysis guides sample-size planning.
In Chapter 7, we move on to simple linear regression, where a single predictor explains or predicts a single outcome. Geometry meets statistics as intercepts and slopes form the backbone of this immensely useful model. You’ll also see how the t-test from Chapter 6 is simply regression with a binary predictor — an early proof that these methods are all variations on a theme.
Chapter 8 extends this framework to multiple regression, where many predictors can be considered at once. This lets you control for confounds, isolate unique effects, and build richer explanations, while also tackling issues such as multicollinearity, model comparisons, and interactions.
In Chapter 9, we reintroduce ANOVA, but show it as part of the same general linear model family. Here categorical predictors, sums of squares, contrasts, post-hoc tests, and factorial designs all fit naturally into the framework you already know.
Chapter 10 turns to repeated-measures designs, the kinds of data where each participant contributes multiple observations. We begin with traditional repeated-measures ANOVA, but quickly connect it to multilevel models, which relax assumptions, cope with missing data, and allow for subject-specific trajectories.
In Chapter 11, we step more fully into multilevel (mixed-effects) linear models. These are essential whenever data are clustered — pupils within classes, reaction times within participant, schools within cities — and allow intercepts and slopes to vary across groups while quantifying that variation.
Chapter 12 introduces logistic regression, the extension of the linear model to categorical outcomes. Here you’ll learn how to model the probability of binary, nominal, or ordinal outcomes as a function of predictors, using a different link function and distribution to match the data’s nature.
Finally in Chapter 13, we introduce models for count data, which often arise in the form of numbers of events or occurrences. You’ll see why the normal model fails for such data, and how Poisson regression, negative binomial models, and zero-inflated models provide better fits while following the same modelling principles you already know.
1.7.3 Part III: Bayesian Methods
Part III turns to a different way of thinking about statistical inference, one that has grown rapidly in influence across the sciences. Bayesian methods approach uncertainty by placing probability distributions directly on parameters, combining prior knowledge with data to form a posterior distribution. This perspective offers a flexible and coherent framework for estimation, prediction, and model comparison.
In Chapter 14 we start with the core ideas of Bayesian inference: prior distributions, likelihoods, and Bayes’s theorem. From there, we move to practical Bayesian modeling in the kinds of regression frameworks you mastered in Part II, including multiple predictors, interactions, and hierarchical models. You will learn how Markov chain Monte Carlo (MCMC) methods make these models computationally feasible. This single chapter is designed as both an accessible introduction for those new to Bayesian thinking and a bridge to more advanced work should you choose to pursue it.
1.8 Further Reading and Listening
If you’d like to explore beyond this chapter, here are some friendly starting points:
Statistics Without Tears by Derek Rowntree.
First published in the 1980s but still a classic, this short book takes you through the big ideas of statistics using plain language and everyday examples. There are no formulas, no software, and no heavy maths — just clear explanations of what concepts like averages, significance, and correlation really mean.More or Less (BBC Radio 4 podcast).
A weekly radio show and podcast that takes claims in the news — from the size of the UK population to whether eating chocolate helps you live longer — and subjects them to statistical scrutiny. The episodes are short (10–30 minutes), witty, and show how statistical thinking works in real debates and controversies.How to Make the World Add Up by Tim Harford.
A highly readable book by the FT’s “Undercover Economist,” this is about how to think clearly when faced with numbers in everyday life.
Each chapter introduces a principle and illustrates it with engaging stories from history, politics, and economics.
For social science students, it’s a lively reminder that statistics isn’t an abstract classroom exercise but a practical way to interrogate claims, uncover patterns, and understand the forces that shape society.
1.9 Chapter summary
What you learned
- Why statistics is indispensable in the social sciences, where outcomes vary across people, places, and time.
- A modelling-first approach: we examined populations vs samples, statistical models, uncertainty, and inference, and discussed why they matter for drawing cautious conclusions beyond the data at hand.
- Where uncertainty comes from: non determinism, sampling variability, measurement noise, confounding, multiple causes.
- How computation powers modern practice: wrangling, visualising, modelling, communicating, reproducibility.
- Why we use R: free, open, powerful, widely used, excellent for transparent workflows.
Common pitfalls to avoid
- Treating a sample pattern as a law for the population.
- Confusing correlation with causation, or ignoring competing explanations.
- Forgetting that measurements are noisy, which can blur or inflate effects.
- Over interpreting a single p value or a single interval in isolation.
Takeaway
Statistics is the craft of using imperfect samples to say something careful about larger social realities.
Models provide the scaffold, computation does the heavy lifting, and uncertainty is reported rather than ignored.
Next, you will put this into practice with R.