+ - 0:00:00
Notes for current slide
Notes for next slide
  • We're going to go through a few case studies of my work, including: {click through}

Bayesian data analysis and political journalism

(Accurate) storytelling with data

G. Elliott Morris

Mar 13, 2023 | Ithaca/New York, NY

1 / 40

2 / 40
  • We're going to go through a few case studies of my work, including: {click through}

Goals in data journalism:

3 / 40

Goals in data journalism:

1. Analyze a subject that is newsy or noteworthy

  • Elections, voting, specific policy votes
3 / 40

Goals in data journalism:

1. Analyze a subject that is newsy or noteworthy

  • Elections, voting, specific policy votes

2. In a way that is novel or visually striking;

  • Eg, use multi-level regression and poststratification to predict support for abortion rights in each state
3 / 40

Goals in data journalism:

1. Analyze a subject that is newsy or noteworthy

  • Elections, voting, specific policy votes

2. In a way that is novel or visually striking;

  • Eg, use multi-level regression and poststratification to predict support for abortion rights in each state

3. Or in a way that beats the competition

  • Election forecasts that are fully bayesian
3 / 40
  • But before that, let's talk about our goals. There are three:

    1. To analyze a subject that is newsy or noteworthy
    1. To do that in a way that is novel or visually striking (or, sometimes, just newsy)
    1. Or in a way that beas the competition

Goals in social science:

Similar to the goals in data journalism!

4 / 40

Goals in social science:

Similar to the goals in data journalism!

1. Identify a phenomena

  • Maybe it is a gap in the literature or a new development
4 / 40

Goals in social science:

Similar to the goals in data journalism!

1. Identify a phenomena

  • Maybe it is a gap in the literature or a new development

2. Measure it

  • Eg, with a survey
4 / 40

Goals in social science:

Similar to the goals in data journalism!

1. Identify a phenomena

  • Maybe it is a gap in the literature or a new development

2. Measure it

  • Eg, with a survey

3. Explain it

  • Often involves some level of modeling or prediction
  • Eg, a randomized experiment or regression
4 / 40
  • Goals of data journalism are not so different from the goals of social science

  • Adapted here from the book Data Analysis for Social Science by Elena Llaudet and Kosuke Imai

  • So I hope the presentation can be helpful for students as they decide if they want to go into journalism, which has the distinguished characteristic of being one of the industries that probably has only marginal returns over becoming an academic

Case study one: wacky polls

5 / 40

Case study one: wacky polls

Method: weighting survey data

5 / 40

1. Weighting survey data

6 / 40

1. Weighting survey data

7 / 40

1. Weighting survey data

7 / 40

1. Weighting survey data

  • Story: Outlier poll getting a lot of attention. Fishy results, shady firm.

  • Novelty: Asked pollster for their data, re-weighted it to generate results.

  • Explanation: Pollsters repeating mistakes of 2016, not weighting data by education

  • Explanation is visually simple

7 / 40
  • Explain process for story

  • While not a story where we needed Bayesian analysis, it could have taught us more:

1. Weighting survey data

Potential Bayesian iteration?

  • Train regression model to predict voting ("target-estimation")
  • Use features with large coefficients to weight the poll

Drawback?

  • Of marginal utility journalistically
8 / 40

Case study two: hypothetical elections, demographic patterns, uncertainty

9 / 40

Case study two: hypothetical elections, demographic patterns, uncertainty

Method: Multilevel regression and post-stratification (MRP)

9 / 40

2. Mister-P

Answer questions like:

  • What would happen if everyone in America voted?
  • Which groups support Joe Biden more than Hillary Clinton?
  • How can we better measure uncertainty in polling?
10 / 40

2. Mister-P: If everyone voted

Guiding questions

11 / 40

2. Mister-P: If everyone voted

Guiding questions

1. How many Democrats and Republicans are there?

Given data constraints, we're really asking: How many Clinton and Trump voters are there?

11 / 40

2. Mister-P: If everyone voted

Guiding questions

1. How many Democrats and Republicans are there?

Given data constraints, we're really asking: How many Clinton and Trump voters are there?

2. How are they distributed geographically?

The answer lets us assign Electoral College votes.

11 / 40

2. Mister-P: If everyone voted

Data

12 / 40

2. Mister-P: If everyone voted

Data

1. Cooperative Congressional Election Study (CCES): A survey of 64,000 Americans

Includes demographic data and 2016 vote choice for 40,000+ validated voters

12 / 40

2. Mister-P: If everyone voted

Data

1. Cooperative Congressional Election Study (CCES): A survey of 64,000 Americans

Includes demographic data and 2016 vote choice for 40,000+ validated voters

2. American Community Survey (ACS): A Census Bureau survey of 175,000 Americans

Includes the same demographic data as the CCES 380,000 “cells”

12 / 40

2. Mister-P: If everyone voted

Method

13 / 40

2. Mister-P: If everyone voted

Method

1. Train a predictive model on CCES data

  • Multi-level logistic regression
  • Predict vote choice with: age, gender, race, education, region and interactions between them
13 / 40

2. Mister-P: If everyone voted

Method

1. Train a predictive model on CCES data

  • Multi-level logistic regression
  • Predict vote choice with: age, gender, race, education, region and interactions between them

2. Use the model to predict voting habits for every eligible American

Via “post-stratification” on the ACS

13 / 40

2. Mister-P: If everyone voted

ACS Post-stratification

14 / 40

2. Mister-P: If everyone voted

ACS Post-stratification

1. Each "type" of person gets their own "cell":

  • One cell for white men ages 18-30 without college degrees who live in the Northeast
  • Another for non-white men ages 18-30 without college degrees who live in the Northeast
  • etc.
14 / 40

2. Mister-P: If everyone voted

ACS Post-stratification

1. Each "type" of person gets their own "cell":

  • One cell for white men ages 18-30 without college degrees who live in the Northeast
  • Another for non-white men ages 18-30 without college degrees who live in the Northeast
  • etc.

2. We know how many voters in that "cell" live in each state

14 / 40

2. Mister-P: If everyone voted

ACS Post-stratification

1. Each "type" of person gets their own "cell":

  • One cell for white men ages 18-30 without college degrees who live in the Northeast
  • Another for non-white men ages 18-30 without college degrees who live in the Northeast
  • etc.

2. We know how many voters in that "cell" live in each state

3. So we can say that x and y% of each "cell" vote for Clinton or Trump, then add up

  • For example, a Latino female age 18-30 with a college degree in Texas is 85% likely to vote for a Democrat for president, and there's 20k of them
14 / 40

2. Mister-P: If everyone voted

15 / 40

2. Mister-P: If everyone voted

16 / 40

2. Mister-P: If everyone voted

17 / 40

2. Mister-P: Polling uncertainty

Souce: Groves et al., 2009

18 / 40

2. Mister-P: Polling uncertainty

Traditional margin of error only covers one source of error (sampling)

We can use MRP to take non-response and adjustment into account too

Via the posterior predictive distribution

19 / 40

2. Mister-P: Polling uncertainty

In brms syntax:

20 / 40

2. Mister-P: Polling uncertainty

Model estimates of parameter uncertainty

21 / 40

2. Mister-P: Polling uncertainty

Posterior draws for every cell account for sampling, non-response, and adjustment error

22 / 40
  • Sampling, via the Bayesian logit updater (and survey weights)

  • Non-response, via adjustment back to survey frame

  • And adjustment error, via varying parameter estimates and partial pooling

  • Or, better yet, via Bayesian model averaging

2. Mister-P: Polling uncertainty

Wider error bars = truer measure of uncertainty in polling

23 / 40

Case study two: Fully Bayesian election forecasting

24 / 40

Case study two: Fully Bayesian election forecasting

Method: Dynamic linear model (latent traits + measurement model)

24 / 40

3. Elections DLMs

Economist presidential model

1. National economic + political fundamentals

2. Decompose into state-level priors

3. Polls

Uncertainty is propagated throughout the models, incorporated via MCMC sampling in step 3.

25 / 40

3. Elections DLMs

It's just a trend through points...

26 / 40

3. Elections DLMs

(...but with some fancy extra stuff)

mu_b[:,T] = cholesky_ss_cov_mu_b_T * raw_mu_b_T + mu_b_prior;
for (i in 1:(T-1)) mu_b[:, T - i] = cholesky_ss_cov_mu_b_walk * raw_mu_b[:, T - i] + mu_b[:, T + 1 - i];
national_mu_b_average = transpose(mu_b) * state_weights;
mu_c = raw_mu_c * sigma_c;
mu_m = raw_mu_m * sigma_m;
mu_pop = raw_mu_pop * sigma_pop;
e_bias[1] = raw_e_bias[1] * sigma_e_bias;
sigma_rho = sqrt(1-square(rho_e_bias)) * sigma_e_bias;
for (t in 2:T) e_bias[t] = mu_e_bias + rho_e_bias * (e_bias[t - 1] - mu_e_bias) + raw_e_bias[t] * sigma_rho;
//*** fill pi_democrat
for (i in 1:N_state_polls){
logit_pi_democrat_state[i] =
mu_b[state[i], day_state[i]] +
mu_c[poll_state[i]] +
mu_m[poll_mode_state[i]] +
mu_pop[poll_pop_state[i]] +
unadjusted_state[i] * e_bias[day_state[i]] +
raw_measure_noise_state[i] * sigma_measure_noise_state +
polling_bias[state[i]];
}
27 / 40

3. Elections DLMs

Poll-level model

28 / 40

3. Elections DLMs

Poll-level model

i. Latent state-level vote shares evolve as a random walk over time
  • Pooling toward the state-level fundamentals more as we are further out from election day
28 / 40

3. Elections DLMs

Poll-level model

i. Latent state-level vote shares evolve as a random walk over time
  • Pooling toward the state-level fundamentals more as we are further out from election day
ii. Polls are observations with measurement error that are debiased on the basis of:
  • Pollster firm (so-called "house effects")
  • Poll mode
  • Poll population
28 / 40

3. Elections DLMs

Poll-level model

i. Latent state-level vote shares evolve as a random walk over time
  • Pooling toward the state-level fundamentals more as we are further out from election day
ii. Polls are observations with measurement error that are debiased on the basis of:
  • Pollster firm (so-called "house effects")
  • Poll mode
  • Poll population
iii. Correcting for partisan non-response
  • Whether a pollster weights by party registration or past vote
  • Incorporated as a residual AR process
28 / 40

3. Elections DLMs

Notable improvements from partisan non-responseand other weighting issues

29 / 40

3. Elections DLMs

Notable improvements from adjusting for partisan non-response and other weighting issues

30 / 40

3. Elections DLMs

2016: good!

31 / 40

3. Elections DLMs

2020: not as good!

32 / 40

3. Elections DLMs

Problem with non-response/weighting adjustments

33 / 40

3. Elections DLMs

Problem with non-response/weighting adjustments

1. Pollsters change their methods

33 / 40

3. Elections DLMs

Problem with non-response/weighting adjustments

1. Pollsters change their methods

2. Not all adjustments work

33 / 40

3. Elections DLMs

Solution? Conditional forecasting!

34 / 40

3. Elections DLMs

Solution? Conditional forecasting!

- Present aggregates assuming some amount of polling bias.

34 / 40

3. Elections DLMs

Solution? Conditional forecasting!

- Present aggregates assuming some amount of polling bias.

- As a way to explain to readers how bias enters the process of polling

34 / 40

3. Elections DLMs

Solution? Conditional forecasting!

- Present aggregates assuming some amount of polling bias.

- As a way to explain to readers how bias enters the process of polling

- And what happens to forecasts if bias now does not follow historical distributions

34 / 40

3. Elections DLMs

Conditional forecasting:

35 / 40

3. Elections DLMs

Conditional forecasting:

1. Debias polls

35 / 40

3. Elections DLMs

Conditional forecasting:

1. Debias polls

2. Rerun simulations

35 / 40

3. Elections DLMs

2. Rerun simulations

36 / 40

3. Elections DLMs

2. Rerun simulations

Advantage: leaves readers with a much clearer picture of possibilities for election outcomes if past patterns of bias aren't predictive of bias now (2016, 2020)

37 / 40

3. Elections DLMs

But exploring parameter conditionality is not always necessary or helpful:

38 / 40

Questions?

39 / 40

Thank you!



Website: gelliottmorris.com

Twitter: @gelliottmorris

Questions?


These slides were made using the xaringan package for R. They are available online at https://www.gelliottmorris.com/slides/

40 / 40

2 / 40
  • We're going to go through a few case studies of my work, including: {click through}
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow