Setup

In these notes we will use the statistical software and programming language R, along with a few packages. RStudio IDE is highly recommended, but not required.

# Essentials:
library(zeallot) # multi-assignment operator %<-%
library(tidyverse)
# Presentable tables:
library(gt)
library(gtsummary)
# Modelling:
library(lme4)
library(brms)
library(tidybayes)

Use the following code to install those packages if you want to follow along:

Show code

install.packages(
  c("tidyverse", "zeallot", "lme4", "car", "brms", "tidybayes", "cmdstanr"),
  repos = c("https://mc-stan.org/r-packages/", getOption("repos"))
)
cmdstanr::install_cmdstan()

# Optional:
install.packages(c("gt", "gtsummary", "broom.mixed")) # tables

Introduction

Suppose we’re running a test of a new editor interface for making edits to an article and we wish to know whether this new editing interface is an improvement – defined as “whether the user published the edit after starting it.” Specifically, we’re interested in a lift of 20% in the publish (successful edit attempt) probability.

We will build a model of edit attempt outcomes \(y\) – which is 1 if the edit was published and 0 if it was not. We use \(p\) to indicate the probability of a successful edit – in other words: \(p = \text{Pr}(y = 1)\). The simplest version of our model is a logistic regression:

\[ \begin{align*} y &\sim \text{Bernoulli}(p)\\ \text{logit}(p) &= \beta_0 + \beta_1 \times \text{used new interface} \end{align*} \]

where \(\beta_0\) is the intercept, and \(\beta_1\) is the slope and the effect associated with using the new interface. In other words:

\[ \begin{align*} & \text{Pr}(\text{edit published}|\text{did not use new interface}) & =~ & \text{logit}^{-1}(\beta_0)\\ & \text{Pr}(\text{edit published}|\text{used new interface}) & =~ & \text{logit}^{-1}(\beta_0 + \beta_1) \end{align*} \]

The goal of the statistical model here is to infer the values of \(\beta_0\) and \(\beta_1\).

It is important to remember that these model parameters (also called coefficients) are on the log-odds scale, and that when we interpret their estimates we must be careful and sometimes we need to apply transformations to make sense of them.

For example, (Gelman, Hill, and Vehtari 2021) suggest using the “divide-by-4” rule for convenience. Under this transformation \(\beta_1/4\) is a reasonable approximation of the maximum increase in the probability of success corresponding to a unit difference – in our case the difference between which interface was used (\(\text{used new interface} = 0\) vs \(\text{used new interface} = 1\)).

Alternatively, \(\exp(\beta_1)\) – or \(e^{\beta_1}\) – represents the multiplicative effect on the odds of an edit getting published.

Simulation

To see the methods in action we will simulate users, their edit attempts, and the outcomes of those attempts – whether they were successful (edit published) or not – and then see how the models are able to recover the true values of the parameters which were used in the simulation.

c(b0, b1) %<-% c(-0.7, 0.8)

c(logit, invlogit)  %<-% c(qlogis, plogis) # Easier to remember

where b1 is the effect of the new editor interface on the log-odds scale (refer to the introduction above).

Show code

simple_example <- tibble(
  new_interface = c(FALSE, TRUE),
  formula = c("\\[\\beta_0\\]", "\\[\\beta_0 + \\beta_1\\]")
) %>%
  mutate(
    log_odds = b0 + b1 * new_interface,
    prob = invlogit(log_odds)
  )
simple_example %>%
  gt() %>%
  fmt(vars(new_interface), fns = function(x) ifelse(x, "New", "Old")) %>%
  fmt_percent(vars(prob), decimals = 1) %>%
  cols_label(
    new_interface = "Editing interface",
    formula = html("logit<sup>-1</sup>(p)"),
    log_odds = "Log-odds scale",
    prob = "Publish probability"
  ) %>%
  tab_header(
    "Logistic regression model",
    html(sprintf("Using &beta;<sub>0</sub> = %.1f and &beta;<sub>1</sub> = %.1f", b0, b1))
  )

Logistic regression model
Using β₀ = -0.7 and β₁ = 0.8
Editing interface	logit^-1(p)	Log-odds scale	Publish probability
Old	\[\beta_0\]	-0.7	33.2%
New	\[\beta_0 + \beta_1\]	0.1	52.5%

Let’s generate some data to use for our study! We will first build a user simulator and then use that to simulate many users making edits, sometimes using the new interface and sometimes using the old interface.

simulate_user_edit_attempts <- function() {

  # unique for each user; informs that user's "base publish prob."
  user_intercept <- b0 + rnorm(1, 0, 0.75)

  # how many edit attempts did this user make?
  n_edit_attempts <- ceiling(rexp(1, 1/50))

  # simulate very active users either very into new interface or not
  prob_using_new_interface <- ifelse(
    n_edit_attempts > 150,    # 'active': 150+ edit attempts
    sample(c(0.05, 0.85), 1), # for active users
    rbeta(1, 1.5, 1.5)        # for everyone else
  )
  # simulate which edit attempts were performed with which interface:
  used_new_interface <- rbernoulli(n_edit_attempts, p = prob_using_new_interface)

  # each edit attempt's success prob. depends on user and interface used:
  user_log_odds <- user_intercept + b1 * used_new_interface
  success_probabilities <- invlogit(user_log_odds)

  # simulate each edit attempt's outcome, based on that edit attempt's prob.:
  edit_attempts <- map_lgl(success_probabilities, rbernoulli, n = 1)

  # return simulated user's edits and outcomes:
  tibble(
    user_intercept = user_intercept,
    user_success = boot::inv.logit(user_intercept),
    new_interface = used_new_interface * 1L,
    success_prob = success_probabilities,
    edit_success = edit_attempts * 1L
  )
}

To see this simulation in action, let’s simulate one user and print the first 10 edit attempts:

set.seed(0)
simulate_user_edit_attempts() %>%
  head(10)

# A tibble: 10 × 5
   user_intercept user_success new_interface success_prob edit_success
            <dbl>        <dbl>         <int>        <dbl>        <int>
 1          0.247        0.561             0        0.561            1
 2          0.247        0.561             1        0.740            0
 3          0.247        0.561             1        0.740            1
 4          0.247        0.561             1        0.740            1
 5          0.247        0.561             1        0.740            0
 6          0.247        0.561             0        0.561            1
 7          0.247        0.561             0        0.561            0
 8          0.247        0.561             0        0.561            1
 9          0.247        0.561             1        0.740            1
10          0.247        0.561             0        0.561            0

Notice how the user’s individual intercept gets transformed into their base probability, how the choice of the interface either keeps it the same or lifts it, and how even using the new interface (and in this case having 74% probability of publishing the edit) did not yield successes in every instance.

That was just one user, but our analysis will require more. Let’s simulate 100 users:

set.seed(42)
simulated_edits <- map_dfr(
  1:100,
  ~ simulate_user_edit_attempts(),
  .id = "user_id"
)
simulated_edits <- simulated_edits %>%
  mutate(
    # format user IDs to use zero padding:
    user_id = factor(sprintf("%03.0f", as.numeric(user_id)))
  )

Inference

Simple Linear Regression

First, let’s see what results we get from a simple model that treats all edit attempts as independent and interchangeable, even though we know this to be wrong since edit attempts from the same user are going to share a base success probability (based on that user’s simulated intercept):

fit0 <- glm(
  formula = edit_success ~ new_interface, # intercept is implied
  family = binomial(link = "logit"),
  data = simulated_edits
)

The results are:

Show code

fit0 %>%
  tbl_regression(
    label = list(new_interface = "Using new interface"),
    intercept = TRUE
  )

Characteristic	log(OR)¹	95% CI¹	p-value
(Intercept)	-0.82	-0.90, -0.74	<0.001
Using new interface	0.82	0.71, 0.93	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

This model has done an OK job of estimating the coefficients \(\beta_0\) = -0.7 and \(\beta_1\) = 0.8.

Hierarchical Logistic Regression

However, the model above is fundamentally flawed because it assumes all of the edit attempts are independent and identically distributed, but this is not the case. A much more correct model would incorporate the fact that these edit attempts are related to each other because they were done by the same users.

Hierarchical regression models – also known as multilevel models and mixed-effects models (due to the presence of random effects in addition to fixed effects) – are used to model this structure when individuals observations are grouped. We refer to these groupings as random effects. In fact, we can even model more deeply nested structures – for example, if we had simulated multiple wikis and multiple users within those wikis then we could model this hierarchy.

fit1 <- glmer(
  formula = edit_success ~ new_interface + (1 | user_id),
  family = binomial(link = "logit"),
  data = simulated_edits
)

The results are:

Show code

tbl1 <- fit1 %>%
  tbl_regression(
    label = list(new_interface = "Using new interface"),
    intercept = TRUE
  )
tbl1

Characteristic	log(OR)¹	95% CI¹	p-value
(Intercept)	-0.77	-0.95, -0.60	<0.001
Using new interface	0.86	0.72, 1.0	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

This model has done a better job of estimating the coefficients \(\beta_0\) = -0.7 and \(\beta_1\) = 0.8.

Interpretation

In order to interpret these estimates, we have to apply a transformation. However, we need to be careful about the confidence intervals. The delta method can be used to estimate the confidence intervals of the transformed parameters:

Show code

c("b1/4", "exp(b1)") %>%
  map_dfr(
    .f = car::deltaMethod,
    object = fit1,
    parameterNames = paste0("b", 0:1)
  ) %>%
  rownames_to_column(var = "transformation") %>%
  gt(rowname_col = "transformation") %>%
  fmt_number(vars(`Estimate`, `SE`, `2.5 %`, `97.5 %`)) %>%
  cols_merge(vars(`2.5 %`, `97.5 %`), pattern = "({1}, {2})") %>%
  cols_label(`2.5 %` = "95% CI", SE = "Standard Error") %>%
  tab_header("Transformed effects", "Estimated using delta method") %>%
  tab_footnote("CI: Confidence Interval", cells_column_labels(vars(`2.5 %`)))

Transformed effects
Estimated using delta method
	Estimate	Standard Error	95% CI¹
b1/4	0.21	0.02	(0.18, 0.25)
exp(b1)	2.36	0.17	(2.03, 2.69)
¹ CI: Confidence Interval

The maximum lift in edit attempt success probability is 21% (95% CI: 18%-25%), and because the confidence interval does not include 0 this is statistically significant at the 0.05 level. When using the new interface the odds of publishing the edit are multiplied by 2.36 (95% CI: 2.03-2.69) – effectively double – and in this case we would determine statistical significance (at the 0.05 level) by checking whether this confidence interval contains 1, since a multiplicative effect of 1 is no change either way.

Highly-active users

One of the things we did in the simulation was have special behavior for highly-active users. Specifically, when a user made more than 150 simulated edit attempts our simulation made approximately 5% of their edits use the new interface OR 85% of their edits use the new interface – effectively simulating power users who are very apprehensive about the new interface or who have a strong preference for it.

most_active_users <- simulated_edits %>%
  group_by(user_id) %>%
  summarize(
    n_edits = n(),
    n_new_interface = sum(new_interface),
    n_successes_new = sum(edit_success * new_interface),
    n_successes = sum(edit_success)
  ) %>%
  top_n(10, n_edits)

Show code

most_active_users %>%
  mutate(
    prop_new_interface = n_new_interface / n_edits,
    prop_successes = n_successes / n_edits,
    prop_successes_new = n_successes_new / n_new_interface
  ) %>%
  gt() %>%
  fmt(
    starts_with("prop_"),
    fns = function(x) sprintf("%s%.0f%%", ifelse(x < 0.1, "&nbsp;", ""), 100 * x)
  ) %>%
  cols_merge(ends_with("_new_interface"), pattern = "{1} (<b>{2}</b>)") %>%
  cols_merge(ends_with("_successes"), pattern = "{1} (<b>{2}</b>)") %>%
  cols_merge(ends_with("_successes_new"), pattern = "{1} (<b>{2}</b>)") %>%
  cols_align("left", vars(user_id)) %>%
  cols_align("right", starts_with("n_")) %>%
  cols_label(
    user_id = "User ID",
    n_edits = "Total edit attempts",
    n_new_interface = "Edits made using new interface",
    n_successes = "Total edits published",
    n_successes_new = "Edits published with new interface"
  ) %>%
  tab_footnote(
    footnote = html("(<b>%</b>) is the proportion of total edits which were made using the new interface"),
    locations = cells_column_labels(vars(n_new_interface))
  ) %>%
  tab_footnote(
    footnote = html("(<b>%</b>) is the proportion of edits made using the new interface which were published"),
    locations = cells_column_labels(vars(n_successes_new))
  ) %>%
  tab_footnote(
    footnote = html("(<b>%</b>) is the proportion of total edits published"),
    locations = cells_column_labels(vars(n_successes))
  ) %>%
  tab_style(
    cell_text(font = "monospace", size = "large"),
    cells_body(starts_with("n_"))
  ) %>%
  tab_style(
    style = list(
      cell_fill(color = "gray95")
    ),
    cells_body(rows = (1:nrow(most_active_users)) %% 2 == 1)
  ) %>%
  tab_header("Most active (simulated) users")

Most active (simulated) users
User ID	Total edit attempts	Edits made using new interface¹	Edits published with new interface²	Total edits published³
002	424	349 (82%)	142 (41%)	161 (38%)
003	262	8 ( 3%)	2 (25%)	118 (45%)
023	150	118 (79%)	82 (69%)	99 (66%)
025	125	22 (18%)	10 (45%)	37 (30%)
035	180	6 ( 3%)	3 (50%)	55 (31%)
040	182	164 (90%)	75 (46%)	81 (45%)
059	188	10 ( 5%)	2 (20%)	43 (23%)
067	122	90 (74%)	45 (50%)	52 (43%)
070	176	9 ( 5%)	1 (11%)	16 ( 9%)
094	159	8 ( 5%)	2 (25%)	15 ( 9%)
¹ (%) is the proportion of total edits which were made using the new interface
² (%) is the proportion of edits made using the new interface which were published
³ (%) is the proportion of total edits published

These users account for 36% of all edits in this simulated dataset.

Let us see what happens to the results when we exclude those users from analysis:

fit2 <- glmer(
  formula = edit_success ~ new_interface + (1 | user_id),
  family = binomial(link = "logit"),
  data = anti_join(simulated_edits, most_active_users, by = "user_id")
)

Show code

tbl2 <- fit2 %>%
  tbl_regression(
    label = list(new_interface = "Using new interface"),
    intercept = TRUE
  )
tbl_merge(
  list(tbl1, tbl2),
  c("Including most active users", "Excluding most active users")
)

Characteristic	Including most active users			Excluding most active users
Characteristic	log(OR)¹	95% CI¹	p-value	log(OR)¹	95% CI¹	p-value
(Intercept)	-0.77	-0.95, -0.60	<0.001	-0.74	-0.92, -0.55	<0.001
Using new interface	0.86	0.72, 1.0	<0.001	0.89	0.73, 1.0	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Show code

list(
  `Including most active users` = c("b1/4", "exp(b1)") %>%
    map_dfr(
      .f = car::deltaMethod,
      object = fit1,
      parameterNames = paste0("b", 0:1)
    ) %>%
    rownames_to_column(var = "transformation"),
  `Excluding most active users` = c("b1/4", "exp(b1)") %>%
    map_dfr(
      .f = car::deltaMethod,
      object = fit2,
      parameterNames = paste0("b", 0:1)
    ) %>%
    rownames_to_column(var = "transformation")
) %>%
  bind_rows(.id = "dataset") %>%
  gt(rowname_col = "transformation", groupname_col = "dataset") %>%
  fmt_number(vars(`Estimate`, `SE`, `2.5 %`, `97.5 %`)) %>%
  cols_merge(vars(`2.5 %`, `97.5 %`), pattern = "({1}, {2})") %>%
  cols_label(`2.5 %` = "95% CI", SE = "Standard Error") %>%
  tab_header("Transformed effects") %>%
  tab_footnote("CI: Confidence Interval", cells_column_labels(vars(`2.5 %`)))

Transformed effects
	Estimate	Standard Error	95% CI¹
Including most active users
b1/4	0.21	0.02	(0.18, 0.25)
exp(b1)	2.36	0.17	(2.03, 2.69)
Excluding most active users
b1/4	0.22	0.02	(0.18, 0.26)
exp(b1)	2.43	0.20	(2.04, 2.81)
¹ CI: Confidence Interval

In this case the estimates did not change much when excluding these most active users, but still there is no reason not to include those users.

Bayesian Hierarchical Logistic Regression Model

One of the benefits of switching to a Bayesian approach with this model is we can reason about the parameters (and any other quantities) probabilistically. This benefit will become apparent soon, but first we need to fit the Bayesian model and obtain draws of the parameters from what’s called the joint posterior distribution – so called because it’s a re-allocation of possibilities (from values of parameters we thought were likely a-priori) after taking into account observed data.

priors <- c(
  set_prior(prior = "std_normal()", class = "b"),
  set_prior("cauchy(0, 5)", class = "sd")
)

fit3 <- brm(
  edit_success ~ new_interface + (1 | user_id),
  family = bernoulli(link = "logit"),
  data = simulated_edits,
  prior = priors,
  chains = 4, cores = 4, backend = "cmdstanr"
)

summary(fit3)

 Family: bernoulli 
  Links: mu = logit 
Formula: edit_success ~ new_interface + (1 | user_id) 
   Data: simulated_edits (Number of observations: 5519) 
  Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~user_id (Number of levels: 100) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
sd(Intercept)     0.70      0.07     0.57     0.87 1.00     1195
              Tail_ESS
sd(Intercept)     2055

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
Intercept        -0.77      0.09    -0.94    -0.59 1.00     1137
new_interface     0.86      0.07     0.71     1.00 1.00     5915
              Tail_ESS
Intercept         2088
new_interface     3332

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Show code

tbl3 <- fit3 %>%
  spread_draws(b_new_interface, b_Intercept) %>%
  mutate(
    exp_b = exp(b_new_interface),
    b4 = b_new_interface / 4,
    avg_lift = invlogit(b_Intercept + b_new_interface) - invlogit(b_Intercept)
  ) %>%
  pivot_longer(
    b_new_interface:avg_lift,
    names_to = "param",
    values_to = "val"
  ) %>%
  group_by(param) %>%
  summarize(
    ps = c(0.025, 0.5, 0.975),
    qs = quantile(val, probs = ps),
    .groups = "drop"
  ) %>%
  mutate(
    quantity = ifelse(
      param %in% c("b_Intercept", "b_new_interface"),
      "Parameter", "Function of parameter(s)"
    ),
    param = factor(
      param,
      c("b_Intercept", "b_new_interface", "exp_b", "b4", "avg_lift"),
      c("(Intercept)", "Using new interface", "Multiplicative effect on odds", "Divide-by-4 rule", "Average lift")
    ),
    ps = factor(ps, c(0.025, 0.5, 0.975), c("lower", "median", "upper")),
  ) %>%
  pivot_wider(names_from = "ps", values_from = "qs") %>%
  arrange(quantity, param)
tbl3 %>%
  gt(rowname_col = "param", groupname_col = "quantity") %>%
  row_group_order(c("Parameter", "Function of parameter(s)")) %>%
  fmt_number(vars(lower, median, upper), decimals = 3) %>%
  fmt_percent(columns = vars(median, lower, upper), rows = 2:3, decimals = 1) %>%
  cols_align("center", vars(median, lower, upper)) %>%
  cols_merge(vars(lower, upper), pattern = "({1}, {2})") %>%
  cols_move_to_end(vars(lower)) %>%
  cols_label(median = "Point Estimate", lower = "95% CI") %>%
  tab_style(cell_text(weight = "bold"), cells_row_groups()) %>%
  tab_footnote("CI: Credible Interval", cells_column_labels(vars(lower))) %>%
  tab_footnote(
    html("Average lift = Pr(Success|New interface) - Pr(Success|Old interface) = logit<sup>-1</sup>(&beta;<sub>0</sub> + &beta;<sub>1</sub>) - logit<sup>-1</sup>(&beta;<sub>0</sub>)"),
    cells_body(vars(median), 3)
  ) %>%
  tab_header("Posterior summary of model parameters")

Posterior summary of model parameters
	Point Estimate	95% CI¹
Parameter
(Intercept)	−0.769	(−0.936, −0.587)
Using new interface	0.856	(0.712, 1.002)
Function of parameter(s)
Multiplicative effect on odds	2.354	(2.039, 2.722)
Divide-by-4 rule	21.4%	(17.8%, 25.0%)
Average lift	20.5%²	(17.1%, 23.9%)
¹ CI: Credible Interval
² Average lift = Pr(Success\|New interface) - Pr(Success\|Old interface) = logit^-1(β₀ + β₁) - logit^-1(β₀)

The results are essentially the same as the non-Bayesian model in the previous section. A couple of key differences:

Both the multiplicative effect on odds and the divide-by-4 rule estimates are more accurate because we did not have to approximate them using the delta method.
We are able to calculate the quantity “average lift” and obtain its posterior distribution, no different than the other two quantities. This is the quantity that the “divide-by-4 rule” is meant to approximate, but which we are able to obtain directly.

Hypothesis Testing

Hypothesis testing in this paradigm works a little differently. Rather than performing null hypothesis significance testing (NHST), rejecting or failing-to-reject the null hypothesis, and calculating p-values, we can ask for probabilities of ranges.

What’s the probability using the new interface at least doubles the odds of an edit getting published?

hyp1 <- "exp(new_interface) > 2"
fit3 %>%
  hypothesis(hyp1)

Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper
1 (exp(new_interfac... > 0     0.36      0.17     0.09     0.65
  Evid.Ratio Post.Prob Star
1      74.47      0.99    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Of interest is the Post.Prob column – the posterior probability. In this case, we are almost absolutely sure that the odds of an edit getting published at least double when the new interface is being used.

What’s the probability that the average lift in edit attempt success probability is positive?

hyp2 <- "invlogit((Intercept) + new_interface) - invlogit((Intercept)) > 0"
fit3 %>%
  hypothesis(hyp2)

Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper
1 (invlogit((Interc... > 0      0.2      0.02     0.18     0.23
  Evid.Ratio Post.Prob Star
1        Inf         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

What’s the probability that the average lift in edit attempt success probability is at least 20%?

h <- "invlogit((Intercept) + new_interface) - invlogit((Intercept)) > 0.2"
(hyp3 <- hypothesis(fit3, hypothesis = h))

Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper
1 (invlogit((Interc... > 0        0      0.02    -0.02     0.03
  Evid.Ratio Post.Prob Star
1       1.53       0.6     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

There is a 0.60425 probability that the difference between the Pr(Success | New interface) and Pr(Success | Old interface) is greater than 0.2 – and it is up to the stakeholder to decide whether that is high enough for them or if they would prefer to see a probability of, say, 0.8.

t-test

Another question we may ask is “is there a significant difference in proportions of successful edits between edits made using the new interface and those made using the old interface?”

Show code

edit_counts <- simulated_edits %>%
  group_by(user_id) %>%
  summarize(
    n_total_edits = n(),
    new_ui_total_edits = sum(new_interface),
    old_ui_total_edits = sum(1 - new_interface),
    new_ui_successful_edits = sum(edit_success * new_interface),
    old_ui_successful_edits = sum(edit_success * (1 - new_interface))
  ) %>%
  # Only users who have used both UIs at least once:
  filter(
    new_ui_total_edits > 0,
    old_ui_total_edits > 0
  ) %>%
  mutate(
    new_ui_prop_success = new_ui_successful_edits / new_ui_total_edits,
    old_ui_prop_success = old_ui_successful_edits / old_ui_total_edits
  )

Note: 4 users were excluded from this analysis because they did not use both interfaces, and this analysis requires at least one edit attempt using each of the interfaces.

Show code

tidy_counts <- edit_counts %>%
  select(user_id, contains("ui")) %>%
  pivot_longer(
    !user_id, 
    names_to = c("interface", ".value"),
    names_pattern = "(.*)_ui_(.*)",
    values_drop_na = TRUE
  )

ggplot(tidy_counts, aes(y = prop_success, x = interface)) +
  geom_boxplot() +
  geom_jitter(aes(size = total_edits), width = 0.2, height = 0, alpha = 0.5) +
  scale_y_continuous("Proportion of edits published", labels = scales::percent_format(1)) +
  scale_size_continuous(
    "Edits attempted by user",
    range = c(1, 4), trans = "log", breaks = c(1, 5, 55)
  ) +
  ggtitle("Proportion of edits published, by interface") +
  theme(legend.position = "bottom")

The users at the top (100%) and bottom (0%) made very few edits, so it is easy for the proportion to be “all” or “none” when the denominator is, say, 1.

To actually test the hypothesis we perform a paired t-test since we have a success proportion from each editor for each user.

t.test(
  x = edit_counts$new_ui_prop_success,
  y = edit_counts$old_ui_prop_success,
  alternative = "greater",
  paired = TRUE
)


    Paired t-test

data:  edit_counts$new_ui_prop_success and edit_counts$old_ui_prop_success
t = 5.0261, df = 95, p-value = 1.175e-06
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.09907877        Inf
sample estimates:
mean of the differences 
              0.1479853

The average difference in proportion of published edits between the interfaces is 14.8% (p < 0.001), which is both practically and statistically significant.

The underlying question here is different. The models above inferred the latent (hidden) value of the publish probability, which is different than inferring the average proportion of published edits.

Also, for anyone curious:

hyp4 <- "invlogit((Intercept) + new_interface) - invlogit((Intercept)) > 0.148"
fit3 %>%
  hypothesis(hyp4)

Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper
1 (invlogit((Interc... > 0     0.06      0.02     0.03     0.08
  Evid.Ratio Post.Prob Star
1    1332.33         1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Environment

Software	Version
OS	macOS Big Sur 10.16
R	R version 4.1.0 (2021-05-18)
CmdStan	2.29.2

Allaire, JJ, Rich Iannone, Alison Presmanes Hill, and Yihui Xie. 2021. Distill: ’R Markdown’ Format for Scientific and Technical Writing. https://CRAN.R-project.org/package=distill.

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.

Gabry, Jonah, and Rok Češnovar. 2020. Cmdstanr: R Interface to ’CmdStan’.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and other stories. https://doi.org/10.1017/9781139161879.

Iannone, Richard, Joe Cheng, and Barret Schloerke. 2020. Gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.

Kay, Matthew. 2020. tidybayes: Tidy Data and Geoms for Bayesian Models. https://doi.org/10.5281/zenodo.1308151.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sjoberg, Daniel D., Michael Curry, Margie Hannum, Karissa Whiting, and Emily C. Zabor. 2021. Gtsummary: Presentation-Ready Data Summary and Analytic Result Tables. https://CRAN.R-project.org/package=gtsummary.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Logistic regression, multilevel models, and t-tests

Setup

Introduction

Simulation

Inference

Simple Linear Regression

Hierarchical Logistic Regression

Interpretation

Highly-active users

Bayesian Hierarchical Logistic Regression Model

Hypothesis Testing

t-test

Environment

References

Corrections

Reuse