Estimation with non-linearly scaled interval responses in surveys

If a survey question asks about how much time the responder took to complete a task – with the underlying quantity of interest being “median task completion time” – presenting options such as “5-15 minutes” and “1-2 days” (instead of freeform text fields) reduces cognitive load and makes analysis simpler. The proposed methodology enables accurate inference on the quantity of interest given the survey responses in the form of intervals.

Mikhail Popov https://mpopov.com/ (Wikimedia Foundation)https://wikimediafoundation.org/
2021-02-12

Introduction

Consider the following survey question:

Example of a question from Better Use of Data program's Data Dexterity Survey for Q2 with non-linear interval responses

Figure 1: Example of a question from Better Use of Data program’s Data Dexterity Survey for Q2 with non-linear interval responses

The survey authors’ goal is to estimate how much time that task takes on average by asking the survey responders to pick the interval that best captures how much time they spent. A past iteration used freeform text, which created difficulties at response time (for people who might worry about providing an exact number) and at analysis time. By providing these options, the survey authors hoped to lessen the survey taker’s cognitive burden and to make the analysis easier.

A natural question is, of course, how does one actually use the responses to answer the question of interest – how much time that task takes on average. One possible, naive approach would be to use the middle point of each interval – e.g. “2-4 hours” becomes “3 hours” – and take the mean of all the responses. In this article, we propose a methodology for estimating that quantity and show that it greatly outperforms the naive approach, especially on non-linearly scaled ranges.

Algorithm

Given \(R\) responses and a response \(r \in R\) with a range \([a_r, b_r]\):

  1. Draw \(n\) observations from \(\text{Uniform}(a_r, b_r)\) (if all values are equally likely) or \(\mathcal{N}(b_r - \frac{b_r-a_r}{2}, \frac{b_r-a_r}{6}))\) (if it is desirable to give the midpoint more weight)
  2. Estimate \(\theta\) in model \(y_i~\sim~\mathcal{D}(\theta); i = 1, \ldots, N\) using the \(N = nr\) observations as data,

where \(\mathcal{D}(\theta)\) is the normal \(\mathcal{N}(\mu, \sigma)\) or the \(\text{Log-normal}(\mu, \sigma)\) distribution, depending on what is more appropriate.

Regarding \(\mathcal{N}(b_r - \frac{b_r-a_r}{2}, \frac{b_r-a_r}{6}))\). When values come from a normal distribution 99.7% of the values fall within 3 standard deviations of the mean. If we assume that \(b = \mu + 3 \sigma\) and \(a = \mu - 3 \sigma\) (any values outside of \([a, b]\) can be safely clipped), we can derive the respective formulas for \(\mu\) and \(\sigma\).

Suppose the survey taker responded with “between 6 and 12 minutes” – there are two ways we specify what we think the actual duration was. By using the Uniform distribution we’re saying “all values between 6 and 12 are equally likely” but by using the Normal distribution with the formulas given above we’re saying “there is a 68% probability that the duration was between 8 and 10 minutes, and 95% probability it was between 7 and 11 minutes”.

Visualization of what the choice of distribution represents for capturing beliefs and uncertainty for which values are more likely than others in the interval [6, 12].

Figure 2: Visualization of what the choice of distribution represents for capturing beliefs and uncertainty for which values are more likely than others in the interval [6, 12].

Examples

We demonstrate the benefits of the proposed method for two data-generating processes. In the first, data comes from a normal distribution and the response intervals are evenly spaced out. In the second, data comes from a log-normal distribution and the response intervals are unevenly spaced out.

In both scenarios we perform step 1 of the algorithm twice to generate two “extended” datasets based on the responses – once by drawing from Uniform distributions and once by drawing from Normal distributions centered at each interval’s midpoint.

Example 1

In this scenario the underlying distribution is \(\text{Normal}(\mu = 12, \sigma = 4)\) and the quantity of interest is \(\mu = 12\):

(0,4] (4,8] (8,12] (12,16] (16,20] (20,24]
0 2 6 8 3 1

We will aim to estimate \(\mu\) through the naive method described previously and the proposed, more sophisticated method. First, let’s establish a baseline:

Median Mean Standard Deviation
14.00 13.00 4.08

To understand the differences between the uniformly-filled out intervals and normally-filled out intervals, we can visualize the distribution of the resulting datasets:

Comparison of data generated from one distribution vs mixtures of Uniform and Normal distributions.

Figure 3: Comparison of data generated from one distribution vs mixtures of Uniform and Normal distributions.

Stan Model

The following Stan code specifies the model. CmdStanR package is used to fit the model to the “extended” data to infer the parameters of interest.

Estimation with different interval-filling strategies
Estimate (95% Credible Interval)
mu
Uniform 12.975 (12.828, 13.125)
Normal 12.975 (12.828, 13.125)
sigma
Uniform 4.026 (3.923, 4.139)
Normal 4.026 (3.923, 4.139)

Comparison

By repeatedly simulating data, performing the competing estimation approaches, and saving the results we can obtain many absolute percentage errors – which we can then average across to calculate mean absolute percentage error (MAPE) for a variety of sample sizes and even investigate differences between sampling from the intervals uniformly or with the Normal distribution:

Comparison of the various approaches' performance at estimating the two parameters (mean and standard deviation). The dashed line indicates the true value of the mean while the dotted lines indicate the true value of the mean +/- the true value of the standard deviation. Five simulations were randomly chosen to showcase.

Figure 4: Comparison of the various approaches’ performance at estimating the two parameters (mean and standard deviation). The dashed line indicates the true value of the mean while the dotted lines indicate the true value of the mean +/- the true value of the standard deviation. Five simulations were randomly chosen to showcase.

Mean absolute percentage error (MAPE)
Calculated from 100 simulations per each sample size
Parameter (Estimation approach)
Median (Naive) Mean (Naive) Mu (Bayes:Uniform) Mu (Bayes:Normal)
15.4% 6.4% 6.4% 6.4%

Example 2

In practice we rarely encounter the normal distribution, and we are much more likely to encounter a skewed distribution like the log-normal where we are dealing only with positive values and smaller values are more likely but we can still observe very large values:

Example of a log-normal distribution and how the mode, median, and mean differ.

Figure 5: Example of a log-normal distribution and how the mode, median, and mean differ.

Imagine most requests requiring no more than a full work day, but still encountering situations where a request may take a week or a month. This example is precisely about demonstrating the utility of the proposed methodology in such situations.

Suppose the random variable “time spent completing a task” follows a \(\text{Log-normal}(\mu = 8, \sigma = 2)\) distribution. The median for log-normal is \(\exp(\mu)\) (in seconds), which comes out to 49m 41s in this case, and it’s the quantity we’re most interested in because the mean is pulled too much too the right by the tail. The median is easy to interpret – half of the tasks are completed in less time, half are completed in more. The mode is also interesting, but because the mode is calculated from both parameters our uncertainty about it will be impacted by having two sources of uncertainty.

Percentiles from a random sample of 1,000 data points drawn from this distribution are:

0% 10% 25% 50% 75% 90% 100%
5s 3m 46s 12m 3s 44m 10s 3h 16m 33s 11h 2m 8s 23d 17h 8m 52s

In this sample, half of the “tasks” were completed in 44m 10s or less.

Let’s generate a set of fake responses as we did before, only this time we’re drawing from the log-normal distribution.

0-5 minutes 5-60 minutes 1-6 hours 6-12 hours 12-14 hours
2 7 6 4 1

If we divide each count of responses by total number of responses, we get the following proportions:

0-5 minutes 5-60 minutes 1-6 hours 6-12 hours 12-14 hours
10% 35% 30% 20% 5%

So the idea is to estimate “median time spent on task across team” based on these intervals. First, a naive approach where we find the midpoint of each interval and calculate a median and a mean across the midpoints:

median mean
3h 30m 3h 41m 38s

Again, the true median is 49m 41s, so neither the mean nor the median in the naive approach handle the skew especially well. Let us try the proposed approach.

First, we generate two datasets – one with observations drawn uniformly from the specified intervals, and one with observations drawn from Normal distributions parameterized according to the specified intervals. In visualizing the resulting distributions we also include the distribution of the sample from earlier – 1000 data points drawn directly from \(\text{Lognormal}(\mu = 8, \sigma = 2)\) – for comparison.

Comparison of data generated from one distribution (the log-normal) vs mixtures of Uniform and Normal distributions.

Figure 6: Comparison of data generated from one distribution (the log-normal) vs mixtures of Uniform and Normal distributions.

Because of the skewed nature of the data, it’s difficult to make out much detail near 0. We can also visualize the data on the log-scale, in which case the log-normal comparison distribution looks normal:

Comparison of data generated from one distribution (the log-normal) vs mixtures of Uniform and Normal distributions.

Figure 7: Comparison of data generated from one distribution (the log-normal) vs mixtures of Uniform and Normal distributions.

Before we continue, let us also see what happens if we were to calculate the median for these two datasets:

Interval-filling distribution Log(Median): approx. mu Median on practical scale
Normal 9.20 2h 44m 11s
Uniform 8.81 1h 51m 15s

Unlike the simple naive approach that’s based on the midpoints, this approach appears to perform better. Let’s call that “extended naive” approach, and later we will compare it with the other approaches.

Stan Model

The following Stan code specifies the model. CmdStanR package is used to fit the model to the “extended” data to infer the parameters of interest – namely, \(e^\mu\).

A note on standard deviation of the prior on \(\mu\). Because we’re working with the log scale, we need to be even more careful about specifying the prior distribution. See the following table for a comparison of different choices of standard_dev:

Effect of choice of standard deviation on prior
Assuming the distribution of mu is centered at "2 hours"
standard_dev Standard deviations from center
1 2 3
0.01 2h 58m – 3h 1m 2h 56m – 3h 3m 2h 54m – 3h 5m
0.10 2h 42m – 3h 18m 2h 27m – 3h 39m 2h 13m – 4h 2m
0.20 2h 27m – 3h 39m 2h 39s – 4h 28m 1h 38m – 5h 27m
0.50 1h 49m – 4h 56m 1h 6m – 8h 9m 40m 10s – 13h 26m
1.00 1h 6m – 8h 9m 24m 22s – 22h 10m 8m 58s – 2d 12h 15m
1.50 40m 10s – 13h 26m 8m 58s – 2d 12h 15m 2m – 11d 6h 3m
2.00 24m 22s – 22h 10m 3m 18s – 6d 19h 47m 27s – 50d 10h 17m
2.50 14m 47s – 1d 12h 32m 1m 13s – 18d 13h 14m 6s – 226d 7m

Based on this table, a standard_dev of 1.0 works quite well for us. What we’re saying here is that – before observing the data – we think the average task completion time is between 1.5 hours and a whole day, and most likely closer to 3 hours. This is actually a big overestimate compared to 49m 41s, but that’s why we have data! Also, at the “practical” scale that seems very wide, but at the “parameter” scale inside the model that’s 8.287-10.287.

Estimation with different interval-filling strategies
Estimate 95% Credible Interval
\[\mu\]
Uniform 8.47 (8.41, 8.54)
Normal 8.56 (8.49, 8.62)
\[\sigma\]
Uniform 1.84 (1.79, 1.89)
Normal 1.72 (1.68, 1.77)
\[\exp(\mu)\]
Uniform 1h 19m 41s 1h 14m 31s – 1h 25m 26s
Normal 1h 26m 38s 1h 21m 13s – 1h 32m 27s

That is a lot closer to the truth (49m 41s) than the naive approach!

Comparison

Comparison of the various approaches' performance at estimating the parameter, shown on the parameter scale. Five simulations were randomly chosen to showcase.

Figure 8: Comparison of the various approaches’ performance at estimating the parameter, shown on the parameter scale. Five simulations were randomly chosen to showcase.

On the scale meaningful to us:

Comparison of the various approaches' performance at estimating the parameter, shown on the practical scale. Five simulations were randomly chosen to showcase.

Figure 9: Comparison of the various approaches’ performance at estimating the parameter, shown on the practical scale. Five simulations were randomly chosen to showcase.

On the logarithmic version:

Comparison of the various approaches' performance at estimating the parameter, shown on the logarithmic scale. Five simulations were randomly chosen to showcase.

Figure 10: Comparison of the various approaches’ performance at estimating the parameter, shown on the logarithmic scale. Five simulations were randomly chosen to showcase.

Mean absolute percentage error (MAPE)
On parameter scale (log-seconds), not exponentiated to seconds; calculated from 100 simulations per each sample size
Sample size Parameter (Estimation approach)
Median (Naive) Mean (Naive) "Mu" (Ext Naive:Uniform) "Mu" (Ext Naive:Normal) Mu (Bayes:Uniform) Mu (Bayes:Normal)
N=20 9.7% 15.8% 4.6% 6.7% 4.0% 4.5%
NA 9.6% 15.8% 4.3% 6.0% 4.1% 4.4%

Using the Uniform (instead of the Normal) distribution to fill out the intervals was consistently the best way to create fake data to fit the model on. Even with 8 responses the method achieves a mean absolute percentage error (MAPE) less than 5% and is consistently better than any of the other approaches.

Application

Survey

Instrumentation

In a survey focused on dexterity of working with data, data analysts and data scientists were asked to reflect about a recent piece of instrumentation deployed to production.

They were asked the questions:

  1. How long did it take from starting work to instrumentation plan approval (sign-off from primary stakeholder)?
  2. How long did it take from instrumentation plan approval (sign-off from primary stakeholder) to the initial deployment of the instrument to production?
  3. If the instrumentation required a fix & re-deployment, how long did that take?

Deliverables

In that same survey focused on dexterity of working with data, data analysts and data scientists were asked to reflect about a specific request they worked on in the last month where the deliverable was a report, a dataset, or a data-based insight.

They were asked the question:

Once you began working on that request, how long did it take you to deliver the requested data or actionable insight(s) to the stakeholder(s)?

Responses

And the following responses were given:

Responses to data dexterity survey questions
Time spent/elapsed
30-60 minutes 1-2 hours 2-4 hours 1-5 work days 1-2 weeks 2-4 weeks 1-2 months
Instrumentation plan 0 0 0 1 1 1 0
Instrument deployment 0 0 0 0 0 1 1
Instrument redeployment 0 0 0 1 0 0 0
Deliverable 1 2 1 3 2 0 0

Analysis

We proceed as we did in Example 2, but generating the dataset by sampling from the intervals uniformly as that approach yielded the best results in the simulation study.

Similarly, we can visualize the resulting distributions in the generated datasets:

"Extended" times using uniform interval-filling approach.

Figure 11: “Extended” times using uniform interval-filling approach.

As with Example 2, we assume the time spent on task follows the \(\text{Lognormal}(\mu, \sigma)\) distribution – with \(e^\mu\) being the median of the distribution and the latent quantity we are interested in inferring from the data.

Per-stage time spent gathering and wrangling data
Days calculated as standard 8 hour work days; 5 days = 1 week
Estimate 95% Credible Interval
Instrumentation plan
\[\mu\] 8.08 (8.01, 8.15)
\[\sigma\] 0.73 (0.68, 0.78)
\[\exp(\mu)\] 1w 1d 6h 1w 1d 2h – 1w 2d 1h
Instrument deployment
\[\mu\] 9.20 (9.15, 9.25)
\[\sigma\] 0.41 (0.38, 0.45)
\[\exp(\mu)\] 4w4h 44m 3w 4d 4h – 4w 1d 4h
Instrument redeployment
\[\mu\] 7.22 (7.15, 7.28)
\[\sigma\] 0.41 (0.37, 0.46)
\[\exp(\mu)\] 2d 6h 2d 5h – 3d
Deliverable
\[\mu\] 6.20 (6.11, 6.29)
\[\sigma\] 1.64 (1.58, 1.70)
\[\exp(\mu)\] 1d 7h 32m – 1d 1h

According to the fitted model, the median time for delivering requested data or actionable insights in Q2 was 1 work day (8 hours). The median times for instrumentation planning, deployment, and re-demployment were approximately 1.1 week, 4 weeks, and almost 3 days, respectively – or about 5.5-6 weeks.

Cumulative distributions of estimated work time for instrumentation and request completion -- based on posterior draws of the model parameters -- showing the probability of an endeavor being completed in some time or earlier.

Figure 12: Cumulative distributions of estimated work time for instrumentation and request completion – based on posterior draws of the model parameters – showing the probability of an endeavor being completed in some time or earlier.

Key takeaways based on these cumulative distributions and some rounding:

Discussion

The examples demonstrated here are a promising proof of concept. Perhaps seconds was too granular of a unit to work with, and the process would be improved by switching to minutes.

Furthermore, we can explore other distributions in addition to the log-normal. One potential alternative is the Gamma distribution, which also has support over positive real numbers.

Future Work: This methodology can also be extended to include regression – to enable assessing how task completion time differs by team, technology stack, or even by an intervention in an experiment.