If a survey question asks about how much time the responder took to complete a task – with the underlying quantity of interest being “median task completion time” – presenting options such as “5-15 minutes” and “1-2 days” (instead of freeform text fields) reduces cognitive load and makes analysis simpler. The proposed methodology enables accurate inference on the quantity of interest given the survey responses in the form of intervals.
Consider the following survey question:
The survey authors’ goal is to estimate how much time that task takes on average by asking the survey responders to pick the interval that best captures how much time they spent. A past iteration used freeform text, which created difficulties at response time (for people who might worry about providing an exact number) and at analysis time. By providing these options, the survey authors hoped to lessen the survey taker’s cognitive burden and to make the analysis easier.
A natural question is, of course, how does one actually use the responses to answer the question of interest – how much time that task takes on average. One possible, naive approach would be to use the middle point of each interval – e.g. “2-4 hours” becomes “3 hours” – and take the mean of all the responses. In this article, we propose a methodology for estimating that quantity and show that it greatly outperforms the naive approach, especially on non-linearly scaled ranges.
Given \(R\) responses and a response \(r \in R\) with a range \([a_r, b_r]\):
where \(\mathcal{D}(\theta)\) is the normal \(\mathcal{N}(\mu, \sigma)\) or the \(\text{Log-normal}(\mu, \sigma)\) distribution, depending on what is more appropriate.
Regarding \(\mathcal{N}(b_r - \frac{b_r-a_r}{2}, \frac{b_r-a_r}{6}))\). When values come from a normal distribution 99.7% of the values fall within 3 standard deviations of the mean. If we assume that \(b = \mu + 3 \sigma\) and \(a = \mu - 3 \sigma\) (any values outside of \([a, b]\) can be safely clipped), we can derive the respective formulas for \(\mu\) and \(\sigma\).
Suppose the survey taker responded with “between 6 and 12 minutes” – there are two ways we specify what we think the actual duration was. By using the Uniform distribution we’re saying “all values between 6 and 12 are equally likely” but by using the Normal distribution with the formulas given above we’re saying “there is a 68% probability that the duration was between 8 and 10 minutes, and 95% probability it was between 7 and 11 minutes”.
We demonstrate the benefits of the proposed method for two data-generating processes. In the first, data comes from a normal distribution and the response intervals are evenly spaced out. In the second, data comes from a log-normal distribution and the response intervals are unevenly spaced out.
In both scenarios we perform step 1 of the algorithm twice to generate two “extended” datasets based on the responses – once by drawing from Uniform distributions and once by drawing from Normal distributions centered at each interval’s midpoint.
In this scenario the underlying distribution is \(\text{Normal}(\mu = 12, \sigma = 4)\) and the quantity of interest is \(\mu = 12\):
(0,4] | (4,8] | (8,12] | (12,16] | (16,20] | (20,24] |
---|---|---|---|---|---|
0 | 2 | 6 | 8 | 3 | 1 |
We will aim to estimate \(\mu\) through the naive method described previously and the proposed, more sophisticated method. First, let’s establish a baseline:
Median | Mean | Standard Deviation |
---|---|---|
14.00 | 13.00 | 4.08 |
To understand the differences between the uniformly-filled out intervals and normally-filled out intervals, we can visualize the distribution of the resulting datasets:
The following Stan code specifies the model. CmdStanR package is used to fit the model to the “extended” data to infer the parameters of interest.
Estimation with different interval-filling strategies | |
---|---|
Estimate (95% Credible Interval) | |
mu | |
Uniform | 12.975 (12.828, 13.125) |
Normal | 12.975 (12.828, 13.125) |
sigma | |
Uniform | 4.026 (3.923, 4.139) |
Normal | 4.026 (3.923, 4.139) |
By repeatedly simulating data, performing the competing estimation approaches, and saving the results we can obtain many absolute percentage errors – which we can then average across to calculate mean absolute percentage error (MAPE) for a variety of sample sizes and even investigate differences between sampling from the intervals uniformly or with the Normal distribution:
Mean absolute percentage error (MAPE) | |||
---|---|---|---|
Calculated from 100 simulations per each sample size | |||
Parameter (Estimation approach) | |||
Median (Naive) | Mean (Naive) | Mu (Bayes:Uniform) | Mu (Bayes:Normal) |
15.4% | 6.4% | 6.4% | 6.4% |
In practice we rarely encounter the normal distribution, and we are much more likely to encounter a skewed distribution like the log-normal where we are dealing only with positive values and smaller values are more likely but we can still observe very large values:
Imagine most requests requiring no more than a full work day, but still encountering situations where a request may take a week or a month. This example is precisely about demonstrating the utility of the proposed methodology in such situations.
Suppose the random variable “time spent completing a task” follows a \(\text{Log-normal}(\mu = 8, \sigma = 2)\) distribution. The median for log-normal is \(\exp(\mu)\) (in seconds), which comes out to 49m 41s in this case, and it’s the quantity we’re most interested in because the mean is pulled too much too the right by the tail. The median is easy to interpret – half of the tasks are completed in less time, half are completed in more. The mode is also interesting, but because the mode is calculated from both parameters our uncertainty about it will be impacted by having two sources of uncertainty.
Percentiles from a random sample of 1,000 data points drawn from this distribution are:
0% | 10% | 25% | 50% | 75% | 90% | 100% |
---|---|---|---|---|---|---|
5s | 3m 46s | 12m 3s | 44m 10s | 3h 16m 33s | 11h 2m 8s | 23d 17h 8m 52s |
In this sample, half of the “tasks” were completed in 44m 10s or less.
Let’s generate a set of fake responses as we did before, only this time we’re drawing from the log-normal distribution.
0-5 minutes | 5-60 minutes | 1-6 hours | 6-12 hours | 12-14 hours |
---|---|---|---|---|
2 | 7 | 6 | 4 | 1 |
If we divide each count of responses by total number of responses, we get the following proportions:
0-5 minutes | 5-60 minutes | 1-6 hours | 6-12 hours | 12-14 hours |
---|---|---|---|---|
10% | 35% | 30% | 20% | 5% |
So the idea is to estimate “median time spent on task across team” based on these intervals. First, a naive approach where we find the midpoint of each interval and calculate a median and a mean across the midpoints:
median | mean |
---|---|
3h 30m | 3h 41m 38s |
Again, the true median is 49m 41s, so neither the mean nor the median in the naive approach handle the skew especially well. Let us try the proposed approach.
First, we generate two datasets – one with observations drawn uniformly from the specified intervals, and one with observations drawn from Normal distributions parameterized according to the specified intervals. In visualizing the resulting distributions we also include the distribution of the sample from earlier – 1000 data points drawn directly from \(\text{Lognormal}(\mu = 8, \sigma = 2)\) – for comparison.
Because of the skewed nature of the data, it’s difficult to make out much detail near 0. We can also visualize the data on the log-scale, in which case the log-normal comparison distribution looks normal:
Before we continue, let us also see what happens if we were to calculate the median for these two datasets:
Interval-filling distribution | Log(Median): approx. mu | Median on practical scale |
---|---|---|
Normal | 9.20 | 2h 44m 11s |
Uniform | 8.81 | 1h 51m 15s |
Unlike the simple naive approach that’s based on the midpoints, this approach appears to perform better. Let’s call that “extended naive” approach, and later we will compare it with the other approaches.
The following Stan code specifies the model. CmdStanR package is used to fit the model to the “extended” data to infer the parameters of interest – namely, \(e^\mu\).
A note on standard deviation of the prior on \(\mu\). Because we’re working with the log scale, we need to be even more careful about specifying the prior distribution. See the following table for a comparison of different choices of standard_dev
:
Effect of choice of standard deviation on prior | |||
---|---|---|---|
Assuming the distribution of mu is centered at "2 hours" | |||
standard_dev |
Standard deviations from center | ||
1 | 2 | 3 | |
0.01 | 2h 58m – 3h 1m | 2h 56m – 3h 3m | 2h 54m – 3h 5m |
0.10 | 2h 42m – 3h 18m | 2h 27m – 3h 39m | 2h 13m – 4h 2m |
0.20 | 2h 27m – 3h 39m | 2h 39s – 4h 28m | 1h 38m – 5h 27m |
0.50 | 1h 49m – 4h 56m | 1h 6m – 8h 9m | 40m 10s – 13h 26m |
1.00 | 1h 6m – 8h 9m | 24m 22s – 22h 10m | 8m 58s – 2d 12h 15m |
1.50 | 40m 10s – 13h 26m | 8m 58s – 2d 12h 15m | 2m – 11d 6h 3m |
2.00 | 24m 22s – 22h 10m | 3m 18s – 6d 19h 47m | 27s – 50d 10h 17m |
2.50 | 14m 47s – 1d 12h 32m | 1m 13s – 18d 13h 14m | 6s – 226d 7m |
Based on this table, a standard_dev
of 1.0 works quite well for us. What we’re saying here is that – before observing the data – we think the average task completion time is between 1.5 hours and a whole day, and most likely closer to 3 hours. This is actually a big overestimate compared to 49m 41s, but that’s why we have data! Also, at the “practical” scale that seems very wide, but at the “parameter” scale inside the model that’s 8.287-10.287.
Estimation with different interval-filling strategies | ||
---|---|---|
Estimate | 95% Credible Interval | |
\[\mu\] | ||
Uniform | 8.47 | (8.41, 8.54) |
Normal | 8.56 | (8.49, 8.62) |
\[\sigma\] | ||
Uniform | 1.84 | (1.79, 1.89) |
Normal | 1.72 | (1.68, 1.77) |
\[\exp(\mu)\] | ||
Uniform | 1h 19m 41s | 1h 14m 31s – 1h 25m 26s |
Normal | 1h 26m 38s | 1h 21m 13s – 1h 32m 27s |
That is a lot closer to the truth (49m 41s) than the naive approach!
On the scale meaningful to us:
On the logarithmic version:
Mean absolute percentage error (MAPE) | ||||||
---|---|---|---|---|---|---|
On parameter scale (log-seconds), not exponentiated to seconds; calculated from 100 simulations per each sample size | ||||||
Sample size | Parameter (Estimation approach) | |||||
Median (Naive) | Mean (Naive) | "Mu" (Ext Naive:Uniform) | "Mu" (Ext Naive:Normal) | Mu (Bayes:Uniform) | Mu (Bayes:Normal) | |
N=20 | 9.7% | 15.8% | 4.6% | 6.7% | 4.0% | 4.5% |
NA | 9.6% | 15.8% | 4.3% | 6.0% | 4.1% | 4.4% |
Using the Uniform (instead of the Normal) distribution to fill out the intervals was consistently the best way to create fake data to fit the model on. Even with 8 responses the method achieves a mean absolute percentage error (MAPE) less than 5% and is consistently better than any of the other approaches.
In a survey focused on dexterity of working with data, data analysts and data scientists were asked to reflect about a recent piece of instrumentation deployed to production.
They were asked the questions:
In that same survey focused on dexterity of working with data, data analysts and data scientists were asked to reflect about a specific request they worked on in the last month where the deliverable was a report, a dataset, or a data-based insight.
They were asked the question:
Once you began working on that request, how long did it take you to deliver the requested data or actionable insight(s) to the stakeholder(s)?
And the following responses were given:
Responses to data dexterity survey questions | |||||||
---|---|---|---|---|---|---|---|
Time spent/elapsed | |||||||
30-60 minutes | 1-2 hours | 2-4 hours | 1-5 work days | 1-2 weeks | 2-4 weeks | 1-2 months | |
Instrumentation plan | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
Instrument deployment | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Instrument redeployment | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Deliverable | 1 | 2 | 1 | 3 | 2 | 0 | 0 |
We proceed as we did in Example 2, but generating the dataset by sampling from the intervals uniformly as that approach yielded the best results in the simulation study.
Similarly, we can visualize the resulting distributions in the generated datasets:
As with Example 2, we assume the time spent on task follows the \(\text{Lognormal}(\mu, \sigma)\) distribution – with \(e^\mu\) being the median of the distribution and the latent quantity we are interested in inferring from the data.
Per-stage time spent gathering and wrangling data | ||
---|---|---|
Days calculated as standard 8 hour work days; 5 days = 1 week | ||
Estimate | 95% Credible Interval | |
Instrumentation plan | ||
\[\mu\] | 8.08 | (8.01, 8.15) |
\[\sigma\] | 0.73 | (0.68, 0.78) |
\[\exp(\mu)\] | 1w 1d 6h | 1w 1d 2h – 1w 2d 1h |
Instrument deployment | ||
\[\mu\] | 9.20 | (9.15, 9.25) |
\[\sigma\] | 0.41 | (0.38, 0.45) |
\[\exp(\mu)\] | 4w4h 44m | 3w 4d 4h – 4w 1d 4h |
Instrument redeployment | ||
\[\mu\] | 7.22 | (7.15, 7.28) |
\[\sigma\] | 0.41 | (0.37, 0.46) |
\[\exp(\mu)\] | 2d 6h | 2d 5h – 3d |
Deliverable | ||
\[\mu\] | 6.20 | (6.11, 6.29) |
\[\sigma\] | 1.64 | (1.58, 1.70) |
\[\exp(\mu)\] | 1d | 7h 32m – 1d 1h |
According to the fitted model, the median time for delivering requested data or actionable insights in Q2 was 1 work day (8 hours). The median times for instrumentation planning, deployment, and re-demployment were approximately 1.1 week, 4 weeks, and almost 3 days, respectively – or about 5.5-6 weeks.
Key takeaways based on these cumulative distributions and some rounding:
The examples demonstrated here are a promising proof of concept. Perhaps seconds was too granular of a unit to work with, and the process would be improved by switching to minutes.
Furthermore, we can explore other distributions in addition to the log-normal. One potential alternative is the Gamma distribution, which also has support over positive real numbers.
Future Work: This methodology can also be extended to include regression – to enable assessing how task completion time differs by team, technology stack, or even by an intervention in an experiment.