Approaches to accuracy for Mechanical Turk

September 30th, 2011 by Yali

This is the third blog post in our series on using Amazon’s Mechanical Turk to build scalable business processes. Please see also our introductory post and our second post, getting started with Mechanical Turk.

Amazon’s Mechanical Turk provides a very convenient platform for getting large numbers of workers to perform manual steps as part of large scale business processes, such as cleaning data sets for use in machine-learning algorithms, or moderating content.

However, it is not enough for Mechanical Turk to provide results fast. The results themselves need to be reliable and hence it is critical that companies using Mechanical Turk invest in a suitable strategy for accuracy.

The mirror in the Hubble Space Telescope, the most precise ever made, was initially 10 nanometers off the correct curvature. The inaccuracy was catastrophic and cost several million dollars to fix

Amazon provides two primary tools for helping users validate the accuracy of results. We’ll look at these both briefly, before outlining a third technique which, used in combination with the first two, can be used to deliver a very rigorous approach to accuracy. These three strategies for accuracy are as follows:


1. Use multiple workers to perform each task independently and compare the results

Mechanical Turk is designed to allow the same task to be given to multiple different workers, and makes it easy to compare their different responses.  It is possible, then, to accept all results where there is consensus amongst the different workers, and to manually check (or even just disregard) results when there is a discrepancy.

The trouble with this approach is that even when all workers give the same answer, it is still possible that they all wrong. In the latter half of this post, we will show how you can begin to quantify that probability, and hence start to accurately measure the confidence levels of your results.

2. Make workers perform a set of qualifying tasks

Amazon makes it very easy to define your own qualifications, assign them to some of your workers (based, presumably on that worker’s accuracy on previous tasks) and then allow only “qualified” workers to complete future tasks.

This functionality makes it straightforward for companies to set pre-task “tests” where the correct answers are known, and workers’ answers can then be compared against the known answers. Workers who answer accurately can be accorded the qualification, enabling them to go on to perform tasks where the answers are not known.

This provides a rigorous method of assessing accuracy. However, it runs the risk that once workers have qualified, the quality of their work declines (because they know they have “won” the qualification already) and hence their accuracy declines, at just the point they start performing tasks where there is no objective yardstick to measure the output against.

An alternative variation is to qualify workers whose answers commonly agree with those of other workers. The danger here, however, is that workers become qualified based on how frequently they give the “average” answer, rather than necessarily the “right” answer. If most of the workers are giving an incorrect answers, future workers will be judged on whether they agree with those inaccurate workers, leading to an accuracy “death-spiral”.

3. Mix a set of tasks with known answers in with the unknown answers

Another technique (to be used instead of or alongside pre-task qualifications) is to mix a set of known tasks into a batch of mostly unknown tasks. An example here would be to make 5-10% of the tasks be questions where the correct answer is already known. This circumvents the problems identified above, because:

  1. The accuracy of workers completing the tasks is measured against the “right” answers, not against the “average” answers from the pool of workers, preventing the accuracy “death-spiral”, and:
  2. The workers cannot distinguish between those tasks which we use for qualifying and those that we do not, incentivizing them to continue completing all of the tasks accurately
Given the above advantages, this third approach is Keplar’s recommended approach to answer accuracy when using Mechanical Turk.

Measuring accuracy

Let us return to the example that we’ve been working on at Keplar: namely checking a dataset of short content items to ensure that each item is in the language that we believe it to be in, so we can use the dataset to train a language detection bot. As you might remember from our previous post, each Mechanical Turk HIT should be “as small as possible”, and we should ask workers to answer closed rather than open questions. So rather than ask “what language is this content in?”, we ask e.g. “is this content in French?”  The purpose is to end up with a data set of thousands of content items that we are very sure are French.

For each HIT on Mechanical Turk, then, there are four possibilities:

  1. Content is in French, and worker confirms it is in French – let this possibility be “FY”
  2. Content is in French, and worker identifies it as not French – let this possibility be “FN”
  3. Content is not in French, and worker identifies it as in French – let this possibility be “GY”
  4. Content is not in French, and worker identifies it as not in French – let this possibility be “GN”)

The 4 possibilities can be mapped on a tree diagram:

A probability tree of the different possibilities for each HIT

We created a batch of tasks that asked workers to confirm whether or not a content item was in French. We specified that five workers should examine each item.

In our example, our primary concern was that an item not in French is incorrectly classified as being in French, because the workers fail to notice that it is actually in, say, German. Hence, the possibility that worries us is labelled “GY” on the diagram. We were much less concerned if a content item that is really in French is incorrectly classified as “not in French” (“FN” on the diagram), because disregarding that result simply reduces the size of our output dataset, but not the quality or reliability of it. So our worst-case scenario was that the 5 workers who answer the HIT “is this content item is in French” all got it wrong and said “yes” (“GY”) when the answer was really “no” (“GN”).

Because we were interested in false positives (“GY”s), we measured the likelihood of a worker incorrectly identifying a non-French content item as French. Hence, the 10% of known tasks included in our batch were all content items that were not in French (i.e. known “G”s).

For each worker “n”, we calculated the % of known, non-French content items that they correctly identified as not French (“p”) and the % of known, non-French content items that they incorrectly identified as being in French (“q”). To restate:

P({N_n}{vert}{G})=p_n
P({Y_n}{vert}{G})=q_n
p_n+q_n=1

where N_n is the outcome where worker n answers “No – this content item is not in French”,
p_n and q_n are the measured probabilities for worker n based on their responses to the known “G”s.

Results from workers who scored very low for accuracy (e.g. p<95%, q>5%) were rejected, and these tasks were listed again on Mechanical Turk, to be completed by new workers, whose accuracy was again measured. Eventually, each HIT had been answered five times, in each case by a worker who accurately identified at least 95% of the known bads in the data set as “not French”.

Across the entire data set, we calculated the % of known, non-French content items that were correctly identified as not French (“P”) and the % of known, non-French content items that they correctly identified as being in French (“Q”). Because we have rejected all workers where p<95% and q>5%, we know that P>95% and Q<5%. To restate this mathematically:

P(N{vert}G)=P
P(Y{vert}G)=Q
P+Q=1

where P and Q are the measured probabilities across all workers, based on their collective responses to the known “G”s.

Based on this figure, we can work out the probability that if a content item is not in French, all 5 workers would incorrectly classify it as French:

P((Y_1{inter}Y_2{inter}Y_3{inter}Y_4{inter}Y_5){vert}G)
{=}{P(Y{vert}G)^5}
{=}Q^5

Then, assuming Q = 5% (in actual fact, it must be equal or less):

P((Y_1{inter}Y_2{inter}Y_3{inter}Y_4{inter}Y_5){vert}G)
{=}Q^5
{=}0.05^5
{=}0.0000003125

That means that there is a less than 0.0001% chance that if a content item is not in French, all 5 workers confirm (erroneously) that it is in French.

More generally, if the probability (as measured on a set of known tasks) of inaccuracy on a task = Q, and n workers are asked to confirm the result, and all of them agree, the probability that all the workers are wrong is given by:

P({Y_{1 to n}}{vert}G)=Q^n

Returning to our practical problem: we want to know how certain we can be, given that “n” workers have each independently confirmed that a particular content item is in French, that the content item is in French (i.e. that they are not all wrong). So we want to know the probability that the given update is in French, given that “n” workers have independently verified that it is – in other words:

P(F{vert}Y_{1 to n})
{=}{P(F{inter}Y_{1 to n})}/{P(Y_{1 to n})}
{=}{P(F)P({Y_{1 to n}}{vert}F)}/{P(F)P({Y_{1 to n}}{vert}F)+P(G)P({Y_{1 to n}}{vert}G)}

Note that the above two equalities follow from conditional probability. The resulting function is of the form:

P(F{vert}Y_{1 to n})=x/{x+delta}

where delta=P(G)*Q^n
and x={P(F)P({Y_{1 to n}}{vert}F)}

delta is incredibly small: it is the product of the probability that a content item is not in French (which should be less than 0.5) and Q^n. As a result, whatever the value of x, the probability of a content item being in French when all the workers indicate that it is, is going to be very close to 1. For example, if:

Q^n=0.05^5
and P(G)=0.5 (which would be surprisingly high)
and x=0.5 (which would be surprisingly low),
then P(F{vert}Y_{1 to n})= 0.999999688

In summary

By mixing a set of questions with known answers into a batch of Mechanical Turk questions with unknown answers, companies can track the accuracy of workers and take a statistical approach to measuring their confidence level in the results produced by those workers on Mechanical Turk. This makes Mechanical Turk an extremely powerful tool for having sometimes-unreliable humans input into highly scalable – and reliable – business processes.

In the next post in this series, we will look at how to use Python scripting to start to integrate Mechanical Turk into your business processes in an automated, scalable way.

Leave a Reply