At Nielsen, I oversaw a product that involved 12,000 surveys a day (4,400,000 surveys a year). If a panelist cheated on the surveys (copied answers from the web), or randomly guessed at answers, those behaviors degrade the quality of our product. So identifying (and so removing) those guessing and cheating behavior was a very important.
I started by researching what other companies did to identify random guessing behaviors. I found lots of suggestions on how to identify these behaviors. And what I learned helped me get started. But the “general wisdom” on the subject didn’t go far enough and (as I later learned) was sometimes wrong.
Basically, by using the random guessing “general wisdom” (labeled in the graph below as “Obvious Markers”, the blue line) I could find that 24% of the panelists/surveys were not guessed (they data suggested they spent time and effort to give honest answers), that 4% where clearly randomly guessed (they answered to quickly and/or just typed “A” for every question for example). But that left 72% where I just wasn’t sure if they guessed or not. So, I analyzed the behaviors of the obvious guessers and compared them to obvious non-guessers, and used Bayesian Inference models along with what I learned to create new models that did a better job of identifying clear guessing behaviors. I did this again and again (about 19 times) before I was satisfied that I had identified most of the random guessing behaviors.
In the end, I identified 15% of panelists/surveys were clear guessers (the black line below, to the right near 100%). I also found that 71% were clearly not guessers (the black line to the left near 0%), and that left 14% where I still wasn’t sure.
After all that effort, I could use the Bayesian Inference models to score new surveys that came in (very CPU intensive) or use the results of the Bayesian Inference models to train more practical modeling methods that were almost as good.
I also learned some things about guessing from this process that often surprised me. Two examples are below:
I found in my research that most sources agreed that the best way to identify guessers is to look at the average time it takes to take a survey. The belief is that the people who finish the surveys at record speed are the guessers. And that is true, but there are also lots of guessers who take a long time to finish the survey.
What I found is that it isn’t the average time the panelist takes to finish the survey, it is the time they take to answer most of the questions in the survey. Look at the graph below, which is the more common pattern of the speed that guessing panelists take to answer a survey of 20 questions. Most questions are “speed through”, but they take a long time to answer a few of the questions. Likely speeding through a long survey is boring, so the guessers get bored and are easily distracted, either by a few specific questions or more likely they are multitasking and are distracted by something external to the survey. So, while most of the questions are answered very quickly (less than 3 seconds each), the average speed of the survey might not be that slow.
I found that while looking at the average time to finish a survey catches some guessers, if you look at the Median Time to answer the questions in the survey you can catch most of them.
Another surprise was what I called “D phobia”. Mostly in multiple-choice questions (4 choices, A, B, C, & D), non-guessers will pick any letter about 25% (more or less). So clearly, someone who answers only the letter “C” is likely just guessing. And it is no surprise that people answering most of the multiple-choice questions with a single letter is randomly guessing. If you look at panelists who answer with a single letter 70%+ of the time, you can identify 32% of the guessers.
But the surprise is that you can identify 50% of the guessers by just looking for people you never answer with the letter “D” (or answer the letter “D” less than 5% of the time). For some reason, guessers have “D phobia” and avoid clicking on the letter “D” answer. So more than people who answer “C” too often, you can find guessers by those that answer “D” never or almost never.
Voice Mail: 917-838-7966
Address: 23 East Tenth Street #304
New York City, NY 10003