Advanced Techniques for Solid Research: How to Clean Your Data
As you may remember from “Product Research 101” and “Product Research 201,” ensuring the right people are taking your survey from the get-go is crucial for producing actionable data. This comes down to sourcing, supply, and qualifying.
“Product Research 301: An Advanced Guide to Analyzing & Employing Data” shares your next step: data cleaning.
Cleanliness is next to godliness. Nowhere is this more true than when it comes to survey data. If you don’t have clean data, you probably won’t get the insights you were seeking when you embarked on your research.
Data cleaning is ensuring your data is reliable by employing tools to filter out problematic responses that may affect the validity of your results. There are several different methods, and it’s important to use a multifaceted approach.
Data cleaning may include flagging and potentially removing participants who do the following:
Speeders: These people go through the survey too quickly to provide meaningful answers. Different methods can be employed to determine how fast is too fast — either on a survey or question level — and these are often automated by platforms.
Straight-liners: With a grid question, if someone repeatedly chooses the same answer — for example, choosing “5” all the way down the line — it can indicate a lack of attentiveness or other poor survey-taking behaviors. However, you must first ask if it’s reasonable to expect someone would respond with the same answer choice, say “5-extremely satisfied,” for all questions. This could be legitimate behavior, so don’t assume all straight-lining is bad. If you want to accurately assess straight-lining, include reverse-worded questions, which will catch those not paying attention.
Bad open-ends: With open-end questions, you’ve asked them to answer in their own words. Here, you want to weed out gibberish, copied-and-pasted answers, or anything nonsensical. There is a bit of science and a bit of art here. Many responses are on the fringe of what you might deem reasonable. If you ask, for example, if someone is interested in a feature, and they respond, “I’m not sure,” this could be a low-effort response, or it could be completely legitimate. They may not have fully understood your question or concept. Additionally, people may provide quite short, one- or two-word responses. While lengthy open-ends are great, we shouldn’t assume short ones aren’t valuable. General guidance is to exclude clearly poor open-ends but includes ones that sit in grey areas.
Bots: While most platforms have some form of bot detection and prevention built in, nothing is bulletproof, and you may see some questionable responses make their way into your data (and you definitely will if the platform doesn’t employ such preventive measures). You may notice responses are sensical but repeated across all open ends. Or you may notice the language seems a bit, well, robotic. This could be the result of a bot entering the survey, or of a human performing bot-like behavior, such as copying and pasting or running scripts to complete surveys. In the big scheme of things, many platforms do a great job in weeding out problematic responses, most of which you’ll never see. However, the final assessment rests with you, the person using the data.
Please remember: Just because someone provides one bad response or straight-lines through one question does not mean their feedback isn’t valuable or that they’re a bad actor. If they fail multiple quality checks (try the 3-strikes-you’re-out rule) or exhibit truly egregious behavior, then they should be removed. But it’s usually fair to err on the side of inclusion, especially if including these people in the dataset doesn’t change the story the data tells.
Manually verifying data quality and validity requires time and effort. And still, it’s easy to miss things along the way. Automated data quality checks are your best friends through this process, as they check and clean data quickly and thoroughly without a human (you) having to be involved. Platforms like the DISQO CX platform, proactively takes such measures to address data quality, making it so much easier to ensure your data is squeaky-clean and ready to deploy.
For more tips to ensure your data is in tiptop shape and ready for analysis, download “Product Research 301: An Advanced Guide to Analyzing & Employing Data.”
“While all survey data requires some level of manual cleaning, and people should expect to see some less than pristine responses, we reduce the burden of data cleaning as much as possible by employing a series of automated quality checks and removals, beginning with evaluating participants before they begin a survey to applying natural language processing to remove the problematic open-ends post-data collection,” says Roddy Knowles, DISQO’s senior director of research and product-led growth.