Just last month an article was shared which showed that over 30% of the data used by Google for one of their shared machine learning models was mislabeled with the wrong data. Not only was the model itself full of errors, but the actual training data used by that model itself was full of errors. How could anyone using Google’s model ever hope to trust the results if it’s full of human-induced errors that computers can’t fix. And Google isn’t alone in mislabeling data, a 2021 MIT study found that nearly 6% of images in the industry-standard ImageNet database are mislabeled and, further, found “labeling errors in test sets of 10 of the most commonly used computer vision, natural language, and audio datasets.” How can we expect to trust these models or use them if the data used to train these models is so bad?
The answer is that you cannot trust this data or these models. As AI goes, garbage is most definitely garbage, and AI projects suffer from serious data waste. If Google, ImageNet and others are making this mistake, you are definitely making this mistake too. Cognilytica research shows that more than 80% of the time spent on AI projects is spent managing data, from collecting and aggregating that data to cleaning and labeling. Even with all this time, errors are bound to happen, and that’s if the data is of good quality to begin with. Bad data means bad results. This has been the case for all kinds of data-driven projects for decades, and now it’s a significant problem for AI projects, which are essentially just Big Data projects.
Data quality is more than just “bad data”
Data is at the heart of AI. What drives AI and ML projects is not the programmatic code, but rather the data from which learning must be derived. Too often, organizations move too quickly with their AI projects only to realize later that poor data quality is causing their AI systems to fail. If you don’t have your data in a good quality state, don’t be surprised when your AI projects are plagued.
Data quality is not limited to “bad data” such as incorrect data labels, missing or erroneous data points, noisy data or poor quality images. Major data quality issues also arise when you acquire or merge datasets. They also occur when capturing data and enhancing data with third-party datasets. Each of these actions, and more, introduces many potential sources of data quality issues.
Of course, how do you realize the quality of your data before you even start your AI project? It’s important to assess the state of your data from the start and not go ahead with your AI project only to realize too late that you don’t need good quality data for your business. project. Teams must determine their data sources such as streaming data, customer data, or third-party data, and then how to successfully merge and combine data from these different sources. Unfortunately, most data does not come in good usable states. You must delete superfluous data, incomplete data, duplicate data or unusable data. You will also need to filter this data to help minimize bias.
But we’re not done yet. You will also need to think about how the data needs to be transformed to meet the specific requirements you have. What are you going to do for the implementation of data cleansing, data transformation, and data manipulation? Not all data is created equal, and over time you will experience data degradation and drift.
Have you thought about how you’re going to monitor that data and evaluate that data to make sure the quality stays at the level you need? If you need labeled data, how do you get that data? There are also data augmentation steps to possibly consider. If you need to increase the data, how are you going to monitor that? Yes, there are a lot of steps involved in data quality and these are all things you need to think about for your project to be successful.
Data labeling in particular is a common area where many teams get stuck. For supervised learning approaches to work, they need to be fed good, clean, well-labeled data so they can learn from example. If you are trying to identify images of boats in the ocean, you need to feed the system good, well-labeled images of boats to train your model. This way, when you give her an image that she has never seen before, it can give you a high degree of certainty whether or not the image contains a boat. If you are just training your system with boats in the ocean on sunny days with no cloud cover, then how should the AI system react when it sees a boat at night or a boat with cloud cover of 50%? If your test data doesn’t match real-world data or real-world scenarios, you’re in for a problem.
Even when teams spend a lot of time ensuring their test data is perfect, the quality of training data often doesn’t reflect real-world data. In a public document, for example, AI industry leader Andrew Ng explained how, in his project with Stanford Health, the quality of data in his test environment did not match the quality of medical images in the real world, deeming its AI models useless outside of the test environment. This caused the entire project to fail, putting millions of dollars and years of investment at risk.
Planning for project success
All of this activity focused on data quality can seem overwhelming, which is why these steps are often skipped. But of course, as stated above, bad data is what kills AI projects. So, not paying attention to these steps is a major cause of the overall failure of the AI project. That’s why organizations are increasingly adopting best practice approaches such as CRISP-DM, Agile, and Le to ensure they don’t miss or skip crucial data quality steps that will help avoid AI project failure.
The problem of teams often moving forward without planning for project success is all too common. Indeed, the second and third phases of the CRISP-DM and CPMAI methodology are “Data Understanding” and “Data Preparation”. These steps even precede the very first step of building models and are therefore considered best practice for AI organizations looking to succeed.
Indeed, if the Stanford Medical Project had adopted CPMAI or similar approaches, they would have realized well before the million-dollar mark and several years away that data quality issues would sink their project. While it might be heartening to realize that even luminaries like Andrew Ng and companies like Google make serious data quality mistakes, you still don’t want to unnecessarily join this club and leave data quality issues behind. affect your AI projects.