With so much data available from all parts of the business, it is often hard to determine which set of data is useful for what decisions. It has also been widely accepted that deeper insights come when we combine data from various aspects of the business and try to take a more holistic view of the problem itself. This is precisely the question that we raised in our previous discussion. Once we have framed the right questions, we then need to determine what datasets need to be fed into the AI system in order for it to derive the right answers.
Any large enterprise will typically have its data distributed across multiple systems of records residing in various departments with strict ownership, access and usage rights. The below map broadly shows the steps involved in the workflow to identify and prepare the data for training an AI model.
The first challenge is to locate the right data sources and getting access to it. Once that is done and the data is sourced and collated into a central location, starts the task of detailed analysis. This is probably the most critical step and typically would take the most amount of time. This analysis should reveal gaps in the present format of the data and desired format which could be fed to the AI model. It is very important to have a business (domain) specific view of the data at this stage and should involve experts who understand the business context and what is it that is desired from the AI model as the output. Any missing data would also be identified in this step and we should loop back to the first step of locating and sourcing it either from within or (sometimes) outside the enterprise. The data transformation and cleaning exercise that follows should also continue to be a joint effort between the data scientists and the domain experts concluding with a thorough data validation exercise. Once the desired set of data is identified and cleansed it needs to be tagged for supervised learning of the AI model. While the tagging exercise can be very detailed and time-consuming and can be performed by relatively lower-skilled resources, it is again imperative to have strict QC measures implemented with the oversight of domain experts. This alone will ensure that the tagged data is of the highest quality and the AI model will use it to learn appropriately and produce the desired results. Finally, the tagged data set is split into training and validation sets by the data scientists before they proceed to train the AI model.
While the overall flow of the data remains the same in training most of the AI models; the individual steps can become extremely complex and time-consuming when dealing with deep learning models that process images, videos or audio files as inputs. Without going into details of the processes themselves, it is easy to imagine the difference between analyzing a set of 500 rows of text data versus a set of 500 video or audio files. And typically, training data sets are orders of magnitude larger than that in real-world use cases.
As discussed in the last two threads on this topic, enterprises often underestimate the time, effort and cost of this step while planning for implementing AI to improve their business outcomes. There are several documented examples where efforts to implement AI applications within the enterprise ecosystem has led to a 3 to 5-year data consolidation and transformation program. While some will go for the big bang approach, it is definitely not the only way to get there. A focused and well thought out AI strategy can target very specific business outcomes and ring-fence the data requirements after a detailed analysis. Several such smaller projects can also be executed in parallel that have relatively independent data requirements and could be spread across various divisions of the enterprise. These can also be coupled along with existing data transformation programs that may already be underway within certain areas of the enterprise. In fact, there could be other benefits of adopting a strategy like this whereby smaller projects help improve the overall data quality of the enterprise and in turn enables it to execute bigger, cross-division transformation programs.
There are a lot of different ways that an enterprise can make a successful transition into the world of data-driven decision making powered by AI – one size certainly does not fit all! However, no matter which path one chooses, it is imperative that such programs are seen as part of the larger data ecosystem of the enterprise and the right resources are allocated at the right stages.