Generative AI is today the most talked about topic, almost everywhere from board rooms to children in schools. The reason is that it serves one of the fundamental needs of all, that of creating relevant & realistic content based on multiple inputs from contextual sources in different forms.
Data is at the heart of any AI led business transformation -- whether developing new products and services or elevating customer experience -- and Generative AI is no exception. The variety, volume, and veracity of data impacts the outcomes and efficacy of the Generative AI models too. This article helps understand the critical aspects of Data for Generative AI - how it helps train the Large Language Models (LLMs), that critically determines the value of outcomes, how policies and prompts can remove biases, ensure privacy and security as well as how outcomes can be evaluated with the right datasets to ensure the efficacy.
The Importance of Volume and Quality
Volumes of data allows the generative AI model to capture a wider dimensionality and a broader range of patterns in data that will enable the model to generate more versatile and apt outputs. Leveraging data, not only from the enterprise, but also from the ecosystem in which it participates as well as from the universe in which the organization could be impacted with (e.g., public or open datasets) will improve the value of the generated outcomes.
The relevant source of data, its quality (its granularity or level of detailing) and provenance in terms of a documented trail of its journey, become even more important in the light of what Generative AI is already capable of. As everyone would appreciate, 'garbage in, is garbage out'. Appropriate pre-processing of data enables completeness of data and also compatibility (for external dataset integration and interoperability) and helps improve the model's ability to learn. Models can therefore be tuned to absorb only relevant data, and ignore the noise. Proper Data Governance, maintaining Data Privacy and ensuring Data Security are therefore very critical to the success of Gen AI for any enterprise.
AI for Data: A Complementary Approach
While obtaining 'Data for AI', applying 'AI for Data' will also be important. When training the Generative AI models, augmenting the input data through techniques such as data synthesis i.e. introducing variations on real data and augmenting it based on multiple factors, can help generate more diverse samples and improve the model's ability to handle different scenarios. The right metadata, provenance and lineage of the data is essential to be tracked and maintained. Quality refers to right data at the right time and at the right level of granularity. Real data from trusted or reputed sources (or partners) within an ecosystem will ensure that organisations do not use generative AI tools for malicious purposes as a bad actor in that ecosystem. Weak data governance, especially with respect to data security, will definitely lead to undesirable, if not dangerous, outcomes.
Addressing Bias in Generative AI
Another dimension worth considering is that Generative AI models are prone to bring in biases present in the training data. If the data used for training is biased, failing to fairly represent the entire population, the generated output can also exhibit the same biases, potentially leading to unfair or discriminatory outcomes. It is therefore crucial to carefully curate and evaluate the training data to ensure that ethical considerations are considered. While real world data is important, it may be sometimes prudent to look at synthetic data or generate data from some base real-world data applying multiple factors, even probably leveraging AI/ML techniques to get 'right' data by removing biases and privacy violations, that can be tuned into the model.
"It is crucial to carefully curate and evaluate the training data to ensure that ethical considerations are considered."
Data Engineering and Prompt Engineering
Therefore, data engineering or data preparation is a key to success. Content needs to be high-quality before customizing the LLMs in any fashion. With ChatGPT or Google Bard, there are widely available curated domain databases, but they are still generic. Companies would need to rely upon human curation to ensure that knowledge content is accurate, timely and complete. Some mature organizations have knowledge workers who are constantly scoring documents along multiple criteria to determine the suitability for incorporation of the same into the GPT-4 /PaLM2 systems. It's part of their data governance structure.
It is also important to understand the link between data engineering and prompt engineering, that facilitates taking the right inputs for the models to act on. As rightly highlighted by an HBR report on Generative AI, it is important to inculcate some specific behaviors for content creators, either through training or policies, to create and tag useful content, create effective prompts (what types of prompts and dialogues are allowed, and which ones are not), using the system's responses in dealing appropriately with customers and partners and finally ensuring that there are no enterprise privacy or security violations. It all finally boils down to the quality of content, through creation and curation, as part of data preparation; that the LLMs are trained in.
Avoiding "Hallucinations" Through Evaluation
Finally, organizations adopting generative AI should avoid what are termed as 'hallucinations', i.e., a model output that is either nonsensical or outright false. They need to have the right evaluation mechanism w.r.t. the domain, e.g., validating against relevant public datasets -- e.g., cross verifying certain identities against public financial datasets for financial investment outcomes; or against public identity datasets for answering telecom usage related questions; or security related data for SOC2 compliances in any IT services company. Organizations also may create policies through a series of "pre-prompts" that tell the generative AI system what types of questions it should answer and those that it should avoid.
Conclusion
Overall, the criticality of data in generative AI cannot be underestimated. The quality, quantity, diversity, and ethical considerations related to the data used for training and fine-tuning the models directly influence their performance, accuracy, and impact on various applications. It is recommended that Organizations conduct a formal assessment of their maturity in foundational data & AI and readiness towards generative AI - from strategic as well as delivery and governance perspective. This will provide the enterprise to take appropriate steps in leveraging Generative AI effectively for its organizational purpose.
Join the Discussion