
Everyone’s talking about unstructured data these days. While unstructured data in the form of user documents has been around for decades, its volume, variety, and the number of applications that generate it — from self-driving cars to smart cameras to genome sequencers — have exploded in recent years, making it the largest and most valuable source of data in an organization, especially in the age of generative AI.
As noted by the authors of a recent Harvard Business Review article: “A company’s content lies largely in ‘unstructured data’ — those emails, contracts, forms, SharePoint files, recordings of meetings and so forth created via work processes. That proprietary content makes gen AI more distinctive, more knowledgeable about your products and services, less likely to hallucinate, and more likely to bring economic value. As a chief data officer we interviewed pointed out, ‘You’re unlikely to get much return on your investment by simply installing CoPilot.’”
The problem is unstructured data is vast, typically found in files and directories scattered across the enterprise, on-premises, and in the cloud. It’s difficult to search and move and, as the HBR authors accurately note, “is frequently of poor quality — obsolete, duplicative, inaccurate, and poorly-structured.” Unstructured data is also multi-modal, meaning it could be images, audio, text, documents, medical DICOM or VNA images, BAM files, and other formats.
For AI initiatives to be successful and relevant to an organization, they must have the right unstructured data at the right time. IT infrastructure and operations leaders should strive to deliver simple visibility across all unstructured data, advanced data classification and segmentation, and secure, high-performance data mobility for AI data ingestion. This is no easy task, but it is possible without hiring expensive consultants.
The payoff of proper unstructured data preparation for AI
Why not just copy all your file data into a secure data lake in the cloud, from which data scientists can cull data for their projects as needed? While data lakes remain a popular option for semi-structured data such as spreadsheets and Parquet files, blindly dumping billions of unstructured data files into data lakes does not work for AI for two reasons:
- They become unwieldy data swamps that are hard to search.
- The iterative nature of AI workflows means that IT will need to move data to different processors, which reduces the effectiveness of a data lake.
With no unifying structure, data lakes of unstructured data become impossible to search and discover the right data for the need at hand. Meanwhile, the cost of storing petabytes quickly adds up. Furthermore, AI processing can happen at the edge, in data centers and the cloud, so you may need to move data to each processing site. This is redundant, costly, and time-consuming. Why copy all unstructured data to a data lake only to copy it again to each AI process? Your costs multiply if the same data is sent to multiple AI processors or retained even after the processing is complete.
The conundrum is: If you send more data than is needed for a project, across many projects that might be running at once or if different users send the same data to the same processor at different times, AI processing costs become prohibitively expensive for most organizations. If you send too little data, your results will be suboptimal and even inaccurate. If employees send sensitive, restricted data to their AI projects, you’re now looking at public access to company secrets, as well as potential compliance violations and lawsuits.
This gets us back to the core challenge: delivering just the right amount of high-quality, relevant unstructured data to AI projects, though without lengthy delays and manual effort.
In the Komprise IT Survey: AI, Data & Enterprise Risk, IT leaders shared that their top challenge in preparing unstructured data for AI is quickly finding and moving the right unstructured data to locations where AI lives. Secondary challenges include a lack of visibility across data stores to understand and identify risks, and segmenting and classifying data. Also, more than 30% lack internal agreement on the right strategy for data management and governance. This is not a surprise, given how early enterprises are in their AI initiatives.
Where to focus on AI data preparation
Enterprise IT organizations seek easier, automated ways to prepare data for AI. The metadata automatically generated by file systems is too basic to add useful context or structure to the data. Manual search and metadata enrichment/tagging across billions of files to classify and organize data is not viable. Consider these four areas of focus for AI data preparation.
Sensitive data detection
IT’s top job is to protect sensitive data, with the majority in the survey (74%) looking to use workflow automation tools to classify sensitive data and prevent its improper use with AI. The second leading tactic for AI data preparation is automated scanning and classification to bring needed structure to unstructured data.
Data classification
While still nascent, unstructured data management technologies are starting to include automated classification capabilities by scanning file contents across the organization’s data estate, tagging them with labels to identify them and when needed, confining the data so that it cannot be ingested into AI. Integrations with AI tools can also deliver rapid data classification across large data sets by cracking open files, searching for keywords and creating a curated set of data.
Metadata enrichment for search
Once unstructured data is further classified through tagging, also called metadata enrichment, file data becomes easier and faster to search, segment, protect, and curate for AI projects. A researcher could use an unstructured data management solution to search on keywords and locate all the related files across distributed file systems without the assistance of IT. The survey showed equal interest in data management and AI approaches for data classification via metadata enrichment.
RAG
Another top data preparation tactic for AI, according to 60% of the survey respondents, is to store data in vector databases for semantic search and retrieval augmented generation (RAG). Vector databases allow organizations to convert file data in formats that capture meaning rather than just keywords, making this a useful strategy for search engines, chatbots, and recommendation systems.
Getting the right unstructured data to AI
Once the unstructured data has been tagged, classified, and segmented, organizations need efficient ways to move the data to AI pipelines. Copying large data sets can take weeks to complete and result in data loss or security risks, especially if you need to move millions of small files over the WAN to a cloud AI service. IT teams typically use one or more methods such as manually copying their data, free tools, or data management tools for these tasks, yet an automated data management solution is the most common preference today, indicated by 64% of the survey respondents.
Automated unstructured data workflow technologies can streamline the process of curating and moving the right data from storage to locations for use in AI with proper governance. This technology can index data across hybrid storage, identify and confine sensitive data, and execute policy-based automated tagging of data sets to help users search for the exact data they need.
An automated workflow could search for data tagged with “MRI,” “glioma,” and “female”, copy the data to the cloud, and then repeat the process as new data enters the organization. Unstructured data workflow solutions include dashboards to monitor workflows in progress and allow IT to investigate data sets used and by whom in a specific project, if needed.
AI data governance capabilities are non-negotiable today as shadow AI is growing and resulting in data leakage to commercial AI tools as well as false and inaccurate outcomes.
The unstructured data mandate for AI in business
Most IT organizations are still trying to efficiently store massive volumes of unstructured data that are growing exponentially, yet it’s crucial to go beyond cost savings and unlock data value for AI agents and other GenAI initiatives. Finding the right data and systematically feeding it to the right AI tool with measurable data governance built in is a top CIO initiative for 2025 and beyond. The old expression “garbage in, garbage out” has never been more profound.
This article was written by Krishna Subramanian, who is the COO and co-founder of Komprise, a company focused on unstructured data management for large enterprises across industries.