Businesses have to govern their data to keep it clean and organized for better use. They may focus on data governance for their systems of record and structured data, but what about big, unstructured data like photos, videos, digitized hardcopy documents and continuous text messages from social media?

To improve unstructured data governance, businesses need to take several proactive steps, including using trusted sources and establishing guidelines for user access. However, there are some limitations that may hinder the effective governance of unstructured data.

Jump to:

Challenges of big data governance

Due to its nature and the complexities involved in ensuring its quality, security and compliance, there are several challenges to big data governance of unstructured data:

  • Lack of inherent organization: Unstructured data lacks a fixed schema — predefined categories or labels — making it difficult to define a standard structure for analysis, and governance, classification and data retrieval.
  • Data security and privacy: When collecting data from disparate sources, the unstructured data may include sensitive information that should be identified and protected from unauthorized access, use and disclosure to comply with regulations such as CCPA or GDPR.
  • Contextual understanding: Discerning context from text, images or videos can be challenging, potentially leading to misinterpretations.
  • Limited expertise: Relying on data scientists who lack IT skills in setting up standards and procedures for data can lead to issues such as inconsistent data practices, security vulnerabilities and compliance concerns.

SEE: Hiring Kit: Database engineer (TechRepublic Premium)

So, how can we improve the governance of unstructured data that now comprises roughly 80% of corporate data under management? Here are five ways to tackle the problem in the enterprise.

Top 5 ways to improve the governance of your unstructured data

1. Use trusted data sources

The data that organizations have directly created and accumulated is trusted, but most organizations also acquire data from outside cloud sources as they build an aggregated data repository for analytics.

How do you know that data from these outside sources is trustworthy? You don’t — unless you vet the data provider, understand where the provider has gotten its data, and know how the provider has prepared and secured the data. For example, if you’re in a sensitive industry such as healthcare, you’ll want to know that data on individual patients has been anonymized to meet privacy requirements.

SEE: Learn how to improve your data strategy.

Checking vendor governance standards to ensure they align with your own should be a routine task performed before any contract is entered into. Prior to signing a contract, you should also request the vendor’s latest IT audit so recent governance and security performance can be reviewed.

2. Establish unstructured data guidelines for user access and permissions

System of record, structured data, has firm rules in place for user access and permissions, but unstructured data may not. Unstructured data access should play by the same rules that structured data does.

In other words, access to unstructured data should be limited to those users who require the data. Within the category of access, there are likely to be levels of permission, with some users getting more access to data than others, depending on job function or role.

These user access decisions should be made between IT and end-user departments. There should be reviews annually, at a minimum, and procedures should be in place so that if an individual leaves the company, access is immediately removed as part of the separation process.

3. Secure all data

The basics of data security are trusted networks; strong user access methods and monitoring; perimeter monitoring that checks for vulnerabilities and potential breaches; and user habits that align with security best practices (such as not sharing passwords or not copying data to thumb drives that can be carried away). If data is stored on hardware at the edge of the enterprise, that hardware should be physically caged and secured when possible, where only those authorized can gain access.

Most of these standards and practices are in place with structured data but not necessarily with data that is unstructured, such as Internet of Things data.

Unstructured data should be governed by the same levels of security guidelines and practices that its structured counterpart is.

4. Use logging and traceability

Robust logging and traceability software should be continuously at work where big data is concerned. Who or what is accessing the data? When and from where is the data being accessed? If there is an issue that arises, what event initiated the issue?

Logging, tracing and (in the future) observability all decrease the time spent to resolve the problem and are integral to security.

5. Dispose of bad data

As an upfront data cleaning practice, bad data should be eliminated as raw and incoming big data streams in. There is a lot of bad big data, whether it’s documents that aren’t needed, IoT streams that contain as many device handshakes as salient information or superfluous social media threads.

SEE: Discover the differences between data governance and data management.

The data preparation process that’s part of data ingestion should eliminate this data so it never takes up real estate in storage. Big data repositories should also be regularly refreshed and revisited, and data that’s no longer needed discarded.

Use of AI tools in handling unstructured data

Unstructured data, compared to structured data, is usually very complex to process and analyze for insights, which is one of the reasons why it’s not often used for business intelligence. AI technologies can make the process of indexing, tracking, mining, analyzing and deriving insights from unstructured data more efficient. AI-enabled tools offer several capabilities that can handle information not organized in a predefined manner:

  • Natural language processing: With NLP, you can extract data from unstructured data automatically using various approaches and techniques, such as sentiment analysis, named entity recognition, topic extraction and language translation.
  • Image and video recognition: AI tools with object recognition and classification technology can identify objects, people and scenes in images or videos, enabling better analysis of visual data.
  • Speech and audio analysis: Speech and audio analysis capabilities enable users to transcribe and analyze audio recordings from spoken contents such as customer service calls, conversations and interviews.
  • Recommendation system: Businesses can use AI tools to analyze unstructured data to generate personalized recommendations from customer feedback and use it to improve their products and services, which will eventually drive business growth and improve customer experience.

When shopping for a data governance solution, it’s best to select a tool that aligns with governance practices for unstructured data. Such a tool will help you enforce consistent standards throughout your organization. It will promote adherence to industry regulations and data protection laws and offer data quality assurance, which will give your data long-term value.

Remember that when it comes to data governance, there is no one-size-fits-all solution. The best data governance tool for your business depends on your data needs and preferences.

Subscribe to the Data Insider Newsletter

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more. Delivered Mondays and Thursdays

Subscribe to the Data Insider Newsletter

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more. Delivered Mondays and Thursdays