Data is valuable. It’s the lifeblood of a modern business, underpinning everything you do. That means you need to control it, if only to stay compliant with regulations and to avoid hefty fines after a data breach. If you know what you have and where it’s stored, then you’re ready to protect what’s important and monitor what’s not.
SEE: 83 Excel tips every user should master (TechRepublic)
Cloud platforms like Microsoft Azure make it trivial to generate vast amounts of data, with storage and databases as a service that can replicate data across regions, provisioned in minutes. There’s support for large-scale data lakes, massively replicated Cosmos DB noSQL, quick MariaDB stores and familiar Azure SQL. Microsoft describes it as a “data supply chain” that covers everything from raw data from Internet of Things sensors and business applications to the analytics workspaces used by business analysts and low-code Power Platform tools, working with on-premises and in-cloud data.
With data scattered across so much of your digital estate, and so easy to create, what’s needed is some form of data governance tooling. It doesn’t need to control your data completely, but it does need to let you understand where it is and how it’s used. It should also be able to help users find the data they need for their projects, exposing what’s been catalogued to anyone with the appropriate permissions.
Introducing Azure Purview
That’s where Azure Purview comes in, building on Microsoft’s own internal data governance tools. It’s a suite of applications, with three key components: Azure Purview Data Map, Azure Purview Data Catalog and Azure Purview Data Insights.
Azure Purview is at heart a tool for data discovery that allows it to address multiple audiences. Developers and business users can treat it as a registry of available data sources. It can be hard knowing what’s available in applications or in analysis tools, so having a place where data and documentation can be found will make users’ lives a lot easier. The same is true for users and systems that produce that data, automating producing documentation and using Purview as a hub for sharing their data with the rest of the business.
SEE: Photos: Windows 11 features you need to know (TechRepublic)
Most important, however, are the security team. They’re now tasked with ensuring the business complies with data protection regulations as well as controlling access for users and applications. Runing Purview as an automated tool for discovering and registering data gives them the option of using its tools to check for sensitive data and to add compliance rules to data.
What Purview provides is relatively simple. It’s a service where you can register your data services then tag them with appropriate metadata. The resulting catalog is indexed and searchable, and anyone can add new metadata to a source. Metadata can include common database features, like column and table names, as well as data types and API URLs. Your data never leaves where it’s stored: All that happens is that Purview acts as a central clearing house for your data, storing its location along with the source metadata.
How to build your first data catalog in Purview
It’s simple enough to get started with Purview: You’ll need an Azure account and an Azure Active Directory. Purview needs specific permissions, so make sure you have a policy that allows applications to create a storage account and an EventHub namespace, as the service will set these up automatically. Once that’s in place, register Purview, Azure Storage and EventHub as resource providers, attached to an subscription with administrative access rights.
You can now create a Purview account from the Azure Portal, choosing how much capacity you want to assign to your account. With everything in place, create the account and launch your Purview workspace from the Azure Portal. You’ll need to set up roles and accounts, ready for use, assigning roles to users in your AAD. Users can be Data Readers, Data Curators and Data Source Administrators. Most users will be readers, with access to the catalog. If they’re managing sources and metadata, make them curators. If they’re running scans, then they’re Data Source Administrators.
How to manage permissions and secrets in Purview
Before it scans your data, Purview will need to be given access to data sources. You can do this by either giving the Azure Purview managed identity access rights or by using it conjunction with credentials stored in Azure Key Vault. Both have their benefits, but if you’re using Azure best practices, you’re most likely to want to work with Key Vault secrets.
Getting Purview configured for a first scan can take time, providing links to subscriptions and secrets, as well as configuring the service’s Azure PowerShell cmdlets. The first set of scripts checks for available data sources in each subscription, and whether the service has access rights. Not all data sources are currently supported by the Azure Purview preview, but those that are account for a significant portion of Azure’s data storage usage. And while there are very few on-premises sources for now, Microsoft is planning to significantly increase the number of supported sources.
It’s worth spending a lot of time in the Azure Purview documentation before running a scan, as configuring data sources can be complex. Register sources and run the first scan from the Data Map view in the Purview portal, making sure you have connectors for all your planned scans. As Purview can work outside Azure, you will need to be careful that you don’t accidentally expose secrets to the whole world, especially for line-of-business systems like SAP HANA or cross-cloud resources like AWS S3.
How to use Purview data
Microsoft bundles much of the Purview tooling into its Azure Purview Studio, a web front end for the service that exposes much of the resulting graph of your data sources. Automatic scans can be annotated with data protection labelling to bring your data into the familiar Microsoft data protection frameworks. There are now over 200 different classifiers built into Purview, for automated metadata generation and you can build your own custom classifiers for business- and industry-specific data.
SEE: This open-source Microsoft benchmark is a powerful server testing tool (TechRepublic)
Under the hood is the open-source Apache Atlas platform, with APIs that support building your own applications and tools. Tools like Purview Catalog build on those APIs, so you can see how Microsoft uses them to navigate the resulting data graph, helping you decide what you want to do–and how you want to do it.
Microsoft may have initially built Purview to solve its own data governance problems, but it’s clear that the resulting tooling is suitable for anyone with a large data estate that needs to know what they’re storing. While it’s missing a way of determining who has access to that data, it gives you enough information to help determine the users and applications with access and, more importantly, ways to begin to control that access.
Control is key to effective governance and essential for regulatory compliance. With an explosion in cross-cloud, on-premises, and hybrid data storage, tools like Purview are going to be essential for CISO and CTO alike.