Microsoft's data classification tool is now out of preview. We talked to Microsoft's Mike Flasko about its future.
Azure Purview is Microsoft's data governance tool, designed to help organizations understand and manage their ever-growing data estates. With auto-scaling cloud data services a few clicks away, there's more scope for data to get out of control than when it relied on provisioning storage in a data center. That means it's easier for developers to hook up to an endpoint and consume that data, adding risks of data leakage or, more dangerously, uncontrolled use in machine learning models.
SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic)
That last risk is one that's growing, as unsupervised use of data can embed dangerous biases in models. Then there's the added effect of increasingly rigorous data protection regulations, which prescribe how personal data can be used, and which bring along the threat of large fines for misuse or data leaks.
Using a tool like Purview makes a lot of sense, providing structure and automating many of the once-manual processes needed to build data governance across databases and line-of-business applications, ensuring that all your systems of record are managed and controlled while still allowing them to operate effectively.
New features on release: S3 support
Microsoft recently moved Azure Purview from preview to general availability, adding new features and tools, including a set of additional services and extensions that take it beyond Microsoft's cloud and into Amazon's and Google's. We sat down with Mike Flasko, the general manager of Azure's Data Governance Platform to talk about the transition to general availability and what the future looks like for cloud-based data governance with Purview.
One of the more important new features is support for scanning Amazon S3 buckets. While Amazon's S3 APIs are used by other storage vendors, currently the Purview tooling is restricted to working inside AWS. You need to have an AWS role for the service, with appropriate credentials that can work with encrypted buckets. The role needs very few permissions, in fact fewer than come with Amazon's own minimum S3 permissions, so you need to create your own permissions, with separate rules for scanning one specific bucket or for working across all your AWS S3 resources.
Other new data sources include Google's Big Query and integration with the Erwin data governance platform. Flasko noted that other popular enterprise storage platforms would soon get Purview support, including the cloud-scale Snowflake database. The intent is to have, as Flasko describes it, "a collection of data sources that we've expanded scanning to both on-premises and additional multi-cloud sources to further automate. You know what you can see and understand."
Taking advantage of intelligent data discovery
Perhaps the most important element of the release of Azure Purview is the data map. Instead of having separate tooling to catalogue and explore data, the map brings it all into one place and adds a visual layer. Flask describes it as "providing a platform for intelligence about your data assets." That's a difference from other data management tooling, as the visual approach helps you understand the flows between your different data sources, and how it's being shared and used across your organization. The idea here, Flasko said, is to use that information to "increase data agility but also ensure right use."
SEE: AWS Lambda, a serverless computing framework: A cheat sheet (free PDF) (TechRepublic)
Data governance is increasingly important, especially when it comes to using it for at-scale analytics or for building machine learning models. With a tool like Purview's data map you can see where sensitive data is being stored, and how it's being used. This approach points to a real-time approach to data governance. Data governance used to be reactive, building and deploying policies after data had been stored and used. By mixing automation with dynamic mapping, tools like Purview offer a new insight-driven approach to governance.
"I think some of the investments we've been making around automated scanning are connecting this conversation of data users with data curators. The folks who govern the data state." Flasko said, talking about the importance of this approach to Purview, "I think it's going to increasingly become more and more essential. It's one of the key areas of Purview, bringing together all of these users through the platform. We feel like there's an opportunity to create a lot more agility in terms of how data is used and further built upon in organizations."
The future of Azure Purview
The future of the platform is one of continuous improvement, adding more data sources and more automations. The more that can be added, the more that can be automated, the more value Purview will add. It's an advantage of working on a cloud cadence, Flasko said, "With every month going forward you'll see more and more data source support being added into Purview. One of the benefits of the cloud delivery model that we have is that as soon as they're ready, they'll be exposed."
Microsoft has used the preview release of Purview to understand what users want from a data governance platform, looking at the metadata they need and how they use it. It's a process that Flasko found fascinating, "We've been really excited and kind of amazed at times with some of our customers in terms of the number of different use cases they come back with." That's led to conversations with customers about what they've been seeing and how they can improve their discovery processes. Flasko describes it as customers asking themselves "If I curated more or if I turned on these classifiers or if I did X, you know, I could use the data and leverage the data in so many more ways."
That's the real value of a tool like this, not so much what the designers and developers expected users to do, but what they're actually using it for. As Flasko said, "That's the exciting part for me, to see how this platform can really enable data use, and appropriate data use across the organization and drive those types of conversations and brainstorming with our customers."
If there's one thing that comes out of talking to Flasko, it's that clearly those customer conversations are ones that will go on for a long time, as Microsoft works with them to roll out new data sources and new features to help them get control of their data explosions. Microsoft's own internal experiences come in to play here, as Flasko described Purview's use inside it's financial organization, as providing "an understanding of that data to all the folks on [the] team and then enabling everyone, if you will, to become data consumers across their tasks in the organization."
- Machine learning can help keep the global supply chain moving (TechRepublic)
- AI and data mining programming languages are "booming" (TechRepublic)
- How to become a data scientist: A cheat sheet (TechRepublic)
- Top 5 programming languages data admins should know (free PDF) (TechRepublic download)
- Data Encryption Policy (TechRepublic Premium)
- Big data: More must-read coverage (TechRepublic on Flipboard)