When you want to understand something, you need data. When you want to set policy, you need evidence. If you can’t see the problem, you can’t make good decisions about it. Connect enough dots and you can get a rich, detailed view of what’s going on, and start to understand why – and maybe what you can do about it. But governments and policy makers don’t always have the equivalent of business intelligence for handling that kind of data.
SEE: Electronic Data Disposal Policy (TechRepublic Premium)
Sometimes they don’t even have the right data. Data you gather in a lab experiment, clinical trial or research study is relatively clean and controlled; you can regulate more of the variables—but you may also miss the interactions that happen in the complexity of the real world, that affect or even cause what’s happening. Sometimes you can discover more by combining research and real-world data. But for some things there’s no ethical or practical way to do research and you’re always going to be dealing with sensitive data about real people.
That’s particularly true for problems like human trafficking, where the data is about people who are already very vulnerable and at even greater risk if any of that data becomes public. If it looks like someone has asked the authorities for help, the traffickers might punish them for that. But without publicly available data, policy makers can’t understand the issues and make better decisions. Anonymizing data takes time and can lose nuance, plus it’s far too easy to deanonymize data. A better approach is to create synthetic data that has all the same properties as the real data and lets researchers get the same results when they analyze the data set—but that can’t leak any information about real victims and put them in even more danger.
Synthetic data is only useful if it’s accurate, Microsoft Research Director Darren Edge said. “You can generate synthetic data with perfect privacy but zero utility by sampling random values from random distributions.” Useful synthetic data has to match the distribution of the real data set, down to the combinations of individual characteristics (like age, nationality, location, occupation and so on).
But it mustn’t be too accurate: “You can get perfect utility but zero privacy by releasing the actual dataset but claiming it is synthetic. This might sound extreme, but if you use machine learning to learn the distributions of a sensitive dataset and then build a synthetic dataset by predicting record attributes, it is very easy to accidentally reproduce much of the sensitive data.”
Using Microsoft’s open source Synthetic Data Showcase tool, the United Nations’ International Organization for Migration created a synthetic human trafficking data set that has the same structure and statistics as the real data, so analyzing it reveals all the same insights about what kind of people are being exploited, where and how—but not enough information to track down real individuals—plus a Power BI dashboard that you can open in the cloud or by using the free Power BI Desktop app.
The key is controlling the resolution of the data: Making sure that any particular combination of characteristics applies to a large enough number of people that it doesn’t act like a fingerprint for one specific person—think of it as safety in numbers. Microsoft does this with a technique called k-anonymity (k being the minimum number of people with each combination). It’s the same way password monitoring tools like Have I Been Pwned, 1Password and Google’s Password Checkup can tell you if your password has been leaked without you having to send them your password.
Synthetic Data Showcase may also help the people who collect data get it to the people who will use it to make decisions more quickly, Edge suggested. “If I can get a clearly understandable privacy guarantee, then perhaps I can share the data more quickly without recruiting a privacy expert to check the data for privacy leaks or negotiating a data-sharing agreement. Similarly, if I can visually review the data myself, perhaps I don’t need to recruit a data scientist to find insights on my behalf.”
Just because two things happen together doesn’t mean that one causes the other. The amount of mozzarella cheese people eat changes at the same rate as the number of civil engineering doctorates that are awarded. But when things are part of the same system you can use data to work out the impact of one particular part of the system—what might contribute to a particular medical condition, whether a particular drug might be helpful or whether the political situation in a country that suffers a natural disaster will lead to more people trying to find a new place to live and falling into the hands of human traffickers.
Trying to work out what’s the cause and what’s just associated with the outcome without being a reason it happens is known as causal inference. It’s a complex statistical process that often means triangulating data from multiple sources to see if they’re correlated, checking for confounders—variables that confound your attempt to identify the cause because they contribute to both the outcome and another variable you think is the cause. Did someone leave home because of a hurricane or because the economy suffered after the hurricane, and do those reasons change by their age or gender?
SEE: Photos: Windows 11 features you need to know (TechRepublic)
Not only does this require expertise, but because it’s a statistical technique you can get slightly different answers with different levels of confidence that one factor is or isn’t causal based on how you handle the different variables.
Microsoft has several tools for developers that can automate causal reasoning, DoWhy, EconML and CausalML, but they’re definitely aimed at experts. The new ShowWhy application will be open source, too, when it’s released later this year, and it uses Python and can save its results as Jupyter notebooks, but it’s aimed at people who aren’t experts or developers. ShowWhy will help you ask a causal question by filling in the different pieces, doing the analysis for you and showing you a diagram of possible causes and how any likely confounders fit in.
That analysis includes whether the results look different if you pick slightly different parameters for some of the statistical decisions. “The idea here is to test very many reasonable specifications of the problem, from how we define the population, exposure and outcome of the question to how we specify the causal model and estimators used to answer the question using causal inference.”
If different causal models give quite different results, it’s important to check that the assumptions each model relies on are correct. A future release of ShowWhy will be able to test the assumptions against the data. Again, that’s bringing a very powerful technique—specification curve analysis, which Edge says can “use data and analysis to show us where our assumptions or decisions might be wrong, and guide us to learn more”—to non-experts.
In Chicago, Microsoft is part of Project Eclipse, using cheap Internet of Things sensors on bus stops to capture pollution data and understand what contributes to air quality. Using causal inference may help avoid misunderstanding the problem because of where the sensors happen to be and making what he calls “the common mistake of confusing correlation in a dataset with causation in the real world.”
Visualizing the data with ShowWhy brings that technique to a coalition of community groups, businesses, environmental organizations and local governments that may not have data science expertise, so they get a clearer picture of the situation without making those mistakes. “It might be very easy to ‘see’ relationships in a dashboard visualization that actually have a common cause in an unobserved variable—something like the wind or air pressure, perhaps.”
Keeping up with the data
Situations change over time, and policy needs to change to match. It’s fairly easy to see obvious changes in a single variable like where people are calling a helpline from, what kind of job they’re being exploited in or how old they are. But that’s not usually enough to understand the kinds of complex real-world situations that you need a new policy to deal with.
“There is some insight to be had by counting or averaging attributes in isolation, but this tells you little about what to do about it,” Edge explained. “While individual attributes can describe whole populations but with little useful context, complete records describe individuals with so much context as to offer little generalizable value. Attribute combinations offer a sweet spot of just enough structure and generality to suggest specific courses of action for manageable subsets of data records/subjects, which in many cases is just what you need.”
But spotting emerging trends as they happen is harder when you have to notice changes in the combination of characteristics that add up to a new situation. There’s a huge number of possible combinations and only a few of them represent real changes rather than the real world being rather random from time to time.
SEE: This open-source Microsoft benchmark is a powerful server testing tool (TechRepublic)
“Many visualization techniques are about data aggregation, and many methods for exploring data visually are about rapidly changing how to aggregate the underlying data—drilling down’ to ever smaller subsets of data. If you are always aggregating, you are going to be drawn to conclusions that result in extreme aggregates: the highest/lowest, greatest/smallest, and so on.” Real-world data is often just too noisy: “Neither absolute values nor relative changes tell you anything for sure, although the peaks and troughs that emerge from the aggregates look like they do.”
Looking at data as a connected graph captures meaningful relationships, and sometimes the fact those relationships exist at all can be more important than the numbers of how strong they are. But most people are trained to look at graphs of nodes and connections and quickly grasp what’s going on.
Microsoft has been working with the University of Bristol in the U.K. to use new techniques in graph statistics (called Unfolded Adjacency Spectral Embedding or UASE) that match up different pairs of characteristics by how much they have in common, normalize them over time so you can seem meaningful changes even if the noise in the data means there are different numbers of nodes and links, and then map them so that things that behave more like each other are closer together—and when they move closer together over time, that seems to reflect change in the situation, Edge said.
“Positions in the embedded space actually encode kinds of behavior. This means that new, unexpected behaviours should be detectable as groups of nodes all moving closer together in this space. And in practice, when we detect this behaviour and look at the actual patterns of attributes, they do indeed seem both unusual and representative of some emerging pattern of real-world behaviour.”
Microsoft will show the dynamic graph analysis at the upcoming Microsoft Research Summit and then add them to its open-source graspologic graph statistics package.
Open data tools for the real world
The common theme with all three tools is that data about the real world is messy, complicated and often hides trends and causes in a combination of characteristics that it takes an expert in the field to understand—if only they have tools to help them spot which combinations are significant.
And usually, those tools are built for data scientists who aren’t experts in the problem. Here, they’re designed to bring the power of data science techniques to the people who do understand the problem but don’t have the data science or statistics expertise.
With ShowWhy, Edge told us, “We want to support domain experts who have no prior experience with data wrangling, data science or causal inference to answer causal questions over real-world datasets.” This could be extremely powerful, but building the tools to make it accessible is also hugely challenging, and ShowWhy will definitely evolve.
“We know that early versions of the tool will assume too much, even with step-by-step guidance along the way and on-demand explanations for technical terms. But by building a tool that ‘technically’ works end-to-end for a wide range of datasets and questions, we can iteratively refine our explanations and user experience with people working in the kinds of roles that we’d like to support.”
If you try out ShowWhy when it’s available, you will see some pretty technical jargon, but it will be introduced logically as you work through putting in your data.
“We don’t want to overwhelm users, but at the same time, we have a responsibility to equip them with the knowledge that they need to present and defend their estimates. This means taking time before introducing technical concepts like confounders. We don’t need to rush in and say ‘this is a confounder, now what are your confounders?’ We can take it slowly, asking about causally relevant factors of any kind, before asking whether they might cause or be caused by the exposure or the outcome. With this information, we can think about defining a confounder to the user using relevant concepts that the user already understands. By the time the user gets to the domain model page, they have already been thinking about casual relationships for a while, so will hopefully be ready to see a simplified causal graph and appreciate the nature of a confound.”
SEE: How to install Windows 11 from Microsoft’s ISO file (TechRepublic)
These tools are useful but not foolproof. For instance, Synthetic Data Showcase won’t work for every data set, Edge warned; in particular it won’t help if you’re trying to anonymize datasets where the records have very little overlap and where there are several unique combinations of characteristics, which he notes is common with numeric datasets that have a lot of dimensions.
“We’re working on ways to guide the user through the process of selecting and processing data columns with feedback about the ‘synthesizability’ of the dataset in progress. In the meantime, we prioritize privacy over utility—we will always uphold the privacy guarantee and we will always generate a synthetic dataset—but that dataset might have many missing values as the price of privacy.”
“Similarly, for our graph methods, if your graphs don’t overlap over time, we won’t be able to detect meaningful changes (as everything changes), and if your exposed and unexposed groups in ShowWhy do not overlap in terms of outcomes, it is impossible to estimate the causal effect. What we can do in all cases is to detect the problem if it arises and offers suggestions about how to resolve it: for example, combining data values in Synthetic Data Showcase and broadening the time period for UASE.”
Synthetic data could be useful in a lot of places, like sharing business information from Dynamics with a supplier or partner who also competes with you. In SQL Server it could allow developers in your organization to work with data that matches what the systems they build will be processing, but make sure can’t leak live customer data by losing a laptop or leaving a test server unsecured. Similarly, causal inference and the new graph statistics visualization techniques could find a natural home in Power BI.
Indeed, Edge says the tools could find a home “in multiple Microsoft products” but, he warns “they need to pass through multiple stages of maturity, validation and generalization to get there.”
“In the meantime, we’re trying to take the most direct route to impact, which means building open technologies, in the open, with community partners.” Even at this very early stage, they might do some real good, and the feedback will, he hopes, help Microsoft build “better end-products that can be adopted at scale for problems that matter.”