With the proliferation of generative AI in the business world today, it’s critical that organizations understand where AI applications are drawing their data from and who has access to it.
I spoke with Moe Tanabian, chief product officer at industrial software company Cognite and former Microsoft Azure global vice president, about acquiring trustworthy data, AI hallucinations and the future of AI. The following is a transcript of my interview with Tanabian. The interview has been edited for length and clarity.
- Trustworthy data comes from a mix of human and AI knowledge
- Balancing public and private information is key
- Questions to ask to cut down on AI hallucinations
Trustworthy data comes from a mix of human and AI knowledge
Megan Crouse: Define what trustworthy data is to you and how Cognite sees it.
Moe Tanabian: Data has two dimensions. One is the actual value of the data and the parameter that it represents; for example, the temperature of an asset in a factory. Then, there is also the relational aspect of the data that shows how the source of that temperature sensor is connected to the rest of the other data generators. This value-oriented aspect of data and the relational aspect of that data are both important for quality, trustworthiness, and the history and revision and versioning of the data.
There’s obviously the communication pipeline, and you need to make sure that where the data sources connect to your data platform has enough sense of reliability and security. Make sure the data travels with integrity and the data is protected against malicious intent.
First, you get the data inside your data platform, then it starts to shape up, and you can now detect and build up the relational aspect of the data.
You obviously need a fairly accurate representation of your physical world in your digital domain, and we do it through Cognite Data Fusion. Artificial intelligence is great at doing 97% of the job, but in the last 3%, there is always something that is not quite there. The AI model wasn’t trained for that 3%, or the data that we used to train for that 3% was not high-quality data. So there is always an audit mechanism in the process. You put a human in the mix, and the human captures those 3%, basically deficiencies: data quality deficiencies [and] data accuracy deficiencies. Then, it becomes a training cycle for the AI engine. Next time, the AI engine will be knowledgeable enough not to make that same mistake.
We let ChatGPT consult a knowledge graph, that digital twin, which we call a flexible data model. And there you bring the rate of hallucinations [down]. So this combination of knowledge that represents the physical world versus a large language model that can take a natural language query and turn it into a computer-understandable query language — the combination of both creates magic.
Balancing public and private information is key
Megan Crouse: What does Cognite have in place in order to control what data the
internal service is being trained on, and what public information can the generative AI access?
Moe Tanabian: The industry is divided on how to handle it. Like in the early days of, I don’t know, Windows or Microsoft DOS or the PC industry, the usage patterns weren’t quite established yet. I think within the next year or so we’re going to land on a stable architecture. But right now, there are two ways to do it.
One is, as I mentioned, to use an internal AI model — we call it a student model — that is trained on customers’ private data and doesn’t leave customers’ premises and cloud tenants. And the big teacher model, which is basically ChatGPT or other LLMs, connects to it through a set of APIs. So this way, the data stays within the customer’s tenancy and doesn’t go out. That’s one architecture that is being practiced right now — Microsoft is a proponent of it. It’s the invention of Microsoft’s student-teacher architecture.
The second way is not to use ChatGPT or publicly hosted LLMs and host your own
LLM, like Llama. Llama 2 was recently announced by Meta. [Llama and Llama 2] are available now open-source [and] for commercial use. That’s a major, major tectonic shift in the industry. It is so big, we have not understood yet the impacts of it, and the reason is that all of a sudden you have a fairly well-trained pre-trained transformer. [Writer’s note: A transformer in this context is a framework for generative Al. GPT stands for generative pre-trained transformer.] And you can host your own LLM as a customer or as a software vendor like us. And this way, you protect customer data. It never leaves and goes to a publicly hosted LLM.
Questions to ask to cut down on AI hallucinations
Megan Crouse: What should tech professionals who are concerned about AI hallucinations have in mind when determining whether to use generative AI products?
Moe Tanabian: The first thing is: How am I representing my physical world, and where is my knowledge?
The second thing is the data that is coming into that knowledge graph: Is that data of high quality? Do I know where the data comes from? The lineage of the data? Is it accurate? Is it timely? There are a lot of dimensions now. A modern data op platform can handle all of these.
And the last one is: Do I have a mechanism that I can interface the generative AI large language model with my data platform, with my digital twin, to avoid hallucinations and data loss?
If the answers to these three questions are clear, I have a pretty good foundation.
Megan Crouse: What are you most excited about in regard to generative AI now?
Moe Tanabian: Generative AI is one of those foundational technologies like how software changed the world. Mark [Andreesen, a partner in the Silicon Valley venture capital firm Andreessen Horowitz] in 2011 said that software is eating the world, and software already ate the world. It took 40 years for software to do this. I think AI is gonna create another paradigm shift in our lives and the way we live and do business within the next five years.