Humans—most of us, at least—are pretty good at judging what's about to happen when we see two people approach each other. A simple handshake? Friendly hug? High-five?
Computers? Not so much.
That is, until now. According to a paper released on Tuesday by MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), its deep learning AI system can accurately predict human interactions.
The AI that MIT developed, which will be presented at the International Conference on Computer Vision and Pattern Recognition (CVPR), was able to correctly decide, after just one second of a scene, whether two people would hug, kiss, shake hands, or high-five. It was also able to anticipate what kind of object would appear in the video after five seconds. For instance, when presented with a microwave, the system could predict that a coffee mug may appear.
How did it do it? The machine learning system—an AI system that relies on neural-network based algorithms to train itself on large sets of data—was applied to visual training data that included more than 600 hours of video from shows like "The Office" and "Desperate Housewives."
Instead of using traditional methods like examining pixels from a frame and coming up with a future image, or having humans (link to data labeling) label a scene, CSAIL took a new approach—they used freeze-frames of representations, with multiple possible outcomes for what might happen in the scene to train the computer on. Rather than focusing on the details, this approach takes the entire picture into account.
While it is impressive that the system can interpret human interactions, MIT's AI was successful only 43% of the time in the first condition. And the object-prediction study improved previous measures by 30%.
Still, we humans are not always completely accurate—in these experiments, humans had only a 71% rate of accuracy in predicting the interactions.
Even so, the research here has implications for how machines may work with humans in the future. For example, if a system can train itself on videos to understand if people might hug, shake hands, etc., it might apply this knowledge to its own social exchanges with humans.
And, to become even more adept at learning social cues, the robot, perhaps, could train itself through watching videos of its co-workers, for example. Or, a ystem could determine if something was wrong, say, in a hospital or school, by seeing if the predicted human interactions seemed off.
CSAIL hopes to further explore the complexity of human interactions to improve the predictions.
"There's a lot of subtlety to understanding and forecasting human interactions," said Carl Vondrick, lead author of the paper. "We hope to be able to work off of this example to be able to soon predict even more complex tasks."
Cybersecurity expert Roman Yampolskiy says that the system's "ability to correctly predict future events seconds in advance, based only on visual information, is a remarkable achievement."
"As AI systems continue to improve, we will see incorporation of other informational modalities and personality profiling," said Yampolskiy. "Combined with more powerful computers in the future, this will lead to machines capable of predicting future events in many domains, and with much longer prediction windows."
- Facebook's machine learning director shares tips for building a successful AI platform (TechRepublic)
- Why robots still need us: David A. Mindell debunks theory of complete autonomy (TechRepublic)
- Google tailors A.I.-powered search for enterprise (ZDNet)
- How Google's DeepMind beat the game of Go, which is even more complex than chess (TechRepublic)
Hope Reese has nothing to disclose. She doesn't hold investments in the technology companies she covers.
Hope Reese is a journalist in Louisville, KY. Her writing has been featured in The Atlantic, The Boston Globe, The Chicago Tribune, Playboy, Undark Magazine, VICE, Vox, and other publications.