Combating the coronavirus with Twitter, data mining, and machine learning

Social media can send up an early warning sign of illness, and data analysis can predict how it will spread.

The coronavirus illness (nCoV) is now an international public health emergency, bigger than the SARS outbreak of 2003. Unlike SARS, this time around scientists have better genome sequencing, machine learning, and predictive analysis tools to understand and monitor the outbreak.

During the SARS outbreak, it took five months for scientists to sequence the virus's genome. However, the first 2019-nCoV case was reported in December, and scientists had the genome sequenced by January 10, only a month later.

SEE: Coronavirus having major effect on tech industry beyond supply chain delays (free PDF) (TechRepublic)

Researchers have been using mapping tools to track the spread of disease for several years. Ten European countries started Influenza Net in 2003 to track flu symptoms as reported by individuals, and the American version, Flu Near You, started a similar service in 2011.

Lauren Gardner, a civil engineering professor at Johns Hopkins and the co-director of the Center for Systems Science and Engineering, led the effort to launch a real-time map of the spread of the 2019-nCoV. The site displays statistics about deaths and confirmed cases of coronavirus on a worldwide map

Este Geraghty, MD, MS, MPH, GISP, and chief medical officer and health solutions director at Esri, said that since the SARS outbreak in 2003 there has been a revolution in applied geography through web-based tools.

"Now as we deploy these tools to protect human lives, we can ingest real-time data and display results in interactive dashboards like the coronavirus dashboard built by Johns Hopkins University using ArcGIS," she said.

SEE: The top 10 languages for machine learning hosted on GitHub (free PDF)

With this outbreak, scientists have another source of data that did not exist in 2003: Twitter and Facebook. In 2014, Chicago's Department of Innovation and Technology built an algorithm that  used social media mining and illness prediction technologies to target restaurants inspections. It worked: The algorithm found violations about 7.5 days before the normal inspection routine did.

The social media advantage

Theresa Do, MPH, leader of the Federal Healthcare Advisory and Solutions team at SAS, said that social media can be used as an early indicator that something is going on.

"When you're thinking on a world stage, a lot of times they don't have a lot of these technological advances, but what they do have is cell phones, so they may be tweeting out 'My whole village is sick, something's going on here,' she said.

Do said an analysis of social media posts can be combined with other data sources to predict who is most likely to develop illnesses like the coronavirus illness.

"You can use social media as a source but then validate it against other data sources," she said. "It's not always generalizable (is generalizable a word?), but it can be a sentinel source."

Do said predictive analytics has made significant advances since 2003, including refining the ability to combine multiple data sources. For example, algorithms can look at names on plane tickets and compare that information with data from other sources to predict who has been traveling to certain areas.

"Algorithms can allow you to say 'with some likelihood' it's likely to be the same person," she said.

Filling gaps in the data

The current challenge is identifying gaps in the data. She said that researchers have to balance between the need for real-time data and privacy concerns.

"If you think about the different smartwatches that people wear, you can tell if people are active or not and use that as part of your model, but people aren't always willing to share that because then you can track where someone is at all times," she said.

Do said that the coronavirus outbreak resembles the SARS outbreak, but that governments are sharing data more openly this time.

"We may be getting a lot more positives than they're revealing and that plays a role in how we build the models," she said. "A country doesn't want to be looked at as having the most cases but that is how you save lives." 

Also see

screen-shot-2020-01-31-at-12-27-00-pm.png

This map from Johns Hopkins shows reported cases of 2019-nCoV as of January 30, 2020 at 9:30 pm. The yellow line in the graph is cases outside of China while the orange line shows reported cases inside the country. 

Image: 2019-nCoV Global Cases by Johns Hopkins Center for Systems Science and Engineering