TECHNOLOGY FEATURE 09 JUNE 2020
Artificial-intelligence tools aim to tame the coronavirus literature
Developers hope that tools for processing natural language will help biomedical researchers and clinicians to find the COVID-19 papers that they need.
Matthew Hutson
The COVID-19 literature has grown in much the same way as the disease’s transmission: exponentially. The NIH’s COVID-19 Portfolio, a website that tracks papers related to the SARS-CoV-2 coronavirus and the disease it causes, lists more than 28,000 articles — far too many for any researcher to read (See ‘Explosive Growth’; code and data at https://github.com/jperkel/covidlit). But a fast-growing set of artificial-intelligence (AI) tools might help researchers and clinicians to quickly sift through the literature.
Driven by a combination of factors — including the availability of a large collection of relevant papers, advances in natural-language processing (NLP) technology and the urgency of the pandemic itself — these tools use AI to find the studies that are most relevant to the user, and in some cases to extract specific findings from the results. Beyond the current pandemic, such tools could help to bridge fields by making it easier to identify solutions from other disciplines, says Amalie Trewartha, one of the team leads for the literature-search tool COVIDScholar, at the Lawrence Berkeley National Laboratory in Berkeley, California.
The tools are still in development, and their utility is largely unproven. They can’t be used to make clinical or research decisions. Even using AI, “a vaccine is not going to emerge full-blown”, says Oren Etzioni, chief executive of the Allen Institute for AI (AI2) in Seattle. But developers hope the new technology will help researchers to focus their efforts. “Augmented intelligence is the best summary of AI,” Etzioni says.
Call to action
The impetus for many of these efforts was a 16 March ‘call to action’ in which the White House Office of Science and Technology Policy invited the AI community to develop tools for mining the COVID-19 literature. To get them started, the White House worked with several organizations to release the COVID-19 Open Research Dataset (CORD-19): a collection of 13,000 full-text papers on SARS-CoV-2 and other coronaviruses. AI2 formatted the files for easier parsing by algorithms, and adds new papers regularly; the collection now numbers some 68,000 papers and 67,000 abstracts. Dozens of tools have emerged as a result.
The CORD-19 data set is inadvertently proving to be a super-interesting pragmatic test” for AI-based literature analysis, says Anthony Goldbloom, chief executive of the website Kaggle, a Google subsidiary in San Francisco, California, that hosts machine-learning competitions.
To focus AI researchers’ efforts, the White House generated a set of questions to answer, such as ‘What is known about adaptations (mutations) of the virus?’. Kaggle presents these questions — there are dozens — to its users, and awards a weekly US$1,000 prize to the team that has the best answers. Medical-student volunteers sort through the results and compile the best answers into a set of tables on a central page, which now acts as a continuously updated reference. More than 1,000 accounts have submitted algorithms.
José Morey, the chief medical-innovation officer at Liberty Biosecurity, a research firm in Arlington, Virginia, used the resulting reference lists to draft, in a matter of days, a not-yet-published review article summarizing risk factors for COVID-19 severity. “This would have taken me weeks of time to research on my own and aggregate,” he says.
Competitors typically use one of two AI methods, says Goldbloom. The first is an “old-school information-retrieval method” that requires explicit rules that look for specific keywords in papers and analyse the text around them; the second uses deep neural networks, a type of machine-learning method, trained on large data sets to recognize text related to a question or topic.
The second phase of the competition, which runs until 16 June, seeks to automate the process of compiling outputs from these AI searches and filling in tables summarizing what the literature reports about facets of COVID-19, such as risk factors and therapeutics.
Search engines
A fast-growing collection of tools exists outside the Kaggle competitions. Google’s COVID-19 Research Explorer, for instance, allows users to ask such questions as ‘What are the rapid molecular diagnostics for COVID-19?’. The tool returns a list of papers, with key passages highlighted.
According to Keith Hall, a computer scientist who leads the project from New York City, the COVID-19 Research Explorer was already in the works as a biomedical-research tool before the pandemic. When CORD-19 came out, “it made it a little more obvious that we could provide a tool that might be helpful to researchers, even if it’s not fully integrated with other Google products”.
COVIDScholar, developed at Lawrence Berkeley, offers a simple search box for combing through the COVID-19 literature. The results page, however, uses AI to tag papers with keywords and topic labels, and offers filters — by attributes such as topic, year, peer-review status and source. Many papers in its corpus come from CORD-19, but the team developed its own scraping tools to collect documents from other sites as well, says Trewartha.
Oscar Whitney, a biology doctoral student at the University of California, Berkeley, used COVIDScholar while writing a paper on nucleic-acid testing for COVID-19. The tool helped him to refine his searches better than Google Scholar and PubMed could, he says — and it turned up papers he might not otherwise have found. “This is definitely the best literature-search tool I’ve ever used,” he says.
A search tool from AI2 called SPIKE-CORD focuses not only on retrieval of papers, but on extraction of information from them, using a simple query language. A search for ‘incubation period … from:* to|- to:* days’ returns a list of snippets such as ‘The incubation period ranges from 3 to 28 days’. With a click, users can download those results to a spreadsheet, with the low and high values in separate columns. Usually, “writing these kinds of queries over large text basically requires you to sit down and write some code”, says Yoav Goldberg, who directs research at AI2 Israel. SPIKE-CORD, he says, aims to expose the power of NLP to those who cannot program.
Forking paths
Other tools enable more open-ended exploration. For instance, SciSight, developed by AI2 in collaboration with the University of Washington in Seattle, blends four tools. The ‘faceted search’ tool generates a changing list of papers as the user refocuses the search by selecting facets from eight categories, including intervention (for example, vaccine), outcome (such as antibody response), author and journal. A ‘network of science’ tool shows which researchers are studying (and collaborating on) which aspects of the disease, and other tools demonstrate connections between diseases and medicines, and between genes and proteins. All the tools are visual and interactive, such that clicking on one variable presents further connections. The platform supports “this kind of iterative refinement”, says Tom Hope, who leads the SciSight project in Seattle, “which is useful when you don’t really know what it is that you don’t know — the ‘unknown unknowns’”.
Sravanthi Parasa, a gastroenterologist and clinical researcher at the Swedish Medical Center in Seattle, calls the SciSight disease–drug network search tool “a brilliant idea”. She envisions using it outside the pandemic with patients, who sometimes ask about uncommon drug interactions doctors don’t normally see. Parasa typically answers these using PubMed, but that “takes 10–15 minutes, even at lightning speed”.
KnetMiner for COVID-19, from Rothamsted Research, a non-profit organization in Harpenden, UK, sifts through the text of papers as well as data on genetic linkages, protein–protein interactions and gene expression to build a knowledge network of papers, genes, drugs, diseases and proteins. Joseph Hearnshaw, a bioinformatics scientist at Rothamsted, is using the tool to understand why COVID-19 kills more men than women, by looking for connections between the disease, hormones and certain genes — a process of interactive exploration, he says. “Within minutes, I can generate new hypotheses, and share the networks with clinicians.”
And COVID-19 Primer supplements the COVID-19 literature with other data sources, including news outlets and Twitter. The site tracks the most-discussed papers, top emerging topics and trending quotes from news sources, including National Public Radio and FOXnews.com, using neural networks and old-school information retrieval. It also constructs reading lists, both overall and in 11 research categories.
The anticipated user is the “general scientifically literate citizen, someone who wants to know what’s the state of play”, says John Bohannon, the director of science at Primer, a San Francisco-based company that develops AI technology. A little unexpectedly, the site has drawn front-line researchers, including Madeline Grade, an emergency-medicine physician and researcher at the University of California, San Francisco. Grade found COVID-19 Primer especially helpful early in the pandemic when “every aspect of care was changing on a daily basis”. She was inundated with information and needed to create daily protocol updates for the university’s hospital. “Amid that chaos,” she says, “the Primer app was actually a really amazing way to cut through the noise.”
In the fight
So far, these sites have received modest traffic. As of late May, COVID-19 Primer has welcomed 14,000 unique visitors per month, and SciSight has seen 11,000 since it launched. COVIDScholar receives about 500 visitors each day, and COVID-19 KnetMiner has had that many in total. The Kaggle challenge has received the most attention, with 1.7 million page views since it launched in mid-March. Of the COVID-19 researchers we contacted, most had not heard of the majority of these tools. And still more such tools are being developed around the world, including Vilokana in India and CovidAsk in South Korea.
To be clear, they are works in progress. But developers are bullish. “Someone asked me, ‘Is this a practice run, and the real show will come five years from now?’” Etzioni says. He responds: “I wouldn’t say this was a practice. I would say maybe a dress rehearsal. We’re not quite open with the show, but we’re definitely in there, in the fight, and trying to do good.”