Skip to NavigationSkip to content

How ‘deep reading’ of the scientific literature will benefit pharma

Published on 03/09/15 at 10:43am
cells
The 'Big Mechanism' project focuses on understanding cancer and tumour biology

The US Defense Advanced Research Projects Agency (DARPA)’s $45 million ‘Big Mechanism’ initiative has an ambitious aim.

It promises to “leapfrog state-of-the-art analytics by developing automated technologies to help explain the causes and effects that drive complicated systems such as diseases like cancer.” 

The three-phase, 4.5-year program has the potential to transform models for cancer drug R&D and spur the development, optimisation and selection of more targeted treatments—all within the next few years. 

Dr Paul Cohen, manager of the Big Mechanism program, explains: “Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. The collection of big data is increasingly automated, but the creation of big mechanisms remains a human endeavor made increasingly difficult by the fragmentation and distribution of knowledge. To the extent that the construction of big mechanisms can be automated, it could change how science is done.” 

To catalyse that change, the Agency has tasked research teams with creating computerised systems that will: 

  • Read scientific and medical research papers in unprecedented depth and detail to uncover and extract relevant pieces of information;
  • Integrate those fragments of knowledge into existing computational models of cancer pathways, updating those models accordingly; and
  • Produce hypotheses and predications about mechanisms that will be tested by researchers. 

The initial focus is cancer biology – specifically cancers driven by KRAS gene mutations, which underlie a significant percentage of colon, lung and head and neck cancers – with an emphasis on signaling pathways. The project is expected to yield new knowledge about the mechanisms that influence sensitivity or resistance to current treatments, thereby facilitating more personalised therapies; currently, KRAS-driven tumours can be managed, but not cured. It will also demonstrate the utility of the computerised processes that are unearthing this knowledge; and produce models that can be replicated by industry and academic researchers in pursuit of more effective therapeutics. 

Digging into the data

In this first phase of the Big Mechanism program, the Carnegie Mellon University team (CMU) – one of 12 funded by DARPA— is charged with making sense of a huge amount of data from diverse sources. The goal is to create a succinct summary of ‘core knowledge’ about KRAS signaling pathways for the computer scientists who are creating the new disease models. 

To accomplish this, the team realised that it needed to do much more than just scan the published literature, though it does include all relevant peer-reviewed articles in its text-mining efforts as a matter of course. However, if one were to limit oneself to that level of input, one would encounter large information gaps because, for the most part, the published articles are incomplete. The following two major problems can interfere with any conclusions that might be drawn based on such input alone:

  • The ‘methods’ sections of many articles do not show all the steps the authors took to reach their conclusions. Part of the CMU team’s brief is to read the methods section as well as the results section of each paper to see whether it is possible to extract the underlying experimental logic. If so, this information will be sent to those who are doing the modelling. However, if only part of the story is present in the text (as is usually the case), there is no way to tell whether a claim or result is valid. One cannot simply take the results section’s claims or the authors’ interpretations of the findings at face value.
  • Statements in papers may be contradictory. The authors may oversimplify or make incorrect assertions that are not picked up during the editing and publication process. In fact, the CMU team provided an example of this problem in its initial proposal to DARPA – the introduction section of a recently-published article claimed that a particular protein could transcriptionally activate certain genes, citing a study that actually stated the contrary, namely that the protein represses gene transcription! 

These problems are not unique to the cancer domain; pharma R&D groups face similar challenges when trying to decide whether to move forward with a candidate compound, for example. To help inform decision-making in general, it is necessary to use technologies that mine the full text of both literature-based and experimental evidence, as well as relevant clinical data – all of which could provide potentially useful pieces of information. 

Speaking the same language

Project teams – whether in the Big Mechanism program or in big pharma – are made up of individuals with disparate areas of expertise, including biology, chemistry, informatics, and visualisation.  In such situations, communication is key.  Different disciplines often describe the same phenomena differently, and an inability to find a common language can impede a project’s progress. 

This holds even between different laboratories in the same subfield, which often use different nomenclature when referring to the same proteins or processes, meaning it is crucial to amass an extensive database of aliases. 

For example, in the current project, the gene of interest alone may be referred to as KRAS, KRAS2 or RASK2, according to the protein catalogue UniProt. Its protein may be referred to as GTPase Kras; K-Ras 2; Ki-Ras; c-K-ras; or c-Ki-ras. What’s more, KRAS interacts with about 150 other proteins in the human genome, all of which have multiple synonyms. This could result in hundreds of different ways of expressing the same basic information. 

The domain expert must wade through this tangled forest and explain that when one person says X and another person says Y, they both mean the same thing, or when someone says Z, or sometimes even X, he or she is really referring to something else. 

The CMU team relies heavily on the domain expertise of researchers with a thorough knowledge of biological terminology and language and access to both a massive library of previously ingested papers and the capacity to handle new input. 

Others in the CMU team are experts in natural language processing software, which provides vital tools to address the problem. By collecting and standardising nomenclature, and by automatically learning to recognise which variations of expression mean the same thing, the software facilitates understanding of both terminology and – to the extent that they can be modeled – experimental methods. This provides the ability to identify relevant information that might otherwise remain hidden in the literature and express it in terms that all team members can understand. 

Taken together, natural language processing software and domain expertise allow ‘deep reading’ to be applied to the literature and other relevant input. Simply put, this means the machines that plow through the articles are able to ‘read’ them not only at the simple surface (word) level, but more like scientists do, making judgments about statements and findings, then extracting only information that validates or contributes to existing knowledge. 

This weeding out and systematisation process will make it much easier for the team to go to the next step—namely, providing accurate input for the modelers.  In addition the system becomes semantically ‘smarter’ with each iteration, which will ultimately benefit everyone who uses it, including current and future industry, academic and government partners.

Proof of concept  

Less than a year into the program, the possibilities look increasingly exciting. DARPA has requested the development of a ‘use case’ that starts with input of research articles, proceeds through automated analysis, informs automatically built and maintained models of the cancer processes being discussed, and suggests new questions on unresolved issues that lead to a bench scientist doing real experiments with the output, ending with that scientist contributing the results of these experiments back into the system to help inform future investigations.

This project started about nine months ago. For the next 18 months, the team will mine all available documents that mention anything related to the Kras protein (and its various synonyms); extract all relevant information into a central database; identify any gaps and inconsistencies; and ask the team’s biologist to perform specific experiments that could help close the gaps or remove the inconsistencies. The use case will serve as a proof of concept for the approach.

Outcome for pharma

So what will this project mean for pharma companies? First, its goal is to develop deep-reading algorithms that can summarise and distill information from an unprecedented 70-80% of the data in a single paper. Currently even the best systems only focus on capturing a portion of the information in a single paper, typically just over 30%.

With over a million articles published annually, companies looking to identify those academic papers that can provide inspiration for new avenues of research into oncology drugs currently face a vast amount of information. Whether they spend valuable time and resources manually searching for links in the published research, or rely on current technology, there is still a chance they will miss a number of the relationships between proteins and compounds that they need because the data is dispersed across a wide range of sources.

The use of automated deep reading to this level could provide a great boost to researchers seeking to optimise leads, and assist companies specifically researching KRAS-related cancers. An automated system that provides a single, constantly updated and freely-available single model of KRAS-activated tumours would allow scientists to help connect the dots between possible causes and treatments and develop drugs faster.

Ultimately, the aim is for the algorithms and models developed in this project to be expanded beyond the focus on KRAS-activated cancers, giving researchers working on different cancers and even other diseases entirely the same ability to search publications, make connections and understand the composition of diseases more quickly and with greater confidence; ultimately speeding the route of treatments to market. 

Professor Eduard Hovy is research professor at Carnegie Mellon Language Technologies Institute, and Anton Yuryev is consultant with Elsevier R&D Solutions’ Professional Services team.

Mission Statement
Pharmafile.com is a leading portal for the pharmaceutical industry, providing industry professionals with pharma news, pharma events, pharma service company listings and pharma jobs,
Site content is produced by our editorial team exclusively for Pharmafile.com and our industry newspaper Pharmafocus. Service company profiles and listings are taken from our pharmaceutical industry directory, Pharmafile, and presented in a unique Find and Compare format to ensure the most relevant matches