close
close

Researchers are combining AI and genomics to find thousands of new viruses

A 3D representation of an antibody (foreground left) and examples of high-priority “prototype” pathogens with the potential to threaten human health are at the heart of global pandemic preparedness research efforts. From left to right: Hantavirus, Yellow Fever Virus, Nipah Virus, Picornavirus and Chikungunya. | Photo credit: NIAID

For most of modern history, humans have overlooked viruses, even though they are the most abundant biological entity on the planet and have immense ecological importance. Viruses are found in every corner of the world – from soil and water to the atmosphere to extreme environments such as hot springs and hydrothermal vents.

Viruses are obligate parasites: they need a host to infect and multiply. This relationship goes both ways. Thanks to advances in research, scientists are increasingly recognizing viruses as pathogens, but also as integral components of ecosystems. Viruses drive genetic evolution through horizontal gene transfer, control microbial population balance, and even influence biogeochemical cycles.

They play a crucial role in maintaining biodiversity and can even influence climate regulation. Understanding their influence is therefore key to unraveling the complexity of life on Earth. However, only a small proportion of the approximately 100 million to trillion virus species have been identified so far.

The unknown-unknown threat

Beyond their role in the environment, understanding viruses is critical for us to anticipate emerging infectious diseases. Some studies estimate that there are approximately 300,000 mammalian viruses still to be discovered, many of which pose a zoonotic threat. Unlike microbes, which scientists have studied using culture-based methods, viruses remain poorly understood due to the challenges of cultivating them.

The rapidly increasing scale and decreasing cost of nucleotide sequencing has led to the widespread use of genome sequencing approaches to understand microbes in the environment, particularly in metagenomics studies. These approaches have transformed our ability to explore the vast diversity of microbes and viruses over the past decade. In a metagenomic study, researchers analyze genetic material directly from environmental samples, allowing them to identify and study an organism without having to cultivate organic material such as tissue in an intermediate step.

More buggy, but faster

In recent years, metagenomics has helped scientists identify an astonishing number of previously unknown microbes in various environments. These discoveries have significantly expanded our understanding of microbial ecosystems. As sequencing technologies continue to improve – becoming more precise, faster and more affordable – while improving global data sharing practices, scientists are beginning to unlock the mysteries of the microbial world at an unprecedented pace.

In this context, RNA viruses are particularly important because they mutate quickly and adapt quickly to new conditions. More specifically, DNA viruses have more stable genomes and their genome replication machinery makes fewer “errors” as they replicate, while RNA viruses replicate faster and with higher error rates. This property is also particularly relevant in the context of emerging infectious diseases: COVID-19, Ebola and influenza are all caused by RNA viruses.

Serratus ups the ante

One way to identify an RNA virus is to detect and isolate fragments of a specific gene that is essential for the virus to replicate: RNA-dependent RNA polymerase, or RdRp. RdRP is one of the oldest genes, so much so that many researchers believe it was among the first genes in the world. RdRp proteins have regions that are well conserved (i.e. that the organism maintains throughout its development) and motifs in the protein that are essential for its function, namely replicating RNA using a template.

In 2022, Canadian researchers led by Artem Babaian developed an open source tool called Serratus. When scientists sequenced a gene, Serratus was able to match the sequence data with sequences known to be associated with viral RdRp proteins. The researchers collected more than 10 petabytes of sequencing data, spanning 5.7 million sequencing libraries from diverse ecologies. When they shared this data set with Serratus, the presence of more than 100,000 viruses was discovered, greatly expanding the diversity of viruses known to humanity. Their results were published in Nature in January 2022.

In another study published in Science That same year, US researchers led by Ahmad Zayed of the University of Ohio used computer tools to comb through terabytes of RNA sequence data to identify thousands of new RNA virus species. In particular, this team identified a new virus species to fill an important gap in our scientists' understanding of the evolution of RNA viruses. a new species that dominated the oceans; and another species that could infect mitochondria (organelles in cellular organisms that serve as an energy source and are thought to come from microbes).

A transformative effect

An important disadvantage of the metagenomic approach is that computer algorithms typically search for proteins that are very similar to sequences already present in databases. This puts them at risk of missing out on proteins that have evolved and changed shape. However, this risk may not last long. In a recent study, researchers from several Chinese research organizations combined genomics with a transformer.

In deep learning, a transformer is a type of machine learning model known for its ability to train quickly to recognize specific patterns. In the study, the researchers fed their transformer genome sequencing data and data from ESMFold, another machine learning model that can predict the structures of proteins, and trained it to recognize genetic patterns corresponding to RdRp.

They then used the transformer to analyze large tranches of metagenomic data, identifying more than 160,000 new RNA viruses. More than half of these viruses were described for the first time and many originated from unique and/or extreme environmental niches, including hot springs, salt lakes and air. Their results will be published in an upcoming issue of cell.

Because transformers look for patterns rather than amino acid sequences, they can find proteins even if they differ significantly from one another. They can also help computers design proteins based on these patterns to perform functions that no natural protein can do. The discovery of new RNA viruses in new locations in the environment is also important for our understanding of public health. Each new discovery improves our ability to better identify and characterize similar viruses, teaches us what to look for and how/where we can improve our methods, and helps us discover more species more quickly.

Fighting pandemics before they start

On the ground, a key benefit of such discoveries lies in pandemic preparedness. As sequencing technology and data become more widespread, we are better able than ever to identify pathogenic viruses with zoonotic potential – those that could jump from animals to humans – long before they pose a significant threat. Early detection gives us the opportunity to intervene in a timely manner and even prevent large-scale outbreaks.

Looking forward, deeper understanding of viruses and their evolution through genomics using ecological monitoring and machine learning will improve our preparedness for pandemics. By continually mapping virus diversity in nature and improving our understanding of virus-host interactions, we can also develop machine learning models that can predict and mitigate viral spillover effects. This future promises not only to deal with emerging viruses, but also to combat the risk of pandemics on a microscopic rather than global scale.

The authors work at Karkinos Healthcare and are associate professors at IIT Kanpur and Dr. DY Patil Medical College, Hospital and Research Centre. The views expressed are personal.