close
close

Scientists have just discovered over 70,000 bizarre new viruses using AI

Viruses are everywhere. They are in the air; in sewage, lakes and oceans; in grasslands and rotting wood. Some thrive in extreme conditions, such as hydrothermal vents, Antarctic ice, and possibly even space.

They are also ancient. Some are probably as old, if not older, than the very first cells.

Although we have coexisted with viruses since the beginning of our species, the viral universe remains largely mysterious. For decades, scientists have laboriously collected samples from all over the world and sequenced their genetic material. But viruses mutate quickly and these efforts only scratch the surface of the virosphere.

Most viral genetic material is biological “dark matter,” Mang Shi of Sun Yat-sen University and colleagues recently wrote in a new paper published in cell.

Using AI, the team sheds new light on the viral world. The AI, called LucaProt, relies on a large language model to understand parts of viral genetic material. Another algorithm further breaks down genetic data into more “digestible” pieces to increase effectiveness.

After analyzing nearly 10,500 samples – some from previous databases, others collected during the study – the AI ​​discovered 70,458 new RNA viruses from samples around the world.

“Suddenly you can see things you hadn't seen before,” said Artem Babaian of the University of Toronto, who was not involved in the study Nature.

Viruses have a bad reputation. The Covid-19 pandemic and the annual flu season are showing their destructive side. But they can also be used to combat antibiotic-resistant bacteria, to transfer gene therapies into cells or to develop vaccines.

Mapping the virus universe provides a bird's-eye view of virus evolution and mutation – with implications not only for biotechnology, but potentially for fighting the next pandemic.

Goes viral

In humans, DNA carries the genetic blueprint. DNA is translated into RNA – also made up of four genetic letters – which carries the genetic information into a cellular factory to make proteins.

Viruses are different. Some forego DNA altogether and instead encode their genetic blueprint directly in RNA. It sounds unusual, but you already know some of these viruses: SARS-CoV-2, which causes Covid-19, is an RNA virus. These viruses have proteins that science knows little about, and they could also offer new insights into biology.

For decades, scientists have been trying to decipher the virosphere by collecting samples. Sources range from everyday water from a local stream to extreme sources such as Antarctic ice or deep seawater. The RNA extracted from these samples is carefully sequenced and deposited in databases. This method, called metagenomics, captures snippets of total viral RNA from an environment.

Understanding the genetic goldmine requires more work. Classic calculation methods have difficulty searching these large databases for meaningful insights.

Enter ESMFold. The program developed by Meta relies on large language models – the same technology that powers OpenAI's ChatGPT and Google's Gemini – to predict protein structures based on their amino acid “letters.” Similar methods, including DeepMind's AlphaFold and David Baker's RoseTTAFold, recently won their developers the 2024 Nobel Prize in Chemistry.

ESMFold captures molecular sequences and predicts the 3D structures of proteins at the atomic level. For their first real-world task, scientists used AI to decipher the “dark matter” proteins in microbes that we know the least about. Last year, AI predicted the structure of over 700 million proteins from microorganisms. Ten percent were completely alien to anything previously discovered.

Shi's team took note of this and asked whether a similar strategy could work in the world of RNA viruses.

Search for viruses

Scientists have previously used AI to fish out potential new RNA viruses from petabytes of genetic sequencing data – an amount roughly equivalent to 500 million high-resolution photos.

These studies focused on RNA-dependent RNA polymerase, or RdRp. Here, the RNA sequences encode RdRps, a family of proteins that mark most RNA virus genomes. An early analysis identified nearly 132,000 new RNA viruses using their genetic data.

The problem? Viruses mutate quickly. If the genetic letters encoding RdRps change, AI trained on these sequences may not be able to detect mutated viruses. The new study addressed the problem by combining the previous approach with ESMFold in a two-channel AI.

The first channel uses a transformer-based model, similar to ChatGPT, to extract amino acid sequence “keywords” encoding viral RdRps from a large database. After training with the desired and some randomly generated sequences, the AI ​​created a vocabulary of about 20,000 common protein sequences that encode RdRps.

Compared to previous methods, this step divides genetic libraries into more digestible sections, making it easier for AI to tackle longer genetic sequences and detect viral RdRp proteins.

The second channel uses a version of ESMFold. This is the slow but careful reader. Instead of burning through protein words, it “reads” each individual letter and predicts how each will structurally connect with others to form 3D protein shapes. This step grounds the AI ​​and gives it an idea of ​​what RdRPs should look like in live viruses.

LucaProt was trained on nearly 6,000 sequences encoding RdRp proteins and over 229,500 sequences known to encode various proteins. When challenged with a test data set where researchers knew the answers, the AI ​​was exceptionally accurate, producing false positives only 0.014 percent of the time.

The AI ​​found 70,458 potential new, unique viruses. One isolated from soil had a surprisingly long genome — “one of the longest RNA viruses identified to date,” the team wrote. Others might thrive in hot springs and extremely salty lakes.

The expanded virosphere adds new viruses to known virus groups – for example Flaviviridaewhich causes hepatitis or yellow fever. LucaProt also identified 60 different virus groups, each of which differs greatly from all viruses known today.

That doesn't mean they cause disease, but they “have been largely overlooked in previous RNA virus discovery projects,” the team wrote.

For Babaian, the study found “small pockets of RNA virus biodiversity that lie truly far away in the blessings of evolutionary space.”

A viral hit?

Viruses need a living host to survive. The team is improving its AI to predict these hosts. Most RNA viruses infect eukaryotes, which include plants, animals and humans. Some viruses can also infect bacteria – their cat-and-mouse game inspired the gene editor CRISPR-Cas9.

“The evolutionary history of RNA viruses is at least as long, if not longer, than that of cellular organisms,” the authors write.

The third branch of life, archaea, is often ignored. These life forms evolved in the early stages of life on Earth and share similarities with bacteria and eukaryotes – for example, in the way their genetic material replicates.

But archaea are a distinct branch of life that thrive in extreme environments like hydrothermal vents or extremely salty water. There is evidence that RNA viruses could also infect archaea. If so, this could lead to new insights into our tree of life – and, as with CRISPR, potentially lead to new biotechnologies.

Image source: National Institute of Allergy and Infectious Diseases / Unsplash