close
close

Evolution shapes and conserves genomic signatures in viruses

Genomic signatures in viral genomes

To investigate the degree of conservation of genomic signatures in viruses, we analyzed the complete genome sequences of 2768 viral species from 105 families. We applied established methods for the analysis of k-mer frequencies using variable-length Markov chains (VLMCs)6,17,18. These VLMCs are generalizations of models containing frequencies of fixed-length substrings commonly present in a certain genome, as explained in Fig. 1. Note that every such model counting substrings of fixed length is a VLMC. The converse is not true as VLMC not only adapts the depth of the tree representing the model to the statistics of the genome during training, but does so branch-by-branch. The motivation is to balance power—large k whenever frequent (k-1)-mer prefixes in a genome allow to reliably estimate probabilities of A, C, G, T following the same (k-1)-mer prefix—and robustness by only allowing large k if the counts for the prefix are sufficiently large. In practice, a maximal k is prescribed for VLMC. The final choice however, and whether that depth applies to all or some prefixes, is always automatically inferred from the genome sequence.

Fig. 1: Variable-length Markov chains (VLMC) are generalizations of models relying on frequencies of fixed-length substrings such as individual.
Evolution shapes and conserves genomic signatures in viruses

A di- or tri-nucleotides, codons, di-codons or, generally, k-mers. B Depicts a VLMC in which probabilities are assigned either to di-nucleotides starting with G or individual nucleotides not following a G. CE Demonstrate the intrinsic balance learned during the training.

To avoid possible bias caused by repeat regions, all genomes were trimmed using DustMasker for removal of low complexity regions prior to any analysis. For viruses with segmented genomes, each segment was analyzed separately, totaling 4273 viral sequences. We divided each sequence into two parts: the first 30% termed query and the last 70% termed profile. Then, we compared the genomic signature in each query to the signatures in all profiles. If a query signature was most similar to its own profile signature, we considered it a conserved species-specific genomic signature, distinguishable from the signatures of all other viruses. For segmented viruses, a match to any segment profile of the same species was also considered species-specific. Although a certain virus has a conserved genomic signature, this signature might in some cases be highly similar to the signatures in other members of the same genus or family. Reasons for this could be either homologous genomes, and thus also similar signatures, or similar selection pressures acting on these viruses. The signature in the query sequence of that virus can therefore, in such cases, not be distinguished from the signature of the profile sequences of other viruses in the same genus or family. To highlight that these viruses present conserved genomic signatures, but that they are indistinguishable between other members in the same genus or family, we choose to call these signatures genus- or family-specific depending on the match. Consequently, if a query signature matched a profile signature of a different viral species in the same genus or family, we classified it as genus- or family-specific, respectively.

Our results show that viral genomic signatures are highly specific, often at the species level. To first explore the influence of genome size, we examined viral genomes of different sizes; ≤5000, 5000–9999, 10,000–19,999, 20,000–49,999, and ≥50,000 nucleotides (nt). Species-specificity was most prominent in viruses with large genomes, gradually decreasing with genome size (Fig. 2a). More specifically, 78% of all viruses with genomes longer or equal to 50,000 nt presented species-specific genomic signatures, distinct from other viruses from the same or different viral families, regardless of genome length. Among the remaining 22%, most had genus- (15%) or family-specific (1%) signatures. Only 6% of these didn’t match any signature from the same family.

Fig. 2: Analysis of genomic signatures in viral genomes.
figure 2

a Signature specificity was determined for viruses with genomes of different lengths. The proportional specificity is color-coded and represented in circle charts for respective genome length. b The signature specificity was further determined for viruses with genomes longer than 10,000 nt from different Baltimore classes and taxonomic families.

Viruses with genomes ranging from 20,000 to 49,999 nt presented similar results, although with fewer viruses with species- (45%) and more with genus- or family-specific signatures (38% and 4%, respectively). In contrast, 13% of these viruses had genomes where the query did not match any profile signature from the same family.

For viruses with genomes between 10,000 and 19,999, and 5000–9999 nt, 22% and 16% had species-specific, 31% and 29% genus-specific, and 12% and 14% family-specific signatures. In these groups, 34% and 41% of the viruses presented genomes where the query did not match any signature from the same family.

Lastly, for viruses with genomes under 5000 nt, 9% presented species-specific, 19% genus-specific, and 9% family-specific genomic signatures. In contrast, for 62% of these viruses, we found no match between the query and any profile within the same family.

To evaluate the statistical robustness of our results, we applied a simulation approach where we randomly paired queries and profiles and compared the number of matches with our results using a Bonferroni-corrected two-tailed t-test. This test demonstrated significant results for all size groups (p = 3.6 × 10−12, 3.1 × 10−15, 5.8 × 10−13, 2.9 × 10−13, 1.4 × 10−14 for the family-specificity per category in order of increasing sequence length, with even lower p values for the genus- and species-specific matches, Supplementary Fig. 1).

Impact of sequence length

Our analysis of genomic signatures relies on repetitive nucleotide patterns, which implies that a higher specificity is typically achievable in longer sequences due to a higher prevalence of each repeated k-mer. As a consequence, the lower frequencies of detectable genomic signatures in viruses with shorter genomes presented here may be derived from a methodological bias.

We therefore tested if the lower detection of genomic signatures in shorter genomes could be attributed to a methodological bias, or to less prominent signatures. We randomly extracted subsequences of 5000, 10,000, and 20,000 nt from the largest genomes (>50,000 nt) in our dataset. We then analyzed these subsequence segments for species-, genus-, and family-specific genomic signatures and compared the results to our previous findings of similar subsequence length (Fig. 2a). To minimize the impact of subsequence location in the genome, this process was repeated 100 times, and the results were summarized.

With 5000 nt subsequences, species-specificity dropped from 78% to an average of 42% (Fig. 3a). Still, it’s notably higher than the 22% seen in viruses with genomes between 5000 nt and 9999 nt. Likewise, genus- and family-specificity decreased from 91% and 94% to 53% and 66% (Fig. 3b, c), but still more accurate than viruses of similar length. This trend held for 10,000 nt and 20,000 nt subsequences, affirming the correlation between larger genomes and increased genomic signature specificity in viruses.

Fig. 3: Evaluation of methodological bias related to sequence length.
figure 3

The genomic signatures of subsequences from viruses with genomes longer than 50,000 nt were compared to those of viruses with genomes of the corresponding sequence lengths. The analysis demonstrates that the (a) species-, (b) genus-, and (c) family-specificity for subsequences of 5000 nt (yellow), 10,000 nt (light purple), and 20,000 nt (light blue) from the sampled viruses have, on average, a higher fraction of specific signatures. Our results suggest that larger genomes tend to have more significant genomic signatures than smaller genomes, although there is also a methodological bias. The 95% confidence intervals are depicted as horizontal bars.

Impact of k-mer length

Here, we use a variable length Markov model to analyze genomic signatures where we state a maximum k-mer length. Previous studies on procaryotes have indicated that the predictive accuracy increases with k-mer size. However, the increase appears to be logarithmic, where an increased size from two to four nucleotides significantly increases the accuracy, while an increase from five to six, or six to eight nucleotides only marginally improved the results6,19. A similar study demonstrated that the accuracy of classification was good already using k-mers of three nucleotides, while longer k-mers of five nucleotides improved the results20. Similarly, in a study using genomic signatures for phylogenetic studies, the authors found an improvement with a k-mer length between two and five nucleotides, while it remained stable with increased length. It was subsequently concluded that a k-mer of length six nucleotides presented a good trade-off between sequence size and k-mer length, which were chosen for further studies21.

To investigate to which extent the maximum k-mer length affects our results on viral genomes, we repeated the analysis of all viral genomes using maximum k-mer lengths of one to seven nucleotides. Our results show that the optimal max length is not necessarily as large as possible but varies depending on genome length. While the genomic signature in viruses with short genomes was better analyzed with a max length of six or seven nucleotides, a shorter max length presented more accurate hits on species, genus, and family levels for viruses with larger genomes (Fig. 4). For simplicity and consistency, we decided to apply a maximum length of six nucleotides in all our analysis, which we consider a proper balance between computational time and accuracy.

Fig. 4: Impact of k-mer max length.
figure 4

The genomic signatures of viral genomes of different lengths were analyzed using seven different k-mer max lengths (y-axis). Our results demonstrate that the optimal k-mer max length differ for viruses with different genome sizes.

Families and Baltimore classes

To examine the variation in signature specificity among different viral families and Baltimore classes, we performed a new analysis on the 811 viral genome sequences exceeding 10,000 nt to avoid possible bias caused by analyzing short genomes.

We observed more than 50% species-specific and over 75% genus-specific genomic signatures for these viruses (Fig. 2b). However, differences existed among Baltimore classes and families. In Baltimore class I (dsDNA viruses), the majority, especially in Baculoviridae and Herpesviridae, presented species-specific genomic signatures (Fig. 2b). Baltimore class IV ((+)ssRNA viruses) generally displayed high specificity, although slightly lower at the species level. For instance, in Coronaviridae, 53% presented species-, 40% genus-, and 7% family-specific genomic signatures. The lowest specificity was found in Baltimore class V ((-)ssRNA viruses). For example, Paramyxoviridae and Rhabdoviridae, the two largest families, only presented 6% and 19% species-specific genomic signatures, respectively. Nevertheless, most (-)ssRNA viruses, including those in Paramyxoviridae and Rhabdoviridae, exhibited at least family-level specificity.

A limited number of Retroviridae family members exceed 10,000 nt and were included from the ssRNA-RT viruses (Baltimore class VI). Among them, 75% showed species-specific, 15% genus-specific, and 5% family-specific genomic signatures. Only 5% of the viruses from this Baltimore class did not present a detectable genomic signature.

In dsDNA-RT (Baltimore class VII), only one virus exceeds 10,000 nt—the Cacao swollen shoot virus in the Caulimoviridae family, which had no discernible species-specific genomic signature.

We assessed the statistical significance of species, genus, and family-specific signatures for Baltimore classes, using the same statistical test as for the size groups (Supplementary Table 1). Except for Solemoviridae (p = 0.18), Nanoviridae (p = 0.43), and Marnaviridae (p = 1), all viral families demonstrated statistically significant fractions of their members with either species-, genus-, or family-specific signatures as compared to the random model (p

Variation within genomes

To analyze intra-genomic variations in signature conservation in viral genomes longer than 10,000 nt, we applied a sliding window approach. The window size ranged from 50 nt to 10,000 nt, or at most one-third of the genome length.

We analyzed six viral genomes: three with sequence lengths between 20,000 nt and 49,999 nt, and three exceeding 50,000 nt. Species were randomly selected from viruses with species, genus, and family-specific signatures within their respective length group.

Most signatures were conserved across the analyzed sequences (Fig. 4a1–f1). For Cydia pomonella granulovirus (Fig. 4c1) and a segment from the Chelonus inanitus bracovirus (segment proviral CiV22.5g4 gene) (Fig. 4f1) species-specificity spanned nearly their entire sequences, with only a few non-matching regions, mainly repeat regions.

In Choristoneura rosaceana nucleopolyhedrovirus (Fig. 4b1) and Bat mastadenovirus B (Fig. 4e1), most regions were genus-specific. Duck atadenovirus A (Fig. 4d1) has two distinct regions: one mostly family-specific and the other mostly species-specific. Tupaiid betaherpesvirus 1 (Fig. 4a1) displays regions with predominantly species- or family-specific windows. For all viruses, windows shorter than 200 nt typically failed to match even the correct family.

We also explored the minimum sequence length needed to identify the species, genus, or family by analyzing genomic signatures (Fig. 5a2–f2, Supplementary Fig. 2). On average, analyzing 1000 nt can classify the correct family in 80% of viruses with genomes longer than 10,000 and the correct genus for 70% (Supplementary Fig. 2). Identifying the correct species is more challenging; analyzing 2000 nt can classify less than 40% of species, although results vary by species. For example, Cydia pomonella granulovirus (Fig. 4c2) requires just 500 nt to identify the correct species in 50% of cases, while Bat mastadenovirus B (Fig. 4b2) can achieve accurate genus classification at best.

Fig. 5: Genomic signatures across genomes.
figure 5

a1f1 We randomly selected six viruses that we had classified as containing family-, genus-, or species-specific genomic signatures, respectively, and applied a sliding window approach to analyze the species- (light green), genus- (dark green), or family-specific (purple) signatures in different regions in their genomes. Regions with no specificity related to the viral family is marked in red. a2f2 We further depicted the proportion of windows that is most similar to the correct species, genus, and family for respective virus to estimate the possibility to classify a viral sequence based on its genomic signature for different sequence lengths. The viruses for each panel correspond to the virus with the same letter as in (a1f1).

Variations within and between families

To further explore how genomic signatures differ between species within and between families, we first computed pairwise signature distances between all viruses with genomes larger than 10,000 nt. These were then used to create an unrooted neighbor-joining tree. This tree thus illustrates similarities and differences in genomic signatures between viruses based on their locations, rather than descendance from common ancestors like in a phylogenetic tree.

Our results demonstrate that some viruses clustered according to their families or genera, such as Corona, Polydna, Toga, Flavi, and Baculo families. However, while all viruses within individual Flaviviridae genera (Flavivirus, Pestivirus, Pegivirus) presented similar genomic signatures, there were large distances between the genera (Fig. 6a). Similarily, Polydnaviridae presented conserved signatures within, but not between, genera (Bracovirus, Ichnovirus).

Fig. 6: Clustering of viruses based on their genomic signatures.
figure 6

We constructed an unrooted tree of the viruses with genomes larger than 10,000 nt, where the distances between taxa correspond to differences in genomic signatures. a A subset of the families, where the species of the same genera have similar signatures. b A subset of the families, where signatures varied significantly within each family. c The respective Baltimore class is color-coded to illustrate their respective variation in genomic signatures.

While all viruses in the Baculo and Corona families, and most viruses in the Toga family, clustered according to their family taxonomy, the Baculoviridae branch also included 15 distantly related, or unrelated, viruses. Additionally, three Tobani family viruses had similar signatures to, and clustered together with, viruses in the Corona family. In contrast, in some viral families, such as the Herpesviridae, Alloherpesviridae, Malacoherpesviridae, Adenoviridae, and Poxviridae, the genomic signatures varied considerably within the families and the members were dispersed throughout the tree (Fig. 6b).

Although some viral families within the same Baltimore class presented similar genomic signatures (Fig. 6c), multiple clusters existed of each Baltimore class, except for the ssRNA-RT viruses. It’s important to note, however, that the Retroviridae family is the sole family within this class.

A high-resolution tree with detailed information is presented in Supplementary Fig. 3.

Viral-host signature adaptation

As viruses partly depend on their host’s genetic machinery, they may have adapted their genomic signatures to converge with their hosts’ signatures. We tested this hypothesis using 2399 host genome sequences, including putative hosts, vectors, and reservoir hosts. We computed genomic signature profiles on host coding regions and compared them to viral signatures. Since many related eukaryotic species share sequence homology and thus likely have similar signatures, we counted matches on hosts from the same genus, family, and order as a positive match.

We found that only 45 viruses presented genomic signatures that were most similar to those of a host of the correct species, genus, family, or order. Six viruses had signatures similar to the signature of the correct host species, namely one Retroviridae (Murine leukemia virus, which includes endogenous subspecies), three members of Potyviridae, one member of Iridoviridae, and one Hepadnaviridae (Supplementary Table 2). Expanding the criteria to host order revealed more matches than expected at random (Fig. 7). For three Baltimore classes, some viral families presented significant host-order similarity: dsDNA (Hepadnaviridae, Nudiviridae, Ascoviridae, Polydnaviridae, p −10), (+)ssRNA (Mesoniviridae, Potyviridae, Betaflexiviridae p Tospoviridae p = 0.005). Plant and insect viruses were overrepresented among viruses with matching genomic signature to their hosts. Specifically, 19 viruses were plant viruses, 16 were insect viruses, one virus infects insects and mammals, two viruses infect both insects and plants, three infect other animals, and four fungi viruses.

Fig. 7: Similarity analysis of genomic signatures in viruses and their hosts.
figure 7

The percentage of viruses similar to a host within the same taxonomic order as its native host (in orange), compared to a random model (in gray). The viruses are subdivided by their genome composition. Viral families with significantly more viruses similar to their hosts than randomly expected are marked (*).