Correlation Engine 2.0
Clear Search sequence regions

Sizes of these terms reflect their relevance to your search.

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches. © 2024. The Author(s), under exclusive licence to Springer Nature Limited.


Zachary N Flamholz, Steven J Biller, Libusha Kelly. Large language models improve annotation of prokaryotic viral proteins. Nature microbiology. 2024 Feb;9(2):537-549

Expand section icon Mesh Tags

Expand section icon Substances

PMID: 38287147

View Full Text