Network Protein Sequence Analysis image
NPS@ at CRCL is a fork of the original NPS@ server

References


SOFTWARE


BLAST
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-3402
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
CLUSTALW
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res 1994 Nov 11;22(22):4673-4680
Thompson JD, Higgins DG, Gibson TJ
European Molecular Biology Laboratory, Heidelberg, Germany.

The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Coiled-coil prediction
Predicting coiled coils from protein sequences.
Science 1991 May 24;252(5010):1162-1164
Lupas A, Van Dyke M, Stock J
Department of Molecular Biology, Princeton University, NJ 08544.

The probability that a residue in a protein is part of a coiled-coil structure was assessed by comparison of its flanking sequences with sequences of known coiled-coil proteins. This method was used to delineate coiled-coil domains in otherwise globular proteins, such as the leucine zipper domains in transcriptional regulators, and to predict regions of discontinuity within coiled-coil structures, such as the hinge region in myosin. More than 200 proteins that probably have coiled-coil domains were identified in GenBank, including alpha- and beta-tubulins, flagellins, G protein beta subunits, some bacterial transfer RNA synthetases, and members of the heat shock protein (Hsp70) family.
DSC
Identification and application of the concepts important for accurate and reliable protein secondary structure prediction
Protein Sci 1996 Nov;5(11):2298-310
King RD, Sternberg MJ
Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, United Kingdom.

A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90 residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm (probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for any prediction method.
DSSP
Dictionary of protein secondary structure : pattern recognition of hydrogen-bonded and geometrical features
Biopolymers 1983, 22: 2577-2637
Kabsch W & Sander C

FASTA
Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.
PNAS (1988) 85:2444-2448
Pearson WR
Department of Biochemistry, University of Virginia, Charlottesville 22908.

The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.


Improved tools for biological sequence comparison.
Pearson WR, Lipman DJ
Department of Biochemistry, University of Virginia, Charlottesville 22908.

We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA orprotein sequences based on a variety of alternative scoring matrices.
GOR IV
GOR secondary structure prediction method version IV
Methods in Enzymology 1996 R.F. Doolittle Ed., vol 266, 540-553
Garnier J, Gibrat J-F, Robson B

GOR:The GOR method is based on information theory and was developed by J.Garnier, D.Osguthorpe and B.Robson (J.Mol.Biol.120,97, 1978). The present version, GOR IV, uses all possible pair frequencies within a window of 17 amino acid residues and is reported by J. Garnier. J.F. Gibrat and B.Robson in Methods in Enzymology, vol 266, p 540-553 (1996). After crossvalidation on a data base of 267 proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3). The program gives two outputs, one eye-friendly giving the sequence and the predicted secondary structure in rows, H=helix, E=extended or beta strand and C=coil; the second gives the probability values for each secondary structure at each amino acid position. The predicted secondary structure is the one of highest probability compatible with a predicted helix segment of at least four residues and a predicted extended segment of at least two residues.
Helix-turn-helix DNA-binding motifs prediction
Improved detection of helix-turn-helix DNA-binding motifs in protein sequences.
Nucleic Acids Res 1990 Sep 11;18(17):5019-5026
Dodd IB, Egan JB
Department of Biochemistry, University of Adelaide, Australia.

We present an update of our method for systematic detection and evaluation of potential helix-turn-helix DNA-binding motifs in protein sequences [Dodd, I. and Egan, J. B. (1987) J. Mol. Biol. 194, 557-564]. The new method is considerably more powerful, detecting approximately 50% more likely helix-turn-helix sequences without an increase in false predictions. This improvement is due almost entirely to the use of a much larger reference set of 91 presumed helix-turn-helix sequences. The scoring matrix derived from this reference set has been calibrated against a large protein sequence database so that the score obtained by a sequence can be used to give a practical estimation of the probability that the sequence is a helix-turn-helix motif.
MLRC
Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score Combination
Bioinformatics vol. 15 no. 5 1999 pp 413-421
Guermeur Y, Geourjon C, Gallinari P, & Deleage G

Motivation: In many fields of pattern recognition, combination has proved efficient to increase the generalization performance of individual prediction methods. Numerous systems have been developed for protein secondary structure prediction, based on different principles. Finding better ensemble methods for this task may thus become crucial. In addition, efforts need to be made to help the biologist in the post-processing of the outputs. Results:
An ensemble method has been designed to post-process the outputs of protein secondary structure prediction methods, in order to obtain an improvement of prediction accuracy while generating class posterior probability estimates. Experimental results establish that it can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are largely different. This combination thus contsitutes an help for the biologist, who can use it confidently on top of any set of prediction methods. Furthermore, the resulting estimates can be used in various ways, for instance to determine which residues are predicted with a given high level of reliability. Availability:
Free availability over the internet on the Network Protein Sequence @nalysis (NPS@) WWW server at https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=NPSA/npsa_mlrc.html. The method is proposed as the default choice.
Multalin
Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res 1988 Nov 25;16(22):10881-10890
Corpet F
Laboratoire de Genetique Cellulaire, INRA Toulouse, France.

An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.
NPS@
NPS@: Network Protein Sequence Analysis
TIBS 2000 March Vol. 25, No 3 [291]:147-150
Combet C., Blanchet C., Geourjon C. and Deléage G.

P-SEA
P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins.
Comput Appl Biosci 1997 Jun;13(3):291-5
Labesse G, Colloc'h N, Pothier J, Mornon JP

MOTIVATION: The secondary structure is a key element of architectural organization in proteins. Accurate assignment of the secondary structure elements (SSE) (helix, strand, coil) is an essential step for the analysis and modelling of protein structure. Various methods have been proposed to assign secondary structure. Comparative studies of their results have shown some of their drawbacks, pointing out the difficulties in the task of SSE assignment.
RESULTS: We have designed a new automatic method, named P-SEA, to assign efficiently secondary structure from the sole C alpha position. Some advantages of the new algorithm are discussed.
AVAILABILITY: The program P-SEA is available by anonymous ftp: ftp.lmcp.jussieu.fr directory: pub/.
PREDATOR
Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence.
Protein Eng 1996 Feb;9(2):133-142
Frishman D, Argos P
European Molecular Biology Laboratory, Heidelberg, Germany.

Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix, beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
PSI-BLAST
See BLAST
Secondary consensus prediction
Protein structure prediction. Implications for the biologist.
Biochimie 1997 Nov;79(11):681-686
Deleage G, Blanchet C, Geourjon C
Institute of Biology and Chemistry of Proteins, Lyon, France.

Recent improvements in the prediction of protein secondary structure are described, particularly those methods using the information contained into multiple alignments. In this respect, the prediction accuracy has been checked and methods that take into account multiple alignments are 70% correct for a three-state description of secondary structure. This quality is obtained by a 'leave-one out' procedure on a reference database of proteins sharing less than 25% identity. Biological applications such as 'protein domain design' and structural phylogeny are given. The biologist's point of view is also considered and joint predictions are encouraged in order to derive an amino acid based accuracy. All the tools described in this paper are available for biologists on the Web (http://www.ibcp.fr/predict.html).
SIMPA96
An algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS Lett 1986 Sep 15;205(2):303-308
Levin JM, Robson B, Garnier J

A secondary structure prediction algorithm is proposed on the hypothesis that short homologous sequences of amino acids have the same secondary structure tendencies. Comparisons are made with the secondary structure assignments of Kabsch and Sander from X-ray data [(1983) Biopolymers 22, 2577-2637] and an empirically determined similarity matrix which assigns a sequence similarity score between any two sequences of 7 residues in length. This similarity matrix differs in many respects from that of the Dayhoff substitution matrix [(1978) in: Atlas of Protein Sequence and Structure, (Dayhoff, M.O. ed). vol. 5. suppl. 3, pp. 353-358, National Biochemical Research Foundation, Washington, DC]. This homologue method had a prediction accuracy of 62.2% over 3states for 61 proteins and 63.6% for a new set of 7 proteins not in the original data base.

Exploring the limits of nearest neighbour secondary structure prediction.
Protein Eng. (1997),7, 771-776
J. LEVIN.

SIMPA is a nearest neighbour method for predicting secondary structures using a similarity matrix, in its latest version the BLOSUM 62, an optimized similarity threshold, a window of 13 to 17 residues and a database of observed secondary structures. In version simpa96 used here, the database contains circa 300 proteins and the window is 13 residues long. Its crossvalidated accuracy was a Q3 of 67.7% for a single sequence and 72.8% when using multiple alignments of homologous sequences.

Main references:
- J. LEVIN, B. ROBSON, J. GARNIER. An Algorithm for secondary structure determination in proteins based on sequence similarity. FEBS, 205, (1986) 303-308. This describes the basic algorithm.
- J. LEVIN, J. GARNIER. Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim. Biophys. Acta, (1988) 955, 283-295. Here the window and threshold are optimized and the results are crossvalidated by jack knife process.
- J. LEVIN. Exploring the limits of nearest neighbour secondary structure prediction. Protein Eng. (1997),7, 771-776 This corresponds to simpa96.
SOPMA
SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments.
Comput Appl Biosci 1995 Dec;11(6):681-684
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.

Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts 82.2% of residues for 74% of co-predicted amino acids.
Sov parameter
Identification of related proteins with weak sequence identity using secondary structure information.
Protein Sci 2001 Apr;10(4):788-97
Geourjon C, Combet C, Blanchet C, Deleage G

Molecular modeling of proteins is confronted with the problem of finding homologous proteins, especially when few identities remain after the process of molecular evolution. Using even the most recent methods based on sequence identity detection, structural relationships are still difficult to establish with high reliability. As protein structures are more conserved than sequences, we investigated the possibility of using protein secondary structure comparison (observed or predicted structures) to discriminate between related and unrelated proteins sequences in the range of 10%-30% sequence identity. Pairwise comparison of secondary structures have been measured using the structural overlap (Sov) parameter. In this article, we show that if the secondary structures likeness is >50%, most of the pairs are structurally related. Taking into account the secondary structures of proteins that have been detected by BLAST, FASTA, or SSEARCH in the noisy region (with high E: value), we show that distantly related protein sequences (even with <20% identity) can be still identified. This strategy can be used to identify three-dimensional templates in homology modeling by finding unexpected related proteins and to select proteins for experimental investigation in a structural genomic approach, as well as for genome annotation.
SSEARCH
Identification of common molecular subsequences.
J. Mol. Biol. (1981) 147:195-197
Smith TF, Waterman MS

STRIDE
Knowledge-based secondary structure assignment
Proteins: structure, function and genetics (1995), 23, 566-579
Frishman D & Argos P





DATABASES


AlphaFold database
Highly accurate protein structure prediction with AlphaFold
Nature (2021), 596, 583-589
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli & Demis Hassabis
Link

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the 'protein folding problem'—has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.


AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
Nucleic Acids Research (2022), 50, D439-D444
Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar
Link

The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set.

BCL-2 database
BCL2DB: database of BCL-2 family members and BH3-only proteins
Database (2014), 2014, bau013
Valentine Rech de Laval, Gilbert Deléage, Abdel Aouacheria, Christophe Combet
Link

BCL2DB (http://bcl2db.ibcp.fr) is a database designed to integrate data on BCL-2 family members and BH3-only proteins. These proteins control the mitochondrial apoptotic pathway and probably many other cellular processes as well. This large protein group is formed by a family of pro-apoptotic and anti-apoptotic homologs that have phylogenetic relationships with BCL-2, and by a collection of evolutionarily and structurally unrelated proteins characterized by the presence of a region of local sequence similarity with BCL-2, termed the BH3 motif. BCL2DB is monthly built, thanks to an automated procedure relying on a set of homemade profile HMMs computed from seed reference sequences representative of the various BCL-2 homologs and BH3-only proteins. The BCL2DB entries integrate data from the Ensembl, Ensembl Genomes, European Nucleotide Archive and Protein Data Bank databases and are enriched with specific information like protein classification into orthology groups and distribution of BH motifs along the sequences. The Web interface allows for easy browsing of the site and fast access to data, as well as sequence analysis with generic and specific tools. BCL2DB provides a helpful and powerful tool to both 'BCL-2-ologists' and researchers working in the various fields of physiopathology.

Bacterial tyrosine kinase database
BYKdb: the Bacterial protein tYrosine Kinase database
Nucleic Acids Research (2012), 40, D321-D324
Fanny Jadeau, Christophe Grangeasse, Lei Shi, Ivan Mijakovic, Gilbert Deléage, Christophe Combet
Link

Bacterial tyrosine-kinases share no resemblance with their eukaryotic counterparts and they have been unified in a new protein family named BY-kinases. These enzymes have been shown to control several biological functions in the bacterial cells. In recent years biochemical studies, sequence analyses and structure resolutions allowed the deciphering of a common signature. However, BY-kinase sequence annotations in primary databases remain incomplete. This prompted us to develop a specialized database of computer-annotated BY-kinase sequences: the Bacterial protein tyrosine-kinase database (BYKdb). BY-kinase sequences are first identified, thanks to a workflow developed in a previous work. A second workflow annotates the UniProtKB entries in order to provide the BYKdb entries. The database can be accessed through a web interface that allows static and dynamic queries and offers integrated sequence analysis tools. BYKdb can be found at http://bykdb.ibcp.fr.

ESM Metagenomic Atlas
Evolutionary-scale prediction of atomic level protein structure with a language model
bioRxiv (2022), 2022.07.20.500902
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives.
Link

Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth.

European Nucleotide Archive
The European Nucleotide Archive in 2022
Nucleic Acids Research (2023), gkac1051
Josephine Burgin, Alisha Ahamed, Carla Cummins, Rajkumar Devraj, Khadim Gueye, Dipayan Gupta, Vikas Gupta, Muhammad Haseeb, Maira Ihsan, Eugene Ivanov, Suran Jayathilaka, Vishnukumar Balavenkataraman Kadhirvelu, Manish Kumar, Ankur Lathi, Rasko Leinonen, Milena Mansurova, Jasmine McKinnon, Colman O'Cathail, Joana Paupério, Stéphane Pesant, Nadim Rahman, Gabriele Rinck, Sandeep Selvakumar, Swati Suman, Senthilnathan Vijayaraja, Zahra Waheed, Peter Woollard, David Yuan, Ahmad Zyoud, Tony Burdett, Guy Cochrane
Link

The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), offers those producing data an open and supported platform for the management, archiving, publication, and dissemination of data; and to the scientific community as a whole, it offers a globally comprehensive data set through a host of data discovery and retrieval tools. Here, we describe recent updates to the ENA's submission and retrieval services as well as focused efforts to improve connectivity, reusability, and interoperability of ENA data and metadata.

European Hepatitis C virus database
euHCVdb: the European hepatitis C virus database
Nucleic Acids Research (2007), 35, D363-D366
Christophe Combet, Nicolas Garnier, Céline Charavay, Delphine Grando, Daniel Crisan, Julien Lopez, Alexandre Dehne-Garcia, Christophe Geourjon, Emmanuel Bettler, Chantal Hulo, Philippe Le Mercier, Ralf Bartenschlager, Helmut Diepolder, Darius Moradpour, Jean-Michel Pawlotsky, Charles M Rice, Christian Trépo, François Penin, Gilbert Deléage
Link

The hepatitis C virus (HCV) genome shows remarkable sequence variability, leading to the classification of at least six major genotypes, numerous subtypes and a myriad of quasispecies within a given host. A database allowing researchers to investigate the genetic and structural variability of all available HCV sequences is an essential tool for studies on the molecular virology and pathogenesis of hepatitis C as well as drug design and vaccine development. We describe here the European Hepatitis C Virus Database (euHCVdb, http://euhcvdb.ibcp.fr), a collection of computer-annotated sequences based on reference genomes. The annotations include genome mapping of sequences, use of recommended nomenclature, subtyping as well as three-dimensional (3D) molecular models of proteins. A WWW interface has been developed to facilitate database searches and the export of data for sequence and structure analyses. As part of an international collaborative effort with the US and Japanese databases, the European HCV Database (euHCVdb) is mainly dedicated to HCV protein sequences, 3D structures and functional analyses.

Hepatitis B virus database protein sequence
HBVdb: a knowledge database for Hepatitis B Virus
Nucleic Acids Research (2013), 41, D566-D570
Juliette Hayer, Fanny Jadeau, Gilbert Deléage, Alan Kay, Fabien Zoulim, Christophe Combet
Link

We have developed a specialized database, HBVdb (http://hbvdb.ibcp.fr), allowing the researchers to investigate the genetic variability of Hepatitis B Virus (HBV) and viral resistance to treatment. HBV is a major health problem worldwide with more than 350 million individuals being chronically infected. HBV is an enveloped DNA virus that replicates by reverse transcription of an RNA intermediate. HBV genome is optimized, being circular and encoding four overlapping reading frames. Indeed, each nucleotide of the genome takes part in the coding of at least one protein. However, HBV shows some genome variability leading to at least eight different genotypes and recombinant forms. The main drugs used to treat infected patients are nucleos(t)ides analogs (reverse transcriptase inhibitors). Unfortunately, HBV mutants resistant to these drugs may be selected and be responsible for treatment failure. HBVdb contains a collection of computer-annotated sequences based on manually annotated reference genomes. The database can be accessed through a web interface that allows static and dynamic queries and offers integrated generic sequence analysis tools and specialized analysis tools (e.g. annotation, genotyping, drug resistance profiling).

Protein Data Bank
Announcing the worldwide Protein Data Bank
Nature Structural & Molecular Biology (2003), 10, 980
Helen Berman, Kim Henrick & Haruki Nakamura
Link

In recognition of the growing international and interdisciplinary nature of structural biology, three organizations have formed a collaboration to oversee the newly formed worldwide Protein Data Bank (wwPDB; http://www.wwpdb.org/). The Research Collaboratory for Structural Bioinformatics (RCSB), the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI) and the Protein Data Bank Japan (PDBj) at the Institute for Protein Research in Osaka University will serve as custodians of the wwPDB, with the goal of maintaining a single archive of macromolecular structural data that is freely and publicly available to the global community.

Swiss-Prot
See UniProt KnowledgeBase
UniProt Knowledge Base
UniProt: the Universal Protein Knowledgebase in 2023
Nucleic Acids Research (2023), gkac1052
The UniProt Consortium
Link

The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users' experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.
















Last modification time : Wed Dec 7 15:21:37 2022. Current time : Mon Feb 6 10:52:48 2023. User : public@3.222.251.91.

© 1998-2022 Centre de Recherche en Cancerologie de Lyon logo Pole Rhone-Alpes de BioInformatique logo Centre National de la Recherche Scientifique logo Institut national de la sante et de la recherche medicale logo Universite Claude Bernard Lyon 1 logo Legal notice