BLAST Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-3402
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Bethesda, MD 20894, USA. firstname.lastname@example.org
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a
variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be
decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word
hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three
times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments
produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific
Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more
sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of
the BRCT superfamily.
CLUSTALW CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice.
Nucleic Acids Res 1994 Nov 11;22(22):4673-4680
Thompson JD, Higgins DG, Gibson TJ
European Molecular Biology Laboratory, Heidelberg, Germany.
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of
divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight
near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different
alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally
reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.
Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up
of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Coiled-coil prediction Predicting coiled coils from protein sequences.
Science 1991 May 24;252(5010):1162-1164
Lupas A, Van Dyke M, Stock J
Department of Molecular Biology, Princeton University, NJ 08544.
The probability that a residue in a protein is part of a coiled-coil structure was assessed by comparison of its flanking sequences
with sequences of known coiled-coil proteins. This method was used to delineate coiled-coil domains in otherwise globular proteins,
such as the leucine zipper domains in transcriptional regulators, and to predict regions of discontinuity within coiled-coil
structures, such as the hinge region in myosin. More than 200 proteins that probably have coiled-coil domains were identified in
GenBank, including alpha- and beta-tubulins, flagellins, G protein beta subunits, some bacterial transfer RNA synthetases, and members
of the heat shock protein (Hsp70) family.
DSC Identification and application of the concepts important for accurate and reliable protein secondary structure prediction
Protein Sci 1996 Nov;5(11):2298-310
King RD, Sternberg MJ
Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, United Kingdom.
A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall
per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts
important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important
concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects,
moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation,
auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of
conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was
analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and
explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable
a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as
having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex
nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90
residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm
(probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly
more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for
any prediction method.
DSSP Dictionary of protein secondary structure : pattern recognition of hydrogen-bonded and geometrical features
Biopolymers 1983, 22: 2577-2637
Kabsch W & Sander C FASTA Searching protein sequence libraries: comparison of the sensitivity and selectivity
of the Smith-Waterman and FASTA algorithms.
PNAS (1988) 85:2444-2448
Department of Biochemistry, University of Virginia, Charlottesville 22908.
GOR:The GOR method is based on information theory and was developed by J.Garnier, D.Osguthorpe and B.Robson (J.Mol.Biol.120,97, 1978).
The present version, GOR IV, uses all possible pair frequencies within a window of 17 amino acid residues and is reported by J.
Garnier. J.F. Gibrat and B.Robson in Methods in Enzymology, vol 266, p 540-553 (1996). After crossvalidation on a data base of 267
proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3). The program gives two outputs, one
eye-friendly giving the sequence and the predicted secondary structure in rows, H=helix, E=extended or beta strand and C=coil; the
second gives the probability values for each secondary structure at each amino acid position. The predicted secondary structure is
the one of highest probability compatible with a predicted helix segment of at least four residues and a predicted extended segment
of at least two residues.
Helix-turn-helix DNA-binding motifs prediction Improved detection of helix-turn-helix DNA-binding motifs in protein sequences.
Nucleic Acids Res 1990 Sep 11;18(17):5019-5026
Dodd IB, Egan JB
Department of Biochemistry, University of Adelaide, Australia.
We present an update of our method for systematic detection and evaluation of potential helix-turn-helix DNA-binding motifs in
protein sequences [Dodd, I. and Egan, J. B. (1987) J. Mol. Biol. 194, 557-564]. The new method is considerably more powerful,
detecting approximately 50% more likely helix-turn-helix sequences without an increase in false predictions. This improvement is due
almost entirely to the use of a much larger reference set of 91 presumed helix-turn-helix sequences. The scoring matrix derived from
this reference set has been calibrated against a large protein sequence database so that the score obtained by a sequence can be used
to give a practical estimation of the probability that the sequence is a helix-turn-helix motif.
MLRC Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score Combination
Bioinformatics vol. 15 no. 5 1999 pp 413-421
Guermeur Y, Geourjon C, Gallinari P, & Deleage G
Motivation: In many fields of pattern recognition, combination has proved efficient to increase the generalization performance of individual
prediction methods. Numerous systems have been developed for protein secondary structure prediction, based on different principles.
Finding better ensemble methods for this task may thus become crucial. In addition, efforts need to be made to help the biologist in
the post-processing of the outputs.
An ensemble method has been designed to post-process the outputs of protein secondary structure prediction methods, in order to obtain
an improvement of prediction accuracy while generating class posterior probability estimates. Experimental results establish that it
can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are
largely different. This combination thus contsitutes an help for the biologist, who can use it confidently on top of any set of
prediction methods. Furthermore, the resulting estimates can be used in various ways, for instance to determine which residues are
predicted with a given high level of reliability.
Free availability over the internet on the Network Protein Sequence @nalysis (NPS@) WWW server at https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=NPSA/npsa_mlrc.html. The method is proposed as the default choice.
Multalin Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res 1988 Nov 25;16(22):10881-10890
Laboratoire de Genetique Cellulaire, INRA Toulouse, France.
An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to
use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a
hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are
aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise
alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different
from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39
sequences of cytochrome c.
NPS@ NPS@: Network Protein Sequence Analysis
TIBS 2000 March Vol. 25, No 3 :147-150
Combet C., Blanchet C., Geourjon C. and Deléage G. P-SEA P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins.
Comput Appl Biosci 1997 Jun;13(3):291-5
Labesse G, Colloc'h N, Pothier J, Mornon JP
MOTIVATION: The secondary structure is a key element of architectural organization in proteins. Accurate assignment of the secondary
structure elements (SSE) (helix, strand, coil) is an essential step for the analysis and modelling of protein structure. Various
methods have been proposed to assign secondary structure. Comparative studies of their results have shown some of their drawbacks,
pointing out the difficulties in the task of SSE assignment.
RESULTS: We have designed a new automatic method, named P-SEA, to assign efficiently secondary structure from the sole C alpha
position. Some advantages of the new algorithm are discussed.
AVAILABILITY: The program P-SEA is available by anonymous ftp: ftp.lmcp.jussieu.fr directory: pub/.
PREDATOR Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence.
Protein Eng 1996 Feb;9(2):133-142
Frishman D, Argos P
European Molecular Biology Laboratory, Heidelberg, Germany.
Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local
residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved
accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix,
beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of
three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure
prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our
approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate
interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded
pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence
as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
PSI-BLAST See BLASTSecondary consensus prediction Protein structure prediction. Implications for the biologist.
Biochimie 1997 Nov;79(11):681-686
Deleage G, Blanchet C, Geourjon C
Institute of Biology and Chemistry of Proteins, Lyon, France.
Recent improvements in the prediction of protein secondary structure are described, particularly those methods using the information
contained into multiple alignments. In this respect, the prediction accuracy has been checked and methods that take into account
multiple alignments are 70% correct for a three-state description of secondary structure. This quality is obtained by a 'leave-one
out' procedure on a reference database of proteins sharing less than 25% identity. Biological applications such as 'protein domain
design' and structural phylogeny are given. The biologist's point of view is also considered and joint predictions are encouraged in
order to derive an amino acid based accuracy. All the tools described in this paper are available for biologists on the Web
SIMPA96 An algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS Lett 1986 Sep 15;205(2):303-308
Levin JM, Robson B, Garnier J
A secondary structure prediction algorithm is proposed on the hypothesis that short homologous sequences of amino acids have the same
secondary structure tendencies. Comparisons are made with the secondary structure assignments of Kabsch and Sander from X-ray data
[(1983) Biopolymers 22, 2577-2637] and an empirically determined similarity matrix which assigns a sequence similarity score between
any two sequences of 7 residues in length. This similarity matrix differs in many respects from that of the Dayhoff substitution
matrix [(1978) in: Atlas of Protein Sequence and Structure, (Dayhoff, M.O. ed). vol. 5. suppl. 3, pp. 353-358, National Biochemical
Research Foundation, Washington, DC]. This homologue method had a prediction accuracy of 62.2% over 3states for 61 proteins and 63.6%
for a new set of 7 proteins not in the original data base.
Exploring the limits of nearest neighbour secondary structure prediction.
Protein Eng. (1997),7, 771-776
SIMPA is a nearest neighbour method for predicting secondary structures using a similarity matrix, in its latest version the BLOSUM
62, an optimized similarity threshold, a window of 13 to 17 residues and a database of observed secondary structures. In version
simpa96 used here, the database contains circa 300 proteins and the window is 13 residues long. Its crossvalidated accuracy was a Q3
of 67.7% for a single sequence and 72.8% when using multiple alignments of homologous sequences.
- J. LEVIN, B. ROBSON, J. GARNIER. An Algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS, 205, (1986) 303-308. This describes the basic algorithm.
- J. LEVIN, J. GARNIER. Improvements in a secondary structure prediction method based on a search for local sequence homologies and
its use as a model building tool. Biochim. Biophys. Acta, (1988) 955, 283-295. Here the window and threshold are optimized and the
results are crossvalidated by jack knife process.
- J. LEVIN. Exploring the limits of nearest neighbour secondary structure prediction. Protein Eng. (1997),7, 771-776 This
corresponds to simpa96.
SOPMA SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments.
Comput Appl Biosci 1995 Dec;11(6):681-684
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.
Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the
prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences
of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids
for a three-state description of the secondary structure (alpha-helix, beta-sheet and coil) in a whole database containing 126 chains
of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts
82.2% of residues for 74% of co-predicted amino acids.
Sov parameter Identification of related proteins with weak sequence identity using secondary structure information.
Protein Sci 2001 Apr;10(4):788-97
Geourjon C, Combet C, Blanchet C, Deleage G
Molecular modeling of proteins is confronted with the problem of finding homologous proteins, especially when few identities remain after the process of molecular evolution. Using even the most recent methods based on sequence identity detection, structural relationships are still difficult to establish with high reliability. As protein structures are more conserved than sequences, we investigated the possibility of using protein secondary structure comparison (observed or predicted structures) to discriminate between related and unrelated proteins sequences in the range of 10%-30% sequence identity. Pairwise comparison of secondary structures have been measured using the structural overlap (Sov) parameter. In this article, we show that if the secondary structures likeness is >50%, most of the pairs are structurally related. Taking into account the secondary structures of proteins that have been detected by BLAST, FASTA, or SSEARCH in the noisy region (with high E: value), we show that distantly related protein sequences (even with <20% identity) can be still identified. This strategy can be used to identify three-dimensional templates in homology modeling by finding unexpected related proteins and to select proteins for experimental investigation in a structural genomic approach, as well as for genome annotation.
SSEARCH Identification of common molecular subsequences.
J. Mol. Biol. (1981) 147:195-197
Smith TF, Waterman MS STRIDE Knowledge-based secondary structure assignment
Proteins: structure, function and genetics (1995), 23, 566-579
Frishman D & Argos P
AlphaFold database Highly accurate protein structure prediction with AlphaFold
Nature (2021), 596, 583-589
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli & Demis Hassabis Link
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort, the structures of around 100,000 unique proteins have been determined, but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the 'protein folding problem'—has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models
Nucleic Acids Research (2022), 50, D439-D444
Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar Link
The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set. BCL-2 database BCL2DB: database of BCL-2 family members and BH3-only proteins
Database (2014), 2014, bau013
Valentine Rech de Laval, Gilbert Deléage, Abdel Aouacheria, Christophe Combet Link
BCL2DB (http://bcl2db.ibcp.fr) is a database designed to integrate data on BCL-2 family members and BH3-only proteins. These proteins control the mitochondrial apoptotic pathway and probably many other cellular processes as well. This large protein group is formed by a family of pro-apoptotic and anti-apoptotic homologs that have phylogenetic relationships with BCL-2, and by a collection of evolutionarily and structurally unrelated proteins characterized by the presence of a region of local sequence similarity with BCL-2, termed the BH3 motif. BCL2DB is monthly built, thanks to an automated procedure relying on a set of homemade profile HMMs computed from seed reference sequences representative of the various BCL-2 homologs and BH3-only proteins. The BCL2DB entries integrate data from the Ensembl, Ensembl Genomes, European Nucleotide Archive and Protein Data Bank databases and are enriched with specific information like protein classification into orthology groups and distribution of BH motifs along the sequences. The Web interface allows for easy browsing of the site and fast access to data, as well as sequence analysis with generic and specific tools. BCL2DB provides a helpful and powerful tool to both 'BCL-2-ologists' and researchers working in the various fields of physiopathology. Bacterial tyrosine kinase database BYKdb: the Bacterial protein tYrosine Kinase database
Nucleic Acids Research (2012), 40, D321-D324
Fanny Jadeau, Christophe Grangeasse, Lei Shi, Ivan Mijakovic, Gilbert Deléage, Christophe Combet Link
Bacterial tyrosine-kinases share no resemblance with their eukaryotic counterparts and they have been unified in a new protein family named BY-kinases. These enzymes have been shown to control several biological functions in the bacterial cells. In recent years biochemical studies, sequence analyses and structure resolutions allowed the deciphering of a common signature. However, BY-kinase sequence annotations in primary databases remain incomplete. This prompted us to develop a specialized database of computer-annotated BY-kinase sequences: the Bacterial protein tyrosine-kinase database (BYKdb). BY-kinase sequences are first identified, thanks to a workflow developed in a previous work. A second workflow annotates the UniProtKB entries in order to provide the BYKdb entries. The database can be accessed through a web interface that allows static and dynamic queries and offers integrated sequence analysis tools. BYKdb can be found at http://bykdb.ibcp.fr. ESM Metagenomic Atlas Evolutionary-scale prediction of atomic level protein structure with a language model
bioRxiv (2022), 2022.07.20.500902
Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives. Link
Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth. European Nucleotide Archive The European Nucleotide Archive in 2022
Nucleic Acids Research (2023), gkac1051
Josephine Burgin, Alisha Ahamed, Carla Cummins, Rajkumar Devraj, Khadim Gueye, Dipayan Gupta, Vikas Gupta, Muhammad Haseeb, Maira Ihsan, Eugene Ivanov, Suran Jayathilaka, Vishnukumar Balavenkataraman Kadhirvelu, Manish Kumar, Ankur Lathi, Rasko Leinonen, Milena Mansurova, Jasmine McKinnon, Colman O'Cathail, Joana Paupério, Stéphane Pesant, Nadim Rahman, Gabriele Rinck, Sandeep Selvakumar, Swati Suman, Senthilnathan Vijayaraja, Zahra Waheed, Peter Woollard, David Yuan, Ahmad Zyoud, Tony Burdett, Guy Cochrane Link
The European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena), maintained by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), offers those producing data an open and supported platform for the management, archiving, publication, and dissemination of data; and to the scientific community as a whole, it offers a globally comprehensive data set through a host of data discovery and retrieval tools. Here, we describe recent updates to the ENA's submission and retrieval services as well as focused efforts to improve connectivity, reusability, and interoperability of ENA data and metadata. European Hepatitis C virus database euHCVdb: the European hepatitis C virus database
Nucleic Acids Research (2007), 35, D363-D366
Christophe Combet, Nicolas Garnier, Céline Charavay, Delphine Grando, Daniel Crisan, Julien Lopez, Alexandre Dehne-Garcia, Christophe Geourjon, Emmanuel Bettler, Chantal Hulo, Philippe Le Mercier, Ralf Bartenschlager, Helmut Diepolder, Darius Moradpour, Jean-Michel Pawlotsky, Charles M Rice, Christian Trépo, François Penin, Gilbert Deléage Link
The hepatitis C virus (HCV) genome shows remarkable sequence variability, leading to the classification of at least six major genotypes, numerous subtypes and a myriad of quasispecies within a given host. A database allowing researchers to investigate the genetic and structural variability of all available HCV sequences is an essential tool for studies on the molecular virology and pathogenesis of hepatitis C as well as drug design and vaccine development. We describe here the European Hepatitis C Virus Database (euHCVdb, http://euhcvdb.ibcp.fr), a collection of computer-annotated sequences based on reference genomes. The annotations include genome mapping of sequences, use of recommended nomenclature, subtyping as well as three-dimensional (3D) molecular models of proteins. A WWW interface has been developed to facilitate database searches and the export of data for sequence and structure analyses. As part of an international collaborative effort with the US and Japanese databases, the European HCV Database (euHCVdb) is mainly dedicated to HCV protein sequences, 3D structures and functional analyses. Hepatitis B virus database protein sequence HBVdb: a knowledge database for Hepatitis B Virus
Nucleic Acids Research (2013), 41, D566-D570
Juliette Hayer, Fanny Jadeau, Gilbert Deléage, Alan Kay, Fabien Zoulim, Christophe Combet Link
We have developed a specialized database, HBVdb (http://hbvdb.ibcp.fr), allowing the researchers to investigate the genetic variability of Hepatitis B Virus (HBV) and viral resistance to treatment. HBV is a major health problem worldwide with more than 350 million individuals being chronically infected. HBV is an enveloped DNA virus that replicates by reverse transcription of an RNA intermediate. HBV genome is optimized, being circular and encoding four overlapping reading frames. Indeed, each nucleotide of the genome takes part in the coding of at least one protein. However, HBV shows some genome variability leading to at least eight different genotypes and recombinant forms. The main drugs used to treat infected patients are nucleos(t)ides analogs (reverse transcriptase inhibitors). Unfortunately, HBV mutants resistant to these drugs may be selected and be responsible for treatment failure. HBVdb contains a collection of computer-annotated sequences based on manually annotated reference genomes. The database can be accessed through a web interface that allows static and dynamic queries and offers integrated generic sequence analysis tools and specialized analysis tools (e.g. annotation, genotyping, drug resistance profiling). Protein Data Bank Announcing the worldwide Protein Data Bank
Nature Structural & Molecular Biology (2003), 10, 980
Helen Berman, Kim Henrick & Haruki Nakamura Link
In recognition of the growing international and interdisciplinary nature of structural biology, three organizations have formed a collaboration to oversee the newly formed worldwide Protein Data Bank (wwPDB; http://www.wwpdb.org/). The Research Collaboratory for Structural Bioinformatics (RCSB), the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI) and the Protein Data Bank Japan (PDBj) at the Institute for Protein Research in Osaka University will serve as custodians of the wwPDB, with the goal of maintaining a single archive of macromolecular structural data that is freely and publicly available to the global community. Swiss-Prot See UniProt KnowledgeBaseUniProt Knowledge Base UniProt: the Universal Protein Knowledgebase in 2023
Nucleic Acids Research (2023), gkac1052
The UniProt Consortium Link
The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users' experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.
Last modification time : Wed Dec 7 15:21:37 2022. Current time : Mon Feb 6 10:52:48 2023. User : email@example.com.