Information

Blosum matrix with probabilities instead of the positive and negative scores

Blosum matrix with probabilities instead of the positive and negative scores


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am trying to find a version of the BLOSUM matrix that has the frequencies instead of the scaled log-odds. i.e. instead of the common version that tells us that the score LEU/ASP is -4, I would like to know the probability of LEU replaced by ASP.


Download the BLOSUM data and source-code from here. Unzip the archive which has several files. The file calledblosum'XX'.qijwill have the co-occurence probabilities, and the subsitution probabilities can be calculated from them.

Also have a look at this article.


It is clear that not all sites in homologous proteins are conserved to the same extent. Those that are essential will be highly conserved (intolerant of change), whereas others that are less important for structure and function will be under less evolutionary constraint (tolerant of change). Here, Ng and Henikoff describe an algorithm, SIFT, a sequence homology-based method that sorts intolerant from tolerant amino-acid substitutions. By aligning multiple similar sequences and assessing the probability of substitution at any give position in the sequence, SIFT helps to assess the impact of an amino-acid replacement on the structure or function of a protein. This method might be useful in the following circumstances: during mutation screening when the status of a mutation suspected to be pathogenic cannot be formally shown (for example, in the absence of parental DNA) to assess the impact of amino-acid substitutions on fitness at a genomic scale and in population genetics, to avoid using markers that may be undergoing selective pressure.

SIFT takes a query sequence and searches for similar sequences using well known tools (PSI-BLAST and MOTIF). Then, a multiple sequence alignment is obtained and the normalized probabilities for all possible substitutions at each position of the alignment are calculated (providing position-specific information). If the probability of the substitution is lower than a specified cutoff, the change is considered to be deleterious. The performance of SIFT was tested using three mutation data sets: the repressor of the lactose operon, LacI the HIV-1 protease and the bacteriophage T4 lysozyme. The prediction accuracy of SIFT is in the range of 60-80%, depending on the data set. In all cases, the performance of SIFT has been compared with the conclusions drawn from the look-up scoring matrix BLOSUM62 (Block substitution matrix), which is used, as are many others, to assess the significance of a protein sequence alignment (as in BLAST). BLOSUM62 helps to distinguish between a 'real' biological result and a sequence alignment obtained by chance. In BLOSUM, each possible amino-acid change is assigned a score, where positive scores will be associated with conservative changes and negative scores with less conservative changes. Position-specific information is lost in the BLOSUM matrix, but is retained by SIFT, so SIFT outperforms BLOSUM62-derived conclusions.


Construction of substitution matrices

It is possible to measure sequence similarity in many different ways, such as counting the number of differences between them (Hamming distance), counting the number of insertions, deletions, and substitutions required to make two sequences identical (Levenshtein distance), percent identity or just use an arbitrary scoring system for matches, mismatches, insertions, and deletions. All these methods yield a measure of a relationship between the sequences, but none reflect any biological association between them.

In the realm of bioinformatics, we are interested in an evolutionary relationship of DNA and protein sequences, except in the case of sequence assembly where measuring sequencing errors and separating repeats are central.

Sequences may be more or less similar by sheer random chance, and consequently, we need a method to distinguish a random similarity from the similarity caused by evolutionary relationship. In other words, we wish to know whether sequences are homologous, i.e., have a common ancestor and in particular whether sequences have the same function despite not having identical sequences. Being able to determine if two sequences have the same function is useful in appraising the function of an unknown protein and gene by comparison to a known one.

Figure 1. A schematic description of the evolution of homologous gene sequences, i.e., sequences that have a common ancestor. The subset of homologous sequences is paralogous and orthologous sequences.
[Click on the image to toggle zoom ◱ ]

The amino acid sequence of a protein is crucial in determining its structure, and in turn, the function is profoundly dependent on the three-dimensional structure of a protein. Many amino acid mutations resulting in changed amino acids having similar physicochemical properties may not alter a protein structure in any functionally critical way. In contrast, a single amino change may alter the function. Note that we can only observe the cases where an altered function is not deleterious and thus does not result in the death of an organism. Further, changes that result in an altered function still produces homologous proteins, but they are not orthologous anymore since they don't have the same function (Figure 1).

Consequently, by observing mutations among orthologous protein sequences, we can determine which amino acid changes are possible without altering the function of a protein. Further, by enumerating the frequencies of these changes, we can construct scoring systems.

Research first done by Margaret Dayhoff in 1970s and colleagues and later by Henikoff and Henikoff early 1990s resulted in PAM and BLOSUM Substitution matrices and are the most widely used today. This tutorial describes their construction and use.

BLOSUM matrices

By studying a broad set of sequences from different species, known to be homologous and having the same function, i.e., orthologous sequences, we can observe changes in amino acids that preserve a function.

To measure the amino acid frequencies, Henikoff and Henikoff analyzed conserved regions of related protein sequences they obtained from BLOCKS database. In total, they examined 2,000 blocks without gaps and 500 groups of related proteins by counting the number of matches and mismatches of each type of the 20 different amino acids.

From the counts of each type, Henikoff and Henikoff created a frequency table and using these frequencies they further computed the probability of each type of match and mismatch and then converted the probabilities into logarithm of odds ratios. This way, the alignment score becomes zero if the observed frequencies are as expected, negative score if frequencies are less than expected and positive score when the frequencies are over the expected frequencies.

However, these are not the final scores in the final BLOSUM matrix. To get the final scores in the matrix, Henikoff and Henikoff further converted the log-odds ratios into bit units and multiplied each bit score by a scaling factor of two and rounded to the nearest integer, producing the final scores in BLOSUM matrix.

A family of matrices

Sequences in a whole protein family cluster may be quite divergent due to contributions of distant relatives. Therefore, Henikoff and Henikoff divided the family clusters into sub-clusters by their percentage of similarity to reduce multiple contributions to amino acid pair frequencies. This division resulted in BLOSUM family of matrices where the associated number, e.g., BLOSUM65 means scores are from a cluster of sequences where sequences are at least 65% similar, in BLOSUM80 matrix scores are from clusters with at least 80% similarity and so on.


Figure 3. Example column of sequence alignment of ten sequences of a conserved block. Nine Ds and one N.

The math

As an example, we consider a column consisting of nine Ds and one N. There are nine N-D and nine D-N pairs, and 36 (1 + 2 + 3 + . 8) possible D-D pairs (Figure 3).

To create a frequency table, we count the number of times, ( n ), each of the 210 (20, 19 + . 1) possible amino acid pairs occur in a block of a depth of ( d ) sequences as follows: ( wd(d-1)/2=n ), where ( w ) is the number of columns in the block. In this example ( d = 10 ) and ( w=1 ) Thus, the block contributes 1x10x(10-1)/2 = 45 amino acid pairs to the count.

The observed probability of occurrence ( q_ ) of each amino acid pair ( i ), ( j ) is

Where ( 1 leq i leq j leq 20 ). Inserting the numbers in the above equation in our example in Figure 2, we get the following: ( f_

=36 ), ( f_=9 ), ( q_
=36/45=0.8 ), and ( q_=9/45=0.2 ).

Subsequently, we estimate the probability of occurrence ( P(x) ) of each amino acid as

In our example 36 sequence pairs have D in both positions, and nine pairs have D only in a single position thus, the expected probability (P(D) = frac<[36+(9/2)]> <45>= 0.9) and (P(N)=frac<(9/2)><45>=0.1), assuming that the observed frequencies are the same as in the population. The general formula for calculating the probability of the occurrence ( p_ ) of the (i) th amino acid in an (i), (j) pair is

The calculation of the expected probability of occurrence of each amino acid pair is (p_p_) for (i=j) and (p_p_+p_p_=2p_p_) for (i e j). In our example this gives DD( =0.9 imes 0.9=0.81) and for DN+ND(=2 imes (0.9 imes 0.1)=0.18).

To get a handy scoring (s_), we first compute an odds ratio table where an entry (e_) for each amino acid pair is (frac<>><>>) and then take a logarithm of base two of each entry (s_=log_<2>(frac<>><>>) ). This scoring results in the alignment score (s_) to become zero if the observed frequencies are as expected, to a negative score if frequencies are less than expected and to a positive score when the frequencies are more than expected frequencies.

We then multiply each score (s_) by two and round to the nearest integer to generate the final scores in BLOSUM matrices (Figure 2).

Why don't different identical amino acid pairings have the same score?

By looking at the BLOSUM62 scores, we can observe that the identity pairing of different amino acids does not get the same score. The reason is that the observed abundance of amino acids is not the same. For example, Leucine-Leucine (Leu-Leu) pairing gets score four and Tryptophan-Tryptophan (Trp-Trp) pairing gets score 11 because Leucine is observed to be more abundant in nature than Tryptophan Thus, Trp-Trp pairing is less likely to be a random one.

Hypothesis testing

The above method of scoring is, in fact, hypothesis testing, and in general, the score (S(a,b)) for a substitution of amino acid (a) with amino acid (b) is

In the above equation (P_) is the likelihood of the hypothesis we want to test: residues correlated because they are homologous and ( f_f_ ) is the likelihood of a null hypothesis: residues are unrelated.


Supplementary Notes

A program for taking a (possibly arbitrary) alignment score matrix and back-calculating the implied target frequencies pab. (DOC 81 kb)

Doing this requires solving for a nonzero lambda in: sum_ab f_a f_b e = 1 and this is a good excuse to demo two methods of root-finding: bisection search and the Newton/Raphson method.

The program is ANSI C, and should compile on any machine with a C compiler: % cc -o lambda lambda.c -lm Any questions about this program should be addressed directly to the author.


Triggering the Generation of Gapped Alignments

Figure 1 shows that even when using the original one-hit method with threshold parameter T = 13, there is generally no greater than a 4% chance of missing an HSP with score >38 bits. While this would appear sufficient for most purposes, the one-hit default T parameter has typically been set as low as 11, yielding an execution time nearly three times that for T = 13. Why pay this price for what appears at best marginal gains in sensitivity? The reason is that the original BLAST program treats gapped alignments implicitly by locating, in many cases, several distinct HSPs involving the same database sequence, and calculating a statistical assessment of the combined result ( 21, 22). This means that two or more HSPs with scores well below 38 bits can, in combination, rise to statistical significance. If any one of these HSPs is missed, so may be the combined result.

A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I ( 87) and horse β-globin ( 88). (a) The region of the path graph explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost of 10 + k for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST program, using the one-hit heuristic with T = 11, is able to locate three of the five HSPs included in this alignment, but only the first and last achieve a score sufficient to be reported. (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT ( 26), release 34 (21 219 450 residues), using the leghemoglobin sequence (143 residues) as query, the E-value is 0.54 if no edge-effect correction ( 22) is invoked. The original BLAST program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect correction, this combined result has an E-value of 31 ( 21, 22). On the central lines of the alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix ( 18) gives a positive score are indicated by a ‘+’ symbol.

A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I ( 87) and horse β-globin ( 88). (a) The region of the path graph explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost of 10 + k for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST program, using the one-hit heuristic with T = 11, is able to locate three of the five HSPs included in this alignment, but only the first and last achieve a score sufficient to be reported. (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT ( 26), release 34 (21 219 450 residues), using the leghemoglobin sequence (143 residues) as query, the E-value is 0.54 if no edge-effect correction ( 22) is invoked. The original BLAST program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect correction, this combined result has an E-value of 31 ( 21, 22). On the central lines of the alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix ( 18) gives a positive score are indicated by a ‘+’ symbol.

The approach taken here allows BLAST to simultaneously produce gapped alignments and run significantly faster than previously. The central idea is to trigger a gapped extension for any HSP that exceeds a moderate score Sg, chosen so that no more than about one extension is invoked per 50 database sequences. (By equation 2, for a typical-length protein query, Sg should be set at ∼22 bits.) A gapped extension takes much longer to execute than an ungapped extension, but by performing very few of them the fraction of the total running time they consume can be kept relatively low.

By seeking a single gapped alignment, rather than a collection of ungapped ones, only one of the constituent HSPs need be located for the combined result to be generated successfully. This means that we may tolerate a much higher chance of missing any single moderately scoring HSP. For example, consider a result involving two HSPs, each with the same probability P of being missed at the hit-stage of the BLAST algorithm, and suppose that we desire to find the combined result with probability at least 0.95. The original algorithm, needing to find both HSPs, requires 2PP 2 ≤ 0.05, or P less than ∼0.025. In contrast, the new algorithm requires only that P 2 ≤ 0.05, and thus can tolerate P as high as 0.22. This permits the T parameter for the hit-stage of the algorithm to be raised substantially while retaining comparable sensitivity—from T = 11 to T = 13 for the one-hit heuristic. (The two-hit heuristic described above lowers T back to 11.) As will be discussed below, the resulting increase in speed more than compensates for the extra time required for the rare gapped extension.

In summary, the new gapped BLAST algorithm requires two non-overlapping hits of score at least T, within a distance A of one another, to invoke an ungapped extension of the second hit. If the HSP generated has normalized score at least Sg bits, then a gapped extension is triggered. The resulting gapped alignment is reported only if it has an E-value low enough to be of interest. For example, in the pairwise comparison of Figure 2, the ungapped extension invoked by the hit pair on the left produces an HSP with score 23.6 bits (calculated using λu and Ku). This is sufficient to trigger a gapped extension, which generates an alignment with score 32.4 bits (calculated using λg and Kg) and E-value of 0.5 ( Fig. 3). The original BLAST program locates only the first and last ungapped segments of this alignment ( Fig. 3c), and assigns them a combined E-value >50 times greater.


Pairwise sequence alignment

How similar are two sequences? This simple question drives much of bioinformatics, from assembly of overlapping sequence fragments into contigs, alignment of new sequences against reference genomes, BLAST searches of sequence databases, molecular phylogeny, and homology modeling of protein structures.

Answering this question requires finding the optimal alignment between two different sequences, scoring their similarity based on the optimal alignment, and then assessing the significance of this score. The optimal alignment, of course, depends on the scoring scheme.

Let’s consider 3 methods for pairwise sequence alignment: 1) dot plot, 2) global alignment, and 3) local alignment.

Dot Plot

The simplest method is the dot plot. One sequence is written out horizontally, and the other sequence is written out vertically, along the top and side of an m x n grid, where m and n are the lengths of the two sequences. A dot is placed in a cell in the grid wherever the the two sequences match. A diagonal line in the grid visually shows where the two sequences have sequence identity. Nucleic acid sequence dot plot comparisons will show a very high level of background (25% chance of random match), so the parameters must be modified to place a dot only if there is a nearly perfect match along a sliding “window” of 10 or more consecutive nucleotides (see tips below).

Web-based dot plot implementations can be found here:

http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher – for both nucleic acid and protein sequences, with standard EMBOSS scoring matrices

Stand-alone dot plot programs operable via either a GUI or command-line can be found in EMBOSS (JEMBOSS is the Java GUI)

Tips for DNA (nucleic acid sequence) dot plots:

  • Use a nucleic acid scoring matrix: ednafull in EMBOSS
  • Because there are only 4 nucleotides, increase window size and threshold score until the background disappears and you are left with clear signal.
  • Using too large a window, such as 100, with a low threshold will cause the diagonals to overlap and lose resolution to see small repeats or inversions. Use a smaller window (less than 30) and raise the threshold score to favor almost exact matches

Q: What will a dot plot show if there are

  1. insertions and deletions?
  2. An inversion?
  3. A sequence motif that is repeated?
  4. A homopolymeric stretch?
  5. A dot plot comparing two nucleotide sequences will have lots of background noise – how can this background noise be reduced or suppressed?

Global Alignment: Needleman-Wunsch

The algorithm published by Needleman and Wunsch in 1970 for alignment of two protein sequences was the first application of dynamic programming to biological sequence analysis. The Needleman-Wunsch algorithm finds the best-scoring global alignment between two sequences. A blog post by Chetan has a very clear explanation of how this works. Global alignments are most useful when the two sequences being compared are of similar lengths, and not too divergent.

Local Alignment: Smith-Waterman

Real life is often complicated, and we observe that genes, and the proteins they encode, have undergone exon-shuffling, recombination, insertions, deletions, and even fusions. Many proteins exhibit modular architecture. In searching databases for similar sequences, it is useful to find sequences that have similar domains or functional motifs. Smith & Waterman (1981) published an application of dynamic programming to find optimal local alignments. The algorithm is similar to Needleman-Wunsch, but negative cell values are reset to zero, and the traceback procedures starts from the highest scoring cell, anywhere in the matrix, and ends when the path encounters a cell with a value of zero.

Scoring Matrices

The Needleman-Wunsch and Smith-Waterman algorithms require a scoring matrix. The scoring matrix assigns a positive score for a match, and a penalty for a mismatch. For nucleotide sequence alignments, the simplest scoring matrix awards +1 for a match, and -1 for a mismatch. The blastn algorithm at NCBI scores +5 for a match and -4 for a mismatch. These scoring matrices treat all mutations (mismatches) equally. In reality, transitions (pyrimidine -> pyrimidine and purine -> purine) occur much more frequently than transversions (pyrimidine -> purine and vice versa). For aligning non-protein coding DNA sequences, a transition/transversion scoring matrix may be more appropriate. For aligning DNA sequences that encode proteins, alignment of the protein amino acid sequences will almost always be more reliable.

Transitions and transversions, from Wikipedia

For protein sequence alignments, the scoring matrices are more complicated. The goal is to reflect evolutionary processes. Some amino acid sequence changes can arise from a single nucleotide change, whereas other amino acid changes require two nucleotide changes. Some amino acid changes are less likely to affect protein structure or function than other amino acid changes. So how can we estimate the relative likelihood of specific amino acid changes?

Dayhoff used alignments of highly conserved proteins to assess what amino acid changes were likely to be accepted – P oint A ccepted M utations. From these data she devised a 20 x 20 amino acid substitution matrix for PAM-1, a unit of evolutionary change resulting in 1 accepted mutation per 100 amino acids. From there she calculated other matrices such as PAM-2 or PAM-30 or PAM-250, where the PAM-n matrix is derived by multiplying the PAM-1 matrix to itself n times. The substitution matrices are converted to scoring matrices by converting the substitution probabilities to log-odds ratios for each cell.

The BLOSUM matrices ( BLO cks SU bstitution M atrix) derive their amino acid substitution frequencies from the Blocks database of ungapped local multiple sequence alignments. BLOSUM62 is calculated from sequences with 62% identity or less BLOSUM 80 from sequences with 80% or less.

The Wikipedia article on substitution matrices gives a reasonably concise and accurate description of the PAM and BLOSUM matrices. http://en.wikipedia.org/wiki/Substitution_matrix

Gap penalty

Sequence alignments usually require insertion of gaps, reflecting insertion or deletion mutations. If a nucleotide or amino acid in one sequence is aligned to a gap in the target sequence, then this should be penalized as a mismatch. However, gaps at the ends of sequences should perhaps not incur any penalty. Moreover, a single insertion or deletion mutation could result in a contiguous gap of multiple residues. Therefore, a single gap that is 3 residues long should incur less penalty than 3 different gaps, of one residue each. An affine gap penalty scheme heavily penalizes opening a gap, but extending a pre-existing gap incurs a much lower penalty per additional residue.

Assessing the significance of an alignment

The Needleman-Wunsch and Smith-Waterman algorithms will always find the best alignment between two sequences, whether or not they are evolutionarily related.

Q: So how can we assess whether a given alignment between two sequences is significant, or indicative of homology (common ancestry)?

We need a way to estimate the statistical significance of a given alignment score. How likely is it that two random sequences of similar length and composition will align with a score equal to or better than our target alignment?

For global alignments, there is no adequate theory to predict the distribution of alignment scores from randomly generated sequences. One can simply generate scores from alignments of sequences that have been randomly shuffled many times. If 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the p-value is likely to be less than 0.01.

For local alignments, probability theory predicts that randomly shuffled sequences will produce alignment scores with an extreme value (type I maximum) distribution.


Materials and Methods

Reagents and Tools table

Reagent/Resource Reference or Source Identifier or Catalog Number
Software
python v3.7 https://www.python.org/
scanpy v1.4 https://pypi.org/project/scanpy/
tensorflow v2.0.1 https://pypi.org/project/tensorflow/

Methods and Protocols

General note on data sets

In this study, we worked on data sets from public databases IEDB (Vita et al, 2019 ) and VDJdb (Shugay et al, 2018 ) and on a public data set from a single-cell pMHC-based T-cell specificity experiment (10x Genomics, 2019 ). IEDB and VDJdb contain pairs of binding T-cell receptors (TCRs) and antigens. In the single-cell experiment, cells were first treated with barcoded pMHCs and were then physically separated into droplets in a microfluidics setup. pMHCs captured in these droplet and T-cell receptor sequences associated with the captured cells are barcoded with a droplet-specific sequence so that both can be mapped to a single observation after sequencing (10x Genomics, 2019 ). Accordingly, one can obtain not only a list of bound TCRs and antigens but also pMHC counts for each TCR. These counts can be discretized into binding events and “spurious” binding or can be directly modeled as proposed in the main text. Importantly, one can easily establish the identity of multiple binding antigens to a single TCR sequence based on such pMHC counts. Two of the four donors (donors 1 and 2) were HLA-A*02:01 (10x Genomics, 2019 ), which was also the HLA type selected for in the IEDB and VDJdb samples. A detailed description of the HLA types and pMHC types used in this study is provided elsewhere (10x Genomics, 2019 ).

Statistics

We present P-values for selected model performance comparisons. These P-values were computed on the comparison of two sets of performance metrics. We used Welch's t-test if we compared two sets of performance metrics from two separate cross-validation sets, which is equivalent to the case of both sets sharing all model hyper-parameters other than cross-validation partition. We used the Wilcoxon test if we compared metrics across sets of models that vary in hyper-parameters, as one would no longer expect a unimodal performance metric distribution in these cases.

Feed-forward network architectures

Here, we describe proposed architectures of the models that predict antigen specificity of a T-cell receptor (TCR) based on the CDR3 loop of both ɑ- and β-chains and on cell-specific covariates. Note that specificity-determining influences of CDR1 and CDR2 loops (Cole et al, 2009 Madura et al, 2013 Stadinski et al, 2014 ) and distal regions (Harris et al, 2016a , b ) have also been demonstrated, but were not measured in the single-cell pMHC assay. All networks presented contain an initial amino acid embedding, a sequence data embedding block, and a final densely connected layer block.

Amino acid embedding

The choice of initial amino acid embedding may impact data and parameter efficiency of the model and therefore may impact the predictive power of models trained on data sets that are currently available. We used one-hot encoded amino acid embeddings, evolutionary substitution-inspired embeddings (BLOSUM), and learned embeddings. The learned embeddings were a 1 × 1 convolution on top of a BLOSUM encoding and were prepended to the sequence model layer stack. Here, channels are the initial amino acid embeddings (we chose BLOSUM50) and filters are the learned amino acid embedding. This learned embedding can reduce the parameter size of the sequence model layer stack. All fits presented in the manuscript other than in Appendix Fig S1 are based on such a learned embedding with five filters. We anticipate that sequence-based embeddings will gain relevance in the context of extrapolation across antigens in the future. Here, parameter efficiency in the sequence models will play an important role and the 1 × 1 convolution presented here is an intuitive first step in this direction.

Sequence data embedding

We screened multiple layer types in the sequence data embedding block: recurrent layers (bidirectional GRU and LSTM), self-attention, convolutional layers (simple convolutions and inception-like), and densely connected layers as a reference. Recurrent layer types and self-attention layers were previously useful for modeling language (Vaswani et al, 2017 ) and epitope (Wu et al, 2019 ) data. Convolutional layer types have been useful for modeling epitope (Han & Kim, 2017 Vang & Xie, 2017 ) and image (Szegedy et al, 2015 ) data. The sequence model layers retain positional information in subsequent layers and can thereby build an increasingly abstract representation of the sequence. To achieve this on recurrent networks, we chose the output of a layer to be a position-wise network state which results in an output tensor of size (batch, positions × 2, output dimension) for a bidirectional network. This position-wise encoding occurs naturally in self-attention and convolutional networks. We did not use feature transforms with positional signals (Vaswani et al, 2017 ) on the self-attention networks, so that the network has no knowledge of the original sequence-structure but can still retain inferred structure in subsequent layers. We presented models fit on the CDR3 loop of both ɑ- and β-chains of the TCR (Fig 1B) and models fit on the CDR3 loop of the β-chain and the antigen sequence (Fig 3B). In both cases, we needed to integrate two sequences. To this end, we either used separate sequence-embedding layer stacks for each sequence (all models presented in Fig 1 and models indicated as “separate” in Fig 3) or by appending the two padded sequences and using a single sequence-embedding layer stack (models indicated as “concatenated” in Fig 3). We reduced the positional encoding to a latent space of fixed dimensionality in the last sequence-embedding layer of recurrent networks by the emitted state of the model on the last element of the sequence in each direction. This last layer allows usage of the same final dense layers independent of input sequence length. Convolutional and self-attention networks were not built to be independent of sequence length. We did, however, pad the input sequences to mitigate this problem on the data handled in this paper. We used a residual connection across all sequence-embedding layers. Further layer-specific hyper-parameters can be extracted from the code supplied with this manuscript (Dataset EV1 and EV2).

Final densely connected layers

We fed the activation generated in the sequence-embedding block into a dense network that can integrate the sequence information with continuous or categorical donor- and cell-specific covariates. We modeled the binding event as a probability distribution over two states (bound and unbound) and compute the deviation of the model prediction from observed binding events via cross-entropy loss. Firstly, one can use such models to predict binding events on a single antigen represented as a single output node with a sigmoid activation function. Secondly, one can model a unique binding event among a panel of antigens with a vector of output nodes (one for each antigen and one node for non-binding) which are transformed with a softmax activation function.

Covariate processing

We set up a design matrix inspired by linear modeling to use as a covariate matrix. We modeled the donor as a categorical covariate, resulting in a one-hot encoding of the donor. We modeled total counts, negative-control pMHC counts, and surface protein counts as continuous covariates. We log(x + 1)-transformed negative-control pMHC counts and surface protein counts to increase the stability of training. We modeled total counts as the total count of mRNAs per cell divided by the mean total count.

Training, validation, and test splits

We used training data to compute parameter updates, validation data to control overfitting, and test data to compare models across hyper-parameters. Model training was terminated once a maximum number of epochs were reached or if the validation loss was no longer decreasing. In the latter case, the model with the lowest validation in a sliding window of n epochs until the last epoch was chosen n is given in the grid search scripts (Dataset EV3). The model metrics presented in this manuscript are metrics evaluated on the test data for models selected on cross-entropy (categorical binding prediction) or mean-squared log error (dextramer count prediction) of the validation data. We provide training curves for all models that contributed to panels in this manuscript in Dataset EV3.

Optimization

We used the ADAM optimizer throughout the manuscript for all models. We used learning rate schedules that reduce the learning rate at the time of training once plateaus in the validation metric are reached. The initial learning rate and all remaining hyper-parameters (batch size, number of epochs, patience, steps per epoch) were varied as indicated in the grid search hyper-parameter list.

Model fitting objectives

We chose cross-entropy loss on sigmoid- or softmax-transformed output activation values to train models that predict binarized binding events and mean-squared logarithmic error (msle) on exponentiated output activation values for models that predict continuous (count) binding affinities.

Performance metrics

We used AUC ROC, F1 scores, false-negative rates, and false-positive rates in the study to evaluate models that predict binding probabilities. AUC ROC is useful if the observations cover the full range of classification thresholds and is useful because it provides a measure that summarizes all scalar classification thresholds. F1 scores can always be used to evaluate a classifier but rely on a strict threshold. We used AUC ROC where possible but complemented with F1 scores if the AUC ROC score may suffer from a disjointed support of test data set on the classification threshold. False-negative and false-positive rates are used in Appendix Fig S4 to emphasize how models trained on single-cell data generalize to data from IEBD and VDJdb in both the negative and the positive classes separately. We used the R 2 to evaluate the performance of models that predicted pMHC counts (positive integer space).

Single-cell immune repertoire (CD8 + T cell) data processing

Primary data processing

We downloaded the full data of all four donors from another study (10x Genomics, 2019 ). All data processing for each model fit is documented in the package code (Dataset EV1) and grid search scripts (Dataset EV2). The number of T-cell clonotypes per antigen varied drastically between the order of 10 0 and 10 4 (Appendix Fig S3A and B). Subsequently, we selected the eight most common antigens (ELAGIGILTV, GILGFVFTL, GLCTLVAML, KLGGALQAK, RLRAEAQVK, IVTDFSVIK, AVFDRKSDAK, RAKFKQLL) for categorical panel model fits to avoid issues with class imbalances. We used the binarized binding event prediction by the authors of the data set (10x Genomics, 2019 labeled “*_binder” in the files “*_binarized_matrix.csv”) as a label for prediction. For the continuous case, in which we predicted pMHC counts, we chose the corresponding count data columns in the same file. Next, we performed multiple layers of observation filtering: (i) doublet removal, (ii) clonotype down-sampling, and (iii) class down-sampling. It was previously shown that doublets, namely, droplets containing two cells targeted with the same barcode, which cannot be distinguished in downstream analysis steps, tend to be enriched in subsets of transcriptome-derived clusters (Wolock et al, 2019 ). We propose using the number of reconstructed TCR chain alleles to identify potential doublets and demonstrate that the so characterized doublets are indeed enriched in a particular cluster in each donor (Appendix Fig S2A–D). There are cells that have two active alleles for either TCR chain, but these cannot be easily separated from doublets that arise in the cell separation process. To avoid bias of the presented results by potential cellular doublets, we chose to exclude all cells showing more than one allele for either the ɑ- or the β-chain. We further investigated the overall contribution of potentially ambient molecules that give rise to all observed T cells and found that high-frequency chains do not dominate the overall signal (Appendix Fig S2E and F). This analysis presents an upper bound to the impact of ambient molecules on this experiment as evolutionary effects probably also contribute to over-representation of particular chain sequences. Subsequently, we removed all cellular barcodes that contain more than one ɑ- or β-chain as mature CD8 + T cells are expected to only have a single functional ɑ- and β-chain allele. Next, we down-sampled each clonotype to a maximum of 10 observations to avoid biasing the training or test data to large clones. Here, we used clonotypes as defined by the authors of the data set in the files “*_clonotypes.csv” (10x Genomics, 2019 ). Lastly, we down-sampled the larger class to a maximum of twice the size of the smaller class when predicting a binary binding event for a single antigen. We did not perform this last step on multiclass and count prediction scenarios. We padded each CDR3 sequence to a length of 40 amino acids and concatenated these padded chain observations to a sequence of length 80 for models that were trained on both chains. We performed leave-one-donor-out cross-validation on models that did not take the donor identity as a covariate. We sampled 25% of the full data clonotypes and assigned all of the corresponding cells to the test set for all models that did use the donor covariate. The latter case yielded 68,716 clonotypes and 91,495 cells across all four donors. All cross-validations shown across different models are based on threefold cross-validation with seeded test–train splits resulting in the same split across all hyper-parameters. We present an analysis of the clonotype diversity encountered in this data set in Appendix Fig S6.

Binarization of single-cell pMHC counts into bound and unbound states

We used the binarization described in the original publication (10x Genomics, 2019 ) for the raw counts to receive binary outcome labels: A total pMHC UMI count larger than 10 and at least five times as high as the highest observed UMI count across all negative-control pMHCs was required for a binding event. If more than one pMHC passed these criteria, the pMHC with the largest UMI count was chosen as the single binder.

Test set assembly for models fit on IEDB data

This section describes how the test described in Fig 3E and Appendix Fig S5C was prepared. The cells were filtered as described above. We then extracted one binding TCR-antigen pair per cell from this list. We used the remaining TCR-antigen pairs as validated negative examples and down-sampled these to the number of positive observations to maintain class balance. All cross-validations shown across different models are based on threefold cross-validation with seeded test–train splits resulting in the same split across all hyper-parameters.

IEDB data processing

Primary processing

We downloaded the data from the IEDB website (Vita et al, 2019 ) with the following filters: linear epitope, MHC restriction to HLA-A*02:01 and organism as human and only human. This yielded a list of matched TCR (mostly β-chain CDR3s) with bound antigens. We assigned TCR sequences to a single clonotype if they were perfectly matched and down-sampled all clonotypes to a single observation. We only extracted the β-chain and CDR3 sequences to a length of 40 amino acids. We padded the antigen sequences to a length of 25 amino acids. We sampled 10% of all observations as a test set. We generated negative samples for both training and test sets separately by generating unobserved pairs of TCR and antigens. Here, we assumed that all TCRs bind a unique antigen out of the set of all antigens present in the database so that any other pairing would not result in a binding event. This procedure yielded 9,697 observations for both the positive and the negative sets before the train–test split from 71 antigens.

Test set assembly for models fit on IEDB data

This section describes how the test depicted in Appendix Fig S5A was prepared. To explore the ability of antigen-embedding TcellMatch models to generalize to unseen antigens, we fit such a model on the subset of high-frequency antigens of IEDB with at least five unique TCR sequences and tested the models on the remaining antigens. All cross-validations shown across different models are based on threefold cross-validation with seeded test–train splits resulting in the same split across all hyper-parameters.

VDJdb data processing

Primary processing

We provided an exploratory analysis of this data set in Appendix Fig S3 “exploration_vdjdb_data.*”. We downloaded the data from the VDJdb (Shugay et al, 2018 ) website with the following filters: Species: human, Gene (chain): TRB, MHC First chain allele(s): HLA-A*02:01. This yielded 3,964 records from 40 antigens. We assigned TCR sequences to a single clonotype if they were perfectly matched and down-sampled all clonotypes to a single observation. We only extracted the β-chain and CDR3 sequences to a length of 40 amino acids. We padded the antigen sequences to a length of 25 amino acids.

Test set assembly from VDJdb for models fit on IEDB data

This section describes how the test depicted in Fig 3D and Appendix Fig S5B was prepared. We sub-selected observations with matching or non-matching antigens with respect to the training set depending on the application (described in the figure caption or main text). All cross-validations shown across different models are based on threefold cross-validation with seeded test–train splits resulting in the same split across all hyper-parameters.


Acknowledgements

The authors are thankful to Martin Hess for helpful discussions regarding CoverageCalculator tool. R.T. also gratefully acknowledges several stimulating discussions with his colleagues Mr. VA Ramesh, Mr. S Suryanarayana and Mr. Rohan Mishra during the course of this study. This work was supported by a grant to H.A.N (University Grants Commission -University with Potential for Excellence - II grant) and also by the core grant of Centre for DNA Fingerprinting and Diagnostics (CDFD). R.T. is a recipient of University Grants Commission (UGC) Junior and Senior Research Fellowships. We also thank the Department of Biotechnology, Government of India, sponsored Bioinformatics Infrastructure Facility (BIF) of School of Life Sciences, University of Hyderabad. Last but not the least, we gratefully acknowledge the INNO Indigo project grant to H.A.N from Department of Science and Technology (DST), Government of India, for its financial help toward article processing charges (APC).


Blosum matrix with probabilities instead of the positive and negative scores - Biology

Abstract Syntax Notation 1 (ASN.1)

ASN.1 is a standard data description language that is used for encoding structured data. ASN.1 allows both the content and the structure of the data to be read by and exchanged between a variety of computer programs and platforms. ASN.1 is the language used to store and manipulate data at the NCBI. All NCBI software reads and writes ASN.1.

The accession number is the most general identifier used in the NCBI sequence databases. This is the identifier that should be used when citing a database record in a publication. The accession number points to a sequence record and does not change when the sequence is modified. In the Entrez system, using the accession number as a query will retrieve the most recent version of the record. The update history of a particular sequence record is tracked by the accession.version number. Changes in version numbers occur only when the actual sequence of a record has been modified and do not reflect any changes in the annotation. The specific version of a record is also tracked by another identifier that is mainly for internal NCBI use called the GI number.

An algorithm is a formal stepwise path to solving a problem, for example the problem of finding high-scoring local alignments between two sequences. Algorithms are the basis of computer programs.

The alignment score is a number assigned to a pairwise or multiple alignment of sequences that provides a numerical value reflecting the quality of the alignment. Alignment scores are usually calculated by referring to some sort of substitution table or alignment scoring matrix and summing the values for each pair or column in the alignment. (See also raw score and bit score). With certain scoring matrices, high scores of local ungapped alignments between two random sequences have the special property of following the extreme value distribution. This property allows a significance level to be assigned to local alignment scores obtained from database searches using such tools as BLAST and FASTA. (See also Expect value.)

A scoring matrix is a table of values used to assign a numerical score to a pair or column of aligned residues in a sequence alignment. The simplest kind, an identity matrix, assigns a high value for a match and some low, often negative value, for a mismatch. The identity matrix is used in the NCBI's nucleotide-nucleotide BLAST program. Protein alignment scoring matrices are usually more complicated and take into account the relative abundance of the amino acids in real proteins and the observation that some amino acids substitute for each other more readily in related proteins (e.g., Phe and Tyr) and others do not (e.g., Phe and Asp). One way of generating such a matrix is to examine alignments of real proteins that are known to be homologous (see Homolog) and tabulate the substitution frequencies of the various amino acid pairs at all positions. The resulting frequency table is then converted to a log-odds additive matrix by taking the log of the ratio of the observed substitution frequency for a particular pair and the background substitution frequency. The PAM and the BLOSUM series are examples of widely used protein-scoring matrices that are derived in this way. The matrices described above do not take into account differences in substitution frequencies at different positions in the alignments. More sensitive position-specific scoring matrices can also be generated. Scores of local alignments of random sequences derived from these log-odds matrices are described by the extreme value distribution. Thus, significance levels can be assigned to results of database searches with these matrices using tools such as BLAST and FASTA. (See also Expect value.)

Alus are the most common class of short, interspersed, repetitive element (SINE) in the human genome. Alus may account for more than 10% of the human genome. They appear to be derived from a signal recognition particle pseudogene. The name Alu derives from the fact that these elements usually contain an AluI restriction enzyme recognition site.

A sequence assembly is a large sequence or ordered set of sequences that may be derived from overlapping smaller sequences and sometimes anchored to a genome or chromosome scale map using information from STS content and other evidence.

B

Bacterial Artificial Chromosome (BAC)

A BAC is a large insert cloning vector capable of handling large segments of cloned DNA, typically around 150 kb. BACs can be propagated in laboratory strains of Escherichia coli. These vectors are used in the construction of genomic libraries for genome scale sequencing projects including human, mouse, Arabidopsis, and rice.

BankIt is a Web form for submitting sequences to GenBank.

Basic Local Alignment Search Tool (BLAST)

BLAST is the NCBI's sequence similarity search tool. It finds high-scoring local alignments between a query sequence and nucleotide and protein database sequences. Although BLAST is less sensitive than the complete Smith-Waterman algorithm, it provides a useful compromise between speed and sensitivity, especially for searching large databases. Because BLAST reports back local alignment scores, it provides statistics that may allow biologically interesting alignments to be distinguished from chance alignments.

The bit score represents the information content in a sequence alignment. It is expressed in base 2 log units. The bit score is in essence a normalized score adjusted by database and matrix scaling parameters. Hence, bit scores for different searches may be compared and only the search space size is needed to calculate the significance (Expect value) of the score. The relationship between Expect value (E) and bit score (S') is shown in equation 3 below.

The BLock SUbstitution Matrices are a set of protein log-odds alignment scoring matrices calculated from substitution frequencies obtained from ungapped multiple alignments of real proteins. Each BLOSUM matrix is identified with a number that indicates the percent identity cut-off for inclusion in that matrix. For example BLOSUM62, includes substitution information for proteins up to 62% identical in the alignment, BLOSUM90 up to 90% identical. Each BLOSUM matrix works best at finding proteins at a particular level of similarity. Hence, BLOSUM90 is better at finding more closely related proteins wheras BLOSUM62 is best at finding more distantly related ones. Experiments have shown that BLOSUM62 also works well at finding similar proteins. For this reason, BLOSUM62 is the default protein scoring matrix for NCBI BLAST.

C

In the molecular sense, a clone is a physical copy of a piece of DNA. The term is most often used to refer to the recombinant cloning vector DNA containing this copy such as a plasmid, BAC, or bacteriophage DNA that can be propagated in a bacterial or other microbial host.

A cluster is a group of sequences associated with each other, usually by some procedure that relies on sequence similarity. Such clusters of sequences are used to produce the UniGene datasets and the clusters of orthologous groups (COGS) dataset.

A COG is a group of related proteins or groups of proteins (paralogs) from different genomes that are thought to derive from a common ancestral gene. COGs are formed based on sequence similarity using a BLAST-based approach. COGs originally were made for the complete microbial genomes, but the dataset is expanding to include more complex organisms. The COGs data are very useful for annotating genes on microbial genomes and can be used to provide potential functional classification for uncharacterized protiens. (See also paralog and ortholog.)

Cn3D (pronounced "see in three dee") is NCBI's structure viewer. It reads Entrez structure data and renders either single structures or structural alignments from the NCBI's molecular modeling database (MMDB). Cn3D functions as a helper application to the Web browser and will launch automatically when the browser downloads NCBI structure data. Cn3D can also function as a stand-alone viewer and can act as a network client to download structures from NCBI. It also has a built-in BLAST and threading capability and can create sequence alignments to fit similar sequences to known structures.

CDART provides a graphical browser that allows one to find proteins with a similar domain architecture (content and arrangement) beginning with the results of a CDD search.

Conserved Domain Database (CDD) Search

CDD Search uses reverse position-specific BLAST (RPS-BLAST) to identify conserved domains contained in a protein query. CDD databases are position-specific scoring matrices (PSSMs) created from multiple sequence alignments from three domain databases: SMART, PFAM, and LOAD.

Contig is short for contiguous sequence. Contigs are assembled overlapping primary sequences. The term contig arises in two different contexts in the NCBI databases. Draft sequences (HTG division) will contain two or more contigs assembled from sequencing reads made from plasmid libraries for that clone. The NCBI also produces contigs made by assembling overlapping GenBank records from large-scale genome projects, such as the human genome project. These contigs are included in the NCBI RefSeq databases and are assigned accession numbers beginning with the prefix NT_.

A curated database is a derivative database containing molecular records that are compiled and edited from primary molecular data by experts who maintain and are responsible for the content of the records. The Swiss-Prot database is an important example of curated protein sequence database. The NCBI produces a curated non-redundant RefSeq dataset of transcripts and proteins for important organisms.

D

In molecular biology, a derivative database contains information derived and compiled from primary molecular data but includes some type of additional information provided by expert curators or automated computational procedures.

A primary nucleotide sequence database that is maintained as part of the Center for Information Biology and DNA Data Bank of Japan (CIB/DDBJ) under the National Institute of Genetics (NIG) in Japan. DDBJ began accepting DNA sequence submissions in 1986 and is a part of the International Nucleotide Sequence Database Collaboration that also includes GenBank and the EMBL nucloeotide sequence database.

A domain is a discrete structural unit of a protein. In principle, protein domains are capable of folding independently from the rest of the protein. Domains can often be identified by non-structural approaches based on conserved amino acid sequences. The NCBI's CDD-search uses information from curated multiple sequence alignments to identify domains in protein sequences.

Draft sequence is unfinished genomic or cDNA sequence. See HTG and HTC.

E

e-PCR is an analysis tool that tests a DNA sequence for the presence of sequence tagged sites (STSs). e-PCR looks for STSs in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that they could plausibly prime the amplification of a PCR product of the correct length.

European Molecular Biology Laboratory (EMBL) Database

A nucleotide sequence database produced and maintained at the European Bioinformatics Institute (EBI) in Hinxton, UK, that collaborates with GenBank and the DNA Database of Japan (DDBJ) to form the International Nucleotide Sequence Database Collaboration.

Ensembl is a joint project between EBI-EMBL and the Sanger Institute to provide automatic annotation of eukaryotic genomes.

Entrez is an integrated search and retrieval system that integrates information from various databases at NCBI, including nucleotide and protein sequences, 3D structures and structural domains, genomes, variation data (SNPs), gene expression data, genetic mapping data, population studies, OMIM, taxonomy, books online, and the biomedical literature.

A non-profit academic organization that performs research in bioinformatics and maintains the EMBL nucleotide sequence database.

A feature within the human genome Map Viewer that provides a graphical display of the molecular evidence supporting the existence of a gene model. ev displays reference sequences, GenBank mRNAs, annotated known or potential transcripts, and ESTs that align to the genomic area of interest.

In BLAST statistics, the Expect value is the number of alignments with a particular score, or a better score, that are expected to occur by chance when comparing two random sequences. The relationship between expect value and alignment score is given by equation 1

In Equation 1, e is the base of the natural logarithm scale, n and m are the lengths of the two sequences, essentially the search space size for database searching, and K and lambda are scaling factors for the search space and the scoring system, respectively. The bit score incorporates lambda and K so that scores can be meaningfully compared when different databases and scoring systems are used.

Expressed Sequence Tag (EST)

A short (300-1000 nucleotide), single-pass, single-read DNA sequence derived from a randomly picked cDNA clone. EST sequences compise the largest GenBank division. There are numerous high-throughput sequencing projects that continue to produce large numbers of EST sequences for important organisms. Many ESTs are classified into gene-specific clusters in the UniGene data set.

F

A sequence similarity search tool developed by William Pearson and David Lipman. The term FASTA is also used to identify a text format for sequences that is widely used. A FASTA-formatted sequence file may contain multiple sequences. Each sequence in the file is identified by a single line title preceded by the greater than sign (">"). Example.

The feature table is the portion of the GenBank record that provides information about the biological features that have been annotated on the nucleotide sequence, including coding and non-coding regions, genes, variations, and sequence tagged sites. The International Sequence Database Collaboration produces a document describing and identifying allowed features on GenBank, DDBJ, and EMBL records.

File Transfer Protocol (FTP)

FTP is a standard Internet protocol used to transfer files to and from a remote network site.

Fluorescence in Situ Hybridization (FISH) map

A FISH map is a cytogenetic map derived from the localization of fluorescently-labeled probes to chromosomes. Genes are mapped according to their cytogenetic (band position) location on the chromosome.

G

GenBank is a primary nucleotide sequence database produced and maintained at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) in Bethesda, MD, USA. GenBank collaborates with EMBL and DDBJ to form the International Nucleotide Sequence Database Collaboration.

GenBank divisions are partitions of the GenBank data into categories based on the origin of the sequence. At first the GenBank divisions were established so that one division would be one file in the GenBank distribution. However, the number of GenBank divisions has not kept pace with the growth of the sequence data the EST division now has over 150 files. There are currently 17 GenBank divisions.

GenBank Flatfile Format

This is the format of the sequence records in the GenBank flatfile release. This is a text-only format containing multiple entries or records. Each record in the large text file, also called a flatfile, begins with a LOCUS line and ends with a single line consisting of a pair of forward slashes ("//"). The term "GenBank format" is often used to refer to the format of individual records within the flatfile. Each record contains a header containing the database identifiers, the title of the record, references, and submitter information. The header is followed by the feature table and then the sequence itself. The GenBank flatfile is described in detail in the GenBank release notes. In the Entrez system, the GenBank format is the default display format for non-bulk sequence entries.

Gene Expression Omnibus (GEO)

GEO is a primary database at the NCBI that is an archived repository for gene expression data derived from different experimental platforms.

A gene model is a mapping of gene features such as coding regions and exon intron boundaries onto the the genomic DNA of an organism. Gene models typically provide a predicted transcript and protein sequence. A simple kind of gene model can be made by aligning an expressed sequence (cDNA) to the genomic DNA sequence. More precise exon intron boundaries can be identified by constraining the aligned segments using consensus splicing signals. This type of alignment-based gene model is used to generate many of the NCBI RefSeq model transcripts for higher genomes. Gene features can also be predicted computationally in the absence of aligned expressed sequences. The simplest candidate gene predictions can be made on microbial genomic DNA by searching for long open reading frames. Database sequence similarity searches with the predicted translations of these ORFs are used to support these gene predictions. Computational gene prediction in higher eukaryotic genomes is complicated by the interruption of gene coding regions by intronic sequences. There are a number of methods that are used in eukaryotic gene prediction. The NCBI uses the program GenomeScan to annotate putative genes on the human, mouse and rat genomes.

A linkage map is an ordered display of genetic information referenced to linkage groups (ultimately chromosomes) in a genome. The mapping units (centiMorgans) are based on recombination frequency between various polymorphic markers traced through a pedigree. One centiMorgan equals one recombination event in 100 meioses.

Genetics Computer Group (GCG)

The GCG is a bioinformatics software development group, originally at the Department of Genetics at the University of Wisconsin, then later existing as a private company, and merging with Oxford Molecular, MSI and Synopsis to collectively form Accelerys. GCG is widely known for its sequence analysis software package properly known as the Wisconsin Package. The intials GCG have been widely used as a synonym for that package.

Genome Survey Sequence (GSS)

GSS sequences comprise a bulk sequence division of GenBank. GSS sequences are in essence the genomic equivalent of the ESTs. The GSS division contains first pass, single reads of genomic DNA. Typical GSS records are initial sequencing surveys and end reads of large insert clones from genomic libraries, exon-trapped genomic sequences and Alu PCR sequences.

GenomeScan is gene prediction program (algorithm) developed by Christopher Burge at the Massachussetts Institute of Technology. This is the algorithm used at the NCBI to produce gene models for higher genomes.

The GI number is an identifier assigned to all sequences at the NCBI. The GI number points to a specific version of a sequence record. This identifier is largely superceded by the accession.version number for outside users. GI stands for GenInfo, a database system at NCBI that preceded the Entrez system.

A global alignment is a sequence alignment that extends the full-length of the sequences that are being compared. Global alignment procedures usually will produce an alignment that includes the entire length of all sequences including regions that are not similar, and can be made to produce meaningless alignments between unrelated sequences. Compare with local alignment.

The Golden Path refers to the human and mouse genome annotation and assembly projects at the University of California Santa Cruz (UCSC).

H

High Throughput Genomic Sequence (HTG)

HTG sequences comprise a Genbank division containing unfinished genomic sequence. HTG records typically are incomplete assemblies sequences of BAC or other large insert clones. GenBank recoginizes four stages of completion (phases) for these sequences. Phase 0 records contain one or a few single pass reads of a given genomic clone. Phase 1 records contain two or more assembled contigs of the sequence data however the contigs are unordered and unoriented and there are still gaps in the sequence. Phase 2 records also contain two or more contigs with gaps, but the order and orientation are known. Once the sequence gaps are resolved, and there is enough sequence coverage to give an accuracy of 99.99%, the record moves to phase 3 and leaves the HTG division for the appropriate taxonomic GenBank division a human sequence would move to the pirmate division (PRI), a mouse sequence to the rodent division (ROD).

High Throughput cDNA (HTC)

HTC is a GenBank division containing draft cDNA sequences. HTC records are similar to ESTs, but often contain more information. Unlike ESTs but like the genomic draft (HTG) records, HTC sequences may be updated with additional sequence data and move to the appropriate traditional division of GenBank.

Two biological entities (structures or molecule) are said to be homologues (or are homologous) if it is thought that they descend from a common ancestral structure or molecule. Correspondong body parts and genes in different or the same species can be homologous. The term has often been extended to include sequences as well. However it is incorrect to report a relative homology or percent homology as is sometimes said of sequences genes or sequences are either homologous or they are not. See also orthologue and paralogue

Human Genome Nomenclature Committee

The HGNC is a non-profit organization located at the University College London that assigns authoritative and unique gene names and symbols for all known human genes.

Human Mouse Homology Maps

The human mouse homology maps show the syntenic chromosome regions between the two organisms and allow the corresponding sequences and other related information to be retrieved from one organism given a gene or map location in the other. The data used to construct these homology maps are derived from UCSC and NCBI human genome assemblies and the mouse MGD genome map and Whitehead/MRC radiation hybrid maps.

I-L

The ISDC involves the three major primary nucleotide sequence repositories GenBank, the DDBJ (DNA Data Bank of Japan), and the EMBL (European Molecular Biology Laboratory) databases. Each database has its own set of submission and retrieval tools, but the three exchange data daily and have shared standards for sequence submission and annotation. All three share data so that all contain the same set of sequence data.

Interspersed repetitive sequences are primarily degenerate copies of transposable elements - also called mobile elements - that, in humans, comprise over a third of the genome. The most common mobile elements are LINEs and SINEs (long and short interspersed nuclear elements, respectively). The Alu families of repeats are the primary SINEs in primates.

Long interspersed nuclear elements are a class of transposable element, also called an interspersed repeat. These constitute about 20% of the human genome. A typical LINE is 6KB long and encodes a reverse transcriptase and a DNA-nick-looping enzyme, allowing it to move about the genome autonomously. LINEs are also called non-LTR retrotransposons.

LinkOut is registry service to create links from specific articles, journals, or biological data in Entrez to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web site, and specification of the NCBI data from which they would like to establish links.

LOAD is the library of ancient domains, a small number of conserved domain alignments that add to the position specific scoring matrices (PSSMs or profiles) in the Conserved Domain Database (CDD) at NCBI. The majority of domains in CDD come form the databases SMART, Simple Modular Architecture Research Tool, and Pfam.

A local alignment is a high scoring alignment between sub-sequences of two or more longer sequences. Unlike a global alignment, there may be multiple high scoring local alignments between sequences. Local alignments are useful for database searches because their scores can be used to assess the biological significance of the alignments found. (See also Alignment Score and Expect Value.) Local alignments are produced by the popular sequence similarity search tools BLAST and FASTA.

LocusLink is an NCBI resource that provides a single query interface to curated sequence and descriptive information about genetic loci. It is a good place to begin a search for information about a particular gene. LocusLink currently contains human, mouse, rat, zebrafish, fruit fly and HIV-1 loci..


Low Complexity Sequence

Low complexity sequence is a region of amino acid or nucletide sequence with a biased residue composition. Low complexity sequence includes homopolymeric runs, short-period repeats, and some subtler over-representation of one or a few residues. Such sequences often look very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. Low-complexity regions can result in misleading high scores in sequence similarity searches. These scores reflect compositional bias rather than significant position-by-position alignment. Filter programs are usually used to eliminate these potentially confusing matches from sequence similarity search results. The NCBI BLAST programs used filters that replace low complexity regions in the query sequence with an anonymous residue (n for nucleic acid, X for amino acid) These regions are thus effectively removed from the search because these anonymous residue are treated as mismatches by the BLAST programs.

M

The Map Viewer is a software component of the NCBI Entrez Genomes that provides special browsing capabilities for genomes of higher organisms. It allows one to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The number and types of available maps vary by organism, but include maps for: genes, contigs, BAC tiling path, STSs, FISH mapped clones, ESTs, GenomeScan models, and SNPs.

MEDLINE is the NLM's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences. MEDLINE contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 70 other countries. The file contains over 11 million citations dating back to mid-1960. Coverage is worldwide, but most records are from English-language sources or have English abstracts. MEDLINE is included in PubMed, which contains additional citations.

MegaBLAST is a local pairwise nucleotide alignment tool that is optimized for finding long alignments between nearly identical sequences. MegaBLAST is most useful for comparing sequences from the same species, and is particulary suited to such tasks as clustering ESTs, aligning genomic clones or aligning cDNA sequences and genomic DNA. MegaBLAST can be up to 10 times faster than many standard sequence similarity programs, including standard nucleotide-nucleotide BLAST. It also efficiently handles much longer DNA sequences. MegaBLAST is the only BLAST program on the NCBI's web site that can perform batch searches.

Model Maker is a tool associated with the Map Viewer that allows one to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence in order to build a gene model. Model Maker also allows editing the model by selecting or removing putative exons. Model Maker can then display the mRNA sequence and potential ORFs for the edited model, and save the mRNA sequence data for use in other programs. Model Maker is accessible from sequence maps displayed in the Map Viewer. To see an example, follow the "mm" link beside any gene annotated on the human "Gene_Sequence" map in the Map Viewer.

NCBI's structure database, MMDB, contains experimentally determined, three-dimensional, biomolecular structures obtained from the Protein DataBank (PDB) the PDB's theoretical models are not imported. MMDB was designed for flexibility, and as such, is capable of archiving conventional structural data as well as future descriptions of biomolecules, such as those generated by electron microscopy (surface models). Most 3D-structure data are obtained from X-ray crystallography and NMR-spectroscopy.

A motif is a short, well-conserved nucleotide or amino acid sequence that represents a minimal functional domain. It is often a consensus for several aligned sequences. The PROSITE database is a popular collection of protein motifs, including motifs for enzyme catalytic sites, prosthetic group attachment sites (heme, biotin, etc), and regions involved in binding another protein. Examples of DNA motifs are transcription factor binding sites.

N

The NCBI is a division of National Library of Medicine at the National Institutes of Health in Bethesda, MD. The NCBI was established in 1988 to create automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics to support the use of such databases and software by the scientific community to coordinate efforts to gather biotechnology information both nationally and internationally and to perform research in computational biology. Currently the NCBI maintains the GenBank database along with several related databases.

The National Institute of Genetics (NIG) was established in 1949 in Mishima, Japan and reorganized in 1988 as an inter-university research institute in genetics. The Institute currently provides graduate education in genetics and also maintains the DNA Data Bank of Japan.

Nonredundant is a term used to describe nucleotide or amino acid sequence databases that contain only one copy of each unique sequence.Non-redundant databases have the advantage of smaller size and, therefore, shorter search times and more meaningful statistics. The default database on most protein BLAST web pages is labeled "nr". This is a nonredundant database where multiple copies of the same sequence such as the corresponding sequences of the same protein from SWISS-PROT, PIR, and GenPept, are combined to make one sequence entry. The default nucleotide database on the standard nucleotide-nucleotide BLAST web page is also labeled "nr", but is no longer a nonredundant database.

O

Online Mendelian Inheritance in Man (OMIM)

OMIM is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI. The database contains textual information, references, and copious links to MEDLINE and sequence records in the NCBI's Entrez system, plus links to additional related resources at NCBI and elsewhere.


Open Reading Frame (ORF)

An ORF is a DNA (or mRNA) sequence that is potentially able to encode a polypeptide. ORFs begin with a start codon (ATG) and are read in triplets until they end with a STOP codon (TAA, TGA , or TAG in the standard code). The NCBI ORF finder is useful for identifying ORFs in cDNA or in intron-less genomic DNA.

Orthologues are genes derived from a common ancestor through vertical descent. This is often stated as the same gene in different species. In contrast, paralogs are genes within the same genome that have evolved by duplication.

The hemoglobin genes are a good example. Two separate genes (proteins) make up the molecule hemoglobin (alpha and beta). The alpha and beta DNA sequences are very similar and it is believed that they arose from duplication of a single gene, followed by separate evolution in each of the sequences. Alpha and beta are considered paralogs. Alpha hemoglobins in different species are considered orthologs.

P

The original Percent Accepted Mutation scoring matrix (see M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 15) was derived from observing how often different amino acids replace other amino acids in evolution, and was based on a relatively small dataset of 1,572 changes in 71 groups of closely related proteins. Further, matrix values are based on the model that one sequence is derived from the other by a series of independent mutations, each changing one amino acid in the first sequence to another amino acid in the second. PAM250 was a very popular matrix, but is often now replaced by the BLOSUM series of matrices, particularly when looking for more distantly related proteins. Lower number PAM matrices correspond roughly to higher numbered BLOSUM matrices.

Paralogs are usually described as genes within the same genome that have evolved by duplication. See Ortholog.

Pfam is a database of conserved protein regions or domains. It is one of three databases that make up the NCBI's Conserved Domain Database (CDD). The other two are SMART and LOAD.

A PopSet is a set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. The population could originate from different members of the same species, or from organisms from different species. They are submitted to GenBank via the program Sequin, often as a sequence alignment.


Position Hit Initiated BLAST (PHI-BLAST)

PHI-BLAST is a variation of BLAST that is designed to search for proteins that both contain a pattern specified by the user, and are similar to the query sequence in the vicinity of the pattern. This dual requirement is intended to reduce the number of database hits that contain the pattern and are likely to have no true homology to the query.


Position Specific Iterated BLAST (PSI-BLAST)

PSI-BLAST is a derivative of protein-protein BLAST that is more sensitive because it incorporates position specific substitution rates in the scoring system. This makes PSI-BLAST useful for finding very distantly related proteins. PSI-BLAST works by first generating a position specific score matrix (PSSM) from the sequences found from a standard BLAST search. The database is then searched with the PSSM. PSI-BLAST can be run in multiple iterations with a new PSSM being made from the the new information collected from the previous search.


Position Specific Scoring Matrix (PSSM)

A PSSM is an alignment scoring matrix that provides substitution scores for each position in a protein sequence. PSSMs are often based upon the frequencies of each amino acid substitution at each position of protein sequence alignment. This gives rise to scoring matrix that has the length of the alignment as one dimension and the possible substitutions in the other. In a PSSM a particular substition, for example Ser substituting for Thr, can have a different score at different positions in the alignment. This is in contrast to a position independent matrix like BLOSUM62, where the Ser Thr substitution gets the same score no matter where it occurs in the protein. PSSMs are more realistic models for related protein sequences since substitution rates are expected to vary across the length of a protein some aligned positions, such as the active site residues, are more important than others.

In the context of alignments displayed in BLAST output, positives are those non-identical substitutions that receive a positive score in the underlying scoring matrix, BLOSUM62 by default. Most often, positives indicate a conservative substitution or substitutions that are often observed in related proteins.

A primary sequence database contains sequences submitted by the researchers who orginally produced the data. In primary sequence databases the submitters of the sequence control the contents and diposition of the data. GenBank is an example of a primary database. The content, accuracy and updating of GenBank sequences is largely the responsibility of the submitter. This is in contrast to a curated database, such as RefSeq or SWISS-PROT, where additional information is added to each record by the staff maintaining the database.

ProbeSet is a by experiment view of NCBI's Gene Expression Omnibus (GEO), which is a gene expression and hybridization array repository. ProbeSet is intended to facilitate searches of the GEO database, and link the search results to internal and external resources where possible.

Protein matches for ESTs (ProtEST) are the best protein matches to tranlations of EST sequences in UniGene. The nucleotide sequences (mRNAs as well as ESTs) are matched with possible translational products through sequence comparison using BLASTX with an expect value of 1x10-6. The sequences are compared with proteins from eight organisms and the best match in each organism is recorded. UniGene nucleotide sequences can thus have up to eight matches in ProtEST.
In order to exclude proteins sequences that are strictly conceptual translations or models, the proteins used in ProtEST are those originating from the structural databases SwissProt, PIR, PDB or PRF.

PDB is the repository for the processing and distribution of 3-D biological macromolecular structure data. As of April, 2002, PDB contained almost 18,000 structures, including more than 1,000 nucleic acids and 400 theoretical models. Except for theoretical models, the PDB data are used to produce the NCBI's structure database, MMDB and are included in the default BLAST databases("nr").

PIR is a curated protein sequence database produced and maintained by the National Biomedical Research Foundation at Georgetown University in Washington, D.C. PIR protein sequences are included in BLAST "nr" database and in the Entrez protein system. PIR contains more than 200,000 entries.

PRF is a protein sequence database maintained in Osaka, Japan, and is one of the protein databases included in BLAST "nr" database searches and in the Entrez protein system. Release 84, March 2002, included
195,660 entries.

PubMed, a service of the National Library of Medicine, provides access to over 11 million MEDLINE citations, from more than 4,300 biomedical journals published in the United States and 70 other countries. Citations cover the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences and date back to mid-1960. PubMed includes additional life science journals not found in MEDLINE, as well as links to many sites providing full text articles and other related resources.

Q-R

Radiation Hybrid (RH) map

A radiation hybrid map is a STS-based physical genome map produced by first breaking chromosomes of a donor cell line with a lethal dose of radiation, and then rescuing the cells by fusion with a recipient cell line. Distances between markers are measured in centirays (cR), with 1 cR representing a 1% probability that a break occurred between two markers.

RasMol is a structure rendering software package produced at the University of Massachusetts. RasMol interprets the native format of structure files from PDB.

A raw score in BLAST output is the non-normalized score of an alignment of a query and target sequence. The raw score is derived directly from the scoring matrix by summing the individual substitution scores of the aligned residues in the alignment. For gapped BLAST the raw score also includes gap penalties.

Reference single nucleotide polymorphisms (refSNP) are curated dbSNP records that define a non-redundant set of markers used for annotation of reference genome sequence and integration with other NCBI resources. Each refSNP record provides a summary list of submitter records in dbSNP and a list of external resource and database links.

Reference Sequences are curated nucleotide or protein records developed by NCBI staff. They attempt to summarize the available information about a given sequence and to provide the most reliable and up to date sequence and annotation. RefSeqs include curated transcripts and proteins, noncoding transcribe RNAs, contig and supercontig assemblies, gene models and chromosome records.

Reverse Position Specific BLAST (RPS-BLAST)

RPS-BLAST is a variation of BLAST in which a protein query sequence is searched against a database of pre-computed Position-Specific Score Matrices as used in PSI-BLAST. This kind of search forms the basis of the CD-Search.

S-T

A sequence alignment is a residue by residue comparison of two or more sequences. In the alignment, the relative positions of the sequences are adjusted to optimize (usually maximize) the alignment score derived by reference to some scoring matrix. In some cases gaps with associated penalties may be inserted into one or more sequences to optimize the alignment score.

Sequence Tagged Site STS

STS's are sequence records that contain a short sequence of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. The primer sequences and PCR conditions are usually included in the record. Sequence tagged sites comprise the STS GenBank division. These markers are used in linkage and radiation hybrid mapping techniques. They are useful for integrating these kinds of mapping data with each other and also with the assembled genomic sequence. The ePCR tool is useful for indentifying known STS markers in a DNA sequence.

Sequin is a stand alone application package produced by NCBI that is platform for preparing and annotating sequences for submission to GenBank.

Serial Analysis of Gene Expression(SAGE)

SAGE is an experimental method of generating a cDNA library that contains concatenated short (usually ten base) fragments called tags of all cDNA species present in library. These tags may be counted to give a quantitative measure of gene expression in the library. The NCBI SAGE Map resources match SAGE tag sequences to UniGene cluster to identify genes expressed in SAGE libraries and provide several mechanisms for exploring relative expression patterns in SAGE libraries..

Shotgun sequencing is a sequencing method in which a large genomic clone is broken into small segments that are then subcloned and randomly sequenced. Once enough random clones have been sequenced, these random sub-sequences are then assembled to establish the large insert sequence. In some cases, an entire genome may be fragmented and cloned into small insert vectors without first being cloned and arrayed in large insert vectors. This latter technique is called whole genome shotgun sequencing and has been used successfully with many smaller genomes and has provided important preliminary assemblies for the human, mouse and rice genomes.

SINEs (Short Interpersed Repeats) are transposable repeat elements in the human genome that are typically 100-400 bp, harbor an internal polymerase III promoter, and encode no proteins.

Single Nucleotide Polymorphism (SNP)

Strictly speaking a SNP is a variation or polymorphism in the genome sequence involving a single nucleotide position. The NCBI maintains dbSNP as a primary repository of SNP data. The SNP data at the NCBI also includes some variations involving multiple positions such as repeat polymorphisms.

Spectral Karyotyping and Comparative Genomic Hybridization Database (SKY/CHG database)

SKY/CHG is a repository of publicly submitted data from SKY and CGH, which are complementary fluorescent molecular cytogenetic techniques. SKY facilitates identification of chromosomal aberrations CGH can be used to generate a map of DNA copy number changes in tumor genomes.

SMART (Simple Modular Architecture Retrieval Tool) is a database of conserved domains that allows automatic identification and annotation of domains in user-supplied protein sequences. Th SMART data are used create one of the sets of PSSMs used in the CD-Search.

Smith Waterman algorithm

The Smith-Waterman algorithm is a local alignment computational protocol that uses dynamic programming to find all possible high-scoring local alignments between a pair of sequences. This is the most sensitive local alignment algorithm but is computationally too expensive to be generally useful for high throughput searches of large sequence databases. The BLAST and FASTA programs are generally used in these kinds of applications.

SWISS-PROT is A highly curated database of protein sequences established in 1986 and currently maintained by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute (EBI).

The TaxBrowser is an aspect of the Entrez system that allows one to browse sequence, genome and structure records based on the taxonomic classification of the source organism. The tax browser allows access at all levels of the taxonomic hierarchy and can be used to acquire records at any taxomic node.

TrEMBL (Translated EMBL) is a derivative protein dataset that is a automatically-annotated supplement to the SWISS-PROT. trEMBL contains all the translations of coding regions of EMBL nucleotide sequence entries. The trEMBL data set serves as a source of proteins that may eventually be incorporated into SWISS-PROT.

U-Z

A database created and maintained at NCBI as an experimental system for automatically partitioning expressed nucleotide sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the map location and tissue types in which the gene has been expressed.UniGene is particularly important for reducing the redundancy and complexity of EST data and is an important resource for gene discovery.

A resource created and maintained at NCBI that reports information about Sequence Tagged Sites (STS). For each STS, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to other NCBI databases.

Vector Alignment Search Tool (VAST)

An algorithm created at NCBI that searches for three-dimensional structures that are geometrically similar to a query structure by first representing the secondary structure elements of each structure as vectors, and then attempting to align these sets of vectors. VAST is used at the NCBI to establish relationships between structures and create structural alignments in the Entrez system.

A parameter of the BLAST algorithm that determines the length of the residue segments (either nucleotides or amino acids) into which BLAST partitions the query sequence. The resulting dictionary of "words" is then used to search the selected sequence database.

Yeast Artificial Chromosome (YAC)

A YAC is a functional (self-replicating) artificial chromosome widely used as a vector for genomic clones in sequencing projects involving large genomes. As the name implies, YACs are propagated in yeast (Saccharomyces). A typical YAC clone can contain fragments up to

2 Mb. A major problem with YAC clones is the tendency to rearrange in the host. YAC technology has largely been replaced by BAC cloning vectors.


Results

We first describe the main features of the thus-estimated LG matrix, and then compare its performance in tree inference to several other replacement matrices with different options and data sets.

LG Replacement Matrix

As stated above, the LG matrix (as estimated using the above procedure) is defined by 3 components: the global rate (ρ), the amino acid equilibrium distribution (Π) and the exchangeability matrix (R). We describe each of these components in turn.

The global rate (ρ) is equal to 1.11 and 1.07 for the first (LG1) and second (LG2) iterations, respectively. This indicates that LG is globally faster than WAG, but it is difficult to extrapolate the LG properties from these findings. To study the LG rate in tree inference, we thus measure the tree length obtained with the normalized version of LG and with WAG, both used with 4 gamma categories and invariant sites. The results are displayed in table 1 for Pfam and TreeBase test alignments. This table also provides a comparison between LG and WAG regarding the estimate of the gamma shape parameter (α). These results highlight a clear difference between LG and WAG: LG trees are ∼10–15% longer on average than WAG trees, and this finding is observed with almost all test alignments. We also observe that the variability of rates among sites is higher (α is lower) with LG than with WAG, and, again, this is observed with most alignments. Both findings are consistent as evolutionary distances and branch lengths are increased when the α value decreases. We shall see that LG trees also tend to be more likely than WAG trees. All these means that LG better characterizes the evolutionary patterns than WAG and thus captures more hidden substitutions, which results in longer trees (for a discussion on tree length and likelihood value, see Pagel and Meade 2005).

Comparison of WAG and LG Regarding the Tree Length and Gamma Shape Parameter

N OTE .—LG and WAG are run with PHYML using the Γ4 + I option on TreeBase and Pfam test alignments. The tree length is the sum of all branch lengths α denotes the gamma shape parameter LG/WAG is the average of the ratios between LG and WAG values, over all alignments. #LG > WAG counts the number of alignments where the LG value is larger than the WAG value, among 59 and 500 alignments for TreeBase and Pfam, respectively. The sign test indicates that all these counts reveal highly significant differences between LG and WAG (p-value ≈ 0.0).

Comparison of WAG and LG Regarding the Tree Length and Gamma Shape Parameter


Watch the video: Substitution Matrices: PAM (July 2022).


Comments:

  1. Macnair

    Incomparable phrase, I like it a lot :)

  2. Aegelmaere

    Bravo, what necessary phrase..., a remarkable idea



Write a message