Overview
Family Pairwise Search (FPS) is a method for scoring a single protein sequence, GCG profile or BLAST checkpoint file against a family of sequences. FPS compares the query to each of a set of sequences and then combines the pairwise scores into an overall score for the match of the query to the family of sequences.
FPS operates in two modes. In single-sequence query mode, FPS compares a single query sequence to a library of sequence families. The result is a family classification for the query sequence. In family query mode, FPS compares a query set of related sequences to a database of single sequences. In this mode, FPS outputs a set of sequences from the database that are related to the query family.
The remainder of this document focuses on FPS operating in single-sequence query mode. This is the only mode currently supported by the FPS web server.
The FPS algorithm consists of three steps:
Each of these steps is described below.
- computing effective family sizes,
- comparing pairs of sequences (where the first member of the pair may be a sequence, GCG profile or BLAST checkpoint file), and
- combining the pairwise scores.
Computing effective family sizes
The goal of the FPS algorithm is to compute, for each target family in the library, an estimate of the significance (p-value) of the match between the query and the family. FPS computes the p-value of a query-to-family match by combining the p-values of the pairwise matches between the query and each target sequence.
FPS combines the pairwise matches by taking the product of their p-values. In order to compute the significance of this product, the degree of correlation among the individual p-values must be taken into account.
FPS compensates for the correlation among pairwise p-values by calculating an effective family size for each family in the target library. If the target family contains n completely unrelated sequences, then the effective family size will be n. On the other hand, if all the p-values are completely correlated (i.e., identical), then the effective family size will be 1. For real families, the effective family size will be between 1 and n.
The effective family size of each family in the library is pre-computed and stored. FPS uses the stored effective family size to estimate the distribution of the product of p-values for each family in the library. These estimated distributions provide p-values for the match of the query to the target families.
Computing pairwise sequence scores
The heart of FPS is a pairwise sequence comparison algorithm. FPS may employ any such algorithm that provides accurate p-values. The FPS server employs the Smith-Waterman algorithm as implemented on the Bioccelerator. The Blosum 62 score matrix is used in computing the pairwise scores. A given query is compared to each sequence in the target library, and the p-values for these comparisons are stored.Combining the pairwise scores
Once the pairwise scores between the query and each target sequence in the library have been computed, the scores for each family are combined by taking the product of the p-values of the pairwise matches between the query and the members of the family. The overall significance of the match to a family is then computed using the formula for the distribution of such products of correlated p-values. The effective family size and actual family size are the parameters of this distribution.
Finally, FPS sorts the library families according to their p-values. A detailed description of the FPS output format can be found here.
FPS was developed by Timothy Bailey at the San Diego Supercomputer Center, and by William Grundy at the Department of Computer Science and Engineering, University of California, Santa Cruz. The FPS server is funded by the National Biomedical Computation Resource.
Please send comments and questions to Timothy Bailey at tbailey@sdsc.edu.