Inputs and outputs
The FPS server takes as input a single protein sequence, GCG profile or BLAST checkpoint file, and returns a list of the families to which the query most likely belongs. This page provides details about the input format required by the server and the output produced by the server.
Input to Family Pairwise Search
To use FPS you must specify
- a query, and
- a protein family library.
The Query
The query may be either a sequence, GCG profile, or a BLAST checkpoint file.
- Sequence
Although the FPS algorithm can be used to classify DNA or protein sequences, the FPS server currently accepts only proteins. The protein sequence should be written in the standard IUPAC alphabet, including the letters ACDEFGHIKLMNPQRSTVWY for amino acids and BUXZ for ambiguities. The sequence may be in uppercase or lowercase letters but may not include any punctuation other than spaces or line breaks.Optionally, the sequence may begin with a single FASTA header line, beginning with the ">" character.
Here is an example query sequence:
>ICYA_MANSE this is an example FASTA header line GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNS FVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCD YHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTT YSLTGPDRH
- Filtering query sequences for low complexity regions
Sequences are filtered for low-complexity regions using the "seg" program by default. This can be disabled by un-checking this checkbox. Low complexity regions commonly give spuriously high pairwise alignment scores that reflect compositional bias rather than significant position-by-position alignment. Filtering can elminate these potentially confounding matches (e.g., hits against proline-rich regions). See "Statistics of local complexity in amino acid sequences and sequence databases" by Wootton, J. C. and S. Federhen (1993), Computers in Chemistry 17:149-163, for more information on filtering and the "seg" program.
- Scoring matrices for pairwise search
The pairwise search will use the scoring chosen from the list.
GCG profile
Alternatively, you may specify the name of a file containing a GCG v1.0 or v2.0 peptide profile as the query. The contents of the file will be uploaded and used to search the specified protein library. Profiles in v2.0 format are converted to v1.0 format automatically. Multiple profiles in v2.0 files are concatenated into a single v1.0 profile by FPS.
BLAST v2.0 checkpoint file
Alternatively, you may specify the name of a file containing a BLAST v 2.0 checkpoint file. The contents of the file will be uploaded, converted to a GCG v1.0 profile, and used to search the specified protein library. The conversion to a profile consists of dividing the amino acid position-specific frequencies contained in the checkpoint file by the amino acid background frequencies used by PSI-BLAST, and taking the logarithm. The amino acid frequencies used by PSI-BLAST are:
A C D E F G H I K L M N P Q R S T V W Y 0.078 0.019 0.054 0.063 0.039 0.074 0.022 0.052 0.057 0.090 0.022 0.045 0.052 0.043 0.051 0.071 0.059 0.064 0.013 0.032 The Protein Family Library
In addition, FPS requires you to select a library of protein families to search. The FPS server currently supports two protein libraries:
The PROSITE and PFAM databases were both purged by removing sequences until no pairs of remaining sequences in them have BLAST similarity scores greater than 250 using BLOSUM 62 scoring. This was done using the PURGE program described in: Neuwald, Liu, & Lawrence (1995) "Gibbs motif sampling: detection of bacterial outer membrane protein repeats" Protein Science 4, 1618-1632.
- SCOP -- Structural Classification of Proteins,
- PROSITE -- Database of protein families and domains (FPS uses the sequences of PROSITE family members not the PROSITE family signatures), and
- PFAM -- Database of protein domain multiple alignments (FPS uses the sequences of the PFAM domains, not the PFAM multiple alignments.)
Output from Family Pairwise Search
An example FPS output page is available here.
The output consists of two sections. The first section lists the families that closely match the query. The families are sorted by E-value. An E-value is just the p-value of the query-to-family match multiplied by the number of families in the library. The E-value of a query-to-family match therefore reflects the probability of finding such a match by chance in the given family library.
Here is an example output table:
SCOP Superfamily size eff size E-value Lipocalins 11 4.63 1.34e-12 PDZ domain 1 1.00 2.69e+00 Catalytic domain of malonyl-CoA ACP transacylase 1 1.00 3.90e+00 Protocatechuate-3,4-dioxygenase, alpha and beta chains 2 1.90 5.47e+00 A sigma70 subunit fragment from RNA polymerase 1 1.00 7.47e+00 RecA protein, C-terminal domain 1 1.00 7.53e+00 Acylphosphatase 1 1.00 8.57e+00 The table of matches lists each family that matches the query with an E-value less than 10. Each row in the table gives the name of the family, the total number of sequences in the family ("size"), the effective family size ("eff size"), and the E-value of the query-to-family match. The effective family size is an estimate of the degree of similarity among the family members.
The second output section contains summaries of the pairwise alignments generated by the Smith-Waterman algorithm. Each summary shows pairwise alignments between the query and members of a single family. These summaries are intended to provide an explanation for the scores that FPS computes as well as insight into the similarities between the query and members of each family. No attempt has been made to produce true multiple alignments by employing information about similarities among family members. Therefore, the summaries are likely to contain more gaps than a true multiple alignment.
Here is part of an example pairwise alignment summary:
alignment of ICYA_MANSE vs. Lipocalins family size: 11 effective family size: 4.63 p-value of match: 2.32e-15 E-value of match: 1.34e-12 ICYA_MANSE - 2 DIFYPGYCPDVKPVNDFDLSAFAGAWHEI---A--------KLPLENENQG--KCTIAEY d1bbpa_ 3.84e-29 1 NVYHDGACPEVKPVDNFDWSNYHGKWWEV---A--------KYPNSVEKYG--KCGWAEY d1hbp__ 1.19e-05 14 NFDKARFAGTWYAM---A--------KKDPEGLFLQ--DNIVAEF d1epba_ 1.59e-04 1 VKDFDISKFLGFWYEI---AfaskmgtpGLAHKEEKMG---AMVVEL d1beba_ 9.97e-04 7 DIQKVAGTWYSL---A----------------MA--ASDISLL d1obpa_ 7.34e-03 8 NLSELSGPWRTVyigS--------TNPEKIQENGpfRTYFREL ICYA_MANSE 49 KYDGK---------KASV--YNSFVSNGVKEYMEGD--LEI----------APD-AKYTK d1bbpa_ 48 TPEGK---------SVKV--SNYHVIHGKEYFIEGT--AYP----------VGD-SKI-- d1hbp__ 46 SVDENghmsatakgRVRL--LNNW---DVCADMVGT--FTD----------TEDpAKF-- d1epba_ 42 K--------------------ENLLALTTTYYSEDHcvLEK----------VTA-TEGDG d1beba_ 29 DAQSA---------PLRV--Y----VEELKPTPEGD--LEIllqkwengecAQK-KIIAE d1obpa_ 43 VFDDE---------KGTVdfYFSVKRDGKWKNVH---------------------VKATKThe alignment summary includes all of the information given previously in the query-to-family table. In addition, the first block of the summary lists the pairwise p-values between the query and each member of the family. Each block of the summary begins with the query. Subsequent lines in the block list the family member sequence identifier, the starting position of this block within the sequence, and the sequence alignment to the query. Amino acids in the alignment are colored according to their biochemical properties.
AMINO ACIDS COLOR AFILMPVW RED CGHNQSTY LIME DE BLUE KR MAGENTA Note that the alignment summary only lists the most significant alignments in the family. Thus, in the exmple alignment summary shown above, the lipocalin family contains eleven members, but the summary only contains the five most significant alignments.
FPS was developed by Timothy Bailey at the San Diego Supercomputer Center, and by William Grundy at the Department of Computer Science and Engineering, University of California, Santa Cruz. The FPS server is funded by the National Biomedical Computation Resource.
Please send comments and questions to Timothy Bailey at tbailey@sdsc.edu.