Inputs and outputs

The FPS server takes as input a single protein sequence, GCG profile or BLAST checkpoint file, and returns a list of the families to which the query most likely belongs. This page provides details about the input format required by the server and the output produced by the server.


Input to Family Pairwise Search

To use FPS you must specify

The Query

The query may be either a sequence, GCG profile, or a BLAST checkpoint file.

The Protein Family Library

In addition, FPS requires you to select a library of protein families to search. The FPS server currently supports two protein libraries:

The PROSITE and PFAM databases were both purged by removing sequences until no pairs of remaining sequences in them have BLAST similarity scores greater than 250 using BLOSUM 62 scoring. This was done using the PURGE program described in: Neuwald, Liu, & Lawrence (1995) "Gibbs motif sampling: detection of bacterial outer membrane protein repeats" Protein Science 4, 1618-1632.

Output from Family Pairwise Search

An example FPS output page is available here.

The output consists of two sections. The first section lists the families that closely match the query. The families are sorted by E-value. An E-value is just the p-value of the query-to-family match multiplied by the number of families in the library. The E-value of a query-to-family match therefore reflects the probability of finding such a match by chance in the given family library.

Here is an example output table:

SCOP Superfamilysizeeff sizeE-value
Lipocalins114.631.34e-12
PDZ domain11.002.69e+00
Catalytic domain of malonyl-CoA ACP transacylase11.003.90e+00
Protocatechuate-3,4-dioxygenase, alpha and beta chains21.905.47e+00
A sigma70 subunit fragment from RNA polymerase11.007.47e+00
RecA protein, C-terminal domain11.007.53e+00
Acylphosphatase11.008.57e+00

The table of matches lists each family that matches the query with an E-value less than 10. Each row in the table gives the name of the family, the total number of sequences in the family ("size"), the effective family size ("eff size"), and the E-value of the query-to-family match. The effective family size is an estimate of the degree of similarity among the family members.

The second output section contains summaries of the pairwise alignments generated by the Smith-Waterman algorithm. Each summary shows pairwise alignments between the query and members of a single family. These summaries are intended to provide an explanation for the scores that FPS computes as well as insight into the similarities between the query and members of each family. No attempt has been made to produce true multiple alignments by employing information about similarities among family members. Therefore, the summaries are likely to contain more gaps than a true multiple alignment.

Here is part of an example pairwise alignment summary:


alignment of ICYA_MANSE vs. Lipocalins 
 family size:              11
 effective family size:  4.63
 p-value of match:   2.32e-15
 E-value of match:   1.34e-12
 
 ICYA_MANSE      -         2 DIFYPGYCPDVKPVNDFDLSAFAGAWHEI---A--------KLPLENENQG--KCTIAEY
 d1bbpa_      3.84e-29     1 NVYHDGACPEVKPVDNFDWSNYHGKWWEV---A--------KYPNSVEKYG--KCGWAEY
 d1hbp__      1.19e-05    14                NFDKARFAGTWYAM---A--------KKDPEGLFLQ--DNIVAEF
 d1epba_      1.59e-04     1              VKDFDISKFLGFWYEI---AfaskmgtpGLAHKEEKMG---AMVVEL
 d1beba_      9.97e-04     7                  DIQKVAGTWYSL---A----------------MA--ASDISLL
 d1obpa_      7.34e-03     8                  NLSELSGPWRTVyigS--------TNPEKIQENGpfRTYFREL
 
 ICYA_MANSE               49 KYDGK---------KASV--YNSFVSNGVKEYMEGD--LEI----------APD-AKYTK
 d1bbpa_                  48 TPEGK---------SVKV--SNYHVIHGKEYFIEGT--AYP----------VGD-SKI--
 d1hbp__                  46 SVDENghmsatakgRVRL--LNNW---DVCADMVGT--FTD----------TEDpAKF--
 d1epba_                  42 K--------------------ENLLALTTTYYSEDHcvLEK----------VTA-TEGDG
 d1beba_                  29 DAQSA---------PLRV--Y----VEELKPTPEGD--LEIllqkwengecAQK-KIIAE
 d1obpa_                  43 VFDDE---------KGTVdfYFSVKRDGKWKNVH---------------------VKATK
 

The alignment summary includes all of the information given previously in the query-to-family table. In addition, the first block of the summary lists the pairwise p-values between the query and each member of the family. Each block of the summary begins with the query. Subsequent lines in the block list the family member sequence identifier, the starting position of this block within the sequence, and the sequence alignment to the query. Amino acids in the alignment are colored according to their biochemical properties.

AMINO ACIDS COLOR
AFILMPVW RED
CGHNQSTY LIME
DE BLUE
KR MAGENTA

Note that the alignment summary only lists the most significant alignments in the family. Thus, in the exmple alignment summary shown above, the lipocalin family contains eleven members, but the summary only contains the five most significant alignments.


FPS was developed by Timothy Bailey at the San Diego Supercomputer Center, and by William Grundy at the Department of Computer Science and Engineering, University of California, Santa Cruz. The FPS server is funded by the National Biomedical Computation Resource.

Please send comments and questions to Timothy Bailey at tbailey@sdsc.edu.