Article Abstract

Research Article

Enriching the sequence substitution matrix by structural information

Octavian Teodorescu¹, Tamara Galor¹, Jaroslaw Pillardy², Ron Elber^{1 *}

¹Department of Computer Science, Cornell University, Upson Hall 4130, Ithaca, New York 14853
²Cornell Theory Center, Cornell University, Upson Hall 4130, Ithaca, New York 14853

email: Ron Elber (ron@cs.cornell.edu)

^*Correspondence to Ron Elber, Department of Computer Science, Cornell University, Upson Hall 4130, Ithaca, NY 41583

The calculations were performed on Dell Edge cluster of the Cornell Theory Center funded by the tri-institutional grant.

Funded by:
National Science Foundation; Grant Number: 9988519
NSERC Canadian fellowship
Cornell and Rockefeller Universities and Memorial Sloan Kettering Cancer Center

Keywords

sequence alignment • threading • fitness function • sequence-to-structure matching • energy function • Z-score

Abstract

A fundamental step in homology modeling is the comparison of two protein sequences: a probe sequence with an unknown structure and function and a template sequence for which the structure and function are known. The detection of protein similarities relies on a substitution matrix that scores the proximity of the aligned amino acids. Sequence-to-sequence alignments use symmetric substitution matrices, whereas the threading protocols use asymmetric matrices, testing the fitness of the probe sequence into the structure of the template protein. We propose a linear combination of threading and sequence-alignment scoring function, to produce a single (mixed) scoring table. By fitting a single parameter (which is the relative contribution of the BLOSUM 50 matrix and the threading energy table of THOM2) we obtain a significant increase in prediction capacity in the twilight zone of homology modeling (detecting sequences with <25% sequence identity and with very similar structures). For a difficult test of 176 homologous pairs, with no signal of sequence similarity, the mixed model makes it possible to detect between 40 and 100% more protein pairs than the number of pairs that are detected by pure threading. Surprisingly, the linear combination of the two models is performing better than threading and than sequence alignment when the percentage of sequence identity is low. We finally suggest that further enrichment of substitution matrices, combing more structural descriptors such as exposed surface area, or secondary structure is expected to enhance the signal as well. Proteins 2003. © 2003 Wiley-Liss, Inc.

Received: 11 December 2002; Accepted: 25 March 2003

Digital Object Identifier (DOI)

10.1002/prot.10474 About DOI

Proteins: Structure, Function, and Bioinformatics

Volume 54, Issue 1 , Pages 41 - 48