The structure, function and dynamics of proteins are determined by the physical and chemical properties of their amino acids. Unfortunately, the information encapsulated within a position or between positions is poorly understood. Multiple sequence alignments of protein families allow us to interrogate these questions statistically. Here, we describe the characterization of bioinformatically-designed variants of triosephosphate isomerase (TIM). First, we review the state-of-the-art for engineering proteins with increased stability. We examine two methodologies that benefit from the availability of large numbers - high-throughput screening and sequence statistics of protein families. Second, we have deconvoluted what properties are encoded within a position (conservation) and between positions (correlations) by designing TIMs in which each position is the most common amino acid in the multiple sequence alignment. We found that a consensus TIM from a raw sequence database performs the complex isomerization reaction with weak activity as a dynamic molten globule. Furthermore, we have confirmed that the monomeric species is the catalytically active conformation despite being designed from 600+ dimeric proteins. A second consensus TIM from a curated dataset is well folded, has wild-type activity and is dimeric, but it only differs from the raw consensus TIM at 35 nonconserved positions. These two TIMs differ in the fraction of dataset sequences from eukaryotes and prokaryotes. These distribution differences have led to the breaking and altering of networks of statistical correlations at nonconserved positions which we demonstrate with mutual information and subset perturbation calculations. Additionally, we show that the curated consensus TIM is an extreme thermostable enzyme. The protein remains half folded at 95 °C and may be the only TIM to completely refold after thermal denaturation.
Third, we wished to understand the determinants of protein stability -- one of biochemistry's most difficult questions. It has been shown that consensus mutations improve the stability of native proteins approximately half the time, but there is no a priori technique to predict which consensus mutations will be stabilizing. We have developed a double-sieve filter that selects stabilizing mutations based on extent of conservation and statistical independence from other positions within the multiple sequence alignment. These two mathematical tests reliably predict stabilizing mutations with greater than 90% accuracy. The statistical algorithm was used to select 15 consensus mutations that together, significantly improved the melting temperature of wild-type TIM .
Finally, we designed and characterized a model system for testing the effects of statistically correlated residues. The TIM-knockout from the Keio Collection was engineered for T7 expression and tested for TIM activity complementation. The single gene knockout exhibits differential growth that correlates well to in vitro specific activities. The design and characterization of two libraries are proposed to test the relationship between correlations and protein fitness.