Abstract
A variety of functionally important protein properties, such as
secondary structure, transmembrane topology and solvent accessibility,
can be encoded as a labeling of amino acids. Indeed, the prediction of
such properties from the primary amino acid sequence is one of the core
projects of computational biology. Accordingly, a panoply of approaches
have been developed for predicting such properties; however, most such
approaches focus on solving a single task at a time. Motivated by
recent, successful work in natural language processing, we propose to
use multitask learning to train a single, joint model that
exploits the dependencies among these various labeling tasks. We
describe a deep neural network architecture that, given a protein
sequence, outputs a host of predicted local properties, including
secondary structure, solvent accessibility, transmembrane topology,
signal peptides and DNA-binding residues. The network is trained jointly
on all these tasks in a supervised fashion, augmented with a novel form
of semi-supervised learning in which the model is trained to
distinguish between local patterns from natural and synthetic protein
sequences. The task-independent architecture of the network obviates the
need for task-specific feature engineering. We demonstrate that, for
all of the tasks that we considered, our approach leads to statistically
significant improvements in performance, relative to a single task
neural network approach, and that the resulting model achieves
state-of-the-art performance.
Original language | English |
---|---|
Article number | e32235 |
Number of pages | 11 |
Journal | PLoS ONE |
Volume | 7 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2012 |
MoE publication type | A1 Journal article-refereed |
Funding
This work was funded by NIH awards R01 GM074257 and P41 RR0011823, and by the Academy of Finland and the Finnish Cultural Foundation.