Overview

Identifying protein recognition parts in DNAs or DNA recognition parts in proteins will help understanding a variety of cellular processes. Theoretical and experimental studies about protein-DNA interactions have been carried out. Many studies have been focused on predicting DNA-binding residues in proteins, but the inverse problem (i.e., predicgin protein-binding nucleotides in DNA sequence) has received fewer attempts. Here we developed a web application called PNImodeler which predicts protein-binding nucleotides in DNA sequence using sequence information.

As of July 2013, we collected 1,584 protein-DNA complexes which are determined by X-ray crystallography with a resolution of 3.0Å or better. In the 1,584 protein-DNA complexes, there are 1416 DNA sequences 837 protein sequences. To determin binding sites in DNA, we used three types of interactions: hydrogen bonds, water bridges and hydrophobic interactions.

PNImodeler provides two prediction models: one uses DNA sequence data alone and the other uses both DNA and protein sequence data. The first model consists of 20,378 binding DNA sequence fragments and 23,950 non-binding DNA sequence fragments. The other model consists of 20,558 binding DNA sequence fragments and 27,630 non-binding DNA sequence fragments. The two models are tested on independent data set which has different DNA sequences from the model with sequence similarity of 80% The first model achived a sensitivity of 73.4%, a specificity of 64.8%, an accuracy of 68.9 and a correlation coefficient of 0.382 and the other model achived a sensitivity of 67.6%, a specificity of 74.3%, and accuracy of 71.4% and a correlation coefficient of 0.418.