RNA-binding sites in protein (last update: 09/11/2013)

Training and testing data for predicting RNA-binding residues were obtained protein-RNA complexes. Listed below are the PDB codes of 490 protein-RNA complexes obtained from the Protein Data Bank (PDB) solved by X-ray crystallography with a resolution of 3.0 Å or better. The 490 protein-RNA complexes contain 372 RNA sequences and 306 protein sequences.


1A34 |1A9N |1AQ3 |1AQ4 |1ASY |1ASZ |1AV6 |1B23 |1B7F |1BMV |1C0A |1C9S |1CVJ |1CX0 |1DDL |1DI2
1DRZ |1DUL |1E7K |1EC6 |1EFW |1EUY |1EXD |1F8V |1FFY |1FXL |1G2E |1G59 |1GAX |1GTF |1GTN |1GTR
1GTS |1H3E |1H4Q |1H4S |1HQ1 |1IL2 |1IVS |1J1U |1JBR |1JBS |1JBT |1JID |1K8W |1KQ2 |1L9A |1LNG
1M5K |1M5O |1M5P |1M5V |1M8V |1M8W |1M8X |1M8Y |1N1H |1N35 |1N38 |1N77 |1N78 |1O0B |1O0C |1OOA
1PGL |1QF6 |1QRS |1QRT |1QRU |1QTQ |1QU2 |1QU3 |1R3E |1R9F |1RC7 |1RLG |1RPU |1S03 |1SDS |1SER
1SI3 |1SJ3 |1SJ4 |1SJF |1TFW |1TTT |1U0B |1U1Y |1URN |1UTD |1UVI |1UVJ |1UVN |1VBX |1VBY |1VBZ
1VC0 |1VC6 |1VC7 |1VFG |1WMQ |1WNE |1WPU |1WRQ |1WSU |1XOK |1Y39 |1YTU |1YTY |1YVP |1YYK |1YYO
1YYW |1YZ9 |1ZBH |1ZDH |1ZDI |1ZDJ |1ZDK |1ZE2 |1ZH5 |1ZHO |1ZJW |1ZL3 |1ZSE |2A8V |2AB4 |2ANN
2ANR |2ASB |2ATW |2AZ0 |2AZ2 |2AZX |2B2D |2B3J |2BBV |2BGG |2BH2 |2BNY |2BQ5 |2BS0 |2BS1 |2BTE
2BU1 |2BX2 |2C4Q |2C4Y |2C4Z |2C50 |2C51 |2CSX |2CT8 |2CV0 |2CV1 |2CV2 |2DB3 |2DLC |2DR2 |2DR5
2DR7 |2DR8 |2DR9 |2DRA |2DRB |2DU3 |2DU4 |2DVI |2DXI |2E9R |2E9T |2E9Z |2EC0 |2EZ6 |2F8K |2F8S
2FK6 |2FMT |2G4B |2GIC |2GJW |2GXB |2HVY |2HW8 |2HYI |2I91 |2IX1 |2IZ9 |2IZM |2IZN |2J0S |2JEA
2JLU |2JLV |2JLW |2JLX |2JLY |2JLZ |2NUE |2NUF |2NUG |2NZ4 |2OIH |2OJ3 |2OZB |2PJP |2PLY |2PO1
2PXB |2PXD |2PXE |2PXF |2PXK |2PXL |2PXP |2PXQ |2PXT |2PXU |2PXV |2PY9 |2QUX |2R7R |2R7T |2R7V
2R7W |2R7X |2R8S |2RD2 |2RE8 |2RFK |2UWM |2V3C |2VNU |2VOD |2VON |2VOO |2VPL |2XD0 |2XDB |2XLI
2XLJ |2XLK |2XNR |2XS2 |2XS7 |2XZO |2Y8W |2Y8Y |2Y9H |2YJY |2YKG |2ZH1 |2ZH2 |2ZH3 |2ZH4 |2ZH5
2ZH6 |2ZH7 |2ZH8 |2ZH9 |2ZHA |2ZI0 |2ZKO |2ZM5 |2ZUE |2ZUF |2ZXU |2ZZM |2ZZN |3ADB |3ADC |3ADD
3ADL |3AEV |3AGV |3AKZ |3AM1 |3AMT |3AVT |3AVU |3AVW |3AVX |3AVY |3BOY |3BSB |3BSN |3BSO |3BSX
3BT7 |3BX2 |3BX3 |3CUL |3CUN |3D2S |3DD2 |3DH3 |3EGZ |3EPH |3EQT |3EX7 |3FHT |3FOZ |3FTE |3FTF
3G0H |3G8T |3G9C |3G9Y |3GIB |3H5X |3H5Y |3HAX |3HHN |3HJW |3HL2 |3HSB |3I5X |3I5Y |3I61 |3I62
3IAB |3ICE |3IEV |3IRW |3K49 |3K5Q |3K5Y |3K5Z |3K61 |3K62 |3K64 |3KLV |3KMS |3KNA |3KOA |3KS8
3L25 |3L26 |3L3C |3LQX |3LRN |3LRR |3LWO |3LWP |3LWQ |3LWR |3LWV |3M7N |3M85 |3MDG |3MDI |3MJ0
3MOJ |3MQK |3MUM |3MUR |3MUT |3MXH |3NCU |3NDB |3NL0 |3NMR |3NMU |3NNA |3NNC |3NNH |3NVI |3O3I
3O7V |3O8C |3O8R |3OG8 |3OL6 |3OL7 |3OL8 |3OL9 |3OLB |3OUY |3OV7 |3OVA |3OVB |3OVS |3PEW |3PEY
3PF4 |3PF5 |3PTX |3PU4 |3Q0L |3Q0M |3Q0N |3Q0O |3Q0P |3Q0Q |3Q0R |3Q0S |3QG9 |3QGB |3QGC |3QJJ
3QJL |3QRP |3QSU |3R2C |3R2D |3R9W |3R9X |3RC8 |3RER |3RW6 |3SIU |3SN2 |3SNP |3SQW |3SQX |3T5N
3T5Q |3TMI |3TRZ |3TS0 |3TS2 |3U4M |3U56 |3UCU |3UCZ |3UD4 |3UMY |3V6Y |3V74 |3VJR |3VNV |3VYX
3VYY |3ZC0 |3ZD6 |3ZD7 |3ZGZ |4AL5 |4AL6 |4AL7 |4ALP |4AM3 |4AQ7 |4ARC |4ARI |4AS1 |4ATO |4AY2
4B3G |4BPB |4E78 |4ED5 |4EI1 |4EI3 |4ENN |4ERD |4F02 |4F3T |4FTB |4FVU |4GCW |4GHA |4H5P |4HT8
4HT9 |4IFD |4IG8 |4ILL |4IQX |4J1G |4JNG |4JVY |4JXX |4JXZ |4JYZ |4K4S |4K4T |4K4U |4K4W |4K4X
4K4Z |4K50 |4KRE |4KRF |4KXT |4L8R |4LG2 |5MSF |6MSF |7MSF |          



Training Data (last update: 09/11/2013)

Two datasets were constructed for training two prediction models: (1) training set for the prediction model that uses both protein and RNA sequences, and (2) training set for the prediction model that uses protein sequence alone.


Dataset total number of RNA-binding residues in protein sequences total number of non-binding residues in protein sequences Download
Training set for the prediction model that uses both protein and RNA sequences 7,537 61,808 Link
Training set for the prediction model that uses protein sequence alone 3,488 27,994 Link



Test Data (last update: 09/11/2013)

Two independent datasets were constructed with sequences that were not used in training. (1) test set for the prediction model that uses both protein and RNA sequences, and (2) test set for the prediction model that uses protein sequence alone.


Dataset total number of RNA-binding residues in protein sequences total number of non-binding residues in protein sequences Download
Test set for the prediction model that uses both protein and RNA sequences 923 7,578 Link
Test set for the prediction model that uses protein sequence alone 1,349 11,217 Link



Binding Criteria (last update: 09/11/2013)

A binding site should be involved in at least one of the following interactions between RNA and protein: hydrogen bonds, water bridges and hydrophobic interactions. [Click here] for detailed information.

We define a protein–RNA binding site by three types of interactions: hydrogen bonds, water bridges and hydrophobic interactions. A nucleotide involved in at least one of the interactions is classified as a RNA-binding site in protein. For each of the 542 protein–RNA complexes, we obtained the three types of interactions from NPIDB and incorporated them into the protein sequences of protein–RNA complexes.




Features

Feature Type Size Meaning
C Global 20 Amino acid compositions
M Local 1 * window size Mass of amino acids
P Local 1 * window size pKa value of amino acids
H Local 1 * window size Hydropathy of amino acids
NP Local 1 * window size Normalized position of amino acids
ASA Local 1 * window size Accessible surface area of amino acids
IP Local 4 * window size Interaction propensity of amino acid triplets
SNP Partner 4 Sum of the normalized position of nucleotides

normalized position: position number of an amino acid divided by the sequence length




Feature Vector Encoding

Example of encoding a protein sequence of 9 amino acids by a sliding window of size 5




Performance

10-fold cross validation

Model Sensitivity Specificity Accuracy PPV NPV MCC
PR 85.3% 68.8% 70.6% 25.0% 97.5% 0.35
PaR 91.3% 62.7% 65.9% 23.4% 98.3% 0.34

Independent test

Model Sensitivity Specificity Accuracy PPV NPV MCC
PR 68.1% 69.2% 69.1% 21.2% 94.7% 0.24
PaR 64.0% 66.1% 65.9% 18.5% 93.9% 0.19

PR: prediction of RNA-binding sites in protein from protein and RNA sequences
PaR: prediction of RNA-binding sites in protein from protein sequence alone
PPV: positive predictive value; NPV: negative predictive value; MCC: Matthews correlation coefficient