Navbar Search Filter Mobile Enter search termSearch

NAR Journals

Navbar Search Filter Enter search termSearch

Advanced Search

Search Menu

AI Discovery Assistant

Article Navigation

Volume 45

Issue W1

3 July 2017

Article Contents

Journal Article

ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly

Yan Wang

1Key Laboratory of Molecular Biophysics of the Ministry of Education, School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Jian Wang

1Key Laboratory of Molecular Biophysics of the Ministry of Education, School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Ruiming Li

3School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Qiang Shi

3School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Zhidong Xue

2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA

3School of Software, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

*To whom correspondence should be addressed. Tel: +1 734 647 1549; Fax: +1 734 615 6553; Email:[email protected]. Correspondence may also be addressed to Zhidong Xue. Tel: +86 27 8779 2254; Fax: +86 27 8779 2251; Email:[email protected]

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Yang Zhang

2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA

4Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA

Search for other works by this author on:

Oxford Academic

PubMed

Google Scholar

Nucleic Acids Research, Volume 45, Issue W1, 3 July 2017, Pages W400–W407,https://doi.org/10.1093/nar/gkx410

Published:

11 May 2017

Article history

Received:

03 March 2017

Revision received:

20 April 2017

Accepted:

28 April 2017

Published:

11 May 2017

Navbar Search Filter Mobile Enter search termSearch

Navbar Search Filter Enter search termSearch

Advanced Search

Search Menu

AI Discovery Assistant

Abstract

We develop a hierarchical pipeline, ThreaDomEx, for both continuous domain (CD) and discontinuous domain (DCD) structure predictions. Starting from a query sequence, ThreaDomEx first threads it through the PDB to identify multiple structure templates, where a profile of domain conservation score (DC-score) is derived for domain-segment assignment. To further detect DCDs that consist of separated segments along the sequence, a boundary-clustering algorithm is used to refine the DCD-linker locations. In case that the templates do not contain DCDs, a domain-segment assembly process, guided by symmetry comparison, is applied for further DCD detections. ThreaDomEx was tested a set of 1111 proteins and achieved a normalized domain overlap score of 89.3% compared to experimental data, which is significantly higher than other state-of-the-art methods. It also recalls 26.7% of DCDs with 72.7% precision on the proteins for which threading failed to detect any DCDs. The server provides facilities for users to interactively refine the domain models by adjusting DC-score threshold, deleting and adding domain linkers, and assembling domain segments, which are particularly helpful for the hard targets for which current methods have a low accuracy while human-expert knowledge and experimental insights can be used for refining models. ThreaDomEX server is available athttp://zhanglab.ccmb.med.umich.edu/ThreaDomEx.

INTRODUCTION

Protein domains are subunits that can fold and evolve independently. The majority of eukaryotic proteins are found to consist of multiple domains for achieving various composite cellular functions (1). The identification of protein domains is thus the essential step for protein structure determination and functional annotations, where a variety of computational methods have been proposed to divide proteins into domains from the amino acid sequences (2).

One of the earliest and still widely used approaches is homologous modeling, which predicts protein domain boundaries through the identification of the evolutionarily conserved sequence families from multiple homologous sequence alignments. Examples that use such approach include Pfam (3,4), PASS (5,6), EVEREST (7,8), ADDA (9,10) and FIEFDom (11). Another popularly used approach obtains domain predictions through the statistical modeling or machine leaning trained from known domain structures in the PDB library. These include, for example, DGS (12), Armadillo (13), DPD (14), DomCut (15), CHOPnet (16), Dompro (17), DomNet (18), PPRODO (19), kemaDom (20), DLP-SVM (21), DROP (22), H-DROP (23), DOBO (24), and the methods proposed by Galzitskaya and Melnik (25) and Tanakaet al. (26). There are also methods, such as SnapDRAGON (27), RosettaDom (28) and OPUS-DOM (29), which first construct 3D models throughab initio folding and then extracted domain information directly from the structure prediction. Recently, we proposed a new method, ThreaDom (30), which deduces the domain boundary information from multiple query-to-template alignments derived by meta-server threading programs (31).

Most of the methods can only generate predictions for continuous domains (CDs) that consist of continuous residues along the query sequence, while a large number of proteins contain discontinuous domains (DCDs). For example, the current PDB library deposits about 18% proteins with at least one DCD which consists of two or more non-sequential segmental sequences (32). The detection and correct division of the DCDs represents a major challenging problem. The 3D-model based methods, such as SnapDRAGON (27), RosettaDom (28), OPUS-DOM (29), can in principle split the predicted 3D models into DCDs, but the success rate ofab initio structure predictions is very low (if not impossible) for the proteins of DCDs which usually have a medium to large size (33). ThreaDom (30) can break up the size limit due to the adoption of threading-based template recognition technique, but the success relies on the existence of similar DCD structure in the threading template library. DomEx (32) was recently proposed to predict DCDs by the reassembly of DCD segments guided with homology and symmetry alignments, which is able to detect DCDs beyond the structures in the template library and therefore significantly increases the accuracy of the DCD predictions.

Despite of the high number of method proposed, the number of on-line webservers available to biological community for automated domain prediction is quite low. Among the limit on-line servers, DOBO (24) generates domain boundary prediction by integrating evolutionary signals and machine learning. DomCut (15) collects linker preference profiles along the sequence with the domain boundary position predicted as the minima of the profiles. Scooby-domain (34) creates domain number and boundary predictions from the length and hydrophobicity analyses. The server by Galzitskaya and Melnik (25) generates fast domain boundary estimations based on chain entropy and amino acid composition. DLP-SVM (21), H-DROP (23) and DROP (22) provide domain linker positions and probability of the predictions based on SVM training. Finally, the ThreaDom server (30) uses multiple threading alignments and domain conservation score profiles to generate domain boundary assignments, which have witnessed successful applications in various 3D structure predictions (35–37) and structure-based function annotation studies (38–41).

Here, we propose a new domain prediction server, ThreaDomEx, which combines the ThreaDom and DomEx methods into a unified and user-friendly on-line server system for efficient structural domain predictions. Compared with many of the existing servers, one of the major novelties of ThreaDomEx is the enhanced ability to predict the DCD structures, mainly through the incorporation of the DomEx algorithm that assembles non-consecutive domain segments following multiple threading template alignments. Meanwhile, considerable efforts have been made to provide advanced visualization facilities, which, for the first time, allows users to conveniently integrate the human-expert knowledge and experimental insight to improve the domain prediction results of the sever pipelines.

IMPLEMENTATION

Method overview

The pipeline of ThreaDomEX for protein domain prediction consists of three consecutive steps (Figure1). When a user uploads an amino acid sequence, the query is first threaded through a representative structure library of the PDB by LOMETS (31) to identify homologous and analogous structural templates. Meanwhile, PSSpred (42) and MUSTER (43) are employed to generate secondary structure and solvent accessibility predictions for further domain structure analyses. Secondly, a domain conservation score (DC-score) is calculated from the multiple threading template alignments to evaluate the conservativeness of each of the residues, where the initial boundaries of domain segments are decided on the distribution of DC-score along the sequence (30). Finally, a boundary-clustering based strategy is used to fine-tune the boundary positions, as well as to detect the DCDs from the templates. If no DCD is detected from the LOMETS templates, a segment assembly process guided by symmetric motif comparison, as proposed in DomEx (32), is employed for further detection of DCD structures.

Figure 1.

Flowchart of ThreaDomEx. The query sequence is first threaded through a representative PDB structure library to search for structural templates. Domain Conservation score (DC-score) is then calculated to evaluate the conservativeness of each amino acid in the query, with the domain boundaries assigned around the valley of the DC-score profile. Next, a boundary clustering-based strategy is used to optimize the boundary predictions and to detect discontinuous domains (DCDs). Finally, a symmetric alignment based segment assembly algorithm is employed for further DCD detection when no DCD was detected in the threading templates.

Open in new tab Download slide

The on-line server system is constructed on a three-layer architecture of front-end, server-end and business logic layer. The front-end of server is implemented with HTML, CSS, JAVASCRIPT, bootstrap CSS and D3.js to preprocess the input data from the user submission and display the domain prediction results. The server-end is developed with PHP, Mysql and Perl for data persistence and constructing output pages. The business logic layer implements the backstage modeling calculations based on the ThreaDom and DomEx programs.

Domain conservation score and initial domain boundary assignment

The DC-score of the query protein is calculated based on a multiple sequence alignment matrix, which is created by matching all threading template sequences to query according to the individual query-template alignments derived by LOMETS (31). Forith query residue, the DC-score is calculated by

\begin{equation*}{\rm{DC - score}}\left( i \right) = 1 - \frac{1}{T}\left[ {{w_1}\mathop \sum \limits_{j = 1}^{{T_{good}}} {a_{ij}} + {w_2}\mathop \sum \limits_{j = 1}^{{T_{bad}}} {a_{ij}} + \mathop \sum \limits_{j = 1}^T \left( {{w_3}{b_{ij}} + {w_4}{c_{ij}}} \right)} \right]\ \end{equation*}

(1)

where|${a_{ij}} = \lambda \ ( {or\ 0} )$|⁠, if theith residue on the query is (or is not) aligned with a linker region of thejth template;|$\lambda = 1\ ( {or\ 0.8} )$|⁠, if the domain of the template structure is defined by the CATH database (44) (or DomainParser program (45)).|${b_{ij}} = 1\ ( {or\ 0} )$|⁠, if theith residue is (or is not) aligned with a gap region of thejth template.|${c_{ij}} = 1\ ( {or\ 0} )$|⁠, if theith residue is (or is not) aligned with a tail region of thejth template.|${T_{good}}$| is the number of good (i.e. homologous) templates with the significance score (⁠|$Z$|⁠) above the threshold (⁠|${Z_0}$|⁠) as specified by LOMETS, while|${T_{bad}}$| is that with|$Z < {Z_0}$|⁠.|$T = {T_{good}} + {T_{bad}}$| is the total number of templates identified by LOMETS.

A residue is assigned as a potential domain linker residue if the DC-score is below the cutoff, i.e.,|${\rm{DC - score}}( i ) < {\rm{DC - scor}}{{\rm{e}}_{\rm{0}}}.$| Here,|${\rm{DC - scor}}{{\rm{e}}_{\rm{0}}}$| and the weight parameters (⁠|${w_{1 - 4}}$|⁠) in Eq. (1) have been systematically trained by maximizing the Normalized Domain Overlap score (NDO-score) (46) on a non-redundant set of 800 proteins. The training dataset contains 400 single-domain, 314 two-domain, 57 three-domain and 29 four- or higher-order-domain proteins, which are defined according to the CATH 3.4 database (44). To increase the specificity, the parameters have been trained for Easy and Hard proteins, separately. Here, the protein type (easy or hard) is an estimation of the easiness or difficulty for the threading programs to detect homologous templates from the PDB library, i.e. a protein is considered as an Easy target if there are more thann homologous templates with|$Z >{Z_0}$|⁠, wheren = 7 is the number of threading programs in LOMETS (31); otherwise it is a hard target. The values of the optimized parameters summarized in Table1.

Optimized parameters in ThreaDomEx

Table 1.

Open in new tab

Optimized parameters in ThreaDomEx

Parameters	Easy targets	Hard targets
w₁	2.0	2.0
w₂	0.6	0.5
w₃	0.8	1.4
w₄	0.1	0.5
DC-score₀	0.6	0.76

Parameters	Easy targets	Hard targets
w₁	2.0	2.0
w₂	0.6	0.5
w₃	0.8	1.4
w₄	0.1	0.5
DC-score₀	0.6	0.76

Table 1.

Open in new tab

Optimized parameters in ThreaDomEx

Parameters	Easy targets	Hard targets
w₁	2.0	2.0
w₂	0.6	0.5
w₃	0.8	1.4
w₄	0.1	0.5
DC-score₀	0.6	0.76

Parameters	Easy targets	Hard targets
w₁	2.0	2.0
w₂	0.6	0.5
w₃	0.8	1.4
w₄	0.1	0.5
DC-score₀	0.6	0.76

Template-based prediction of DCD structures

A DCD consists of two or more segments that are from non-consecutive regions of the query sequence. If >30% of LOMETS templates contain DCDs, the query protein will be considered by ThreaDomEx as having DCDs.

To predict the boundary locations of DCDs, TheaDomEx first clusters all the templates that contain DCDs based on the similarity of their domain distributions, i.e. all templates, which have the same number of domains with each domain having similar boundaries, are grouped into the same cluster. Here, the ‘similar boundary’ means that the difference in boundary positions from different templates is within|$ \pm 5$| residues after the structure alignment of the two templates.

Following the clustering process, the domain structure from the largest template cluster will be used to guide the refinement of the DC-score based domain predictions. If the domain boundary difference between the domain structure of the largest template cluster and the assignments from the DC-score profile is within|$ \pm 20$| residues, for example, the separated domain segments in the DC-score models will be merged into a single DCD. Meanwhile, if the number of the domains from the DC-score assignment is >3 but less than the number of domains in the largest template cluster, ThreaDomEx will substitute the DC-score domain assignment by the domain structure from the largest template cluster, in case that the domain boundaries in >50% of all the templates are consistent (i.e. with average difference <20 residues).

Further DCD prediction from symmetric segment reassembly

If there is no DCD found in the threading templates, a new symmetric segment assembly method is used to further detect DCDs. Here, when the number of domain segments assigned by the DC-score profile is|$ \ge$|3, any pairs of two nonadjacent segments will be assembled from N-terminal to C-terminal into a putative DCD. The possibility of the putative domain to be a true DCD is assessed by three scores.

The putative domain sequence is searched by PSI-BLAST through a non-redundant domain library collected from SCOP (47), CATH (44) and Pfam (48) databases. The template similarity score (TS-score) is calculated by

\begin{equation*}{\rm{TS - score}} = s \times h \times l\end{equation*}

(2)

where|$s$| is the sequence identity between the putative domain and template sequences;|$h = \min ( {10, - \lg E} )/10$| is the normalized E-value (⁠|$E$|⁠) by PSI-BLAST; and|$l$| is a factor associated with the alignment coverage (32).

Second, a symmetry index score (SI-score) of the PSI-BLAST alignment is defined by

\begin{equation*}{\rm{SI - score}} = \sqrt {{{\left( {si{d_1} - si{d_2}} \right)}^2} + {{\left( {{c_1} - {c_2}} \right)}^2}} \end{equation*}

(3)

wheresid_1,2 andc_1,2 are, respectively, the sequence identity and alignment coverage of the two component segments compared to the PSI-BLAST templates.

Finally, a profile-profile alignment search, assisted with the secondary structure predictions, is performed by MUSTER (43) through the Pfam domain library, with the alignment score,|${S_{PPA}}$|⁠, returned.

The putatively assembled domain sequence is predicted as a true DCD, if the scores calculated above satisfy the conditions of TS-score >TS-score₀,|${S_{PPA}} >S_{PPA}^0$|⁠, and SI-score >SI-score₀. The threshold parameters of TS-score₀,|$S_{PPA}^0$| and SI-score₀ have been determined by maximizing the Matthews correlation coefficient on an independent set of training proteins (32). This step of DCD detection and validation involves mainly the running of the PSI-BLAST search against a library of ∼500 millions single domain sequences and the profile-profile search through the Pfam database. It takes a much longer time (∼3–5 h) than the last step of template-based DCD detection through domain boundary clustering (<1 min).

USING THE WEB SERVER

Input

To use the ThreaDomEx server, users need to upload the amino acid sequence of the query protein. Once the sequence is uploaded, a page containing the job ID and job status information will be displayed, which will be refreshed every 10 minutes. The users can retrieve the results in future by bookmarking this page or via the job ID from the on-line ThreaDomEx system. The server also allows the user to search the job with the submitted sequence. In addition, a URL link to the result page will be sent to the email address provided optionally by the user in the input page.

Output page and user-based interactive adjustment facilities

The procedure for multi-threading alignments, domain boundary prediction, boundary optimization and DCD detection is fully automated. The entire process of ThreaDomEx, from job submission to output generation, takes ∼4 h for a protein of 1000 residues. The data in the output page includes: (i) the DC-score profile; (ii) predicted secondary structure; (iii) histogram distribution of predicted solvent accessibility; (iv) top 50 LOMETS templates and alignments; (v) final domain model results. Most of the output data can be adjusted interactively by user based on the human-expert knowledge and experimental insights. A snapshot of the output page on an illustrative example from Chromodomain–helicase–DNA-binding protein (UniProt ID: Q86WJ1, 897 AA) is shown in Figure2, which was taken fromhttp://zhanglab.ccmb.med.umich.edu/ThreaDomEx/example.php. The output data for each target will be retained for three month before removed from the on-line server. The following paragraphs contain a brief description of the major items contained in the ThreaDomEx output results.

Figure 2.

An illustration of ThreaDomEx output page from chromodomain–helicase–DNA-binding protein (UniProt ID: Q86WJ1, 897 AA). (A) DC-score distribution along the query sequence; (B) predicted secondary structure; (C) predicted solvent accessibility; (D) top 50 threading templates by LOMETS; (E) the domain boundaries assigned by DC-score; (F) optimized boundaries and discontinuous domain (DCD) detection after the clustering process; (G) domain models editable by user; (H) DCD models after segment assembly and refinement.

Open in new tab Download slide

DC-score profile

ThreaDomEx uses the DC-score profile to make initial domain segment assignments. As shown in Figure2A, the predicted domain boundaries are marked by the orange vertical lines, which are determined by the DC-score cutoff that is marked by the green horizontal line. The default DC-score cutoff is shown in Table1, but users are allowed to manually adjust the cutoff by dragging the horizontal line. Meanwhile, users are also allowed to modify (or add/delete) the domain boundaries by dragging (or right-clicking on) the vertical lines, where domain segment division results, as shown in color bar above the DC-score profile, are updated simultaneously according to the user's edit.

Predicted secondary structure and solvent accessibility

Domain boundaries of proteins often locate at the coils/loops and have a higher level of solvent accessibility than other regions (25). Figure2B and C shows the distribution of predicted secondary structure (SS) and solvent accessibilities (SA), which may be used for users to fine-tune the domain boundary locations. To facilitate the referring of the SS and SA data, a thin vertical line will be displayed across the SS and SA figures when user moves the mouse on the DC-score profile figure. If a user moves the mouse along the SS figure, the detailed local secondary structure information is displayed in an enlarged pop-up window.

The top 50 LOMETS templates

This section provides information on the 50 highly ranked threading templates collected by LOMETS, which have been used by ThreaDomEx for calculated DC-score profile (Figure2D). It includes: (i) the name of the threading programs creating the templates; (ii) the PDB ID and the link to the PDB entry for each template; (iii) visualization of query-template alignments. The domains along the alignment is marked by different colors based on the domain information of the template structure from CATH or DomainParser assignment, where segments in the same DCD are marked by the same color. The gray bars are the regions that are not within any domains defined by CATH or DomainParser. While the thin gray lines mark the regions of alignment gaps on the templates, the small red bulges above the domain bars are the insertions to query sequence. When user moves mouse along the alignment, an enlarged window will pop up to show the start and end residues of the segments or insertions for the query and template sequences.

The domain prediction results

The domain prediction results are summarized at right-top panel of the output page of the ThreaDomEx server. The data include the predicted domain boundaries according to the DC-score (Figure2E), and the refined domain boundaries and DCDs detected by the template clustering and segment assembly process (Figure2F). Both results are generated automatically by ThreaDomEx. This section also allows the user to edit the boundary by the ‘up to merge’ button; or to initialize, save and rollback the prediction results by the other corresponding buttons (Figure2G), i.e. the ‘initiate’ button enabling the recovering of the original prediction results, the ‘rollback’ button helping user rollback the result to the last modification, and the ‘save’ button for result save. After the boundary modification and saving, users can re-run the DCD detection by clicking on the ‘DCD detect’ button in Figure2H.

Despite that the server provide a variety of options for users to tweak and adjust the modeling process and results, it is worthy of noting that the ThreaDomEx server is fully automated and such manual refinement is not a required condition for successful domain model generation. In fact, the ThreaDomEx pipeline has been extensively trained and benchmarked on large-scale datasets, aiming to generate optimal domain models without using external information and human interventions. Thus, users are not encouraged to manually modify the automated modeling results, unless they have confident evidences, such as those from experimental data, biological functional analyses or reliable common sense, which are different from the automated models and deemed to be able to improve the model results.

RESULTS

Datasets and assessment criterions

Two independent datasets (Test-I and Test-II) were constructed to evaluate the performance of ThreaDomEX on the domain boundary prediction and DCD detection, respectively. The Test-I set contains 630 non-homologous proteins, which include 315 single-domain, 245 two-domain, 57 three-domain, and 13 four- or higher-order-domain proteins; the Test-II set has 481 non-redundant and multi-domain chains, which include 326 chains containing DCDs and 155 chains with only CDs. Each protein chain in Test-II has at least three segments, with segment length >40 residues. None of the testing proteins are homologous to any of the proteins that are used for training the ThreaDomEx program. To further rule out homology contamination, all the templates that have a sequence identity >30% to the query or are detectable by PSI-BLAST with anE-value <0.05 have been excluded in the threading template library.

The true domain boundaries of the test proteins are defined based on the CATH3.4 database (44). A predicted boundary is considered as correct if it is located with|$ \pm$|20 residues away from true domain boundary in the CATH3.4; this assessment criterion is the same as used in many previous studies (13,21). Moreover, a DCD detection is considered as correct only if the boundaries of all segments in the DCD are correct and the segment assembly is consistent to the DCD structure in the CATH3.4.

Domain boundary prediction on Test-I

The ability of domain boundary prediction is tested on the Test-I dataset, where ThreaDomEx correctly classifies the target sequences as being single- or multi-domain proteins in 81% of the cases. For the 522 ‘Easy’ proteins in which LOMETS has a higher confidence score, the average accuracy of classification is 84.7%, and for the rest of the 108 ‘Hard’ proteins, the accuracy is 68.5%. We also used the NDO-score, which is defined as the normalized overlap rate of all predicted domain and linker regions with the true assignments of the query structure (46), to evaluate the domain boundary prediction. The average NDO-score for the ‘All’, ‘Easy’ and ‘Hard’ targets is 0.893, 0.905 and 0.832, respectively. Figure3A lists the NDO-score of ThreaDomEX, in control with that by the five publicly available domain predictor programs, including FIEFDom (11), DomPro (17), DROP (22) and PPRODO (19), on the same set of testing proteins. These programs represent a representative set of methods on homology and machine learning based approaches, where the data demonstrates the advantage and efficiency of the threading-based domain predictions on accurately assigning the protein domain locations.

Figure 3.

Benchmark results of ThreaDomEx in control with other methods. (A) Average NDO-score on the first test set of 630 proteins, with dark, gray and white histograms being for all, easy and hard proteins. (B) Average NDO-score on the second test set of 481 proteins. The dark or gray histograms mark the methods with or without ability for prediction discontinuous domains (DCD). X+DomEx refers to a hybrid method combining Method-X with DomEx to detect DCD.

Open in new tab Download slide

We also control our method with a random domain predictor based on the regions that are allowable for domain boundary assignment along the sequence length of the testing proteins by cutting down 40 residues at each end (we have assumed that a domain is no <40 residues in size in this study), following the approach by Dovidchenkoet al. (14). For the 245 two-domain proteins in the Test-I which have an average sequence length 272 residues, 61.8% (157 proteins) are correctly divided by ThreaDomEx into 2-domains within|$ \pm$|20 residues to the true boundary, while the percentage of the random prediction is 32.6%. The probability of the random predictions to generate the same or a greater number of correct domain boundaries is very small (8.2E–72) based on the Gaussian error distribution (14), showing that the ThreaDomEx prediction is highly non-random.

DCD detection on Test-II

Figure3B summarizes the NDO-score of ThreaDomEx with the other control programs for the 481 proteins from the Test-II dataset. Here, only ThreaDomEx and ThreaDom have the ability to detect DCDs, while the major difference between these two programs is that ThreaDomEx exploits DomEx to detect DCDs when ThreaDom does not detect any DCDs. Although the NDO-score of ThreaDomEx (0.759) is only slightly higher than ThreaDom (0.750), ThreaDomEx could recall 26.7% of the DCDs with 72.7% precision on the subset that ThreaDom failed to detect any DCDs, indicating that segment alignment and assembly process by DomEx can indeed help identify DCDs beyond template-based transferals.

To further examine the impact of the segment-assembly based DCD detection on the domain boundary prediction, we construct two hybrid methods that combine DomEx with the best two boundary predictors ThreaDom_Bdr and FIEFDom, denoted as ThrDm_Bdr+DomEx and FIEFDom+DomEx, respectively. ThrDm_Bdr represents a truncated program of ThreaDom that uses the DC-score profile to predict the domain boundaries but turns off the clustering-based boundary optimization and DCD detection. The results in Figure3B shows that the inclusion of DomEx can improve the NDO-score of both programs, demonstrating a positive impact of the segment-assembly based DCD detection on the overall domain boundary prediction. But the ThreaDomEx pipeline, which includes the entire process of multiple threading, domain refinements, and segment-assembly based DCD detection, has a higher accuracy than all these testing programs.

ThreaDomEx on a public dataset

In addition to the tests on the two internal datasets, we applied ThreaDomEx on a publically available dataset athttp://web.tuat.ac.jp/∼domserv/cgi-bin/LinkerList.txt, which was previously used by Ebinaet al. (21). This dataset contains 206 proteins, with 174 two-domain, 24 three-domain and 8 four-domain proteins. Under a similar homologous template filter, i.e. excluding all templates with an identity identity >30% to the query or detectable by PSI-BLAST with anE-value <0.05 from the threading library, we obtained the domain boundary prediction with the average sensitivity and specificity being 0.795 and 0.601, respectively, compared to the true domain assignments. These values are significantly higher than the randomly domain assignments, which have the sensitivity and specificity being 0.221 and 0.262, respectively, calculated by Ebinaet al. by randomly selecting a 11-residue window from the non-terminal region as the domain linker (21). The ThreaDomEx results also compare favorably to that by several other sequence-based predictors reported by Ebinaet al., including DLP-SVM (sensitivity/specitificity = 0.597/0.436) (21) and Armadillo (0.486/0.342) (13). But it should be noted that the dataset that Ebinaet al. used (containing 182 proteins) is slightly smaller than what we used here, which may account for part of the differences of the results between ThreaDomEx and these methods.

CONCLUSION

We developed a new on-line server system, ThreaDomEx, for efficient and user-friendly protein domain prediction, which was built on the multiple threading template alignments followed by domain boundary clustering and segment reassembly. Compared to traditional sequence homology and machine-learning based approaches, the threading based domain assignment, guided by a domain conservation scoring profile, achieves a significantly higher domain division accuracy, as shown in the large-scale benchmark tests. In particular, a segment assembly algorithm is introduced to enhance accuracy of both domain boundary prediction and DCD detection, which makes ThreaDomEx one of the very few on-line systems equipped with the ability to model DCD structures beyond template-based domain transferals.

The pipeline of the ThreaDomEx is fully automated. However, the overall accuracy of domain prediction can be low for the non-homologous hard proteins and those with DCDs, where the human-expert knowledge and insights from experimental data or biological function analyses will become valuable for further improving the automated domain prediction results. Considerable effort has been made to enable users to interactively edit and refine the domain predictions; these include the facilities to manually tune the DC-score threshold, delete and add domain linkers, and merge and assemble different domain segments. In addition to the final modeling results, various intermediate modeling data, including the DC-score profile, secondary structure and solvation prediction, and multiple threading template alignments, have been made available and visualizable, which not only help users to manually improve the domain predictions, but also provide valuable information to assist further structure prediction and function annotation studies for the submitted sequences.

FUNDING

National Institute of General Medical Sciences [GM083107, GM116960]; National Science Foundation [DBI1564756]; National Natural Science Foundation of China [30700162, 61073095]; Fundamental Research Funds for the Central Universities of China [HUST2014TS138, HUST2015QN101]; China Postdoctoral Science Foundation [2014M552043]. Funding for open access charge: Fundamental Research Funds for the Central Universities of China.

Conflict of interest statement. None declared.

REFERENCES

Han

J.H.

Batey

Nickson

A.A.

Teichmann

S.A.

Clarke

The folding and evolution of multidomain proteins

Nat. Rev. Mol. Cell. Biol.

2007

;

319

–

330

Kirillova

Kumar

Carugo

Protein domain boundary predictions: a structural biology perspective

Open Biochem. J.

2009

;

–

Sonnhammer

E.L.

Eddy

S.R.

Durbin

Pfam: a comprehensive database of protein domain families based on seed alignments

Proteins

1997

;

405

–

420

Punta

Coggill

P.C.

Eberhardt

R.Y.

Mistry

Tate

Boursnell

Pang

Forslund

Ceric

Clements

et al.

The Pfam protein families database

Nucleic Acids Res.

2012

;

D290

–

D301

Kuroda

Tani

Matsuo

Yokoyama

Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics

Protein Sci.

2000

;

2313

–

2321

Hondoh

Kato

Yokoyama

Kuroda

Computer-aided NMR assay for detecting natively folded structural domains

Protein Sci.

2006

;

871

–

883

Portugaly

Harel

Linial

EVEREST: automatic identification and classification of protein domains in all protein sequences

BMC Bioinformatics

2006

;

277

Portugaly

Linial

EVEREST: a collection of evolutionary conserved protein domains

Nucleic Acids Res.

2007

;

D241

–

D246

Heger

Wilton

C.A.

Sivakumar

Holm

ADDA: a domain database with global coverage of the protein universe

Nucleic Acids Res.

2005

;

D188

–

D191

10.

Heger

Holm

Exhaustive enumeration of protein domain families

J. Mol. Biol.

2003

;

328

749

–

767

11.

Bondugula

Lee

M.S.

Wallqvist

FIEFDom: a transparent domain boundary recognition system using a fuzzy mean operator

Nucleic Acids Res.

2009

;

452

–

462

12.

Wheelan

S.J.

Marchler-Bauer

Bryant

S.H.

Domain size distributions can predict domain boundaries

Bioinformatics

2000

;

613

–

618

13.

Dumontier

Yao

Feldman

H.J.

Hogue

C.W.

Armadillo: domain boundary prediction by amino acid composition

J. Mol. Biol.

2005

;

350

1061

–

1073

14.

Dovidchenko

N.V.

Lobanov

M.Y.

Galzitskaya

O.V.

Prediction of number and position of domain boundaries in multi-domain proteins by use of amino acid sequence alone

Curr. Protein Peptide Sci.

2007

;

189

–

195

15.

Suyama

Ohara

DomCut: prediction of inter-domain linker regions in amino acid sequences

Bioinformatics

2003

;

673

–

674

16.

Liu

Rost

Sequence-based prediction of protein domains

Nucleic Acids Res.

2004

;

3522

–

3530

17.

Cheng

J.L.

Sweredoski

M.J.

Baldi

DOMpro: Protein domain prediction using profiles, secondary structure, relative solvent accessibility, and recursive neural networks

Data Mining Knowledge Discov.

2006

;

–

18.

Yoo

P.D.

Sikder

A.R.

Taheri

Zhou

B.B.

Zomaya

A.Y.

DomNet: protein domain boundary prediction using enhanced general regression network and new profiles

IEEE Trans. Nanobiosci.

2008

;

172

–

181

19.

Sim

Kim

S.Y.

Lee

PPRODO: prediction of protein domain boundaries using neural networks

Proteins

2005

;

627

–

632

20.

Chen

L.S.

Wang

Ling

S.P.

Jia

C.Y.

Wang

KemaDom: a web server for domain prediction using kernel machine with local context

Nucleic Acids Res.

2006

;

W158

–

W163

21.

Ebina

Toh

Kuroda

Loop-length-dependent SVM prediction of domain linkers for high-throughput structural proteomics

Biopolymers

2009

;

–

22.

Ebina

Toh

Kuroda

DROP: an SVM domain linker predictor trained with optimal features selected by random forest

Bioinformatics

2011

;

487

–

494

23.

Ebina

Suzuki

Tsuji

Kuroda

H-DROP: an SVM based helical domain linker predictor trained with features optimized by combining random forest and stepwise selection

J. Comput.-Aided Mol. Des.

2014

;

831

–

839

24.

Eickholt

Deng

Cheng

DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning

BMC Bioinformatics

2011

;

25.

Galzitskaya

O.V.

Melnik

B.S.

Prediction of protein domain boundaries from sequence alone

Protein Sci.

2003

;

696

–

701

26.

Tanaka

Yokoyama

Kuroda

Improvement of domain linker prediction by incorporating loop-length-dependent characteristics

Biopolymers

2006

;

161

–

168

27.

George

R.A.

Heringa

SnapDRAGON: a method to delineate protein structural domains from sequence data1

J. Mol. Biol.

2002

;

316

839

–

851

28.

Kim

D.E.

Chivian

Malmstrom

Baker

Automated prediction of domain boundaries in CASP6 targets using Ginzu and RosettaDOM

Proteins

2005

;

193

–

200

29.

Dousis

A.D.

Chen

OPUS-Dom: applying the folding-based method VECFOLD to determine protein domain boundaries

J. Mol. Biol.

2009

;

385

1314

–

1329

30.

Xue

Wang

Zhang

ThreaDom: extracting protein domain boundary information from multiple threading alignments

Bioinformatics

2013

;

i247

–

i256

31.

Zhang

LOMETS: A local meta-threading-server for protein structure prediction

Nucleic Acids Res.

2007

;

3375

–

3382

32.

Xue

Jang

Govindarajoo

Huang

Wang

Extending protein domain boundary predictors to detect discontinuous domains

PLoS One

2015

;

e0141541

33.

Zhang

Progress and challenges in protein structure prediction

Curr. Opin. Struct. Biol.

2008

;

342

–

348

34.

George

R.A.

Lin

Heringa

Scooby-domain: prediction of globular domains in protein sequence

Nucleic Acids Res.

2005

;

W160

–

W163

35.

Zhang

Yang

Jang

Zhang

GPCR-I-TASSER: a hybrid approach to G protein-coupled receptor structure modeling and the application to the human genome

Structure

2015

;

1538

–

1549

36.

Zhang

Interplay of I-TASSER and QUARK for template-based and ab initio protein structure prediction in CASP10

Proteins

2014

;

175

–

187

37.

Meng

F.C.

Kurgan

DFLpred: high-throughput prediction of disordered flexible linker regions in protein sequences

Bioinformatics

2016

;

341

–

350

38.

Adam

Michael

Lutz

Oliver

Juri

Serum albumin domain structures in human blood serum by mass spectrometry and computational biology*

Mol. Cell. Proteomics MCP

2015

;

1105

–

1116

Google Scholar

OpenURL Placeholder Text

WorldCat

39.

Stojanoski

Sankaran

Prasad

B.V.

Poirel

Nordmann

Palzkill

Structure of the catalytic domain of the colistin resistance enzyme MCR-1

BMC Biol.

2016

;

40.

Menon

Panwar

Eksi

Kleer

Guan

Omenn

G.S.

Computational inferences of the functions of alternative/noncanonical splice isoforms specific to HER2+/ER−/PR− breast cancers, a chromosome 17 C-HPP study

J. Proteome Res.

2015

;

3519

41.

Ding

Y.H.

Gong

Dong

Liu

S.M.

Dong

M.Q.

Tang

Modeling protein excited-state structures from ‘over-length’ chemical cross-links

J. Biol. Chem.

2017

;

292

1187

–

1196

42.

Yan

Yang

Walker

Zhang

A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction

Sci. Rep.

2013

;

2619

43.

Zhang

MUSTER: improving protein sequence profile-profile alignments by using multiple sources of structure information

Proteins

2008

;

547

–

556

44.

Orengo

C.A.

Michie

A.D.

Jones

D.T.

Swindells

M.B.

Thornton

J.M.

CATH—a hierarchic classification of protein domain structures

Structure

1997

;

1093

–

1108

45.

Guo

J.T.

Kim

Improving the performance of DomainParser for structural domain partition using neural network

Nucleic Acids Res.

2003

;

944

–

952

46.

Tai

C.H.

Lee

W.J.

Vincent

J.J.

Lee

Evaluation of domain prediction in CASP6

Proteins-Struct. Funct. Bioinformatics

2005

;

183

–

192

47.

Murzin

A.G.

Brenner

S.E.

Hubbard

Chothia

SCOP: a structural classification of proteins database for the investigation of sequences and structures

J. Mol. Biol.

1995

;

247

536

–

540

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

48.

Bateman

Coin

Durbin

Finn

R.D.

Hollich

Griffiths-Jones

Khanna

Marshall

Moxon

Sonnhammer

E.L.

et al.

The Pfam protein families database

Nucleic Acids Res.

2004

;

D138

–

D141

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact[email protected]

Issue Section:

Web Server issue

Download all slides

Comments

0 Comments

Comments (0)

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

2,709

Altmetric

More metrics information

Metrics

Total Views2,709

2,061Pageviews

648PDF Downloads

Since 5/1/2017

Month:	Total Views:
May 2017	52
June 2017	20
July 2017	31
August 2017	23
September 2017	32
October 2017	27
November 2017	16
December 2017	30
January 2018	38
February 2018	125
March 2018	94
April 2018	56
May 2018	39
June 2018	56
July 2018	24
August 2018	26
September 2018	34
October 2018	10
November 2018	29
December 2018	14
January 2019	33
February 2019	36
March 2019	24
April 2019	31
May 2019	25
June 2019	27
July 2019	22
August 2019	18
September 2019	33
October 2019	35
November 2019	21
December 2019	28
January 2020	35
February 2020	34
March 2020	31
April 2020	20
May 2020	32
June 2020	54
July 2020	43
August 2020	28
September 2020	27
October 2020	20
November 2020	8
December 2020	17
January 2021	22
February 2021	29
March 2021	45
April 2021	59
May 2021	22
June 2021	35
July 2021	19
August 2021	23
September 2021	27
October 2021	22
November 2021	24
December 2021	25
January 2022	22
February 2022	9
March 2022	24
April 2022	12
May 2022	30
June 2022	15
July 2022	32
August 2022	16
September 2022	13
October 2022	32
November 2022	22
December 2022	34
January 2023	21
February 2023	17
March 2023	58
April 2023	24
May 2023	7
June 2023	6
July 2023	23
August 2023	19
September 2023	14
October 2023	28
November 2023	14
December 2023	78
January 2024	28
February 2024	37
March 2024	39
April 2024	32
May 2024	19
June 2024	16
July 2024	13
August 2024	9
September 2024	28
October 2024	32
November 2024	14
December 2024	16
January 2025	13
February 2025	18
March 2025	15