Correlating protein interaction network and phenotype network to predict disease genes

Wu X's homepage -->>



Deciphering the genetic basis of human diseases is an important goal of biomedical research. On the basis of the assumption that similar diseases are caused by functionally related genes, we propose a computational framework that integrates human protein-protein interactions, disease phenotype similarities, and known gene-phenotype associations in order to capture the complex relationships between phenotypes and genotypes. We develop a tool named CIPHER to predict and prioritize disease genes, and we show that the global concordance between the human protein network and the phenotype network reliably predicts disease genes. Our method is applicable to genetically uncharacterized phenotypes, effective in the genome-wide scan of disease genes, and also extendable to explore gene cooperativity in complex diseases. The predicted genetic landscape of over 1000 human phenotypes reveals the modular organization of phenotype-genotype relationships and contributes to the understanding of how genotype may determine phenotype. The genome-wide prioritization of candidate genes for over 5000 human phenotypes, including those with under-characterized disease loci or even those lacking known association, is publicly released to facilitate future discovery of disease genes.

This website hosts two versions of the predicted genetic landscape of human disease. One is predicted using a highly reliable protein interaction network, HPRD. The other is predicted on an extended network combining several high quality protein interaction databases. The results based on HPRD network may be more reliable, while the one based on the extended network may be more powerful in discovering novel disease genes.

The top 1000 genes for each phenotype can be explored here. We are going to update this website to make it more user-friendly for biologists. We will integrative gene annotations, map the scores onto the genome, visualize the results on protein interaction network, provide tools for retrieving of specific disease or gene, etc.  


The predicted genetic landscape of human disease
Connecting 5080 human phenotypes with 14433 human genes
Data Description Format Download
Genome-wide prioritization results**  based on the HPRD network Each column of 'rank' is a list of ranked genes for one phenotype, with their scores in corresponding column in 'score' matrix. For example, for results based on the extended network, column 243 is breast cancer [MIM #114480], the first three elements of this column are the IDs of the top 3 ranked genes : 102, 1370 and 1502, representing BRCA1, ATM and MDC1, respectively. Their scores can be found in the first three elements of column 243 in 'score': 0.3600, 0.3380 and 0.3348. Phenotypes are placed in the same order in the disease identifier file.
*_all: result for all genes
*_top1000: result for top 1000 genes
MAT file

(MATLAB 7.0.1)

score_all 277Mb score_top1000 32Mb

rank_all 80Mb rank_top1000 8Mb 

 based on the extended network

score_all 574Mb score_top1000 31Mb

rank_all 135Mb rank_top1000 9Mb

Protein interaction network the HPRD network 34364 non-redundant protein-protein interactions between 8919 human proteins. Downloaded from the HPRD database in May 2007. Tab-delimited
the extended network 72431 non-redundant protein-protein interactions between 14433 human proteins. Assembled from the HPRD, BIND, MINT, OPHID databases.
Gene-phenotype network the HPRD network These two files contain the gene-phenotype associations used in this study. The original associations are extracted from After automatic gene identifier mapping, they are arranged in the following format:
<MIM ID><tab><Gene ID 1><tab><Gene ID 2>.....

For example:
254200 249 2588 253 3642
indicates that the phenotype 'MIM:254200' has four known disease genes: 249, 2588,253 and 3642.
the extended network
Gene Identifier  for the HPRD network Gene IDs, five columns: Gene ID used in 'score' and 'rank'<tab>Gene ID in HPRD(or extended)<tab>UniProt/Swiss-Prot Accession<tab>RefSeq Peptide ID<tab>HGNC symbol. for example:

1 02995 Q13485 NP_005350.1 SMAD4

the extended network
Disease Identifier for both networks Each row is a MIM ID.
phenotype network phenotype similarity scores for 5080 MIM phenotypes used in this study. Please contact Dr Han G Brunner for the dataset.  
Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Molecular Systems Biology, 4:189