Descriptron – A Large Scale System for Biomedical Data Integration

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Daniel J McGoldrick Ph.D.

Center For Computational Pharmacology

University of Colorado Health Sciences Center.

 

 


Descriptron’s Relational Backend and Data Model: 3

Table overview: Data Integation Components. 3

The Namespace Table – Specific Types, Access, Global Connectivity Constraints. 4

Table description. 4

Example. 4

The Relevancy Table – Type Groupings Relevant to Semantic Types. 5

Table description. 5

Example. 5

The Fact Tables - Identifiers and Fact Registration. Specific Connectivity Constraints. 6

Table description. 6

Example. 6

The Map Tables – Method, Time of Linking, Source and linkages to Registered Facts. 7

Table description. 7

Example. 7

The Graph Table – Normalized Fact linkages referenced by Method. 8

Table description. 8

Example. 8

Appendix 1. Registered Fact Types. 9

API Issues. 12

- Simplicity, Flexibility. 12

- MYSQL/Oracle compatibile. 12

- Scaleable for large systems. 12

- Speed. 12

- Language independent (accessable to all database API’s) 12

- Updating and Search expressions that overcome combinatorial issues (State vs Path Functions, Two Step algorithm). 12

- Synchronized with sources. 12

Web Services. 13

WSDL - (Web Services Definition Language) Complete description of a soap (Structured Object Access Protocol) service that provides access to a resource on the internet that can be used in local code. 13

Soap Service – an interface to a remote resource via XML/SOAP RPC messaging interface with defined inputs and outputs. 13

Soap Server/Processor - a local computer program that can communicate via the web with XML/Soap Calls. 13

Workflow – a connected set of processors that combine to perform computational tasks. 13


Descriptron’s Relational Backend and Data Model:

 

Table overview: Data Integation Components

 

mysql> show tables;

+-----------------+

| Tables_in_Dtron |

+-----------------+

| FlyFacts        |

| FlyGraphs       |

| FlyMaps         |

| HumanFacts      |

| HumanGraphs1    |

| HumanGraphs2    |

| HumanGraphs3    |

| HumanMaps       |

| MouseFacts      |

| MouseGraphs     |

| MouseMaps       |

| RatFacts        |

| RatGraphs       |

| RatMaps         |

| WormFacts       |

| WormGraphs      |

| WormMaps        |

| YeastFacts      |

| YeastGraphs     |

| YeastMaps       |

| namespace       |

+-----------------+

21 rows in set (0.00 sec)


The Namespace Table – Specific Types, Access, Global Connectivity Constraints.

Table description

mysql> describe namespace;

+------------------+-------------+------+-----+---------+-------+

| Field            | Type        | Null | Key | Default | Extra |

+------------------+-------------+------+-----+---------+-------+

| idtype           | varchar(40) | YES  |     | NULL    |       |

| typeddescription | text        | YES  |     | NULL    |       |

| webmethod        | varchar(10) | YES  |     | NULL    |       |

| terminalp        | char(3)     | YES  |     | NULL    |       |

+------------------+-------------+------+-----+---------+-------+

4 rows in set (0.00 sec)

Example

mysql> select * from namespace order by idtype limit 5;

+-----------------------------------------+-------------------------------------------+------------+-----------+

| idtype                                  | typeddescription                          | webmethod  | terminalp |

+-----------------------------------------+-------------------------------------------+------------+-----------+

| affymetrix_netaffyx_affyprobe_uid       | Affymetrix affyprobe                      | wwwlnk0013 | nil       |

| affymetrix_netaffyx_chip_annot          | Affymetrix Chip                           | wwwlnk9999 | t         |

| biobase_transfac_domain_annot           | BIOBASE transfac domain                   | wwwlnk0014 | t         |

| biobase_transfac_domain_uid             | BIOBASE transfac domain                   | wwwlnk0014 | t         |

| chs-fitzimmons_vh-dissector_anatomy_uid | Center for Human Simulation Anatomy model | wwwlnk9999 | t         |

+-----------------------------------------+-------------------------------------------+------------+-----------+

5 rows in set (0.00 sec)


The Relevancy Table – Type Groupings Relevant to Semantic Types.

Table description

+---------+------+------+-----+---------+-------+

| Field   | Type | Null | Key | Default | Extra |

+---------+------+------+-----+---------+-------+

| concept | text | YES  |     | NULL    |       |

| idtypes | text | YES  |     | NULL    |       |

+---------+------+------+-----+---------+-------+

 

Example

| genesymbol     | (hugo_hgnc_officialsymbol_uid hugo_hgnc_aliassymbol_annot ncbi_ll_aliasprotien_annot ncbi_ll_aliassymbol_annot ncbi_ll_officialsymbol_annot ncbi_ll_preferredsymbol_annot stanford_spd_officialsymbol_annot |

 

| genename       | (hugo_hgnc_officialname_annot um_bbd_enzyme_annot ncbi_ll_aliasname_annot ncbi_ll_officialname_annot ncbi_ll_aliasname_annot ncbi_ll_preferredname_annot stanford_sgd_genename_annot) |

 

| gene           | (affymetrix_netaffyx_affyprobe_uid ebi-sib_trembl-sp_p_acc ebi-sib_trembl-sp_p_uid ebi_trembl_p_acc ebi_trembl_p_uid gu_pir_p_uid hugo_hgnc_officialsymbol_uid jax_mgd_gene_uid ncbi_genbank_p_acc ncbi_genbank_p_uid ncbi_ll_aliasprotien_annot ncbi_ll_aliassymbol_annot ncbi_ll_locus_uid ncbi_ll_officialsymbol_annot ncbi_ll_preferredsymbol_annot ncbi_refseq_np_uid ncbi_refseq_xm_acc ncbi_refseq_xm_uid ncbi_refseq_xp_acc ncbi_refseq_xp_uid

| officialsymbol | (hugo_hgnc_officialsymbol_uid ncbi_ll_officialsymbol_annot stanford_spd_officialsymbol_annot ut-ca_bind_officialsymbol_annot)


The Fact Tables - Identifiers and Fact Registration. Specific Connectivity Constraints.

Table description

mysql> describe HumanFacts;

+------------+-------------+------+-----+---------+-------+

| Field      | Type        | Null | Key | Default | Extra |

+------------+-------------+------+-----+---------+-------+

| nuid       | varchar(30) | YES  |     | NULL    |       |

| idvalue    | text        | YES  |     | NULL    |       |

| idtype     | varchar(50) | YES  |     | NULL    |       |

| authority  | varchar(30) | YES  |     | NULL    |       |

| deprecated | char(3)     | YES  |     | NULL    |       |

| tstamp     | bigint(20)  | YES  |     | NULL    |       |

| terminalp  | char(3)     | YES  |     | NULL    |       |

+---------------------------------------------------------+

Example

+--------------+--------------------+--------------------------------+------------+------------+------------+-----------+

| nuid         | idvalue            | idtype                         | authority  | deprecated | tstamp     | terminalp |

+--------------+--------------------+--------------------------------+------------+------------+------------+-----------+

| nuid-2072786 | 7340969            | ncbi_genbank_n_uid             | NCBI-LL    | NULL       |          0 | nil       |

| nuid-2215370 | 22761563           | ncbi_genbank_n_uid             | NCBI-LL    | NULL       |          0 | nil       |

| nuid-767613  | Q8N9J3             | sib_swissprot_p_acc            | AFFYMETRIX | NULL       |          0 | nil       |

| nuid-2384655 | FLJ13556           | ncbi_ll_aliassymbol_annot      | NCBI-LL    | NULL       |          0 | nil       |

| nuid-2544940 | AAH09326           | ncbi_genbank_p_acc             | NCBI-LL    | NULL       |          0 | nil       |

| nuid-4428303 | 51475048           | ncbi_genbank_contig_uid        | NCBI-LL    | nil        | 3324241944 | t         |

| nuid-4428310 | 3610142            | ncbi_pubmed_literature_uid     | NCBI-LL    | nil        | 3324228940 | t         |

7 rows in set (0.01 sec)


The Map Tables – Method, Time of Linking, Source and linkages to Registered Facts.

Table description

mysql> describe HumanMaps;

+--------+-------------+------+-----+---------+-------+

| Field  | Type        | Null | Key | Default | Extra |

+--------+-------------+------+-----+---------+-------+

| muid   | varchar(30) | YES  |     | NULL    |       |

| source | text        | YES  |     | NULL    |       |

| nuid   | varchar(30) | YES  |     | NULL    |       |

| linker | varchar(50) | YES  |     | NULL    |       |

| tstamp | bigint(20)  | YES  |     | NULL    |       |

+--------+-------------+------+-----+---------+-------+

5 rows in set (0.01 sec)

Example

 

+--------------+-------------+------+---------------------------+-----------+

| muid         | source      | nuid | linker                    | tstamp    |

+--------------+-------------+------+---------------------------+-----------+

| muid-2118263 | 4572462     | nil  | ncbi-ll-tmpl-parse-Hs     | 599223651 |

| muid-1255565 | 71285_at    | nil  | affymetrix-Hs-annot-parse | 535646377 |

| muid-1285564 | Hs.368007   | nil  | affymetrix-Hs-annot-parse | 535801118 |

| muid-1585604 | BAB14925    | nil  | ncbi-ll-tmpl-parse-Hs     | 597552731 |

| muid-1825598 | 1524068     | nil  | ncbi-ll-tmpl-parse-Hs     | 598140408 |

| muid-2718208 | NG_002676   | nil  | ncbi-ll-tmpl-parse-Hs     | 600301027 |

| muid-1224916 | 62274_at    | nil  | affymetrix-Hs-annot-parse | 535486389 |

| muid-1284913 | AA174142    | nil  | affymetrix-Hs-annot-parse | 535798748 |

+--------------+-------------+------+---------------------------+-----------+


The Graph Table – Normalized Fact linkages referenced by Method.

Table description

mysql> describe HumanGraphs1;

+-------+-------------+------+-----+---------+-------+

| Field | Type        | Null | Key | Default | Extra |

+-------+-------------+------+-----+---------+-------+

| muid  | varchar(30) | YES  |     | NULL    |       |

| nuid  | varchar(30) | YES  |     | NULL    |       |

+-------+-------------+------+-----+---------+-------+

2 rows in set (0.00 sec)

 

Example

mysql> select * from HumanGraphs1 limit 15;

+--------------+--------------+

| muid         | nuid         |

+--------------+--------------+

| muid-2636134 | nuid-2636080 |

| muid-2636134 | nuid-2636081 |

| muid-2636134 | nuid-2636082 |

| muid-2636134 | nuid-2636083 |

| muid-2636134 | nuid-2636084 |

| muid-2636134 | nuid-2636085 |

| muid-2636134 | nuid-2636086 |

| muid-2636134 | nuid-2636087 |

| muid-2636134 | nuid-2636088 |

| muid-2636134 | nuid-2636089 |

+--------------+--------------+

15 rows in set (0.03 sec)


 

Appendix 1. Registered Fact Types.

 

+-----------------------------------------+------------------------------------------

| idtype                                  | typeddescription                         |

+-----------------------------------------+-------------------------------------------

| affymetrix_netaffyx_affyprobe_uid       | Affymetrix affyprobe                     |

| affymetrix_netaffyx_chip_annot          | Affymetrix Chip                          |

| biobase_transfac_domain_annot           | BIOBASE transfac domain                  |

| biobase_transfac_domain_uid             | BIOBASE transfac domain                  |

| chs-fitzimmons_vh-dissector_anatomy_uid | Center for Human Simulation Anatomy model|

| doe-mbi-ucla_dip_pxp_uid                | DOE-MBI-UCLA DIP protein interaction   |

| doe-mbi-ucla_dip_p_uid                  | DOE-MBI-UCLA DIP protein id            |

| ebi-sib_trembl-sp_p_acc                 | EBI-SIB trembl-sp protein              |

| ebi-sib_trembl-sp_p_uid                 | EBI-SIB trembl-sp protein              |

| ebi_interpro_domain_uid                 | EBI domain   |

| ebi_interpro_pfam_uid                   | EBI interpro protein family            |

| ebi_interpro_p_uid                      | EBI interpro protein                   |

| ebi_trembl_p_acc                        | EBI trembl protein                     |

| ebi_trembl_p_uid                        | EBI trembl protein                     |

| embl_hssp_structure_uid                 | EMBL homology-structure id             |

| embl_smart_pfam_uid                     | EBI smart protein family               |

| flybase_flybase_gene_uid                | Flybase gene                           |

| germonline_germonline_pathway_uid       | UofBasel-ch germonline pathway         |

| goc_go-bp_concept_uid                   | GO consortium biological process uid   |

| goc_go-bp_term_annot                    | GO consortium go biological process term|

| goc_go-cc_concept_uid                   | GO consortium cellular component uid   |

| goc_go-cc_term_annot                    | GO consortium go cellular component term|

| goc_go-mf_concept_uid                   | GO consortium molecular function uid   |

| goc_go-mf_term_annot                    | GO consortium go molecular function term|

| gu_pirsf_pfam_uid                       | UofGeorgetown protein super-family     |

| gu_pir_p_uid  | UofGeorgetown protein                  |

| hugo_hgnc_aliassymbol_annot             | HUGO HGNC alias symbol                 |

| hugo_hgnc_genbank_acc                   | HUGO HGNC genbank accession -mixed     |

| hugo_hgnc_officialname_annot            | HUGO HGNC official gene name           |

| hugo_hgnc_officialsymbol_uid            | HUGO HGNC official symbol              |

| incyte_ypd_p_uid                        | Incyte yeast protein database          |

| inra-fr_prodom_domain_uid               | Prodom domain                          |

| iubmb_ec_p_acc                          | IUBMB EC protein                       |

| jax_mgd_gene_uid                        | MGD gene id  |

| jax_mgd_phenotype_uid                   | Jax MGD phenotype                      |

| jax_mgi_marker_uid                      | Jax MGI genetic marker                 |

| jhu_gdb_orf_uid                         | Johns Hopkins GDB orf                  |

| ku_kegg_ligand_annot                    | UofKyoto ligand                        |

| ku_kegg_ligand_uid                      | UofKyoto ligand                        |

| ku_kegg_pathway_annot                   | UofKyoto KEGG pathway                  |

| ku_kegg_pathway_uid                     | UofKyoto KEGG pathway                  |

| mips_mips_pxp_uid                       | Munich MIPS protein interaction        |

| ncbi_cdd_domain_acc                     | NCBI-conserved-domain                  |

| ncbi_cdd_domain_annot                   | NCBI-conserved-domain                  |

| ncbi_cog_pfam_acc                       | NCBI-COG protein family                |

| ncbi_genbank_contig_acc                 | NCBI-genbank contig                    |

| ncbi_genbank_contig_uid                 | NCBI-genbank contig                    |

| ncbi_genbank_n_acc                      | NCBI-genbank nucleotide                |

| ncbi_genbank_n_uid                      | NCBI-genbank nucleotide                |

| ncbi_genbank_p_acc                      | NCBI-genbank protein                   |

| ncbi_genbank_p_uid                      | NCBI-genbank protein                   |

| ncbi_gene_gene_uid                      | NCBI Entrezgene gene uid               |

| ncbi_grif_function_annot                | NCBI-LocusLink grif                    |

| ncbi_ll_aliasname_annot                 | NCBI-LocusLink name alias              |

| ncbi_ll_aliasprotien_annot              | NCBI-LocusLink protein alias           |

| ncbi_ll_aliassymbol_annot               | NCBI-LocusLink symbol alias            |

| ncbi_ll_go_annot                        | NCBI-LocusLink gene ontology definition|

| ncbi_ll_grif_annot                      | NCBI-LocusLink gene reference into function|

| ncbi_ll_loc-map_annot                   | NCBI-LocusLink map location            |

| ncbi_ll_locus_uid                       | NCBI-LocusLink locus                   |

| ncbi_ll_officialname_annot              | NCBI-LocusLink official name           |

| ncbi_ll_officialsymbol_annot            | NCBI-LocusLink official symbol         |

| ncbi_ll_organism_annot                  | NCBI-LocusLink organism                |

| ncbi_ll_phenotype_annot                 | NCBI-LocusLink phenotype               |

| ncbi_ll_phenotype_uid                   | NCBI-LocusLink phenotype               |

| ncbi_ll_preferredname_annot             | NCBI-LocusLink preferred name          |

| ncbi_ll_preferredsymbol_annot           | NCBI-LocusLink preferred symbol        |

| ncbi_ll_summary_annot                   | NCBI-LocusLink summary                 |

| ncbi_mmdb_structure_annot               | NCBI-MMDB structure                    |

| ncbi_mmdb_structure_uid                 | NCBI-MMDB structure id                 |

| ncbi_omim_gene_uid                      | NCBI-OMIM    |

| ncbi_pubmed_literature_uid              | NCBI-pubmed literature                 |

| ncbi_refseq_nc_acc                      | NCBI-refseq complete genomic           |

| ncbi_refseq_nc_uid                      | NCBI-refseq complete genomic           |

| ncbi_refseq_ng_acc                      | NCBI-refseq genomic                    |

| ncbi_refseq_ng_uid                      | NCBI-refseq genomic                    |

| ncbi_refseq_nm_acc                      | NCBI-refseq mRNA                       |

| ncbi_refseq_nm_uid                      | NCBI-refseq mRNA                       |

| ncbi_refseq_np_acc                      | NCBI-refseq protein                    |

| ncbi_refseq_np_uid                      | NCBI-refseq protein                    |

| ncbi_refseq_xm_acc                      | NCBI-refseq model mRNA                 |

| ncbi_refseq_xm_uid                      | NCBI-refseq model mRNA                 |

| ncbi_refseq_xp_acc                      | NCBI-refseq model protein              |

| ncbi_refseq_xp_uid                      | NCBI-refseq model protein              |

| ncbi_refseq_xr_acc                      | NCBI-refseq non-coding RNA             |

| ncbi_refseq_xr_uid                      | NCBI-refseq non-coding RNA             |

| ncbi_taxonomy_taxon_uid                 | NCBI-taxonomy taxon                    |

| ncbi_ug_orf_uid                         | NCBI-unigene orf                       |

| nlm_medline_literature_uid              | NLM medline reference                  |

| null_null_null_null                     | null tag     |

| pasteur-fr_candida_gene_uid             | Pasteur-fr candida gene                |

| sanger_pfam_pfam_uid                    | Sanger pfam protein family             |

| sanger_wormpep_p_uid                    | Sanger inst. protein id                |

| sib_prosite_domain_acc                  | SIB prosite domain acc                 |

| sib_prosite_domain_uid                  | SIB prosite domain id                  |

| sib_swissprot_p_acc                     | Swissprot protein acc                  |

| sib_swissprot_p_uid                     | Swissprot protein id                   |

| sri-carnegie_metacyc_pathway_acc        | Stanford-Carnegie metacyc pathway      |

| sri-carnegie_metacyc_pathway_annot      | Stanford-Carnegie metacyc pathway description|

| stanford_sgd_genename_annot             | Stanford SGD genename                  |

| stanford_sgd_gene_uid                   | Stanford SPD id                        |

| stanford_sgd_literature_uid             | Stanford SGD literature                |

| stanford_sgd_orf_uid                    | Stanford SGD orf                       |

| stanford_sgd_phenotype_annot            | Stanford SGD phenotype                 |

| stanford_sgd_summary_annot              | Stanford SGD description               |

| stanford_smd_expt_uid                   | Stanford microarray experiment         |

| stanford_spd_aliassymbol_annot          | Stanford SPD alias gene symbol         |

| stanford_spd_genesymbol_annot           | Stanford SPD gene symbol               |

| stanford_spd_officialsymbol_annot       | Stanford SPD official gene symbol      |

| tigr_cmr_microbial_uid                  | TIGR comprehensive microbial           |

| tigr_egad_m_uid                         | TIGR EGAD rna                          |

| tigr_tgi_orf_uid                        | TIGR orf     |

| tigr_tigrfams_pfam_uid                  | TIGR pfam    |

| ucsd_tc-db_p_uid                        | UCSD tc-db protein                     |

| uf-de_euroscarf_strain_uid              | UofFrankfurt-de EUROSCARF strain       |

| umber-uk_prints_domain_uid              | UofManchester domain id                |

| umn_cgc_nomenclature_genename_annot     | University of Minnesota Caenorhabditus genetics center gene name |

| umn_cgc_nomenclature_genesymbol_annot   | University of Minnesota Caenorhabditus genetics center symbol    |

| um_bbd_enzyme_annot                     | UofMichigan biodegredation enzyme      |

| um_bbd_enzyme_uid                       | UofMichigan biodegredation enzyme      |

| um_bbd_pathway_annot                    | UofMichigan biodegredation pathway     |

| uniprot_swissprot_genesymbol_annot      | Uniprot_swissprot_genesymbol           |

| uniprot_swissprot_keyword_annot         | Uniprot_swissprot keyword              |

| uniprot_swissprot_litdates_annot        | Uniprot citation dates                 |

| uniprot_swissprot_plength_annot         | Uniprot protein length                 |

| uniprot_swissprot_pmw_annot             | Uniprot molecular weight               |

| uniprot_swissprot_proteinname_annot     | Uniprot_swissprot protein name         |

| uniprot_swissprot_p_acc                 | Uniprot_swissprot protein              |

| uniprot_swissprot_p_seq                 | Uniprot_swissprot protein sequence     |

| uniprot_swissprot_seqheader_annot       | Uniprot_swissprot sequence header      |

| uniprot_trembl_genesymbol_annot         | Uniprot_trembl_genesymbol              |

| uniprot_trembl_go_uid                   | GO term id   |

| uniprot_trembl_keyword_annot            | Uniprot_trembl keyword                 |

| uniprot_trembl_litdates_annot           | Uniprot citation dates                 |

| uniprot_trembl_plength_annot            | Uniprot protein length                 |

| uniprot_trembl_pmw_annot                | Uniprot molecular weight               |

| uniprot_trembl_proteinname_annot        | Uniprot_trembl protein name            |

| uniprot_trembl_p_acc                    | Uniprot_trembl protein                 |

| uniprot_trembl_p_seq                    | Uniprot_trembl protein sequence        |

| uniprot_trembl_seqheader_annot          | Uniprot_trembl sequence header         |

| UofW_foundation-anatomy_anatomy_uid     | UofW Foundation model of Anatomy id    |

| ut-ca_bind_officialsymbol_annot         | University of Toronto BIND protein     |

| ut-ca_bind_pxp_uid                      | University of Toronto BIND protein interactioon|

| whitehead-mit_human-snp-db_marker_uid   | Mit-whitehead Human SNPs               |

| whitehead-mit_mouse-snp-db_marker_uid   | Mit-whitehead mouse SNPs               |

| wusm-sanger_wormbase_genename_annot     | Sanger-WUSM C.elegans coding gene name |

| wusm-sanger_wormbase_genesymbol_annot   | Sanger-WUSM C.elegans gene symbol      |

| wusm-sanger_wormbase_orf_uid            | Sanger-WUSM C.elegans coding sequence [CDS] id|

| wusm-sanger_wormbase_p_acc              | Sanger-WUSM C.elegans peptide accession|

| wusm-sanger_wormbase_p_pid              | Sanger-WUSM C.elegans peptide id       |

+-----------------------------------------+----------------------------------------|

150 rows in set (0.00 sec)