XMDR

eXtended MetaData Registry (XMDR) Project

XMDR Metadata/Terminology Content Datasets Priority List and Content Characterization
Frank Olken (LBNL)
Introduction

This is a list of proposed datasets which are being given priority for characterization and loading into the XMDR prototype.

We have also included links to the content characterization survey forms for most of the datasets.

Criteria for selection:

  • Importance to sponsors
  • Unclassified
  • Availability (via licensing, open source)
  • Permission to post access to the web (preferably unrestricted)
  • Structural (graph theoretic) diversity
  • Possibility of cross linking to other metadata datasets.
  • Opportunities to record evolution of the metadata dataset.
  • Desire to have metadata datasets from diverse subject domains.
  • Inclusion of code sets, and evolving code sets.
  • Availability of documentation
  • Importance to prospective users
  • Manageable size of the metadata set.
Comments

We have omitted the Alexandria Digital Library Gazetteer, and the NIMA Foreign Gazetteer. It is unclear to me [FO] what we would gain from adding these, and I think we have too many datasets already for the first year.

Installation of metadatasets is being done by Harold Solbrig (Apelon, Inc.) via LexGrid and (currently) by Karlo Berket (LBNL). Initial loading at LBNL was done by Kevin Keck (LBNL).

Notation

DAG = directed acyclic graph.

Content Priority List
Dataset Name XMDR Contact Graph Structure Priority Licensing
Issues
Status
Survey Form
Status
LexGrid Loading
Status
XMDR Loading
References
and Comments
DTIC Thesaurus (Defense Technology Info. Center Thesaurus) Gail Hodge (IIA for USGS) Directed graph (tree + related terms) 1 No. DTIC Thesarus survey form
Yes Yes This is an outdated version.
NCI Thesaurus (National Cancer Institute Thesaurus) Sherri De Coronado (NCI) Directed graph (tree + related terms) 1 No NCI Thesaurus survey form
Yes Yes .
NCI caDSR (National Cancer Institute Data Standards Repository) Sherri De Coronado (NCI) Directed graph 1 No survey form NA In progress .
ISO 3166 Country Codes Frank Olken List 1 No survey form
Yes, may need a language reload Yes ISO 3166 Country Codes Download Page (English and French) or extract from EPA EDR
GEMET (GEneral Multilingual Environmental Thesaurus) Gail Hodge (IIA for USGS)
and Linda Spencer (EPA)
Directed graph (trees + related terms) 1 No missing Yes Yes Bruce Bargmeyer has a new set of GEMET files from 2006/04. Nothing has been done with the new GEMET files.

Multilingual.

EPA Terms of the Environment Linda Spencer (EPA) List 1 No survey form for overall TRS. No No This is a glossary (with definitions). It is part of EPA's TRS (Terminology Reference System).

Gail Hodge will get a copy of text file from Linda Spencer.

EDR Data Elements (Environmental Data Registry Data Elements) Larry Fitzwater (EPA) Directed graph, trees, lists 1 No survey form NA In progress We have a Oracle dump of EDR. It has been loaded into an Oracle 10.1 DB at LBNL. We have ER diagrams of schema from Doug Mann.
UMLS (non-proprietary portions) Harold Solbrig (Apelon, Inc.) Directed graph 1 No survey form Probably want to use NCI edit of the UMLS. Yes (NCI Metathesaursu Version) No Kevin has downloaded the files from UMLS, need to expurgate copyrighted portions (i.e., run Metamorphis).

Some non-English content.

USGS Geographic Names Information System (GNIS) Gail Hodges (IIA for USGS) Tree / lists 1 No survey form
No No What portion (level of resolution) to get, how?

Formerly, Board of Geographic Names.

Foreign language content?

ITIS (Integrated Taxonomic Information System) Gail Hodge (IIA for USGS) Directed acyclic graph 1 No survey form
Postponed No Downloaded into SKOS. Harold Solbrig and Joel Saches have copies.
NBII Biocomplexity Thesaurus Gail Hodge (IIA for USGS) Directed graph, (tree + related terms) 1 No Survey form in Word - but needs lots of work. Yes Yes .
ISO Language Identifiers ISO 639-3 Part 3 Gail Hodge (IIA) List 1 No(?) Missing survey form. From TC37. Yes No From TC37. ISO 639-3 uses 3 letter codes. Includes both English and French names of languages. Unclear which part. Allows countries, regions, as modifiers

Jennifer DeCamp at MITRE is active in language code standards.

ISO Language Identifiers ISO 639-1,2 Gail Hodge (IIA for USGS) List 1 No missing Yes No From Library of Congress. ISO 639-1 = 2-letter language codes, ISO 639-2 = 3-letter language codes
ISO Currency Code - ISO 4217 ? List 1 No missing Yes Yes
Agrovoc Harold Solbrig (Apelon, Inc.) Multi-lingual Thesaurus 1 No missing Yes No Trying to become an ontology, but not the version we have.

Note that Agrovoc has some 17 languages.

Water Quality related ontology Kevin Keck (LBNL)

(Palle Haastrup, ISPRA, Italy)

Ontology 2 No missing No No. Discuss at Ecoterm meeting.
EPA Water Quality Terminology Gail Hodge (IIA) Taxonomy 2 No missing No No. For web browser use. G. Hodge will investigate.
EPA/CDC Air Quality Metadata Larry Fitzwater (EPA) ? 2 No missing NA No Via EDR Environmental Data Registry. Originates from Center for Disease Control.

Geospatial metadata (?)

EEA (European Environment Agency) Data Dictionary Bruce Bargmeyer (LBNL) Metadata about data elements 2 No missing No No Partial implementation of ISO/IEC 11179
EPA Office of Research and Development Bruce Bargmeyer (LBNL) Ontology for metadata extraction, environmental domain ontology 2 No missing No No From EPA ORD research corpora. Possible use of Cheshire 3. Extract from PDF files. Possible use of GATE or UIMA or IXTI extraction frameworks. Goal is index and search research corpora.

Texts are all English.

Omega Linguistic Ontology Ed Hovey (ISI), Kevin Keck (LBNL) Repository of shallow semantic terms. 1 probably OK missing No No Derived from Microcosmos, Wordnet.
Euro Wordnet ? Multi-lingual linguistic ontology 2 ? missing No No European languages.

Still active?

Dublin Core Multilingual Metadata Harold Solbrig (Apelon, Inc.) Vocabulary 2 No missing Yes No Multilingual = 25 languages.
Logical Observation Identifiers Names and Codes (LOINC) Harold Solbrig (Apelon, Inc.) Faceted Classification 2 No LOINC survey form No, Use original LOINC files, not NCI version. No This is a vocabulary related to laboratory tests.

Primarily in English, perhaps also Japanse and Chinese.

Adult Mouse Anatomical Dictionary Sherri De Coronado (NCI) DAG 2 Unknown, contact Jackson Labs. missing
Yes Yes English only.
Getty Thesaurus of Geographic Names (TGN) Frank Olken (LBNL) ? Forest of trees, some DAGs 2 Requires license from Getty. missing

Apparently can be downloaded if Getty okays. Available as either XML or relational tables.

No No We need to contact Getty concerning a license.
NASA Semantic Web Earth and Environmental Terminologies (SWEET) Ontologies Gail Hodge (IIA for USGS) Directed Graph, Axiomatized Ontology 2 No survey form No Partially loaded Encoded in OWL. Use Biosphere, Earth Realms.

English only.

SIC (Standard Industrial Classification System) Frank Olken (LBNL) Tree 2 No missing No No From US Census Bureau

Statistical Classification System.

NAICS (North American Industrial Classification System) Frank Olken (LBNL) Tree 2 No missing No No From U.S. Census Bureau in English. French from Canada, Spanish from Mexico (?)

Statistical classification system.

NAIC - SIC mappings Frank Olken (LBNL) Bipartite Graph 2 No (?) missing No No From US Census Bureau

Statistical classification system.

UNSPSC (United Nations Standard Products and Services Codes) Harold Solbrig (Apelon, Inc.) and Frank Olken (LBNL) Tree 2 None (?), from U.N. survey form Yes No Available in English, French, German, Spanish, Italian, Japanese, Korean, Chinese, Portugese.
IETF Language Identifiers RFC 3066 Harold Solbrig (Apelon, Inc.) List 2 No survey form NA No Supercedes IETF RFC 1766

Not a code set, specifies how to build a registry.

EPA Chemical Substance Registry System (SRS) Larry Fitzwater (EPA) and Linda Spencer (EPA) Classified by EPA legislation category 2 No missing No No Schema is under revision. Contact Larry Fitzawater (EPA). Estimated completion 2006-09.

Mappings from old to new?

Units Ontology Frank Olken (LBNL) � 2 Unknown missing No No. From Kendall/ Wallace/ Gruber

No publicly released yet.

HL7 Terminology Harold Solbrig (Apelon, Inc.) Unknown structure 3 No survey form
No No .
HL7 Data Elements Harold Solbrig (Apelon, Inc.) Unknown structure 3 Unknown survey form Yes No .
(V.A.) National Drug Formulary Reference Terminology (NDFRT) Sherri De Coronado (NCI) Directed Graph (?) (DAG + related terms) 3 None anticipated, check. missing Yes No VA National Drug File Reference Terminology: A Cross-Institutional Content Coverage Study ... Contact Mark Tuttle. at Apelon Inc.

Moribound at VA, awaiting action from VA

GO (Gene Ontology) Kevin D. Keck (LBNL) and Frank Olken (LBNL) Directed acyclic graph (?) 3 No missing Yes No .
Foundational Model of Anatomy (FMA) Harold Solbrig DAG (?) 3 Need to contact Rosse (or perhaps Barry Smith) survey form
Yes No Awaiting licensing release from Cornelius Rosse.
EPA Browse Vocabulary Linda Spencer (EPA) Tree, small (100's of concepts), taxonomy 3 No We have a survey form, but it needs lots of work. Yes Yes �Est. 32 KB text files, has NO DEFINITIONS.
Deferred Content Priority List

The following datasets are deferred until after the first set of content datasets are assimilated. The reasons for deferral are either sponsors' priorities, licensing issues, or complexity. SNOMED has been deferred in preference to NCI Thesaurus, in part due to anticipated potential problems with licensing.

Dataset XMDR Contact Graph Structure Priority Licensing Status
Survey Form
Status
LexGrid Loading
Status
XMDR Loading
References
and Comments
SNOMED CT Sherri De Coronado (NCI) and Harold Solbrig (Apelon, Inc.) Directed Graph 4 No license for non-US users.

Hence can not put on web site.

survey form Yes No Start licensing discussion now, but defer loading.
EPA Terminology Reference System (TRS) Larry Fitzwater (EPA) and Linda Spencer (EPA) Tree 4 No survey form Overlaps greatly with GEMET, hence defer, except for the terms of the environment.
BioPAX Ontology Frank Olken (LBNL) Directed Graph 4 No missing ? No Deferred due to personnel constraints, limited use.

© 2007, Lawrence Berkeley National Laboratory
maintained by Karlo Berket
Credits: The research and development of the eXtended MetaData Registry is supported by a variety of participating organizations. Valid XHTML 1.0 Strict Valid CSS!