XMDR

eXtended MetaData Registry (XMDR) Project

XMDR: Proposed Prototype Architecture

Version 1.01, edited by Kevin D. Keck and John L. McCarthy, February 3, 2005

To see a graphic representation of the XMDR architecture please refer to the architecture diagram [PDF] [SVG] [PNG]

Introduction

This is a draft architecture for the first prototype for the Extended Metadata Registry (XMDR) project. We highly recommend reading the Use Cases document before reading this one.

Because "architecture" is a scale-independent concept, as equally applicable to microchip design as to the design of the entire Internet, some confusion often arises when discussing "the architecture" of a system as to which scale or scales are being addressed. With respect to XMDR, we have identified architectures at two distinct scales which are of primary interest, which we will refer to as internal architecture and external architecture. The internal architecture describes the way in which components are assembled to create the prototype XMDR itself. The external architecture describes the way in which the XMDR is coupled to other systems to form an encompassing application, intranet, extranet, framework, grid, web, or other type of larger distributed "system". These two very different levels of architecture are addressed in turn below.

I. External Architecture

The most popular approach to an external architecture is currently being called Service-oriented Architecture, or SOA. It represents the latest generation of distributed architecture framework in the tradition of RMI, COM/DCOM, CORBA, RPC, etc. The favored invocation protocols are XML Web Services (SOAP, WSDL, UDDI, etc.), but the modern approach is intended to be open-ended enough to also work with any legacy or other less-common RPC-type protocol required in a particular setting.

It is well understood, if not widely appreciated, that SOAP is a marked departure from the so-called REST architectural style of the existing World Wide Web, and which the World Wide Web Consortium (W3C) continues to favor within their very high-profile "Semantic Web Activity". The primary distinction is the emphasis in REST on interface genericity ("genericity" as in generic, not generat(iv)e), and most particularly on the very simple yet powerful "interface" embodied in URIs and HTTP "GET", "PUT", and "POST". For example, the RDF standard is self-consciously a document format, not a service specification. Within the W3C, the favored watchword is URI addressability. SOAP, by contrast, is a protocol designed to support a more object-oriented approach, wherein an address (in the case of SOAP, a URL) identifies an object (instance) which supports a more elaborate [object-oriented] "interface", often consisting of a rather diverse set of [strongly-typed] "methods".

Both architectural approaches arguably have distinct advantages and disadvantages, and in any case the choice of one over the other within a given organization or community will be driven by much more salient concerns than which one might be favored by a revised 11179 standard. In order to serve as a truly "global" registry, as chartered, it is therefore highly desirable that a Metadata Registry offer support for both encompassing architectural styles. That said, experience has shown that it is much easier to wrap a REST interface in a SOAP interface than vice versa. For this reason we will be implementing a REST interface first, with a SOAP wrapper to follow when time and resources permit.

The 11179 metadata registry standard is deliberately written to be data model independent, in two senses:

  1. It can be used to describe data elements and other constructs within any data model; and
  2. Its own abstract model (defined in UML) can be "implemented in" (i.e., described by means of) any sufficiently powerful data model

Within current 11179 Metadata Registries, the Classification_Scheme component frequently differs from other types of 11179 Administered_Items because it often is an extra-registry artifact, rather than merely an intra-registry item. Moreover, many classification_Schemes may have more complex substructure than other current Administered_Items. We thus need to be able to represent external objects that may have arbitrarily complex internal structure of their own.

After extensive discussion and investigation, we propose to use XML as both a data model and data exchange format for the XMDR Prototype. In addition to supporting the REST architectural style, this approach will greatly simplify both specification and development of an exemplar XMDR system. While we gave some consideration to using some less conventional representations (such as dialects of XMI and SCL), none seemed to have distinct advantages over the two major W3C formats: XML and RDF (W3C's Resource Description Framework). The exchange format is not for exchange between the registry and the sources of concept system (registry input), rather it is for exchange between the XMDR and clients. Concept systems may be received in the form of their native relational, hierarchical, or other data model. Internally, the XMDR may store the concept system in both its native data model, and the canonical XML data model.

We further concluded that XML would be preferable to RDF for our basic XMDR data representation because it is more general as well as simpler to understand and use. Although RDF is layered on top of XML and uses XML syntax, in practice it interposes a very different data model from the world of XML and XML Schemas—namely open-ended sets of subject-predicate-object triples. RDF is interpreted with "open world" semantics, which means, among other things, that a description in RDF can never be assumed to be complete (e.g., in RDF, assertion of four particular names for a resource does not imply that those are the only names for that resource). A more concrete difference (which is partly a consequence) is that an RDF "schema" only constrains what can be true, not what must actually be stated in any particular RDF graph (document). This last difference makes RDF a rather poor basis (by itself) for information exchange. In practice this may be mitigated by obtaining RDF data as a response to a very explicit query, since the association of the graph with a particular query provides additional information due to the more explicit operational semantics of the query language. Such association could be represented explicitly within an encompassing "envelope" message (as was done, perhaps not coincidentally, in ACL), but that broader data model is quite a bit more complex, particularly if the query language is not standardized (as was the case in ACL). This has led us to conclude that, while RDF certainly has its place, it is not (by itself) the most appropriate language for exchanging the highly constrained messages to which XML is perfectly suited, and any non-standard extended message format would largely defeat the purpose of using RDF at all. It should be possible, however, to design XML schemas for XMDR in such a way that conforming XML documents can also be parsed as RDF, if that seems to be a useful feature.

The important points are that each Administered_Item in the prototype XMDR will be contained in and/or described by a specific XML data structure, that each such XML data structure will be a separate XML document, and that each such XML document can be searched and retrieved as a single document file.



II. Internal Architecture

Our list of potential platforms and components is rather extensive, and includes components geared toward a very wide spectrum of system architectures. This section attempts to lay out the internal architecture we believe is optimal to adopt for the initial XMDR prototype, based on comparisons with a number of alternatives . This choice of architecture will in turn help us prune our list of components to be evaluated in depth, and help to clarify the criteria on which they should be evaluated.

We initially identified four candidate architectural approaches. The salient distinction is in where the application logic is incorporated.

  • The traditional approach is a typical server stack consisting of a persistence layer and an application layer. Client applications interact with the application layer through either a protocol-based API or a provided client-side "driver". Use of a standard API between the persistence layer and application layer (e.g., SQL/JDBC, JDO, or EJB Persistence) provides some degree of modularity, but the application logic is still tightly coupled to the data model of the persistence abstraction. This approach fits pretty well within SOA, but the result is relatively rigid.

  • The DBMS-centric approach moves most, if not all, of the "application logic" into declarative constraints, triggers, etc. within the DBMS. This can be implemented with an RDBMS back-end, with a logic-based database, or possibly with an XML- or RDF-based DBMS. One downside is that constraint and trigger specification are still much less standardized than data model and query interfaces, so this approach tends to lock the system in to a particular DBMS platform and presents a greater learning curve for any developer not already familiar with that specific platform. It is also often difficult to override the default exception-handling behavior when integrity violations do occur.

  • The model-driven approach is rather similar to the DBMS-centric approach, but the constraints and logic are specified in a platform-independent modeling language which is translated into application logic at compile time—i.e., all the application code is generated from the model. This is clearly avoids the (runtime) platform lock-in associated with the DBMS-centric approach, but it not as mature nor as familiar to the current XMDR developers, and much of the available software tooling may be beyond the budget of the XMDR project.

  • The fully modular approach is exemplified by some highly successful open-source software such as the Apache Web Server, the Eclipse IDE, and the Protégé Ontology Editor. The idea is to start from an abstract model, as in the model-driven approach, but to use that to merely specify a module (or "plug-in") framework rather than a complete, monolithic system. Numerous modules should then be relatively easy to implement, resulting in a very flexible system with a very clean separation of concerns and high reusability and portability; and the tooling support required is minimal. The main caveat is that the application functionality needs to be well enough understood at the outset to make it possible to design a module framework that is both straight-forward and stable.

At the last quarterly meeting, there was a concensus that the modular approach is the most appropriate for XMDR, given the requirements and the current state of the art in development tools.


An aggressive development schedule and limited resources make it particularly important for us to prioritize functionality and make some reasonable compromises of initial functionality in order to expedite development. We have identified three kinds of functionality that we think can be postponed for the time being, as follows:

Transaction Management
The transaction management requirements of metadata registration are generally pretty minimal, and sophisticated transaction management (TM) requirements would greatly increase the complexity of the system through all phases of development. We therefore plan to use the simplest possible transaction model, based on just atomic registrations, with no support for "in-place" updates and no support for database or data element locking (with optimistic concurrency control).
Classification (in DL sense)
The primary distinction of description logics (DLs) is that they provide a decidable semantics which is ideally also tractable, so that a DL model can be both checked for constitency and satisfiability and also canonicalized (the/a transitive reduction of its transitive closure can be computed). In discussions with our collaborators, however, we have generally confirmed our intuition that such computation will be much more typically done in a separate development environment than it will be in a deployed XMDR context. For the time being, therefore, we will not incorporate this kind of functionality for the initial prototype XMDR.
Integrated Query Language
As detailed in subsection C. below, we have identified two different varieties of query language which have not yet been integrated within any one system. Unfortunately the XMDR project does not yet have the resources to implement a remedy, particularly as we have not yet identified any use case(s) which appear to run up against this potential problem. Our initial prototype will therefore simply support two distinct query facilities, with no substantial integration between them.

Five modules have been proposed for the initial XMDR prototype:

  1. RegistryStore (Persistence and Versioning)
  2. MetadataValidator
  3. RetrievalIndex
  4. MappingEngine
  5. AuthenticationService

These are each described in more detail below, and depicted in an Architecture Diagram [PDF] [SVG] [PNG]. Note that in some cases the functionality of more than one of these logical components may be provided by a single software component; for example, DBMS's typically provide some facilities to support nearly all of them. However, presuming that any particular constellation of facilities will all be provided by a monolithic DBMS would have two disdvantages: it would tend to lock the XMDR in to a small subset of the available DBMS's, and it would preclude the use of innovative implementations of one or another of those bundled facilities from being used in place of the implementation provided by the DBMS. Ideally, the logical modularization defined here will permit any or all of these "bundled" functionalities to be made available more or less independently.

1. RegistryStore

The most minimal persistence engine is a simple file- or document-based repository, accessible either locally or through a web server. Given the wide range of intended users outlined in the high-level use cases, it would appear desirable to support at least the web server approach as one option. Alternative persistence modules can then be made to fit the same HTTP-based (GET/PUT) API.

The versioning model required for metadata registration appears to be somewhat different from that supported by, e.g., WebDAV, but considerable work remains to be done in specifying the XMDR versioning requirements.

2. MetadataValidator

Whether for XML, RDF, or some proprietary format, it will be desirable to support validation of uploaded documents against specified schemas. We wish to wrap validators for XML Schema, RDF Schema, and OWL in a way that will allow documents to be validated via a uniform interface. An open question is whether (at this time) to define and implement an extended validation service which will also check, e.g., Value Domain constraints specified in the registry rather than in schema files. (An alternative approach in some cases might be to provide a service to combine a static schema file with Value Domain information from the registry to produce a new schema file which expresses both, and then do standard validation against that elaborated schema.)

3. RetrievalIndex

Even though persistence modules might provide some indexing functionality natively, it will be desirable to support the creation of additional indexes which either differ from the natively-supported functionality or span multiple data stores. A stand-alone indexing module will provide an interface for adding potentially arbitrary content and an interface for querying that index. The indexing functionality of a persistence module can be exposed through the same generic query interface. The query interface envisioned will describe the query language supported by a given module in generic terms, will support basic schema (searchable field) interrogation, and will provide a uniform interface to query results.

We have identified two very different query models which are of primary interest for XMDR:

Textual Information Retrieval
The primary focus of XMDR is extensions to better support the use of thesauri, taxonomies, and ontologies (in addition to—and integrated with—management of data elements, conceptual domains, etc.). An important query functionality in such contexts is a modern text query functionality, supporting word-based search of textual parts of the database. Text query interfaces such as those commonly found in electronic library information systems (LIS) also typically have respectable support for the use of hierarchical classification schemes as another component. Most modern text retrival systems also support searching within particular fields (basic structural query functionality) and comparison based on whole-string match or lexicographic or numeric inequality. Mature, off-the-shelf text retrival systems therefore seem like a natural fit for a metadata registry. (This should probably not be surprising, as an LIS is also a "metadata repository" in a loose sense of the phrase, but oddly enough the two don't seem to have been connected together previously.) We further feel that the text-searching functionality will prove quite useful even for automated information gathering systems (spiders and other software "agents"), as a common fall-back wherever formal models are either incomplete or disconnected from one another.
Logic-based Systems
One ambition of ontologies and other formal knowledge capture is to support automated reasoning. In a query context, this traditionally means logic-based queries posed in the form of a predicate with some number of variables, with replies in the form of a sets of variable bindings. Just a few examples of such query languages are: RDQL, TRIPLE, the RACER Query Language, and the (unnamed?) Ontology Works KS query language. The DIG query interface is also subsumed within the query-answering model (it supports a small set of named, parameterized queries). A concrete example of an RDQL query is given below. Most of these languages are descended from Prolog and/or SQL, in which the underlying data model consists of predicates or relations. Even though the data models of XML and RDF have evolved beyond the relational abstraction (toward trees or, more generally, graphs), the vast majority of query systems which support inference have not yet evolved their query models to match.

There is great variability in the performance (speed, completeness, and robustness) of logic-based systems stemming from various trade-offs between performance dimensions, so it is unlikely that any single choice of system will please everyone. Instead, we wish to provide a generic interface and binding machinery for any such system(s) to be installed.

Both of the query models outlined above encode a query as a simple String and return the results in a List or Set. In the case of text queries, the List contains a number of objects, each with a title, a selection of other identifying attributes, in most cases a score, and often an excerpt of the text of the document where the query matched (with the match highlighted). In the case of question-answering systems, the List contains tuples each corresponding to a binding of the query variables to values. For example, the simple example RDQL query below (taken from the W3C submission of RDQL):

SELECT ?family , ?given
WHERE  (?vcard  vcard:FN "John Smith")
       (?vcard  vcard:N  ?name)
       (?name   vcard:Family  ?family)
       (?name   vcard:Given  ?given)
USING  vcard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>

would ostensibly return the single result, i.e., List containing the single tuple, (headings shown for clarity):

?family ?given
"Smith" "John"

For an example of a text-based query, in the query syntax supported by the Lucene text retrival engine out-of-the-box (see http://jakarta.apache.org/lucene/docs/queryparsersyntax.html), the closest analog to the example RDQL query above would be:

FN:["John Smith" TO "John Smith"]

Note that this is actually a range query, but a very simple preprocessor could be used to derive this query string from something like:

FN="John Smith"

In either case, note that the query can only return the entire vcard, not just the ?family,?given tuple. This is one of the big differences between the two query types.

Another example text query that illustrates selection based on a single element of the target item might be:

Family="Smith"

which would return all the vCard objects where family name is "Smith."

4. MappingEngine

The part of the MDR specification that requires the most work for XMDR is the support for registration and use of mappings between pairs of classification systems, ontologies, schemas, and value domains. Thus far, we have identified three general approaches to mapping being used today:

Translation Tables
The simpler approach is to build a table of pairs (an unlabeled bipartite graph) between the classification scheme items, concepts, or values which have corresponding or overlapping meaning. They are also sometimes called "correspondence tables"; see for example the mappings provided with NAICS 2002. A one-to-many matching indicates ambiguity. Translation tables are sometimes qualified with a confidence and/or completeness scale or measure, which is necessarily direction-dependent.
DL-based Translation
A more powerful approach is to use a description logic (such as OWL) to express mappings, which provides more precision. However some non-trivial tool is still needed to "apply" a DL mapping as a transformation.
FOL-based Translation
First-order logic (FOL) is more powerful than description logics, and so it supports the definition of more complicated mappings. The trade-off is that DLs are typically decidable and tractable, while full FOL is neither. Still, some communities have been using full FOL for some time and found that these theoretical problems rarely (if ever) materialize in practice. Additionally, there is a great variety of DLs, many of which cannot be combined without breaking the decidability and tractability conditions which motivate the use of a DL in the first place, and so any system attempting to leverage knowledge which is expressed in two different DLs will generally be forced to use a substantial subset of FOL anyway. An emerging ISO standard for exchange of FOL axiom sets is called Simple Common Logic (SCL).
Rule- or Query-based Translation
In the relational database community translations are commonly described as views. Euzenat observes that an equivalent level of expressivity is provided by SWRL for OWL/RDF [Euzenat, 2004]. It is not immediately apparent whether rule-based translation is (in theory) any less powerful (or more tractable) than FOL-based translation.

5. AuthenticationService

Sophisticated authentication support is not an immediate priority for the XMDR project. Nonetheless, we wish to cleanly encapsulate authentication within a distinct module in order to minimize the effort required to adapt the XMDR software for genuine deployment within a sizable enterprise.


The architecture diagram [PDF] [SVG] [PNG] only conveys the structure of the XMDR at a very high level; not depicted are the details of how the modules are to be connected together.

The first system-level decision we have made is to implement the core in Java. There are a great many reasons one could cite for this choice, but the one we find most compelling is the much greater number of state-of-the-art parsers, validators, processors, and other APIs and components which are available for Java, and their relative maturity, in comparison with all other development platforms. Of particular note are the Protégé and Eclipse applications, both of which include a plug-in framework which may facilitate some component reuse in XMDR, and the Jena RDF toolkit (from HP Labs).

Upon this platform, we plan to use two patterns of interaction to tie the individual modules together:

1. AOP/Interceptor

Aspect-Oriented Programming (AOP) is a methodology developed to support the modular implementation of cross-cutting concerns. Authentication is often cited as an example of such a concern, as it needs to be enforced uniformly but across a heterogeneous set of operations. The clean solution is to use advices (also known as interceptors) to interpose execution of some common task before all invocations of some arbitrary set of methods (known collectively as a pointcut). For example, an AOP framework will enable a single declaration to dispatch an authentication method before any method of the RegistryStore interface is invoked, thereby guaranteeing that no access to the RegistryStore will be permitted without successful authentication. Not only is the resulting codebase much more robust, it is also quite a bit smaller and much easier to modify.

A good example of an existing AOP system is the JBoss application server, which includes its own AOP framework. Another popular Java AOP framework is the Spring Framework.

In the case of XMDR, the most important pointcuts are before and after the registration of a new RegisteredItem, or new version thereof. The MetadataValidator and AuthenticationService components will typically be invoked before such registration, and the RetrievalIndex post method will typically be invoked immediately afterwards. Through the use of an AOP framework, however, this can all be made very easy to configure differently in a configuration file. For example, authentication could (optionally) be also invoked before the getDescription() method of the Registry interface (which provides basic information about the particular services available through the registry) if that operation is deemed to either reveal confidential information or to present a significant denial of service (DoS) vulnerability.

2. The Strategy Pattern

MetadataValidator and MappingEngine will be services for which multiple implementations will often be available. How will a particular implementation be selected in a given context?

This is an instance of the Strategy Design Pattern, and is seen often in Java libraries; in particular consider JAXP and JDBC, both of which have mechanisms for "registering" drivers/providers and for selecting one for use in a given context. As the above referenced article makes clear, what we do not want to do is write code like:

switch(getDocumentType()) {
  case RDF: jenaWrapper.validate(document);
    break;
  case XML: xercesWrapper.validate(document);
    break;
  case SCL: sclWrapper.validate(document);
    break;
  default: throw new ValidationException
    ("Can't validate document of type "+
     getDocumentTypeName());
}

Instead, we want to be able to register additional implementations dynamically, we want the implementations to each describe their capabilities (what document types they provide validation for), and we want the selection of an implementation to be done with the minimal dependency (and burden) possible placed on the client. Since the analogy to JDBC seems particularly natural, we will use the same proven and familiar mechanisms as are used there.

RetrievalIndex may also be implemented along the Strategy pattern, if doing so does not prove too cumbersome. (One potential compromise, if it significantly simplifies anything, would be to support such a pattern only for QuestionAnsweringModule, while more directly binding FullTextIndex.)



Appendix

This document is accompanied by an architecture diagram [PDF] [SVG] [PNG] in UML. This appendix provides some additional explanation of what the diagram is intended to show.

At the bottom of the diagram is a brief key to the UML symbols used to indicate different kinds of associations, for those who are not intimately familiar with UML. It is also worth noting that the "+" at the beginning of all the methods and named association ends is to indicate "public" access.

At the top right is the "Registry" interface, which is meant to represent the external interface of XMDR. The other interfaces in the diagram are all internal component interfaces, through which the XMDR core will interact with the internal components.

Throughout the diagram, "getDescription" methods are provided to provide information about the specific capabilities of, or services provided by, an instance of that interface. In all cases it has no parameters and returns an "RDFGraph"; we have not yet decided on an Ontology to use for this function, but it is likely to be an extension of the OWL-S ontology for web services. In the case of the Registry, the "getDescription" is intended to more or less return the union of the RDFGraphs obtained from each of its bound components which provide such a method.

The "get", "put", and "post" methods of the RegistryStore, WritableRegistryStore, and RetrievalIndex interfaces are named after the HTTP methods of the same name (just lower-cased), to indicate their RESTful semantics.

The other methods should all be pretty self-explanatory, and in any case are at this point only provided for illustration, and still subject to significant refinement in the design and implementation phases of the project.



XMDR Project Home Page: http://xmdr.org/ | Maintained by Kevin D. Keck, Lawrence Berkeley National Laboratory, kdkeck@lbl.gov | Last updated: February 3, 2005.

© 2007, Lawrence Berkeley National Laboratory
maintained by Karlo Berket
Credits: The research and development of the eXtended MetaData Registry is supported by a variety of participating organizations. Valid XHTML 1.0 Strict Valid CSS!