| | XMDR: Proposed Prototype Architecture |
Version 1.01, edited by Kevin D. Keck and John L. McCarthy,
February 3, 2005
To see a graphic representation of the XMDR architecture please refer to the architecture diagram [PDF]
[SVG]
[PNG]
| Introduction |
This is a draft architecture for the first prototype for the
Extended Metadata Registry (XMDR)
project. We highly recommend reading the Use Cases document before reading this
one.
Because "architecture" is a scale-independent concept, as
equally applicable to microchip design as to the design of the
entire Internet, some confusion often arises when discussing "the
architecture" of a system as to which scale or scales are being
addressed. With respect to XMDR, we have identified architectures
at two distinct scales which are of primary interest, which we will
refer to as internal architecture and external architecture. The
internal architecture describes the way in which components are assembled to create the prototype
XMDR itself. The external architecture describes the way in
which the XMDR is coupled to other systems to form an encompassing
application, intranet, extranet, framework, grid, web, or other
type of larger distributed "system". These two very different
levels of architecture are addressed in turn below.
|
| I. External Architecture |
The most popular approach to an external architecture is
currently being called Service-oriented Architecture, or
SOA. It represents the latest generation of distributed
architecture framework in the tradition of RMI, COM/DCOM, CORBA,
RPC, etc. The favored invocation protocols are XML Web Services
(SOAP, WSDL, UDDI, etc.), but the modern approach is intended to be
open-ended enough to also work with any legacy or other less-common
RPC-type protocol required in a particular setting.
It is well understood, if not widely appreciated, that SOAP is a
marked departure from the so-called REST
architectural style of the existing World Wide Web, and which
the World Wide Web Consortium
(W3C) continues to favor within their very high-profile "Semantic Web Activity". The
primary distinction is the emphasis in REST on interface
genericity ("genericity" as in generic, not
generat(iv)e), and most particularly on the very simple
yet powerful "interface" embodied in URIs and HTTP "GET", "PUT",
and "POST". For example, the RDF standard is self-consciously a
document format, not a service
specification. Within the W3C, the favored watchword is
URI
addressability. SOAP, by
contrast, is a protocol designed to support a more object-oriented
approach, wherein an address (in the case of SOAP, a URL)
identifies an object (instance) which supports a more elaborate
[object-oriented] "interface", often consisting of a rather diverse
set of [strongly-typed] "methods".
Both architectural approaches arguably have distinct advantages
and disadvantages, and in any case the choice of one over the other
within a given organization or community will be driven by much
more salient concerns than which one might be favored by a revised
11179 standard. In order to serve as a truly "global" registry, as
chartered, it is therefore highly desirable that a Metadata
Registry offer support for both encompassing architectural
styles. That said, experience has shown that it is much easier to
wrap a REST interface in a SOAP interface than vice versa. For this
reason we will be implementing a REST interface first, with a SOAP
wrapper to follow when time and resources permit.
The 11179 metadata registry standard is deliberately written to
be data model independent, in two senses:
- It can be used to
describe data elements and other constructs within any data model;
and
-
Its own
abstract model (defined in UML) can be "implemented in" (i.e.,
described by means of) any sufficiently powerful data model
Within current 11179 Metadata Registries, the
Classification_Scheme component frequently differs from other types
of 11179 Administered_Items because it often is an
extra-registry artifact, rather than merely an
intra-registry item. Moreover, many classification_Schemes
may have more complex substructure than other current
Administered_Items. We thus need to be able to represent external
objects that may have arbitrarily complex internal structure of
their own.
After extensive discussion and investigation, we propose to use
XML as both a data model and data exchange format for the XMDR
Prototype. In addition to supporting the REST architectural style,
this approach will greatly simplify both specification and
development of an exemplar XMDR system. While we gave some
consideration to using some less conventional representations (such
as dialects of XMI and SCL), none seemed to have distinct
advantages over the two major W3C formats: XML and RDF (W3C's
Resource Description Framework). The exchange format is not for
exchange between the registry and the sources of concept system
(registry input), rather it is for exchange between the XMDR and
clients. Concept systems may be received in the form of their
native relational, hierarchical, or other data model. Internally,
the XMDR may store the concept system in both its native data
model, and the canonical XML data model.
We further concluded that XML would be preferable to RDF for our
basic XMDR data representation because it is more general as well
as simpler to understand and use. Although RDF is layered on top of
XML and uses XML syntax, in practice it interposes a very different
data model from the world of XML and XML Schemas—namely open-ended
sets of subject-predicate-object triples. RDF is interpreted with
"open world" semantics, which means, among other things, that a
description in RDF can never be assumed to be complete (e.g., in
RDF, assertion of four particular names for a resource does not
imply that those are the only names for that resource). A
more concrete difference (which is partly a consequence) is that an
RDF "schema" only constrains what can be true, not what
must actually be stated in any particular RDF graph
(document). This last difference makes RDF a rather poor basis (by
itself) for information exchange. In practice this may be mitigated
by obtaining RDF data as a response to a very explicit query, since
the association of the graph with a particular query
provides additional information due to the more explicit
operational semantics of the query language. Such association could
be represented explicitly within an encompassing "envelope" message
(as was done, perhaps not coincidentally, in ACL), but that broader
data model is quite a bit more complex, particularly if the query
language is not standardized (as was the case in ACL). This has led
us to conclude that, while RDF certainly has its place, it is not
(by itself) the most appropriate language for exchanging the highly
constrained messages to which XML is perfectly suited, and any
non-standard extended message format would largely defeat the
purpose of using RDF at all. It should be possible, however, to
design XML schemas for XMDR in such a way that conforming XML
documents can also be parsed as RDF, if that seems to be a useful
feature.
The important points are that each Administered_Item in the
prototype XMDR will be contained in and/or described by a specific
XML data structure, that each such XML data structure will be a
separate XML document, and that each such XML document can be
searched and retrieved as a single document file.
|
| II. Internal Architecture |
Our list of potential platforms and
components is rather extensive, and includes components geared
toward a very wide spectrum of system architectures. This section
attempts to lay out the internal architecture we believe is optimal
to adopt for the initial XMDR prototype, based on comparisons with
a number of alternatives . This choice of architecture will in turn
help us prune our list of components to be evaluated in depth, and
help to clarify the criteria on which they should be evaluated.
We initially identified four candidate architectural approaches.
The salient distinction is in where the application logic is
incorporated.
-
The traditional approach is a typical server stack
consisting of a persistence layer and an application layer. Client
applications interact with the application layer through either a
protocol-based API or a provided client-side "driver". Use of a
standard API between the persistence layer and application layer
(e.g., SQL/JDBC, JDO, or EJB Persistence) provides some degree of
modularity, but the application logic is still tightly coupled to
the data model of the persistence abstraction. This approach fits
pretty well within SOA, but the result is relatively rigid.
-
The DBMS-centric approach moves most, if not all, of
the "application logic" into declarative constraints, triggers,
etc. within the DBMS. This can be implemented with an RDBMS
back-end, with a logic-based database, or possibly with an XML- or
RDF-based DBMS. One downside is that constraint and trigger
specification are still much less standardized than data model and
query interfaces, so this approach tends to lock the system in to a
particular DBMS platform and presents a greater learning curve for
any developer not already familiar with that specific platform. It
is also often difficult to override the default exception-handling
behavior when integrity violations do occur.
-
The model-driven approach is rather similar to the
DBMS-centric approach, but the constraints and logic are specified
in a platform-independent modeling language which is translated
into application logic at compile time—i.e., all the application
code is generated from the model. This is clearly avoids the
(runtime) platform lock-in associated with the DBMS-centric
approach, but it not as mature nor as familiar to the current XMDR
developers, and much of the available software tooling may be
beyond the budget of the XMDR project.
-
The fully modular approach is exemplified by some
highly successful open-source software such as the Apache Web
Server, the Eclipse IDE, and the Protégé Ontology Editor. The idea
is to start from an abstract model, as in the model-driven
approach, but to use that to merely specify a module (or "plug-in")
framework rather than a complete, monolithic system. Numerous
modules should then be relatively easy to implement, resulting in a
very flexible system with a very clean separation of concerns and
high reusability and portability; and the tooling support required
is minimal. The main caveat is that the application functionality
needs to be well enough understood at the outset to make it
possible to design a module framework that is both straight-forward
and stable.
At the last quarterly meeting, there was a concensus that the
modular approach is the most appropriate for XMDR, given the
requirements and the current state of the art in development
tools.
An aggressive development schedule and limited resources make it
particularly important for us to prioritize functionality and make
some reasonable compromises of initial functionality in order to
expedite development. We have identified three kinds of
functionality that we think can be postponed for the time being, as
follows:
- Transaction Management
- The transaction management requirements of metadata
registration are generally pretty minimal, and sophisticated
transaction management (TM) requirements would greatly increase the
complexity of the system through all phases of development. We
therefore plan to use the simplest possible transaction model,
based on just atomic registrations, with no support for "in-place"
updates and no support for database or data element locking (with
optimistic concurrency control).
- Classification (in DL sense)
- The primary distinction of description logics (DLs) is that
they provide a decidable semantics which is ideally also tractable,
so that a DL model can be both checked for constitency and
satisfiability and also canonicalized (the/a transitive reduction
of its transitive closure can be computed). In discussions with our
collaborators, however, we have generally confirmed our intuition
that such computation will be much more typically done in a
separate development environment than it will be in a deployed XMDR
context. For the time being, therefore, we will not incorporate
this kind of functionality for the initial prototype XMDR.
- Integrated Query Language
- As detailed in subsection C. below, we have identified two
different varieties of query language which have not yet been
integrated within any one system. Unfortunately the XMDR project
does not yet have the resources to implement a remedy, particularly
as we have not yet identified any use case(s) which appear to run
up against this potential problem. Our initial prototype will
therefore simply support two distinct query facilities, with no
substantial integration between them.
Five modules have been proposed for the initial XMDR
prototype:
- RegistryStore (Persistence and Versioning)
- MetadataValidator
- RetrievalIndex
- MappingEngine
- AuthenticationService
These are each described in more detail below, and depicted in
an Architecture Diagram [PDF]
[SVG]
[PNG]. Note
that in some cases the functionality of more than one of these
logical components may be provided by a single software component;
for example, DBMS's typically provide some facilities to support
nearly all of them. However, presuming that any particular
constellation of facilities will all be provided by a monolithic
DBMS would have two disdvantages: it would tend to lock the XMDR in
to a small subset of the available DBMS's, and it would preclude
the use of innovative implementations of one or another of those
bundled facilities from being used in place of the implementation
provided by the DBMS. Ideally, the logical modularization defined
here will permit any or all of these "bundled" functionalities to
be made available more or less independently.
1. RegistryStore
The most minimal persistence engine is a simple file- or
document-based repository, accessible either locally or through a
web server. Given the wide range of intended users outlined in the
high-level use cases, it would appear
desirable to support at least the web server approach as one
option. Alternative persistence modules can then be made to fit the
same HTTP-based (GET/PUT) API.
The versioning model required for metadata registration appears
to be somewhat different from that supported by, e.g., WebDAV, but
considerable work remains to be done in specifying the XMDR
versioning requirements.
2. MetadataValidator
Whether for XML, RDF, or some proprietary format, it will be
desirable to support validation of uploaded documents against
specified schemas. We wish to wrap validators for XML Schema, RDF
Schema, and OWL in a way that will allow documents to be validated
via a uniform interface. An open question is whether (at this time)
to define and implement an extended validation service which will
also check, e.g., Value Domain constraints specified in the
registry rather than in schema files. (An alternative approach in
some cases might be to provide a service to combine a static schema
file with Value Domain information from the registry to produce a
new schema file which expresses both, and then do standard
validation against that elaborated schema.)
3. RetrievalIndex
Even though persistence modules might provide some indexing
functionality natively, it will be desirable to support the
creation of additional indexes which either differ from the
natively-supported functionality or span multiple data stores. A
stand-alone indexing module will provide an interface for adding
potentially arbitrary content and an interface for querying that
index. The indexing functionality of a persistence module can be
exposed through the same generic query interface. The query
interface envisioned will describe the query language supported by
a given module in generic terms, will support basic schema
(searchable field) interrogation, and will provide a uniform
interface to query results.
We have identified two very different query models which are of
primary interest for XMDR:
- Textual Information Retrieval
- The primary focus of XMDR is extensions to better support the
use of thesauri, taxonomies, and ontologies (in addition to—and
integrated with—management of data elements, conceptual domains,
etc.). An important query functionality in such contexts is a
modern text query functionality, supporting word-based search of
textual parts of the database. Text query interfaces such as those
commonly found in electronic library information systems (LIS) also
typically have respectable support for the use of hierarchical
classification schemes as another component. Most modern text
retrival systems also support searching within particular fields
(basic structural query functionality) and comparison based on
whole-string match or lexicographic or numeric inequality. Mature,
off-the-shelf text retrival systems therefore seem like a natural
fit for a metadata registry. (This should probably not be
surprising, as an LIS is also a "metadata repository" in a loose
sense of the phrase, but oddly enough the two don't seem to have
been connected together previously.) We further feel that the
text-searching functionality will prove quite useful even for
automated information gathering systems (spiders and other software
"agents"), as a common fall-back wherever formal models are either
incomplete or disconnected from one another.
- Logic-based Systems
- One ambition of ontologies and other formal knowledge capture
is to support automated reasoning. In a query context, this
traditionally means logic-based queries posed in the form
of a predicate with some number of variables, with replies in the
form of a sets of variable bindings. Just a few examples
of such query languages are: RDQL,
TRIPLE, the
RACER Query Language, and the (unnamed?) Ontology Works KS query
language. The DIG
query interface is also subsumed within the query-answering model
(it supports a small set of named, parameterized queries). A
concrete example of an RDQL query is given below. Most of these
languages are descended from Prolog and/or SQL, in which the
underlying data model consists of predicates or relations. Even
though the data models of XML and RDF have evolved beyond the
relational abstraction (toward trees or, more generally, graphs),
the vast majority of query systems which support inference have not
yet evolved their query models to match.
There is great variability in the performance (speed, completeness,
and robustness) of logic-based systems stemming from various
trade-offs between performance dimensions, so it is unlikely that
any single choice of system will please everyone. Instead, we wish
to provide a generic interface and binding machinery for any such
system(s) to be installed.
Both of the query models outlined above encode a query as a
simple String and return the results in a List or
Set. In the case of text queries, the List contains a
number of objects, each with a title, a selection of other
identifying attributes, in most cases a score, and often an excerpt
of the text of the document where the query matched (with the match
highlighted). In the case of question-answering systems, the
List contains tuples each corresponding to a binding of
the query variables to values. For example, the simple example RDQL
query below (taken from the W3C
submission of RDQL):
SELECT ?family , ?given
WHERE (?vcard vcard:FN "John Smith")
(?vcard vcard:N ?name)
(?name vcard:Family ?family)
(?name vcard:Given ?given)
USING vcard FOR <http://www.w3.org/2001/vcard-rdf/3.0#>
would ostensibly return the single result, i.e., List
containing the single tuple, (headings shown for clarity):
| ?family |
?given |
| "Smith" |
"John" |
For an example of a text-based query, in the query syntax
supported by the Lucene text retrival engine out-of-the-box (see
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html), the
closest analog to the example RDQL query above would be:
FN:["John Smith" TO "John Smith"]
Note that this is actually a range query, but a very
simple preprocessor could be used to derive this query string from
something like:
FN="John Smith"
In either case, note that the query can only return the entire
vcard, not just the ?family,?given tuple. This is one of the big
differences between the two query types.
Another example text query that illustrates selection based on a
single element of the target item might be:
Family="Smith"
which would return all the vCard objects where family name is
"Smith."
4. MappingEngine
The part of the MDR specification that requires the most work
for XMDR is the support for registration and use of mappings
between pairs of classification systems, ontologies, schemas, and
value domains. Thus far, we have identified three general
approaches to mapping being used today:
- Translation Tables
- The simpler approach is to build a table of pairs (an unlabeled
bipartite graph) between the classification scheme items, concepts,
or values which have corresponding or overlapping meaning. They are
also sometimes called "correspondence tables"; see for example the
mappings provided
with NAICS 2002. A one-to-many matching indicates ambiguity.
Translation tables are sometimes qualified with a confidence and/or
completeness scale or measure, which is necessarily
direction-dependent.
- DL-based Translation
- A more powerful approach is to use a description logic (such as
OWL) to express mappings, which provides more precision. However
some non-trivial tool is still needed to "apply" a DL mapping as a
transformation.
- FOL-based Translation
- First-order logic (FOL) is more powerful than description
logics, and so it supports the definition of more complicated
mappings. The trade-off is that DLs are typically decidable and
tractable, while full FOL is neither. Still, some communities have
been using full FOL for some time and found that these theoretical
problems rarely (if ever) materialize in practice. Additionally,
there is a great variety of DLs, many of which cannot be combined
without breaking the decidability and tractability conditions which
motivate the use of a DL in the first place, and so any system
attempting to leverage knowledge which is expressed in two
different DLs will generally be forced to use a substantial subset
of FOL anyway. An emerging ISO standard for exchange of FOL axiom
sets is called Simple Common Logic (SCL).
- Rule- or Query-based Translation
- In the relational database community translations are commonly
described as views. Euzenat observes that an equivalent level of
expressivity is provided by SWRL for OWL/RDF [Euzenat, 2004]. It is
not immediately apparent whether rule-based translation is (in
theory) any less powerful (or more tractable) than FOL-based
translation.
5. AuthenticationService
Sophisticated authentication support is not an immediate
priority for the XMDR project. Nonetheless, we wish to cleanly
encapsulate authentication within a distinct module in order to
minimize the effort required to adapt the XMDR software for genuine
deployment within a sizable enterprise.
The architecture diagram [PDF]
[SVG]
[PNG]
only conveys the structure of the XMDR at a very high level; not
depicted are the details of how the modules are to be connected
together.
The first system-level decision we have made is to implement the
core in Java. There are a great many reasons one could cite for
this choice, but the one we find most compelling is the much
greater number of state-of-the-art parsers, validators, processors,
and other APIs and components which are available for Java, and
their relative maturity, in comparison with all other development
platforms. Of particular note are the Protégé and Eclipse
applications, both of which include a plug-in framework which may
facilitate some component reuse in XMDR, and the Jena RDF toolkit
(from HP Labs).
Upon this platform, we plan to use two patterns of interaction
to tie the individual modules together:
1. AOP/Interceptor
Aspect-Oriented Programming (AOP) is a methodology developed to
support the modular implementation of cross-cutting
concerns. Authentication is often cited as an example of such
a concern, as it needs to be enforced uniformly but across a
heterogeneous set of operations. The clean solution is to use
advices (also known as interceptors) to interpose
execution of some common task before all invocations of some
arbitrary set of methods (known collectively as a
pointcut). For example, an AOP framework will enable a
single declaration to dispatch an authentication method before any
method of the RegistryStore interface is invoked, thereby
guaranteeing that no access to the RegistryStore will be
permitted without successful authentication. Not only is the
resulting codebase much more robust, it is also quite a bit smaller
and much easier to modify.
A good example of an existing AOP system is the JBoss
application server, which includes its own AOP framework. Another
popular Java AOP framework is the Spring Framework.
In the case of XMDR, the most important pointcuts are before and
after the registration of a new RegisteredItem, or new version
thereof. The MetadataValidator and AuthenticationService components
will typically be invoked before such registration, and the
RetrievalIndex post method will typically be invoked
immediately afterwards. Through the use of an AOP framework,
however, this can all be made very easy to configure differently in
a configuration file. For example, authentication could
(optionally) be also invoked before the getDescription() method of
the Registry interface (which provides basic information about the
particular services available through the registry) if that
operation is deemed to either reveal confidential information or to
present a significant denial of service (DoS) vulnerability.
2. The Strategy Pattern
MetadataValidator and MappingEngine will be services for which
multiple implementations will often be available. How will a
particular implementation be selected in a given context?
This is an instance of the
Strategy Design Pattern, and is seen often in Java libraries;
in particular consider JAXP and JDBC, both of which have mechanisms
for "registering" drivers/providers and for selecting one for use
in a given context. As the above referenced article makes clear,
what we do not want to do is write code like:
switch(getDocumentType()) {
case RDF: jenaWrapper.validate(document);
break;
case XML: xercesWrapper.validate(document);
break;
case SCL: sclWrapper.validate(document);
break;
default: throw new ValidationException
("Can't validate document of type "+
getDocumentTypeName());
}
Instead, we want to be able to register additional
implementations dynamically, we want the implementations to each
describe their capabilities (what document types they provide
validation for), and we want the selection of an implementation to
be done with the minimal dependency (and burden) possible placed on
the client. Since the analogy to JDBC seems particularly natural,
we will use the same proven and familiar mechanisms as are used
there.
RetrievalIndex may also be implemented along the Strategy
pattern, if doing so does not prove too cumbersome. (One potential
compromise, if it significantly simplifies anything, would be to
support such a pattern only for QuestionAnsweringModule, while more
directly binding FullTextIndex.)
|
| Appendix |
This document is accompanied by an
architecture diagram [PDF]
[SVG]
[PNG] in UML. This
appendix provides some additional explanation of what the diagram
is intended to show.
At the bottom of the diagram is a brief key to the UML symbols
used to indicate different kinds of associations, for those who are
not intimately familiar with UML. It is also worth noting that the
"+" at the beginning of all the methods and named association ends
is to indicate "public" access.
At the top right is the "Registry" interface, which is meant to
represent the external interface of XMDR. The other interfaces in
the diagram are all internal component interfaces, through which
the XMDR core will interact with the internal components.
Throughout the diagram, "getDescription" methods are provided to
provide information about the specific capabilities of, or services
provided by, an instance of that interface. In all cases it has no
parameters and returns an "RDFGraph"; we have not yet decided on an
Ontology to use for this function, but it is likely to be an
extension of the OWL-S ontology for web services. In the case of
the Registry, the "getDescription" is intended to more or less
return the union of the RDFGraphs obtained from each of its bound
components which provide such a method.
The "get", "put", and "post" methods of the RegistryStore,
WritableRegistryStore, and RetrievalIndex interfaces are named
after the HTTP methods of the same name (just lower-cased), to
indicate their RESTful semantics.
The other methods should all be pretty self-explanatory, and in
any case are at this point only provided for illustration, and
still subject to significant refinement in the design and
implementation phases of the project.
XMDR Project Home Page: http://xmdr.org/ | Maintained by Kevin D. Keck, Lawrence Berkeley
National Laboratory, kdkeck@lbl.gov | Last updated: February
3, 2005.
|
|
|