CMD2RDF: Building a Bridge from CLARIN to Linked Open Data

Metadata can be represented in many different ways. CLARIN’s Component Metadata Infrastructure (CMDI) uses the eXtensible Markup Language (XML) as the representation format for metadata records. However, the Resource Description Format (RDF) as used by Linked Open Data (LOD) is gaining more popularity. RDF has interesting potential for queries that involve both metadata about and the content of linguistic resources. This chapter describes the implementation of a mapping for records in CMDI from XML to RDF and experiments to assess the potential of this representation.


Introduction
Metadata has always been a key issue for libraries and archives and thus has a long history (M-Files, 2016).Throughout the ages the physical form and, more recently, the digital representation of metadata has changed, i.e., adapted to the standard current at that time.When the CLARIN preparatory phase started in 2007 the eXtensible Markup Language (XML; Bray et al., 2008) was the current standard.CLARIN's metadata standard as implemented in the Component Metadata Infrastructure (CMDI;Broeder et al., 2012;CLARIN, 2016a) is thus also based on XML as the representation format for metadata.However, the Resource Description Format (RDF; Cyganiak, Wood and Lanthaler, 2014) as used, for example, by the Linguistic Linked Open Data (LLOD) cloud (Chiarcos et al., 2012;LIDER project, 2016) is gaining more popularity.RDF provides an interesting potential for queries that involve both metadata about and the content of linguistic resources, as both metadata and content can be collected and queried in a set of connected graphs.In the CMD2RDF project CLARIN-NL (2016) CLARIN-NL sponsored the actual implementation of the mapping from Component Metadata (CMD) to RDF, which has been proposed by Durco and Windhouwer (2014a), and the services to provide access to the resulting RDF.This enables the CLARIN community to experiment with RDF representations of the CMD records, and to get a sense of its potential and the opportunities for cross fertilisation with other Linked Data resources like those found in the LLOD cloud The results of this project are described in the main part of this chapter.The rst two sections provide a short summary of both CMDI and the Linked Data paradigm, and the chapter ends with the current status of CMD2RDF and future plans for it.

The Component Metadata Infrastructure
The basic building blocks of CMDI are, not surprisingly, components.A component focuses on a speci c aspect of a (linguistic) resource and groups together metadata elements, which can be used to capture information, and other components.For example, an address component contains the elements street, city and country.This component could be reused by a contact person or an organisation component.The infrastructure provides a Component Registry for metadata modellers to share and reuse components.The registry is accompanied by an editor, which allows adapting components to speci c needs or creating completely new ones.A modeller in the end creates metadata pro les, i.e., a collection of metadata components, targeted at a speci c resource type, e.g., a historic text or an audio recording of an endangered language.A CMD pro le is a tree-based structure where the nodes are components, from which one is the root of the tree, and the leaves are elements.This tree can be very naturally mapped to XML and thus an XML Schema (XSD; Gao, Sperberg-McQueen and Thompson, 2012) can be used to validate whether a CMD record is compliant with a speci c pro le.In CLARIN various tools, e.g.online and o ine editors, have been developed to create and maintain valid CMD records (also known as metadata descriptions).This core of CMDI, the Component Metadata model, is visualised in Figure 8. CLARIN centres o er the CMD records they create for harvesting via the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH; Lagoze and Van de Sompel, 2015).Central CLARIN services, like the Virtual Language Observatory (VLO; CLARIN, 2016b), provide access to the full set of harvested CMD records.

Linked Open Data
Linked Data, open or closed, has become increasingly popular.In this paradigm graphs are constructed out of triples consisting of a subject, a predicate and an object.The object of a triple can be the subject of another triple thus building the graph.All parts of the triple can be identi ed (nodes, i.e., subjects or objects) or typed (nodes or edges, i.e., predicates) with an Internationalized Resource Identi er (IRI; Dürst and Suignard, 2005), most commonly a Uniform Resource Location (URL; Berners-Lee, Masinter and McCahill, 1994).A coherent vocabulary of types is commonly described in an RDF Schema (RDFS; Brickley and Guha, 2014) or extensions thereof.Many RDF vocabularies (Open Knowledge Foundation, 2016) exist and some are frequently reused.Graphs are linked with each other when they share an IRI.In this way large graphs like the Linked Open Data (LOD) cloud Cyganiak and Jentzsch (2016) and the Linguistic Linked Open Data (LLOD) cloud (LIDER project, 2016) can be identi ed.
Access to (parts of) these graphs is mostly provided in two ways: 1) as downloads in one or more of the various RDF serialisations, and/or 2) via SPARQL (W3C SPARQL Working Group, 2013) query endpoints.In the latter case the graphs are in general stored in a triple store, i.e., a system for managing (large) sets of triples equivalent to Relational DataBase Management Systems (RDBMS) for structured data.

The CMD2RDF Bridge
The aim of the CMD2RDF project has been to bring all of the CLARIN CMD record collection to the Linked Data cloud.For this the XML-based records have to be transformed into RDF without loss of information (note that this goal is di erent from the approach taken by an aggregator like LingHub (McCrae et al. 2015), where only a subset of the information, i.e. in the case of LingHub the set already mapped to Dublin Core (DC; Dublin Core Metadata Initiative 2016) by the OAI-PMH provider, is transformed to RDF).The exibility of CMDI also means that in such a generic transformation a xed metadata RDF Schema, like the Data Catalog Vocabulary (DCAT; Maali and Erickson, 2014), is not directly applicable as it would require hand-cra ed and maintained mappings to the xed schema for every CMD pro le encountered.But as shown below more generic RDF vocabularies do play a role in transformation.These graphs should also be accessible, either as a download or via a SPARQL endpoint.The next subsections describe the approaches taken to tackle these issues.

The Component Model and RDF
A CMD record is an instance of a CMD pro le, which in its turn is an instance of the CMD model.Next to the pro le-speci c part each record also uses a generic envelope, e.g. to provide information on the resources involved.For all these levels and parts an RDF equivalent has to be created The following description is short, i.e. highlights some issues, the design choices made to resolve them and consists mainly of examples, but Durco and Windhouwer (2014a) gives a full description of the mapping of all these levels and parts.
In the CMD model the main building block, the CMD component naturally corresponds to an RDFS class.A CMD pro le can be seen as a specialisation of component, so it is a subclass of the RDFS class for component.It seems natural to map a CMD element to an RDF property.However, a CMD element is more complex than an RDF property, i.e., it can carry additional information in the form of attributes.To be able to retain this information in the mapping a CMD element also has to be mapped to an RDFS class.In RDF, as opposed to XML, the nesting of CMD components or elements in a CMD component needs a predicate.For this the very generic contains predicate is introduced.To retain consistency attributes are modelled in a similar way as elements.This results in the following RDF Schema: where the cmd1: and cmd2: pre xes are bound to component-speci c IRIs, i.e., the URL to the component speci cation in the CMDI Component Registry.
A complicating matter is that although a component or element has a unique name among its siblings, within a single component speci cation a name can very well be ambiguous -so context has to be taken into account.This is done by adding the context to the IRI of a component or element; e.g., cmd2:Actor Languages Language represents a Language element nested in a Languages component which itself is nested in a reusable Actor component. 2ow that a CMD pro le can be transformed into an RDF Schema an actual CMD record can also be transformed.The core of such a record is formed by its instantiation of the component hierarchy allowed by the pro le: In this example the hierarchy is instantiated using RDF blank nodes, but the IRI of a record extended with a local unique identi er can also be used.
In a CMD record the pro le-speci c payload is placed inside a generic CMD envelope, which contains information about the resources involved and metadata about the records themselves, e.g. who has created them and when.This part is also mapped to RDF.And as it is more generic it was possible to reuse existing RDF vocabularies: Dublin Core for the metadata, Open Annotation (OA; W3C Web Annotation Working Group, 2016) for the relation between the pro le-speci c part and the resources, and the Open Archives Initiative's Object Reuse and Exchange vocabulary (ORE; Open Archives Initiative, 2016) for the relationships of the record with other CMD records.

From Harvesting CMD to Providing RDF
Using the RDF mapping described above any CMD record can be transformed.However, to be of actual use the continuously evolving CLARIN-wide collection of CMD records would have to become available in the Linked Data cloud.To achieve this goal the system architecture depicted in Figure 8.2 was implemented in the CMD2RDF project.
CMD records provided by the CLARIN centres are regularly harvested by the CLARIN OAI-PMH harvester.As the harvester currently does not support incremental harvests, and since even if it did all centres would still not necessarily support them, the CMD2RDF conversion pipeline determines which records are new or updated and transforms those into RDF.These RDF records and the RDFS of the components and pro les involved are stored in the Virtuoso triple store (OpenLink So ware, 2016).Virtuoso supports a SPARQL endpoint and RESTful access to the RDF graphs, which each correspond to a CMD record.CMD2RDF does put a proxy in front of those to be able to (potentially) control the access, e.g. to prevent too heavy SPARQL queries.The resulting service is available at: catalog.clarin.eu/ds/cmd2rdfAnother important aspect of the CMD2RDF conversion pipeline is the ability to also enrich the CMD or RDF representations.This makes it possible to introduce links to other datasets, i.e., determine the place of a CMD record in the Linked Open Data cloud and especially in the Linguistic Linked Open Data cloud.

CMD2RDF and LLOD
In the CMD2RDF system architecture CMD records can be enriched with links to other LLOD datasets.The main linking pins for linguistic datasets are of course languages.The most prominent set of language codes is ISO 639:3 (Summer Institute for Linguistics, 2016), which is represented by DBpedia (2016) IRIs in the LOD cloud.Due to the heterogeneous nature of CMDI these codes can appear anywhere in a CMD record.However, due to the semantic network (Durco and Windhouwer, 2014b) that overlays the CLARIN collection of CMD record these places can be identi ed.Currently CMD2RDF uses the approach used for the VLO facet mapping (Van Uytvanck, Stehouwer and Lampen, 2012) and includes the resulting facets explicitly.To retain the original value next to the IRI identi ed by the enrichment process the cmdm:hasElementEntity predicate (which gets subclassed by speci c enrichments like the VLO facets) was introduced: < hdl :123/456 > vlo : hasFacetISO6393ElementValue " nld " ; vlo : hasFacetISO6393ElementEntity < http : // dbpedia .org / resource / ISO_639 : nld > .
As a showcase the WALS dataset (Dryer and Haspelmath, 2013) was also loaded into Virtuoso.Now SPARQL queries can be issued that involve both CMD records and linguistic content, i.e., WALS.The following query is an example of this: SELECT DISTINCT ?resource ?mimetype ?language ?value WHERE { ?feature dcterms : references wals :9 A .? feature dcterms : hasPart / rdfs : label ?value .? feature ˆdcterms : isReferencedBy / owl : sameAs ?language GRAPH ?g { ?cmd vlo : hasFacetISO6393ElementEntity ?language .? cmd oa : hasTarget ?resource .? resource cmdm : hasMimeType ?mimetype .} } This query returns the locations (?resource) of multimedia (?mimetype) resources for languages (?language -from the RDF graph ?g, which represents the CMD record ?cmd) where the WALS contains information (?value) on a typological feature (?feature), i.e., the distribution of the sound η (the velar nasal, which is WALS feature 9A).The example SPARQL queries at catalog.clarin.eu/ds/cmd2rdfinclude this query so its current result can be inspected there.
Similar queries that cross (multiple times) the boundaries between metadata and content can easily be envisioned.For example, the new Lexicon Model for Ontologies (Ontolex; Cimiano, McCrae and Buitelaar, 2016), which is an RDF-based model, would enable one to query for the word for a concept, e.g., peace or love in a speci c language, and via CMD2RDF time segments in annotated media could be found where this word in uttered.Several lexica are available in Ontolex or its RDF-based predecessors, but the use of RDF for time-based annotations is not so common.
The example query also shows that still quite intimate knowledge of the usage of speci c RDF vocabularies by the involved datasets is needed, but this is to be expected for structured queries where one has to know the structure, as opposed to full text or facetted search Writing a SPARQL query like this is a task for a technically savvy and adventurous user, so for the average user easier interfaces will need to be provided.The CMD2RDF service does include a general RDF browser, which allows some basic interaction with the SPARQL endpoint, but for more domainspeci c interaction expert user interfaces with more built-in knowledge of the used vocabularies are needed.

Current Status and Future Plans
For a while the CMD2RDF service has been hosted by the Max Planck Institute for Psycholinguistics, but due to strategic decisions by this CLARIN centre the service had to be moved, and, as a medium-term solution, is now hosted by the Meertens Institute However, the generic CLARIN URL redirect at catalog.clarin.eu/ds/cmd2rdfwill take any user to the current host.
In the new Dutch CLARIAH (2016) project, which covers both linguistics and the broader Digital Humanities, there is an agreement to use RDF as a lingua-franca and to merge information obtained from di erent sources.The CLARIAH approach for the linguistics work package will be based on the CMD Infrastructure for compatibility with CLARIN; however, it will also o er Linked Data via the CMD2RDF service for use by others.
To also enable the discovery and use of interesting resources created within non-linguistic work packages in CLARIAH, an inverse procedure, i.e., RDF2CMD, is required, which if su ciently scalable, will also make the Linked Data for Language Resources (LR) outside CLARIAH available for CLARIN.
With respect to the procedure to facilitate this transformation of RDF encoded LR metadata the plan is to investigate a number of di erent strategies.All strategies will start with a PID (Persistent IDenti er) or URI (Uniform Resource Identi er) of a LR and then search from a suitable source, e.g. a SPARQL endpoint, RDF data set, for statements related to this resource.The collected RDF statements are aggregated and processed.The RDF2CMD mapping can then use, for example, the following strategies: • Comparison strategy: the collected RDF is compared to a number of RDF templates that were derived from a set of records, which instantiate recommended CMD pro les.A suitable proximity measure will then select the closest template a er which the original CMD pro le can be instantiated with the correct values.• Building strategy: the collected RDF is inspected and every triple considered for implying a component or element in a dedicated CMD pro le.The generated pro le may be unique and can be 'shaved' of linguistically uninteresting non-linguistic adornments.
Minimal functionality should be supporting roundtrip conversion from a CMD record to RDF and back to CMD without loss of information, but the 'perfect' translation from Dublin Core RDF statements to the CMD Dublin Core pro le should also be mandatory -a requirement which can be extended to some other popular metadata schemas.
In the proximity measure the semantic registries, e.g. the CLARIN Concept Registry (Schuurman et al., 2016), the Dublin Core metadata elements and terms, and special Linked Data repositories like Schema.org(2016) and sameas.org(2016), will play an important role.

Conclusion
This rst full-edged implementation of the mapping of Component Metadata to Linked Data already enables powerful queries that cross the line between metadata and content, which is in general prominent in the traditional metadata domain but less so in Linked Data.The future plans outlined will make it possible to more easily switch back and forth between these XML and RDF-based approaches, making the information on language resources available in the CLARIN infrastructure more widely available.