FoLiA in Practice. The Infrastructure of a Linguistic Annotation Format

We present an overview of the software and data infrastructure for FoLiA, a Format for Linguistic Annotation developed within the scope of the CLARIN-NL project and other projects. FoLiA aims to provide a single unified file format accommodating a wide variety of linguistic annotation types, preventing the proliferation of different formats for different annotation types. FoLiA is being developed in a bottom-up and practice-driven fashion. We have invested mainly in the creation of a rich infrastructure of tools that enable developers and end-users to work with the format. This work will present the current state of this infrastructure.


Introduction
CLARIN's aim is to deliver an infrastructure for researchers that work with language data and tools. This is impossible without agreeing on standards with regard to data formats. Standardisation is an important prerequisite for good interoperability between the many language tools that have emerged within and outside of the scope of the CLARIN project, and to ensure the various datasets released are usable in practice.
In the eld, however, we o en encounter an abundance of ad-hoc formats. We de ne ad-hoc formats to be data formats that are characterised by most of the following traits: • They are only used once, o en by one speci c tool or for just one speci c purpose; • They are poorly formalised or not formalised at all, i.e. there is a lack of a formal schema and semantics; • They are poorly documented; • They are o en rigid and hard to extend.
The use of such ad-hoc formats can be considered the opposite of proper standardisation and is to be avoided in any large infrastructure project.
CLARIN adheres to the following principles when it comes to standardisation: • Open standards are preferred over proprietary standards; • Formats and protocols should be: •well documented •veri able •proven (being used in practice); • Text-based formats are (where possible) preferred over binary formats.
Fortunately, there are various initiatives for standardisation resulting in annotation formats that transcend the ad-hoc level, each with their own merit, and ours being of one of them. At the onset of CLARIN-NL, however, the Dutch and Flemish Natural Language Processing (NLP) community lacked such a proper standard with respect to linguistically annotated text, and ad-hoc formats were prevalent in the eld. In the scope of CLARIN-NL project TTNWW (see chapter 7), the NWO project DutchSemCor, and the STEVIN project SoNaR, FoLiA (Format for Linguistic Annotation) was developed as a solution to accommodate the representational needs of these projects.
The aim of FoLiA is to provide a practical standard, following a generic paradigm, for the linguistic annotation of primarily written text. For this purpose, a wide variety of linguistic annotation types is supported.
In the current chapter, we intend to focus on the practical nature of the format, or rather, on the infrastructure that is built around the format, the so ware that supports it, and the ways in which it has been put to use in CLARIN and beyond. Section 6.3 will explain the philosophy behind FoLiA and its infrastructure.
Earlier work (van Gompel and Reynaert, 2013) addresses the motivation for the creation of FoLiA. In summary, FoLiA sprung from a limited corpus format used in the Dutch and Flemish NLP communities (Apperloo, 2006), at a time and place where a more comprehensive format was needed for various corpora and tools in development. Existing solutions o en did not su ciently meet the needs at the time, were not mature yet, or were simply not well known.
FoLiA currently represents one of various possible solutions. We claim that its merit is best decided on its practical usability with respect to the user's speci c purpose. Focus on the practical dimension, i.e. the availability of hands-on tools and libraries, was in fact a key reason for the creation of yet another format. The tools, libraries and existing FoLiA-delivered corpora described in the current work are intended to help people assess whether it is an appropriate solution for their tasks.
The aforementioned work (van Gompel and Reynaert, 2013) presents a comparison with similar initiatives such as the D-Spin Text Corpus Format (TCF), PAULA XML, XCES, as well as with more abstract frameworks such as LAF (Linguistic Annotation Format) and comprehensive text-encoding formats such as TEI. In summary, the prior study observes that rather than a format, LAF (Ide and Romary, 2004) is an abstract framework which o ers a greater level of abstraction and genericness than FoLiA, whereas FoLiA is more speci c and aims at the practical level. This makes FoLiA more readily adoptable in so ware tools. In the comparison with TEI (Burnard and Bauman, 2007), it was observed that TEI is very extensive and speci c when it comes to encoding text structure, but FoLiA is more speci c when it comes to linguistic annotation types, for which TEI only o ers more abstract solutions. TEI is very extensive and therefore fairly complex; schemas may come in various avours, as elements can be adapted by users in many ways. FoLiA o ers one single speci c solution instead, the format is a given, and the exibility to customise is deliberately limited to the data categories or tagsets, in the form of set de nitions. Initiatives such as TCF (Heid et al., 2010), PAULA XML (Zeldes et al., 2013), and also NAF (Fokkens et al., 2014) are more similar to FoLiA, as they are less abstract and provide practical usability in so ware. Di erences come down to paradigm choices, sustainability, tool availability and documentation maturity, and especially to variation in coverage of available linguistic annotation types and text structure elements.
Full documentation of FoLiA is available elsewhere (van Gompel, 2014). It o ers a reference guide to all elements and attributes that FoLiA de nes. A brief summary of key features will be repeated in Section 6.2. Section 6.4 subsequently presents the currently available so ware infrastructure for FoLiA. Section 6.5 presents some corpora that have been delivered in FoLiA.

Overview
FoLiA is an XML-based format and de nes speci c XML elements for structure annotation (e.g. paragraphs, sentences, word tokens, lists, gures, etc.) and linguistic annotation (e.g. part-ofspeech, dependency relations, syntax, named entities, etc.). FoLiA makes use of a combination of inline and stand-o annotation, making proper use of the hierarchical nature of XML and facilitating the job for parsers where possible. FoLiA does not de ne any linguistic categories; the format is fully language and tagset independent as tagsets are de ned separately in FoLiA Set Definitions by users and never prescribed by FoLiA itself. These tagsets can in turn be related to data category registries. Validation can proceed on a shallow level, against a RelaxNG schema, as well as on a deep level which validates the used tagsets against the set de nition les.
The sets are at the core of the FoLiA paradigm; annotation elements take a generic attribute named 'class' . These classes pertain to a set and are de ned by whatever set de nition the user decides to use. The set de nition de nes all allowed classes and allows for links with data category registries for formal semantic closure.
Other generic attributes besides 'class' are attributes to denote the annotator of a particular annotation, the annotator type (human or machine), the con dence level of the annotation, the time of the annotation, and more.
FoLiA also allows for various types of higher-order annotation, such as the ability to include alternative annotations, as well as extensive support for corrections on annotations. Moreover, there is the possibility to link other modalities, such as imagery or audio fragments of speech, to structural elements. So, even though FoLiA is primarily a format to annotate text documents, speech transcripts are supported as well.
For metadata CLARIN-NL was committed to the CMDI standard (Broeder et al., 2011). Although FoLiA has simple native support for metadata, we see no sense in reinventing the wheel and FoLiA is ideally used in combination with an external metadata format such as CMDI whenever extensive metadata is desired. A reference to the metadata le can be made in the header of the FoLiA document.
The FoLiA paradigm laid out here is schematically illustrated in Figure 6.1 (van Gompel, 2014). A more in-depth treatise is beyond the scope of this current chapter.

Our philosophy
Recalling the CLARIN principle that a format should be proven and used in practice, FoLiA has been designed in a bottom-up manner taking especially this principle to heart. Our focus is to solve real problems people face in the eld with regards to their linguistic representation needs, and to do so in a generic manner. The ambition is to deliver a single uni ed le format that can e ectively handle a multitude of annotation needs in a generic way. The main motivation is to prevent the need to switch formats whenever an extra annotation type is introduced, and to prevent the scenario in which a plethora of di erent formats are used for di erent annotation types. It is nevertheless always conceivable that a user's particular need is not yet covered by the latest version of FoLiA; in such cases we gladly hear from the user and expand FoLiA where necessary, in collaboration with the user. The development of FoLiA has already proceeded for several years in such a collaborative work ow, and various annotation types have been added in close contact with end-users both from within CLARIN and from beyond.
In our philosophy, the creation of a le format is useless if an infrastructure of tools to work with said format is not simultaneously created. This has therefore been our main focus over the years and will be the subject of the next section.

So ware Infrastructure
When we speak of a FoLiA so ware infrastructure we refer to a published set of so ware, from whatever sources and for whatever architecture, that enable people to work with FoLiA. Such an infrastructure in simple terms encompasses anything that can either process or deliver the data in the format. We can subdivide it into the following components: 1. programming libraries; 2. tools for validation; 3. tools for conversion from and to other formats; 4. tools for visualisation; 5. tools for searching/querying; 6. editing tools and 7. special-purpose tools; i.e. specialised tools that use the format but are not necessarily focused on it. In the case of FoLiA, this includes Natural Language Processing or Information Retrieval tools that use the format as input and/or output.
The programming libraries and tools that are purely designed to visualise, manipulate, or convert the format in basic ways can be considered part of a core layer of the infrastructure, whereas the special-purpose tools can be considered to constitute an outer layer.
As FoLiA is an XML-based format, the rich and well-established XML infrastructure is open to its users as well. In fact, almost all FoLiA tools e ectively rely on the existing so ware infrastructure available for XML.
It is possible to not use any of the FoLiA-speci c tools and use the infrastructure o ered by XML directly. For instance, one can use XPath to query a FoLiA document and XSL to transform it. To do so e ectively, however, the user/developer needs to be more familiar with the intricacies of FoLiA than when using a tool from the FoLiA infrastructure that abstracts over this for the bene t of the user/developer.
Many of the tools of the core layer are available as command-line tools and are bundled in two so ware packages: there is a Python-based FoLiA Tools package 1 and a FoLiA Utilities package 2 consisting of tools written in C++. Both are built on the respective libraries. There is some overlap in tools, but each also o ers distinct tools the other does not. It is therefore recommended to install both.
These packages, and all other tools pertaining to the FoLiA infrastructure which have been developed at Radboud University, are bundled in our LaMachine distribution. 3 LaMachine greatly facilitates installation of this so ware and is a recommended starting point if you work with FoLiA. It is available as a Virtual Machine, a Docker package or a local compilation & installation script.
We subscribe strongly to the CLARIN principle that standards should be open and place a similar requirement on the infrastructure components we build.

Programming Libraries
At the heart of the FoLiA infrastructure are the programming libraries that enable developers to work with documents in the format in their so ware. We ourselves o er libraries for both Python and for C++.
Python is a widely popular high-level programming language in the academic world, and the NLP world in particular. The Python library for FoLiA enables developers to quickly integrate support for FoLiA in their scripts. The library is part of the larger PyNLPl library 4 and is also available from the Python Package Index. 5 It is extensively documented and comes with tutorials for users.
The Python library su ers from the performance drawback that any high-level interpreted language has. Whenever faster processing is required, or integration in high-performance tools is desired, libfolia, 6 the FoLiA library for C++, o ers a better solution. The library is modelled a er the Python library, so both are similarly structured, employ a similar syntax and the respective authors try their best to keep the libraries in sync.
A third popular language in the eld is Java, but no Java-based FoLiA library is available yet to our knowledge. There are a number of Java-based tools in the FoLiA infrastructure that have nevertheless been developed without a common underlying FoLiA library.

Validation
We already touched upon the notion of shallow and deep validation. FoLiA's syntax is formalised in a RelaxNG schema, and shallow validation can therefore be done using any XML validator with support for RelaxNG.
The tools foliavalidator and folialint 7 also perform shallow validation, and their usage is strongly recommended, or should even be considered mandatory, for anybody producing FoLiA documents. Moreover, the former tool can optionally perform deep validation as well, i.e. it can validate the used classes against the set de nitions.

Conversion
The FoLiA tools and utilities collections contain tools for the conversion from and to various di erent other formats: • Conversion to plaintext • Conversion to HTML • Conversion to simple columned data or to CSV • Conversion from/to reStructuredText 8 • Conversion from/to DCOI XML format (Apperloo, 2006) • Conversion from the Alpino XML format  • Conversion from ALTO XML format 9 • Conversion from hOCR HTML format (Breuel, 2007) • Conversion from PAGE XML format 10 Conversions may be limited by the source or target format. Conversion to FoLiA's predecessor DCOI XML, for instance, is only possible for the subset of elements that DCOI supports. Similarly, conversion to reStructuredText is limited to text, its structure and markup, and does not include linguistic annotations.
Besides the in-house developed FoLiA tools, third parties also make available converters from or to FoLiA. A notable case is OpenConvert, 11 developed by the former Institute for Dutch Lexicology (INL), now Institute for the Dutch Language (INT), which can convert from TEI, plaintext, ALTO, Microso Word, and HTML to FoLiA.

Visualisation
An XSL stylesheet is available to visualise FoLiA documents. It renders documents and unobtrusively pops up with annotation information when hovering over structural items such as words. A major advantage is that this form of visualisation can be conducted entirely client-side in nearly every web browser. The folia2html conversion tool also employs the same stylesheet. 7 Part of respectively FoLiA Tools and FoLiA Utilities 8 http://docutils.sourceforge.net/rst.html 9 http://www.loc.gov/standards/alto/ 10 http://www.primaresearch.org/tools 11 https://github.com/INL/OpenConvert

Searching
Tools for searching and querying FoLiA documents can be divided into two categories: 1. In-document search and 2. Document retrieval systems / corpus search tools.
At a low level, in-document search can be conducted with the command-line tool foliaquery, part of the FoLiA tools. This tool reads one or more FoLiA documents in memory (sequentially), executes a search query, and presents the matching results. This, however, is not a solution that scales to large numbers of documents as it takes a fair amount of time and memory to process a document.
Full document retrieval systems do not rely on such costly real-time processing of the FoLiA documents, but construct smart indices from the original documents and operate on these indices. The corpus retrieval engine BlackLab 12 , based on Apache Lucene, and the front-end WhiteLab (Reynaert et al., 2014) (see chapter 19) are examples of this. WhiteLab was developed in the CLARIN-NL project OpenSoNaR 13 , and can operate on FoLiA documents, as does BlackLab. So far, these engines typically only supported a simpler subset of the annotation types supported by FoLiA, such as Part-of-Speech tags and lemmas. At the time of writing, there is collaboration, and some competition, between the various developers in the Netherlands to support span annotation types such as dependency relations, syntax and named entities. Another FoLiA-capable search and retrieval system called Multi-Tier Annotation Search (MTAS) has been promised by the Meertens Institute, and builds upon Solr and Lucene. It is being developed in the scope of the Nederlab project  and the CLARIAH project. This system, however, is still in early stages of development and has not been released yet.
As FoLiA is a highly expressive format, the need arose for a query language tuned speci cally to the idiosyncrasies of FoLiA. Although FoLiA can be perfectly searched with XPath, formulating a robust query is not always trivial and may require more in-depth knowledge of FoLiA. The FoLiA Query Language (FQL) was designed as a higher-level query language, covering all of FoLiA, to make querying FoLiA documents easier. FQL is implemented alongside the FoLiA Python library in PyNLPl. It is documented as part of the FoLiA documentation (van Gompel, 2014).
FQL is a new and expressive query language speci cally attuned to the FoLiA paradigm. People in the eld are likely more accustomed to the simpler and established query languages such as CQL, the Corpus Query Language (Christ, 1994), developed at the Corpora and Lexicons group, IMS, at the University of Stuttgart in the early 1990s. For this reason, PyNLPl includes a library that converts CQL to the more expressive but verbose FQL. The low-level query tool makes use of both these libraries. In the next section we will discuss FQL further and introduce higher-level tools in the FoLiA infrastructure that make use of it.

Editing
FQL has been designed in such a way that it is not just a language for passive querying, but a language that allows active manipulation of FoLiA documents. In other words, FQL is to FoLiA as SQL is to relational database tables. Therefore, the foliaquery command-line tool and the FQL library it relies on can be used not just to passively retrieve information, but also to actively edit documents.
A FoLiA document server 14 has been constructed as a back-end for the editing of FoLiA documents. It is implemented as a RESTful webservice, with a simple human-interface to manually enter queries, and takes care of on-demand loading and unloading of documents in memory and serialising them to disk. It maintains a browsable document repository, which features git version control support.
Neither the command-line tool nor the document server o ers an interface adequate for human end-users to easily work with. To provide such an environment, we have been developing the FoLiA Linguistic Annotation Tool (FLAT) 15 . It is a modern web-application that o ers an interface for the visualisation and editing of FoLiA documents. Under the hood, user-interface interactions are translated to FQL queries and communicated to the aforementioned FoLiA document server. The motivation for the creation of FLAT, as opposed to the adaption of existing web annotation environments, was the desire for a solution that seamlessly integrates with FoLiA and adopts the same paradigm. Di erent design choices implied it would be easier to build this from the ground up.
Although not yet supporting all of FoLiA at the current stage, FLAT has already been used successfully in several annotation projects with student assistants at Radboud University. Further development of FLAT is planned for the CLARIN-NL successor project CLARIAH, with the aim of providing a mature editing environment covering all of FoLiA. FLAT is intended to be deployed as a platform for crowd-sourcing annotation tasks in CLARIAH and other projects. 16

Special-purpose tools
The previous sections discussed tools that can be considered part of the core layer. In this section we will discuss the outer layer of tools; these are tools that either take FoLiA as their input or deliver it as their output to perform a speci c and specialised task, usually an NLP (annotation) task. It is a most essential layer to the infrastructure and consists of tools such as: • Ucto 17 -An advanced rule-based tokeniser and sentence-splitter for a variety of languages. Supports FoLiA input and output. Can be used to bootstrap plaintext to tokenised FoLiA . • Frog 18 -An NLP suite for Dutch, implementing tokenisation (through Ucto), Part-of-Speech tagging, Lemmatisation, Dependency Parsing, Named Entity Recognition, Shallow Parsing and Morphological Analysis. Supports FoLiA input and output. • CLAM 19 -Turns command-line NLP tools into RESTful webservices with an interface for human end-users. It integrates the FoLiA viewer to visualise FoLiA documents. (van Gompel, 2012) • TICCL 20 -Text-Induced Corpus Clean-up (Reynaert, 2010). Supports FoLiA input and output. Used in the CLARIN-NL projects TICCLops 21 and @PhilosTEI 22 (Reynaert, 2014), see chapter 32.

Conclusion
In this chapter we have described the rich infrastructure that has been developed around the Format for Linguistic Annotation (FoLiA). We emphasised the need for a practical and proven format, in line with CLARIN's standardisation principles, and hence placed the focus for this chapter on the so ware and data infrastructure. A more extensive overview of FoLiA itself and of the motivation for its inception was presented in earlier work (van Gompel and Reynaert, 2013). Continued e orts in the CLARIN-NL successor project CLARIAH ensure that the developments on the infrastructure surrounding FoLiA will continue in the foreseeable future. FoLiA XML is the pivot format in the project 'Philosophical Integrator of Computational and Corpus Libraries' , or PICCL, (Reynaert et al., 2015) which is part of CLARIAH.