TTNWW to the Rescue: No Need to Know How to Handle Tools and Resources

‘But I don’t know how to work with [name of tool or resource]’ is something one o en hears when researchers in Human and Social Sciences (HSS) are confronted with language technology, be it written or spoken, tools or resources. The TTNWW project shows that these researchers do not need to be experts in language or speech technology, or to know all kinds of details about the tools involved. In principle they only need to make clear what they want


Introduction
The idea behind the Flemish/Dutch CLARIN project TTNWW 1 ('TST Tools voor het Nederlands als Webservices in een Work ow' , or 'NLP Tools for Dutch as Web services in a Work ow') was that many end users of resources and tools o ered by CLARIN will not know how to use them, just as they will not know where they are located. With respect to the location, the CLARIN policy is that the Human and Social Sciences (HSS) researcher does not need to know this as the infrastructure will take care of that: the only thing the user needs to do is to indicate what (s)he is interested in.
The same should hold for the use of tools and resources: users do not need to know which (other) tools are to be used in order to obtain the data one is looking for. Once more, the infrastructure has to take care of that.
For the Dutch language TTNWW served as a pilot project, trying to provide this service for a whole range of existing resources (both text and speech) and tools. The envisaged end users in TTNWW were researchers in social history, literary onomastics and archaeology. Of course, the web service can also be useful for researchers in other domains, such as linguistics, media, political sciences, communication technology, and sociology. Currently, the requirements are that the resources be in Dutch (spoken or written). 2 Once resources have been handled by (some of) the services described below, it becomes much easier for researchers to nd the data they are looking for, especially for audio resources where the gain of time can be tremendous, that is if something can be found at all without the data being transcribed. Suppose one needs data about 'lead paint' , nowadays considered hazardous but commonly used in the past. In metadata such a concept will only be mentioned when the document is about lead paint, not when artists are discussed and remarks about the paint they commonly used are made in passing. A speci c document about, say, Rembrandt could easily escape notice, while it contains just the data one is looking for. When the transcription and the original resource are time-synchronous, the user can listen to the parts of the resource (s)he is interested in. In originally written documents it is easier to nd such data once a resource is available in machine-readable format, but even in such cases the gain of time can be huge as one can search in a much more goal-oriented manner.
As shown in Figure 7.1, two main types of input are possible in TTNWW: written or spoken. The transcribed audio resources can be used as such, or they can be inserted in the pipeline for written texts. In the following sections we will rst discuss the work ow for written texts, followed by the work ow for audio recordings. In the remainder of the chapter the TTNWW web service will be explained.

Formats and Web Service Support
Some necessary conditions for building a text work ow based on existing linguistic tools are that the tools need to be able to communicate and that they need to share a particular text annotation format rich enough to accommodate all the components in the work ow. FoLiA (Format for Linguistic Annotation), cf. van Gompel and Reynaert (2013), was explicitly developed to this end in the scope of both TTNWW and other projects. The format proposes a exible and generic paradigm over a wide variety of linguistic annotation types. FoLiA aims at practical usage, and the focus has been on the development of a rich infrastructure of tools to work with the format. Although many of the tools employed in the TTNWW project have adopted FoLiA either as input or output format, it should also be noted that other formats have been used as well -most notably the Alpino XML format for syntactic processing, but also other formats for more complex annotation structures. This emphasises the need for more convergence amongst these formats. In this respect FoLiA aims to provide a single comprehensive solution supporting a multitude of annotation types, and its ongoing development o ers the possibility to extend it towards any annotation layers not provided yet. Such extensions can be informed by similar initiatives in this area such as the German Text Corpus Format (TCF) or the NLP Annotation Format (NAF); these may also provide alternatives in their own right, and the availability of good converters is therefore desirable for projects such as TTNWW. On a more practical level, interoperability should also address more ordinary issues, such as common tokenisation methods, to provide the opportunity to truly interrelate di erent annotation layers.
For linguistic enrichment to be e ective in the web services/work ow paradigm, most already existing command-line tools had to be transformed to web services. In fact, the road towards this had already been paved in the prior CLARIN-NL (Odijk, 2010) demonstrator project TICCLops (Reynaert, 2014b), which not only turned an existing spelling correction system into a web application and service, but in fact delivered a generic solution for turning linguistic applications with a command-line interface into web applications and RESTful services.
The generic solution to turning any linguistic application into a web application/service, the Computational Linguistics Application Mediator, or CLAM (van Gompel, 2014;van Gompel and Reynaert, 2014), 3 was readily adopted by the TTNWW consortium to prepare their own linguistic applications for integration into the TTNWW work ow.

Text Preprocessing
As a primary input TTNWW accepts digital texts that are either 'born digital' or the result of a digitisation process. To reduce the amount of Optical Character Recognition (OCR) noise in digitised texts TTNWW o ers a corpus clean-up tool. The spelling and OCR post-correction system Text-Induced Corpus Clean-up (TICCL) was turned into the 'online processing system' TICCLops. 4 The approach is based on anagram hashing, which was rst fully described and evaluated on English and Dutch in Reynaert (2005). In Reynaert (2010) it was applied to OCR post-correction of large corpora. Two e cient modi operandi for obtaining the same end result, i.e. the set of vocabulary neighbours di ering up to a speci ed number of characters, were presented. In a naive implementation based only on edit or Levenshtein distance (LD), each and every item in the vocabulary has to be compared to every other item. Anagram hashing typically reduces the number of comparisons required by several orders of magnitude, depending on the size of the vocabulary involved. Automatic correction of the Early Dutch Books Online corpus, which has a vocabulary of nearly 20 million items, is described in Reynaert (2014a).

Linguistic and Semantic Layers in TTNWW
To understand a text, key information can be inferred from the linguistic structure apparent in and across the sentences of the text. To determine who does what, to whom, when, where, why, and how, it is vital that the syntactic roles of words and word groups be identi ed, that entities be properly detected, and that di erent references to the same entities be linked.
TTNWW o ers a number of tools that automatically identify this information. Of the following tools, tools 1 to 3 were developed and integrated into Frog, an Open Source natural language processing toolkit for the Dutch language 5 (van den Bosch et al., 2007). Almost all tools were integrated into TTNWW through the web service shell so ware package CLAM. We brie y discuss the tools independently: 1. Part-of-speech tagging and lemmatisation: identifying the syntactic roles of individual wordforms (e.g. 'paints' in 'Rembrandt used lead white paints for esh tones' is a plural noun), and linking these wordforms to their standard dictionary lemma ('paint, noun'). The particular machine learning approach to part-of-speech tagging adopted for TTNWW, MBT (memorybased tagger), was originally introduced by Daelemans et al. (1996). Frog lemmatizes words and also performs a morphological analysis using a machine learning approach introduced in van den Bosch and .
2. Chunking: grouping words into syntactic phrases (e.g. 'lead white paints' and ' esh tones' are noun phrases). Chunking can be used for di erent purposes, for example for identifying salient units in term extraction (' esh tones' makes more sense as a term than ' esh' or 'tones' individually) and for identifying the units for answering the 'who did what to whom. . . ' questions ('Rembrandt' is the subject who 'used' 'lead white paints' as an object). The chunking approach in TTNWW, also based on the use of machine learning algorithms, was introduced by Daelemans et al. (1999). As training material, the Lassy Small Corpus was used, which is a syntactic treebank; tree structures from Lassy were converted into chunks with a rule-based script, and a memory-based tagger was trained on the chunked sentences.
3. Named entity recognition (NER): identifying proper names as names of people ('Rembrandt'), places, organisations, or other types of entities. For the system delivered for TTNWW, the developers experimented with a classi er ensemble in which a genetic algorithm was used for the weighted voting of the output of di erent classi ers (see Desmet and Hoste (2013) for more information). Since it performed equally well as the meta-learning approach, we opted for a single classi er based on the conditional random elds algorithm (La erty et al., 2001) as nal NER classi er, which was delivered as a CLAM web service.
4. Coreference resolution: linking references to the same entities. For instance, if 'Rembrandt' is later referred to as 'He' , the latter pronominal reference should be linked to Rembrandt and not to any other entity mentioned in the text. For TTNWW, an existing mention-pair approach to coreference resolution (Hoste, 2005;de Clercq et al., 2011) which was further re ned in the framework of the STEVIN projects COREA (Hendrickx et al., 2012) and SoNaR (Oostdijk et al., 2008;Schuurman et al., 2009;, was adapted to the pipeline of tools developed in the other work packages in TTNWW (e.g. the construction of markables was derived from Alpino output, cf. below). The resulting system was delivered as a CLAM web service.
5. Automated syntactic analysis is made available as a web service, by providing an interface to the Alpino parser for Dutch. Researchers can upload their texts to a web service which takes care of the required preprocessing, and takes care of running the Alpino parser. The result, syntactic dependency structures in the standard format developed in CGN (Schuurman et al., 2003), D-Coi (van Noord et al., 2006) and Lassy (van Noord et al., 2012), is made available to researchers in a simple XML format. Named entity recognition and classi cation, part-ofspeech tagging and lemmatisation is integrated in the output of the parser. The underlying Alpino parser (van Noord, 2006;de Kok et al., 2011) is the de-facto standard syntactic parser for Dutch. It is a stochastic attribute value grammar where a handwritten grammar and lexicon for Dutch is coupled with a maximum entropy statistical disambiguation component. The parser is fairly accurate, with labeled dependency accuracy of around 90% on newspaper text. The speed of the parser varies with sentence length and ambiguity, but is about 2 seconds per sentence on average for typical newspaper text on standard hardware.
6. Spatiotemporal analysis: the STEx-tool (SpatioTemporal Expressions) for spatiotemporal analysis used in TTNWW enables researchers to deal with incomplete information and to analyze geospatial and temporal information the way the intended reader would have interpreted it, taking into account the relevant temporal and cultural information (using the metadata coming with the resource). Information presented in a text is never complete (Schuurman, 2007). What is meant is solvable by knowing where (and when) a text appeared originally. This information is stored in the metadata coming with a resource (Schuurman andVandeghinste, 2010, 2011).
In 'Hij doet opgravingen in het Turkse Sagalassos' (E: 'He is excavating in Sagalassos in Turkey' . De Morgen, 22-10-2011), 'Sagalassos' would be annotated as being situated in the Asian part of Turkey, where in 2011 the province of Antalya was located, Sagalassos having coordinates '37.678,30.519' . It was part of the region of Pisidia, and existed more or less from 10,000 BC until 600 AD. As input, STEx uses fully parsed sentences as provided by Alpino (cf. above).

Alignment
Alignment is a little bit of an outsider in the TTNWW project, as it is the only task involving another language than Dutch. Within the STEVIN project DPC (Dutch Parallel Corpus) an alignment tool chain was developed to arrive at a high-quality, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language (Paulussen et al., 2012). Within TTNWW this task included creating a web service for the alignment and the annotation of parallel texts (Dutch and English). The constraints of the alignment task involved a number of challenges not encountered elsewhere in TTNWW, due to the fact that more than one language is involved. The existing ow of the web service tool supposes the processing of just one input le (or a set of similar input les using the same processing chain), whereas an alignment task requires at least two input les. For the time being, the alignment service in TTNWW opts for a provisional solution.

Additions and Some Use Cases
Several other tools can be added, for example dealing with sentiment analysis, summarisation, semantic role labelling, information extraction, etc. TTNWW is designed to enable further extensions. Some of the tools described above have been put to practice in large-scale follow-up projects. TICCL, for example, has been used as a standard preprocessing step in the Nederlab project (Brugman et al., 2016) for the Early Dutch Books Online corpus (Reynaert, 2016). Work in the Nederlab project also involves POS tagging using Frog to produce linguistically annotated corpora. Alpino is used in a broad range of projects; for HSS GrETEL (Augustinus et al., 2013), and Poly-GrETEL (Augustinus et al., 2016) are especially relevant, making it much easier to search in treebanks.

Tools Included in TTNWW
Speech recognition so ware provides HSS researchers with the possibility to transform audio signals into machine readable text formats. The speech recognition output could be reused as input for the text analysis processes, provided that the recognition rate is su ciently high. Speech recognition systems are complex pieces of so ware requiring a fair amount of expertise to install and maintain. To make life easier for HSS users several web services were incorporated in TTNWW in which submitted audio les are automatically transcribed or where related tasks are performed. Several of these web services have been combined, resulting into ready-to-use work ows available to the HSS end user, see (Pelemans et al., 2012). Speech recognition web services are based on the SPRAAK so ware, see Demuynck et al. (2008).
1. Converter: extracts or converts speech les to the required .wav format for the Transcriber web service from a variety of other formats, including MP3 and video. This service is described in more detail in Pelemans et al. (2014).
2. Segmentation: within the TTNWW project, an audio segmentation tool was further improved and was made available via an easily accessible web service through CLAM. The provided audio segmentation tool rst analyses the audio to nd intervals which contain foreground speech without long interruptions, a process called speech/non-speech segmentation. Next, the speech intervals are divided into shorter segments uttered by a single speaker (speaker segmentation), and the speech fragments belonging to the same speaker are grouped (speaker clustering). These steps basically solve the "who-speaks-when" problem. Finally, the system identi es the language being spoken by each speaker (Dutch vs non-Dutch), enriches every audio fragment with extra non-verbal meta-information (e.g. is this music or telephone speech or dialect speech etc.), and detects the gender of every speaker. See Desplanques and Martens (2013), Desplanques et al. (2015), and Desplanques et al. (2014).
3. Diarisation: automatic speaker diarisation is the task of automatically determining: "who spoke when". On reception of an audio le, the web service labels each speaker in the recording ("SPK01", "SPK02", etc), it nds all speech segments and it assigns a speaker label to each segment. The result of the web service can be used as a preprocessing step in most state-ofthe-art automatic speech recognition systems. The system is described in Hain et al. (2010) and Wooters and Huijbregts (2008).
4. Dutch Transcriber: uploads and transcribes Dutch broadcast news style of speech. Users have to answer some questions about the audio input so that the best recognition models are chosen from a set of existing ones. More information on the transcription service may be found in Pelemans et al. (2014).

Additions and Some Use Cases
In addition to the services described above, several other useful speech services have been made available. Due to their experimental character they have not been incorporated into standard work ows for the TTNWW project. Some end users may however nd some of them useful for their purposes. They are available as CLAM-enabled services and can be found on the www.spraak.org/webservice website. These include the 1. Dutch phoneme recogniser: this recogniser returns a phonetic transcription for the given audio input.
2. Grapheme to Phoneme Converter (g2p): this web service takes a list of (orthographic) Dutch words and returns a phonetic transcription for each of them.
3. Dutch speech and text aligner: takes as its input both an audio le and a text le and tries to align them. The output le contains the same words as the input, but with added timing information for every word. Optionally a speech segmentation le can also be given that contains speech/non-speech, male/female and speaker ID information as obtained from the speech segmenter described above.
These web services have already been put to use by several HSS users as demonstrated by some use cases: • A test dataset of nine interviews from the KDC (Catholic Documentation Centre) at RU Nijmegen was prepared to be processed by the TTNWW speech chain. The interviews (total duration of 22.5 hours) were a small subset of the KomMissieMemoires series (KMM 873-880). All interviews obtained a CMDI metadata le which followed the OralHistoryDans pro le (see https://catalog.clarin.eu/ds/ComponentRegistry) used in van den Heuvel et al. (2012). • Currently, about 50 users have registered for the SPRAAK-based web services. Many users of the services want to check the potential performance of speech recognition on their speci c task (o en interview transcription and transcription of broadcast material) and nd this a fast and exible way to achieve this. • Some applications and projects need existing tools, and instead of installing and maintaining these locally, prefer to call them over the web, as a RESTful service. One such example is the STON project (about semi-automated subtitling based on speech recognition), where a g2p converter is needed to provide phonetic transcriptions when new words are entered in the lexicon of the subtitling application, cf. Verwimp et al. (2016).

Web Service Delivery
All linguistic processing modules were required to be made available as web services. Web service deployment allows for a single service to be used by more non-technical users by lowering the barriers of installation and maintenance requirements. However, most modules had been constructed as command line tools as a result of previous projects. CLAM, (cf. Section 7.1.1), allows any command line tool to be wrapped as a web service -only parameters, input formats and output formats need to be described. Many of TTNWW's web services have been constructed in this manner. To facilitate transfer of web services from technology providers to CLARIN centres, providers were requested to deliver services as fully installed virtual images. This reduces the installation overhead for CLARIN centres and ensures that web services are delivered according to the technology provider's recommended operating system. Images were deployed in an OpenNebula High Performance Cloud environment made available by SURFsara through a parallel project.

Combining Web Services in a Work ow
Depending upon the end user's requirements towards the desired linguistic annotation output, web services may need to be combined into pipelines. For example, to obtain coreference annotations the process entails tagging of textual input through Frog, followed by coreference annotation using the COREA service. To facilitate the full process, rather than just delivering an individual process, web services may be combined into work ows (Kemps-Snijders et al., 2012). In the CLARIN community two approaches were proposed for this. One approach allows end users to construct their own work ows by matching input/output requirements of individual services. Possible service combinations are determined using a generic chaining algorithm. This approach has been used in the WebLicht application (Hinrichs et al., 2010), created as part of the German CLARIN D-SPIN project. An alternative approach is to preconstruct complete work ows and provide these to the end user to perform a speci c task. This has the advantage that end users can concentrate on task execution rather than task construction. Given the limited number of services and possible combinations for the available TTNWW services this approach was selected for this project. Incidentally, the WebLicht project now also o ers prede ned processing chains as an Easy Mode. For TTNWW, Taverna was selected as a work ow construction and execution framework. Upon selection of a speci c task, the corresponding work ow de nition is sent to a Taverna server monitoring execution and data transfer between contributing annotation services running in the HPC cloud environment. End users are shielded from work ow de nitions, web services and execution environment through an easy-to-use user interface allowing them to upload their textual/audio data, to select the annotation task to perform and to collect the results a erwards.

Related Work
The web services and work ow paradigm has also been adopted by other projects to deliver processing services to the end user community. D-SPIN's WebLicht project mentioned before was an initiative of the German CLARIN community. The Danish CLARIN-DK project (O ersgaard et al., 2013) pursued a similar line with respect to automatic chaining of services into a work ow. The European PANACEA project (Poch et al., 2012), on the other hand, used the Taverna workbench and associated service registry to allow end users to construct and execute work ows in the NLP domain. Another recent service work ow is Galaxy, used by CLARINO 6 and LAPPS 7 , amongst others.

Conclusions and Further Work
The TTNWW project delivers a suite of web services for the Dutch language domain. The CLAM so ware packaging so ware was broadly adopted by many teams to turn their shell-oriented soware systems into web services. It has been demonstrated in the project that these services can be successfully combined into work ows. The resulting work ows are task-oriented in the sense that a series of web services are combined to deliver a speci c end-user-oriented task. End-users only need to select a task and upload their resources, audio or text, a er which execution and orchestration of the services is handled by the system. The TTNWW system is currently being revised as part of the ongoing CLARIAH project. Here, a new user workspace based on ownCloud 8 is expected to be added, as well as new features allowing the end user to search the resulting annotation les directly. As far as alignment, (cf. Section 7.2.3), is concerned, future work would involve to split up the original tasks into subtasks (i.e. cleaning, tokenisation and tagging) and to restrict the web service to its main task: i.e. alignment of parallel texts. In this way, the other web services can be used to handle the preparatory tasks, giving more exibility in the development of tools and in administrating work ows. This will imply that all the other tasks require an extra language ag, so that language-speci c modules can be used whenever necessary. Another aspect would consist in adapting the input format to the FoLiA format for input and output, so that the data format matches the requirements of the other tools in the web services chain.