Enriching a Scientific Grammar with Links to Linguistic Resources: The Taalportaal

Scientific research within the humanities is different from what it was a few decades ago. For instance, new sources of information, such as digital grammars, lexical databases and large corpora of real-language data offer new opportunities for linguistics. The Taalportaal grammatical database, with its links to other linguistic resources via the CLARIN infrastructure, is a prime example of a new type of tool for linguistic research


Introduction
This chapter focuses on the ways that the digital Taalportaal grammar is enriched with links to language corpora and other digital linguistic resources.We rst give an introduction to the goals and architecture of the Taalportaal, a new type of online scienti c grammar that covers the syntax, the morphology, as well as the phonology, of Dutch and Frisian, the two o cial languages of the Netherlands.In the second part, we elaborate on why and how the Taalportaal's grammatical information is enriched with links to corpora and other linguistic resources.

The Taalportaal
Language is everywhere.The working linguist is confronted with linguistic data any moment they read a newspaper, talk to their neighbour, watch television, or switch on the computer.To overcome the volatility of many of these data, digitised corpora have been compiled for languages all around the globe since the 1960s.These days, there is therefore no lack of natural language resources, at least not for commonly studied languages like English and Dutch.Large corpora and databases of linguistic data are amply available, both in raw form and enriched with various types of annotation, and o en free of charge or for a very modest fee.
There is no lack of linguistic descriptions either: linguistics is a very lively science area, producing tens of dissertations and thousands of scholarly articles in such a small country as the Netherlands alone.An enormous body of linguistic knowledge, however, is stored in paper form only: in grammars, dissertations and other publications, be they aimed at scholarly or lay audiences.The digitisation of linguistic knowledge is only beginning, and online grammatical knowledge is relatively scarce in comparison with all the treasures that are hidden in the bookshelves of libraries and studies.
Of course, there are notable exceptions.One such exception is the Taalportaal (Language Portal) project, an online portal containing a comprehensive and fully searchable digitised reference grammar, i.e. an electronic reference of Dutch and Frisian phonology, morphology and syntax.Information about the Afrikaans language is currently being added as well.With English as its meta-language, the Taalportaal aims at serving the international scienti c community by organising, integrating and completing the grammatical knowledge of the Dutch and Frisian languages, as well as of Afrikaans.
The Taalportaal (www.taalportaal.org) is a collaboration project of the Meertens Institute, the Fryske Akademy, the Institute of Dutch Lexicology and Leiden University, funded, to a large extent, by the Netherlands Organisation for Scienti c Research (NWO).The project is aimed at the development of a comprehensive and authoritative scienti c grammar for Dutch and Frisian in the form of a virtual language institute (cf.Landsbergen et al., 2014).
The Taalportaal is built around an interactive knowledge base of the current grammatical knowledge of Dutch and Frisian.Its prime intended audience is the international scienti c community, which is why English is chosen as the language used to describe the language facts.The Taalportaal aims to provide an exhaustive collection of the currently known data relevant for grammatical research, as well as an overview of the currently established insights about these data.This is an important step forward compared to presenting the same material in the traditional form of (paper) handbooks.For example, the three sub-disciplines of syntax, morphology and phonology are o en traditionally studied in isolation, but, by presenting the results of these sub-disciplines on a single digital platform and internally linking these results, the Taalportaal contributes to the integration of the results reached within these disciplines.
This can be illustrated by means of a simple example concerning diminutive formation in Dutch.At rst sight, this may look like a strictly morphological phenomenon, but upon closer inspection there are certainly also phonological and syntactic aspects to it.For example, the form of the diminutive morpheme depends on the phonological structure of the preceding noun: hond-je 'dog.dim', kam-metje 'comb.dim', konin-kje 'king.dim', etc.There is also a syntactic e ect of diminutive formation in that it changes the gender of the input noun; diminutives are all neuter and thus select the de nite singular article het 'the' (cf.de hond 'the dog' versus het hondje 'the dog.dim') and may also trigger di erent forms of agreement (cf.een oude hond 'an old dog' versus een oud hondje 'an old dog.dim').Semantically, many morphological diminutives carry a (positive or negative) emotional load.Thus, the usage possibilities of hondje '(cute) doggy' are di erent from those of kleine hond 'small dog' .The Taalportaal makes visible these and less obvious cases of grammatical phenomena that are not restricted to one of the traditional sub-disciplines, to the bene t of each of the three disciplines and thus to the study of grammar in general.
The Netherlands are not the only country considering a linguistic knowledge base like the Taalportaal.Recently, South Africa has started building a virtual language institute called VivA (http://viva-afrikaans.org/) that aims at developing a digital infrastructure for the Afrikaans language.Among its goals are the study and description of Afrikaans, as well as the development of tools and resources for written and spoken Afrikaans, including digital dictionaries and corpora; language advice is also supplied.The cornerstone of the VivA portal is a comprehensive grammar of Afrikaans, which is inspired by and based on the Taalportaal architecture, and is currently being added to the Taalportaal infrastructure.
As of January 2016, the rst release of the Taalportaal is online.Figure 24.1 below shows an instance of the portal's opening screen.
Technically, the Taalportaal is built as a number of XML les, organised as DITA-topics. 1 It is freely accessible via the Internet via any standard internet browser.The organisation and structure of much of the linguistic information is reminiscent of, and to a certain extent inspired by, Wikipedia and comparable online information sources.An important di erence, however, is that Wikipedia's democratic (anarchistic) model is avoided by restricting the right to edit the Taalportaal information to authorised experts.
Figure 24.2 shows a small, introductory fragment of the portal concerning Dutch phonology: 2 1 DITA, the Darwin Information Typing Architecture, is an XML data model for authoring and publishing.According to https://en.wikipedia.org/wiki/DarwinInformation Typing Architecture (as of 17 June 2016), 'the name derives from the following components: Darwin: it uses the principles of specialization and inheritance, which is in some ways analogous to the naturalist Charles Darwin's concept of evolutionary adaptation, Information typing, which means each topic has a defined primary objective (procedure, glossary entry, troubleshooting information) and structure, Architecture: DITA is an extensible set of structures'.Among other things, the information about the vowels contains data about their distribution, phonetic details, and links to sound les exemplifying the realisation of the sound in several positions within the word.This is illustrated in gure 24.3: The Taalportaal grammars were not built from scratch.One of their main components is an online version of the Syntax of Dutch (SoD; Broekhuis et al., 2012Broekhuis et al., -2016)), a descriptive grammar that goes well beyond the level of detail provided by other sources, including reference grammars.Although the SoD grammar is descriptive in nature, the emphasis in the selection and presentation of the phenomena discussed is clearly guided by discussions in the (generative) theoretical literature (Broekhuis, 2013;Hoeksema, 2013;Bouma et al., 2015); by implication, the same holds for the Taalportaal's treatment of Dutch syntax.For Dutch morphology, the Taalportaal has been built, among many other sources, upon the rst volume of Haeseryn et al. (1997), as well as on morphological handbooks such as those of De Haas and Trommelen (1993), Booij (2002) and Smessaert (2013).The parts on Dutch phonology are indebted to Booij (1995) and Kooij and van Oostendorp (2003).For Frisian there was no lack of studies that could be pro ted from either, for instance Visser (1997), Hoekstra (1998), Tiersma (1999), Popkema (2006)  As an internet grammar portal, the Taalportaal is somewhat comparable to the grammis portal of the German Language Institute (IDS; http://hypermedia.ids-mannheim.de/call/public/sysgram.ansicht).grammis, however, covers only one language (German) and is not aimed at the international scienti c community but primarily at a German audience.These di erences explain both the choice of the metalanguage (English for Taalportaal, German for grammis) and the differences in depth of analysis.grammis moreover is far less connected to other data sources than Taalportaal.
Besides the grammar modules, the Taalportaal contains an extensive ontology of linguistic terms (recently recast in the CLARIN Concept Registry; cf.Schuurman, 2015) and a large bibliography.Many of the words and phrases in the texts are marked: they can be clicked on, which results in sounds being played, de nitions popping up and/or related topics being opened, a feature that will be elaborated upon in the following sections.

Enriching the Taalportaal
It is becoming more and more common for 21st-century linguists to want to check whether and to what extent the linguistic facts as they are presented in the linguistics literature correspond to the linguistic reality.In this context, '[c]reating a link between a descriptive grammar and a syntactically annotated corpus can be valuable for various reasons.Illustrating a given construction with corpus examples may help to get a better understanding of the variation of the construction and the frequency of these variants.Corpus data may also convince a reader that a given variant actually occurs in (well-formed) text, or in some cases may illustrate that examples judged ungrammatical by the authors of the descriptive grammar do occur with some frequency in actual text' (Bouma et al., 2015).
Searching for realistic language data becomes easier by the day, thanks to joint e orts such as CLARIN (www.clarin.eu) that seek to enhance the scienti c research infrastructure by, among many other things, linking and making available large existing corpora and other linguistic resources under a single user licence.
The (syntactically annotated part of the) Spoken Dutch Corpus (manually veri ed, speech from various situations, 1M words; Oostdijk, 2000;van der Wouden et al., 2003), the Lassy Small treebank (manually veri ed, written material from various genres, 1M words, 65,200 sentences) and the Lassy Large treebank (automatically created, written material from various genres, 700M words, 8.6M sentences; van Noord et al., 2013) are all suitable corpora for this kind of application.The rst two resources provide high-quality data for a limited amount of text, while the last resource provides wide-coverage, but noisy, data.All treebanks follow (with minor modi cations) the same annotation standard (Van Eynde, 2003 for lemmatisation and POS tagging; Schuurman et al., 2003 for syntax), which has become a de facto standard for Dutch corpus annotation, allowing for the re-use of the queries on these new data.
Taalportaal has been enriched with a range of queries that search for relevant constructions in these corpora.Queries are linked to: • Linguistic examples; • Linguistic terms; or • Names or descriptions of constructions.
The queries are embedded in the Taalportaal texts as standard hyperlinks.Clicking these links brings the user to a corpus query interface where the speci ed query is executed -or, if it can be foreseen that the execution of a query takes a lot of time, the link may also connect to an internet page containing the stored result of the query.
Syntactic annotations are a complex type of data, usually formally encoded in accordance with a well-de ned schema in XML at present.Sometimes these syntactically annotated corpora come with a search interface, but to search these complex data e ciently and optimally, one needs a command of an XML search language such as XPath. 3Many researchers in linguistics lack these skills.Although the basics of the XPath language are not di cult, interesting queries o en become very complex.Moreover, one has to know every particular detail of the encoding of constructions in a particular treebank.

Automatic Links
Many of the links have been generated automatically: all examples in the Taalportaal can be clicked on, which will open a 'pop-up' window like the one in gure 24.4:By clicking the links, the example sentence Jan is niet boos (over die opmerking) can be searched in a number of resources, as is illustrated in the screen dump above: in this case the choices are Google, DBNL, GrETEL, CHN, OpenSoNar, and TaalPortaal.Suppose we choose the third option, the GrETEL web application (http://portal.clarin.nl/node/1967;cf.Augustinus et al., 2013 and chapter 22 in this volume); we can then search for linguistic structures in the most user-friendly  way -that is, without having to learn a corpus query language -in a number of large annotated corpora of Dutch (cf.Augustinus et al., 2012).The sentence is parsed using the Alpino parser (cf.van Noord, 2006).The resulting parse is shown in gure 24.5.
The query can be edited via a menu, for example by replacing speci c lexical items by syntactic categories, as illustrated in gure 24.6: If the user has made their choices, the structure can be searched for.Part of the result is given in gure 24.7: The example sentences show copula sentences with an adjective that has a prepositional complement: Peking is niet tevreden met zijn groeiende economische macht 'Beijing is not satis ed with its growing economic power' and Rotterdam is ook bezig met zo'n plan 'Rotterdam is busy with such a plan as well' .

Manually Prepared Queries
Whereas it is relatively easy to automatically translate example words or sentences into corpus queries, this usually does not hold for grammatical descriptions meant for human readers.Still, these readers might be interested to check the grammarian's claims in corpus data.
CLARIN-NL made it possible to enrich Taalportaal fragments, most of them dealing with Dutch syntax, with more sophisticated queries in annotated corpora (cf.Bouma et al., 2015).
The translation of a linguistic example, a linguistic term, or a name or description of a construction is not a task that is easy or that even has deterministic results that could be implemented in an algorithm (cf.Bouma et al., 2015).Therefore, the queries were formulated by experts, who got selections of the Taalportaal texts to read, interpret, and enrich with queries where appropriate.The queries were amply annotated with explanations concerning the choices made in translating the grammatical term or description or linguistic example into the corpus query.When necessary, warnings about possible false hits, etc., were added as well.The results were checked by senior linguists.Consider the small section from Dutch syntax depicted in gure 24.8:The sentence about adjectives has been translated into two, radically di erent queries: the rst one searches for adjectives with a prepositional complement as sister, the second one for predicatively used modi ed adjectives without a PP-complement or any kind of modi er.Clicking the (blue) link 'Show results of this query in lassysmall with PaQu' will open a new browser window to the PaQu interface (http://portal.clarin.nl/node/4182;cf.Odijk, 2015).The rst query results in a number of hits from the Lassy Small corpus; the rst ones are given in gure 24.9: We twice see the adjective a omstig 'originating' followed by a prepositional phrase headed by uit 'from' , and once the adjective geschikt suitable with a prepositional phrase with voor 'for' .
The second query results in the sentences, among many others, given in gure 24.10:Here we see the three adjectives duidelijk 'clear' , welkom 'welcome' , and werkzaam 'active' , modi ed with niet zo 'not so' , niet 'not' , and als arts 'as a doctor' , respectively.
If the user clicks the small plus ign following the result sentence, a parse tree is shown.(PaQu o ers corpus statistics as well, but that is beyond the scope of this chapter.) As the corpora dealt with so far o er little or no morphological or phonological annotation, they cannot be used for the formulation of queries to accompany the Taalportaal texts on morphology and phonology.There is, however, a linguistic resource that is in principle extremely useful for precisely these types of queries, namely the CELEX lexical database (cf.Baayen et al., 1995), which o ers morphological and phonological analyses for more than 100,000 Dutch lexical items.This database is currently being transferred from the Nijmegen Max Planck Institute for Psycholinguistics (MPI) to the Leiden Institute for Dutch Lexicology (INL).It has its own query language, which implies that Taalportaal queries that address CELEX have to have another format -but again, the Taalportaal user will not be bothered with the small details.
As was mentioned above, the Frisian language -the other o cial language of the Netherlands, with Dutch -is described in the Taalportaal as well, in parallel to Dutch.Although there is no lack of digital linguistic resources for Frisian, internet accessibility of these resources is lagging behind.This makes it di cult at this point to enrich the Frisian parts of the Taalportaal with queries.It is to be hoped that this CLARIN project will stimulate further e orts to integrate Frisian language data in the research infrastructure.

Concluding Remarks
In the rst part of this chapter, we have introduced the Taalportaal grammar portal, a digital scienti c grammar of Frisian and Dutch, covering the syntax, morphology and phonology of the two o cial languages of the Netherlands.In the second part of the chapter, we focused on the dynamic links from the grammatical descriptions to other linguistic resources of various sorts -something that is of course impossible in traditional paper grammars.By this extension, the Taalportaal functions as a hub within the scienti c infrastructure supplied by CLARIN.This is relevant for the Taalportaal users in at least two ways: • it increases the value of the Taalportaal as a research tool; and • it lowers the threshold to use the linguistic resources involved.The Taalportaal's open architecture allows for extension with new languages (Afrikaans is well under way), but also with new language varieties (dialectal data, historical data, etc.).Moreover, the CLARIN network allows for extension with links to new and so far largely unexplored linguistic resources, such as the huge digital dictionaries of the INL and semantically organised lexical databases such as Open Dutch WordNet (http://wordpress.let.vupr.nl/odwn/;cf.Postma et al., 2016), which may make linguists' practical work even easier and, at the same time, even more exciting.
It is to be foreseen that future corpora of Dutch (and hopefully of Frisian as well) will be embedded in the very same CLARIN infrastructure, using the same architecture, the same type of interface, and the same kind of linguistic annotation.

Figure 24 . 1 :
Figure 24.1: the opening page of the taalportaal site.

Figure 24 .
Figure 24.10:second Lassy output example.The result of the annotation process is as follows: