GrETEL A Tool for Example-Based Treebank Mining

This chapter describes the use of GrETEL for linguistic research. GrETEL is a linguistic search tool that enables users to look up constructions in syntactically annotated corpora or treebanks . It provides online access to the data, allowing users to query a treebank using either an example sentence or an XPath expression in order to look for similar constructions. A major asset of GrETEL is that it enables non-technical users to consult treebanks in a user-friendly way, which is also in line with the main CLARIN goal of applying the results of speech and language technology to research in the humanities and the social sciences. Besides a description of the querying procedure in GrETEL, this chapter presents a selection of research in Dutch syntax and semantics that has been carried out using GrETEL. Furthermore, an overview is given of further developments.


Introduction
The construction of syntactically annotated corpora or treebanks has created exciting opportunities for the empirical investigation of syntax. 1 For Dutch, several treebanks are available, e.g. the CGN treebank (van der Wouden et al., 2002) for spoken Dutch, and LASSY (van Noord et al., 2013) and SoNaR (Oostdijk et al., 2013) for written Dutch.While treebanks have the potential to be an added value for descriptive and theoretical linguistics, the exploitation of such treebanks usually requires that the user have in-depth knowledge of the annotation guidelines and master a formal query language.Some users are not deterred by this, but many are, so that the potential of the treebanks will not be realised.To make the treebanks useful for the computationally less inclined we have developed GrETEL, a user-friendly search engine for treebanks (Augustinus et al., 2012;Augustinus et al., 2013).It o ers the possibility to provide the system with an example sentence in order to collect relevant corpus data.Therefore, the development of GrETEL paves the way for combining treebank mining with descriptive and theoretical linguistics.

What is GrETEL?
GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics.It is a linguistic search engine that enables users to extract information from treebanks in a user-friendly way.Instead of a formal search instruction, it takes a natural language example as input.This provides a convenient way for novice and non-technical users to use treebanks with a limited knowledge of the underlying syntax and formal query languages.
Since linguists tend to start their research from example sentences, example-based querying allows them to use those examples as a starting point for treebank search.Work related to our approach is the Linguist's Search Engine (Resnik and Elkiss, 2005), a tool that also made use of example-based querying, but is no longer available, and the TIGER Corpus Navigator (Hellmann et al., 2010), which is a Semantic Web system used to classify and retrieve sentences from the TIGER corpus on the basis of abstract linguistic concepts.
The system we present here is an online system,2 which shares the advantages of tools like TüNDRA (Martens, 2013) and INESS-Search (Meurer, 2012): they are platform-independent and no local installation of the treebanks is needed.This is especially attractive for (very) large parsed corpora which require a lot of disk space.Another related tool is the more recently constructed PaQu application (Odijk, 2015, see chapter 23).In addition to an online search interface, PaQu also o ers the possibility to upload and parse a locally installed corpus.
For a presentation of the way in which GrETEL works we rst focus on the basic search mode of example-based querying (section 22.2.1) and then we turn to more advanced modes of querying (section 22.2.2).

Example-Based Querying
The example-based querying procedure consists of six steps.

Example
The user provides an example sentence, containing the syntactic construction (s)he is looking for.For instance, in colloquial Dutch the complementizer van 'of ' is sometimes used in constructions re ecting direct speech (Coppen, 2010;Hoekstra, 2010).An example is given in (1).
(1) Hij he dacht thought van of ik I zal will dat that morgen tomorrow wel rather doen.do 'He thought: I will do that tomorrow' .

Parse
GrETEL automatically parses the input construction using the Alpino parser (van Noord, 2006), and returns it as a syntax tree (see Figure 22.1).The user can verify the parse tree.If Alpino returns an erroneous parse, the user is advised to choose another input example.

Selection matrix
In the selection matrix, shown in Figure 22.2, the user indicates which parts of the entered example are relevant for the construction under investigation, as well as their level of abstraction.We have indicated lemma for van 'of ' , and word class of the verbs dacht 'thought' and  zal 'will' , as we want to abstract over verb forms. 3The other words in the example are not relevant for the construction under investigation, so those words are indicated as 'optional in search' .
The dependency relation and the word class ( tag) of all selected items are automatically included in the search instruction.For instance, it will be taken into account that the word van is a preposition (tagged as vz) functioning as a complementizer (cmp).

Treebank selection
In the next step the user can choose which treebank(s) to query.Currently one can choose between the CGN treebank for spoken Dutch, and LASSY Small and the SoNaR-500 treebank for written Dutch. 4It is possible to query the CGN and LASSY Small treebanks as a whole, or one can select one or more treebank components, for instance to compare data from di erent genres.Because of its size (500 million words, ca 41 million sentences), it is only possible to query SoNaR per component.For this example we have chosen the part of SoNaR containing discussion lists (WR-P-E-A, 50 million words, ca 4.5 million sentences).

Query
Based on the information provided in the selection matrix, GrETEL extracts a query tree from the parse tree (Figure 22.3).Besides the lexical information indicated in the selection matrix, the dependency relation (rel) and the phrasal category (cat) of the relevant nodes are included in the query tree, see (5).GrETEL automatically converts the query tree into an XPath expression,5 which is used to search the treebank.

Results
The results of the query are presented to the user as a list of sentences, with the matching part emphasised.The user can click on any of these sentences in order to visualise the results as syntax trees.For the query in Figure 22.3 GrETEL nds 175 results in the WR-P-E-A component of SoNaR.Some are presented in (2-4).
( The results show the greedy nature of GrETEL: it not only returns constructions in which the parts of the construction indicated in the matrix are adjacent, but also returns examples in which those elements are discontinuous. 6For instance, the nite verb zegt 'says' in (2) is not adjacent to van 'of ' .Because of this discontinuity, looking for similar constructions in a at (raw or -tagged) corpus would be much harder.
If we run the same query on less informal data, such as the component of SoNaR containing periodicals and magazines (WR-P-P-H), we only nd 42 hits even though the corpus is larger in size (ca 5.5 million sentences) than WR-P-E-A (ca 4.5 million sentences).This con rms the colloquial nature of the construction.
Example-based querying has the advantage that the user does not need to be familiar with XPath, nor with the exact syntactic structure of the XML in which the trees are represented, nor with the exact grammar implementation that is used by the parser or the annotators.

XPath Search
In the advanced mode of example-based search, users can inspect not only the query tree (Figure 22.3), but also the corresponding XPath expression, spelled out in (5).
For users who are thoroughly familiar with XPath and with the details of the annotation there is also the possibility to directly formulate an XPath query describing the syntactic pattern the user is looking for.This query is then processed in the same way as the automatically generated query in the rst approach.

Using GrETEL for Research and Education
GrETEL has been used for linguistic research on various topics within Dutch syntax and semantics (section 22.3.1).In addition, it has been used for teaching, and it has been presented at several conferences and guest lectures (section 22.3.2).

Research on Dutch Syntax and Semantics
While GrETEL has been used to investigate several linguistic topics, two strands of research received considerable attention, i.e. the investigation of verb clusters and of copular constructions.

Verb Clusters
Augustinus (2015) provides both a theoretical and a treebank-based account of Dutch verb clusters, i.e. constructions in which multiple verbs group together.She shows how such constructions can be extracted from the treebanks using GrETEL, and how the treebank observations serve as an empirical basis to verify the claims made by the theory.She conducted several case studies, such as word order variation in verb clusters, the occurrence of In nitivus pro Participio (a.k.a. the IPP e ect), and interruption of the cluster by nonverbal elements.
Dutch verb clusters are characterised by an unusual type of word order variation, i.e. one that does not entail a change of meaning, as shown by the examples in (6).( 6 Augustinus (2015) investigates which types of word order variation occur in non-dialectal varieties of Dutch, i.e. in the CGN and LASSY Small treebanks included in GrETEL.Barbiers and Schuurman (2015) compare the word order variation in three-verb clusters encountered in those treebanks to data obtained from MIMORE, a tool for investigating morphosyntactic variation in Dutch dialects. 7n nitivus pro Participio or IPP refers to constructions in which an in nitive occurs instead of a past participle, as in ( 7). ( 7 2017) compare the set of Dutch IPP verbs to the German IPP verbs.In order to add this cross-linguistic perspective, they queried two German treebanks using the TüNDRA treebank search tool (Martens, 2013).The case study not only illustrates how the results obtained by GrETEL can be complemented by using additional resources, but also shows how the treebank data can be employed to evaluate theoretical accounts of IPP.
A third case study on IPP using GrETEL investigates the choice of the auxiliary of the perfect in Dutch IPP constructions, i.e. the choice between hebben 'have' and zijn 'be' .Canonically the choice for the auxiliary in IPP constructions is determined by the IPP verb, as in (8a).However, one also encounters constructions in which the auxiliary is determined by the main verb, as in (8b).( 8 While this variation has been reported in the literature, no large-scale corpus study was available pointing out the frequency and the distribution of the phenomenon.Van Eynde et al. (2016a) investigate the choice between hebben 'have' and zijn 'be' in IPP constructions by means of GrETEL and OpenSoNaR. 8The corpus study provides insight in the set of verbs that allow this alternation.For the verbs moeten 'must' and kunnen 'can' the distribution of the canonical and the alternative construction is investigated in more detail.
Besides word order variation and the IPP e ect, Augustinus and Van Eynde (2014) and Augustinus (2015) investigate the occurrence of cluster interruption.Canonical verb clusters cannot be interrupted by nonverbal elements (9).There are some exceptions though, such as cluster creeping by separable verb particles, predicative adjectives, and stranded adpositions (10).
( The treebank investigations conducted in Augustinus and Van Eynde (2014) and Augustinus (2015) show that the set of cluster creepers is larger than the literature suggests.This illustrates once more how a treebank-based investigation can provide additional insights into syntactic phenomena.

Copular Constructions
In addition to the research on verb clusters, GrETEL is used for research on copular constructions.Van Eynde et al. (2014) illustrate that the set of copular verbs discussed in traditional grammars is incomplete.Typically those grammars mention a set of 10 to 15 verbs, adding, as an a erthought, that the list is not complete.By means of GrETEL treebank data were collected in order to get a more complete and empirically motivated typology of Dutch copular constructions, which consists of at least 40 verbs.As the typology is based on linguistically motivated criteria, it can be used to complete the list of verbs by investigating a larger dataset.
Van Eynde et al. (2016b) deal with number agreement in copular constructions.Canonically, there is number agreement between the subject and the predicate nominal in Dutch copular constructions, as in (11).Mismatches are not excluded, however, as shown in ( 12). ( 11 This research demonstrates how the data obtained from the treebanks not only provide information with respect to the frequency and the distribution of number agreement in copular constructions, but also serve as an empirical basis for a theoretical analysis.In addition, the treebank data were employed to de ne under which circumstances mismatches between the subject and the predicate nominal are allowed.

Dissemination
GrETEL is currently used in courses on descriptive linguistics, syntax and semantics, corpus analysis, and computational linguistics in order to teach students how to look up syntactic constructions and their frequencies in a treebank without requiring them to familiarise themselves with the speci cs of XPath or the speci c syntax of the treebank.It teaches students about syntactic parses and treebanks by providing them easy online access to large amounts of data.
As GrETEL has a focus on user-friendliness and is freely available online, it is an example of how Digital Humanities applications disclose datasets and computational tools, without requiring the user to have a technical background.
GrETEL was presented to a technical audience at several conferences within the eld of computational linguistics and to an audience of potential users at general linguistic conferences and doctoral schools in Flanders and the Netherlands.Those lectures typically include a tutorial demonstrating the functionality and use of GrETEL, followed by a hands-on session.In addition, some case studies are discussed, showing how the results obtained from the treebanks in GrETEL can serve as an empirical basis for research in linguistics.One of the case studies includes the combined use case of GrETEL and MIMORE (Barbiers and Schuurman, 2015).It illustrates how GrETEL and MIMORE can be used as complementary tools for studies on Dutch syntax.

Further Developments
GrETEL has been designed in such a way that it can also be used for treebanks in other languages, even if they have di erent annotation schemes compared to the Dutch treebanks.In the Afri-Booms project (Augustinus et al., 2016a), a treebank for Afrikaans has been developed, which is also included in GrETEL (section 22.4.1).In the context of the SCATE project (Vandeghinste et al., 2016), GrETEL was adapted to query parallel treebanks (section 22.4.2).The tool is also included in Taalportaal (Landsbergen et al., 2014), an online descriptive grammar of Dutch (section 22.4.3).

GrETEL for Afrikaans
In comparison to Dutch, Afrikaans is a low-resource language, so until recently no treebanks for Afrikaans were available.In the AfriBooms project a (small) treebank containing ca 50K words has been developed, based on the corpus of the South African National Centre for Human Language Technologies (NCHLT).The annotations of the treebank are manually corrected, which makes it a reliable resource for linguistic research.In addition, a rst parser for Afrikaans was developed.Both the treebank and the parser are included in a version of GrETEL for Afrikaans (Augustinus et al., 2016a).9

Querying Parallel Treebanks with Poly-GrETEL
In the context of the SCATE project, large-scale parallel treebanks are constructed which are used for syntax-based machine translation.Since parallel treebanks are a valuable resource for translators and linguists as well, Poly-GrETEL was developed, i.e. an extension of GrETEL for querying parallel treebanks (Augustinus et al., 2016b). 10Currently it contains the (automatically annotated) Europarl parallel treebank for Dutch and English.

The Europarl parallel treebank
We have made an update of the treebank described in Kotzé et al. (2016): we used the data from Europarl version 7 (Koehn, 2005) and extracted the Dutch and English sentence-aligned data from www.statmt.org.The Dutch side was parsed with the Alpino parser and the English side with the Stanford parser (Klein and Manning, 2003) with added dependencies (de Marne e et al., 2006).The phrase structure output of the Stanford parser is converted into an XML-tree,11 analogous to the XML-output of Alpino, as shown in Figure 22.4.Besides the syntactic annotations the parallel treebank contains node alignments.12Poly-GrETEL In combination with the example-based query functionality, Poly-GrETEL avoids the need for users to be familiar with the query language and the structure of the trees in the source and target language, thus facilitating the use of parallel corpora for comparative linguistics and translation studies.
The user can query the treebanks in a similar way as in the monolingual GrETEL environment, i.e. example-based or by means of an XPath query.The main di erence is that the user can choose between a bilingual and a monolingual input.In the bilingual search option the user provides two input constructions: one in English and one in Dutch.Poly-GrETEL returns two parses, and the user can indicate the relevant parts of both the English and the Dutch input examples.Poly-GrETEL automatically extracts a search instruction in a similar fashion as the monolingual GrETEL, but provides the option to return only the constructions in which the English and the Dutch query trees are aligned.It is a syntactic concordancer for parallel treebanks, as it shows how a Dutch syntactic construction is translated in English (or vice versa).One could, for instance, investigate how the Dutch van-construction presented in section 22.2.1 is translated in English.This makes the tool interesting not only for research in (comparative) linguistics and translation studies, but also to serve as a tool for computer-aided translation for translators and language learners.
Adding the parallel English-Dutch treebank furthermore implies that GrETEL also includes English data.Since it is possible to query the English side of the parallel treebank in a monolingual way, one can use these data for a monolingual treebank investigation of syntactic phenomena in English.

Link with Taalportaal
Recently, GrETEL was linked to Taalportaal, a website that contains online descriptive grammars for Dutch, Frisian and Afrikaans (Landsbergen et al., 2014, see chapter 24). 13By means of intelligent links, users can look up linguistic phenomena described in Taalportaal in a variety of online corpora, amongst others the treebanks included in GrETEL (van der Wouden et al., 2015).
The link with Taalportaal enhances the visibility of GrETEL, and encourages its use, alone or in combination with other corpus tools.Bouma et al. (2015) mention how they have used the example-based input method of GrETEL to facilitate query formulation.It turns out to be particularly useful if one does not know exactly how certain phenomena are annotated in the treebanks.In addition, the authors mention how they have used GrETEL's example-based querying functionality to become aware of di erences between the treebank annotations and the analyses of the descriptive grammar included in Taalportaal (Bouma et al., 2015: 18).

Conclusion and Future Work
We have described GrETEL, a user-friendly search tool for treebanks.It originated in the context of a CLARIN-Flanders project, which aimed at the creation of tools for the exploitation of Dutch treebanks.In follow-up research, GrETEL was extended to other languages (Afrikaans and English), and other types of treebanks, i.e. parallel ones.The extensions make the tool also useful for a larger (CLARIN) audience, i.e. researchers who are not (only) working on Dutch.
Future work includes adding more languages to GrETEL, such as German and French, as for those languages we also have high-quality parsers and treebanks available.
In the framework of the Dutch CLARIAH infrastructure project and the Anncor project (University of Utrecht), there are plans to further extend the functionality of GrETEL.An upload function will be added, enabling researchers to upload their own corpus and metadata, supporting multiple formats.Another extension concerns adding options for data analysis, and creating possibilities to sort, group, and lter search results and metadata.

Figure 22 . 3 :
Figure 22.3: Query tree based on the input example.
Augustinus (2015)ermanic languages, such as Dutch, German and Afrikaans.These languages di er, however, with respect to the set of verbs that can appear as IPP verbs, and with respect to whether the phenomenon occurs obligatorily or optionally.For some verbs, the literature is not conclusive on whether they can occur in IPP constructions or not.Augustinus and Van Eynde  (2012)andAugustinus (2015)describe how a treebank-supported investigation of Dutch IPP verbs using GrETEL results in a more exhaustive and empirically valid typology of Dutch IPP verbs than the lists available in the literature.Augustinus and Van Eynde (