Namescape: Named Entity Recognition from a Literary Perspective

The project Namescape: Mapping the Landscape of Names in Modern Dutch Literature (2012–2013) was a demonstrator project granted in the third CLARIN-NL call. Partners in the project were the Huygens Institute for the History of the Netherlands, the University of Amsterdam, and the Dutch Language Institute (CLARIN centre). The project dealt with Named Entity Recognition (NER) for modern Dutch ction and delivered two new NER tools for this purpose. It also addressed Named Entity Resolution and focused on a set of visualisations of names in individual texts from the corpus. This chapter gives an overview of the results of the project, starting with a description of the background of the research questions in the discipline of comparative literary onomastics. It then goes on to describe the tools that were delivered, and which can be found on the project website, http://www.namescape.nl/.


Introduction
The research discipline dealing with name studies -onomastics -includes a subdiscipline in which scholars aim to analyse and compare the usage and the function of names in literary works. In this kind of research, which can be called comparative literary onomastics, the scholar assumes that patterns and trends can be discovered in the way in which literary authors make use of proper names in their work (van Dalen-Oskam, 2005, van Dalen-Oskam, 2016. The comparative literary onomastics analysis not only deals with quantitative issues, such as the amount of names in a work, but also with a more qualitative evaluation of the functions of the names that have been used. Names almost always have an identifying function, to discriminate one place or person from another, but in some literary cases names are also used to do the opposite, that is to hide a person's identity or the location of a place. And, to give just a few examples of other functions, personal names may also be used to describe the personality of a certain character, and place names clearly help to situate a story in a speci c geographical area or to emphasise that area as imaginary. The problem that the Namescape project wanted to address is that this type of literary onomastics until now could deal with only one text or a very small corpus, due to a lack of specialised tools for named entity recognition and classi cation in literary texts. The most e cient approach available was to privately scan texts and manually annotate them, whereas the need clearly was to be able to compare name usage and name functions in much larger corpora of literary works. Direct incentive for the Namescape project was a pilot performed by literary onomastician van Dalen-Oskam on a collection of 22 Dutch and 22 English novels, in which the use of proper names was analysed (van Dalen-Oskam, 2013). Tagging this corpus, using a combination of semi-automatic and manual tagging, took around 12 months. The literary named entity recognition applied in the pilot di ers from usual named entity recognition in two respects: (1) personal names, place names, and other names were tagged. Personal names were also tagged as being a rst name, a family name, or a nickname. This was necessary from the perspective of literary onomastic analysis to be able to test the hypothesis that rst names and family names may be used with di erent e ects and di erent functions. Furthermore, the literary onomastician needs to view these as separate instances of separate names and not lumped together as one name. References to a character with only a rst name, only a family name, or a combination of both may each have a di erent stylistic e ect and a di erent (set of) function(s). (2) All names were further labelled with information on whether they were purely ctional, referring to 'plot internal' entities (e.g. Harry Potter), or referred to 'plot external' , really existing, named entities (e.g. Churchill, London). 1 This was done to be able to test the hypothesis that 'plot internal names' and 'plot external names' have a di erent set of functions.
The conclusion of the pilot was that a much larger corpus of literary works was needed to con rm or correct the observations that were made in the pilot project, helping the scholar to perform statistically signi cant quantitative analysis of the use of names, and so was a set of tools for the researcher to tag, search and analyse the corpus, including insightful visualisations. The Namescape project set out to do just that.

Namescape Research Environment Components
The main tasks for the Namescape project were: to create a larger corpus, to perform good quality Named Entity (NE) recognition on literary material and try to perform NE resolution so as to determine whether names in literary works are plot internal or plot external, and nally to make the data available in an environment in which the researcher can search and visualise search results (technical details can be found in van .

Namescape Corpora and Annotation Scheme
For the core NE corpus, the project took the Dutch part of the corpus that van Dalen-Oskam used (the 'Huygens corpus') in the above mentioned pilot, consisting of 22 Dutch novels and containing ca 1.5 million tokens, and extended it with a collection of 550 OCRed Dutch books from the period 1970-2009 and containing ca 28 million tokens.
Random paragraphs were selected from that extended core corpus in order to create a manually annotated gold standard corpus for NE recognition, consisting of about 1 million tokens. There were two reasons to compose the gold standard corpus in this way. First, annotating a limited selection of complete works would have severely limited the amount and variety of name mentions in the training corpus. 2 Furthermore, by choosing snippets of annotated texts instead of complete texts as training material we hoped to circumvene IPR issues, so as to be able to distribute the training corpus for research purposes. 3 For evaluation of NER performance, we used a xed random split of the corpus in a training and a testing partition.
In the course of the project, three additional corpora were collected and curated: a corpus of eBooks (over 7,000 books and ca 500 million tokens), a subselection of the SoNaR Corpus 4 (over 100 books and ca 11 million tokens) and a corpus with the Dutch books from the Gutenberg project (530 books and ca 30 million tokens). The last corpus contains books from the 17th to 20th century, which is a challenge for NE recognition because of the historical Dutch spelling.
The XML encoding was done in TEI P5. We made a simple extension to TEI (Text Encoding Initiative) to tag the named entity properties. We also chose to use a single tag for named entities and a di erent tag for entity parts to avoid nested name tags. For the basic principles for NE recognition, we have followed the 1999 Named Entity Recognition Task De nition Chinchor et al., 1999. The basic de nitions (quoted from section 30.3 of the Task De nition) are: PERSON: named person, family, or certain designated non-human individuals ORGANIZATION: named corporate, governmental, or other organizational entity LOCATION: name of politically or geographically de ned location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) and astronomical locations (Chinchor et al., 1999).
We refer to the Task De nition document for further details. We added a MISC category for name occurrences which did not clearly t in any of the three basic types.

Named Entity Recognition
We have used our Namescape training corpus and trained two named entity taggers: the Namescape-trained instance of the Conditional Random Field-based Stanford tagger 5 (with the default settings), and a Support Vector Machine-based (SVM) tagger, 6 which has been designed to improve performance by making use of information derived in an unsupervised way from a corpus (in this case the extended core Namescape corpus). Both taggers are fairly standard supervised machine learning applications with slightly di erent, but similar, feature sets, consisting of a set of context features and a set of word shape features. The SVM tagger had a slightly better performance (cf. van . Table 30.1 gives an evaluation of NER performance, using a xed random split of the gold standard corpus. The application of NER to the other corpora has not been evaluated in a strict way; manual inspection shows a roughly similar accuracy on the eBooks and SoNaR corpora, and, as was to be expected given the di erences in language, a much worse accuracy on the historical Gutenberg data. It should also be mentioned that, taking advantage of the annotation of the pilot corpus, we endeavoured to go beyond the standard NER annotation categories by distinguishing between rst names, family names and nicknames, thus accommodating the wish of the literary name scholar to compare, for example, the usage and functions of rst names with that of family names, instead of heaping them all together as personal names.

Web Service and Application
Since we wanted to enable non-technical users to do named entity recognition on their own texts, we created a small lab environment which has both NE taggers implemented as a web service and is easy to use (http://ner.namescape.nl/namescape/tagger). Text can be uploaded in several formats (plain text, HTML, EPUB, Word, TEI) from the user's own computer or directly from the web by supplying a URL. The result of the tagging process is a TEI le with the inline annotation, delivered to the user either as is ('raw output') or formatted and displayed with NEs highlighted. The formatted display also includes overviews of names per category, snippets per name and a co-occurrence graph allowing the user to explore the relations between the named entity mentions.

Named Entity Resolution
To establish whether a name is plot internal or plot external, NE resolution has been performed by means of the ILPS semanticiser (Odijk et al., 2013;cf. http://semanticize.uva.nl/doc/). The tool tries to link named entities in the texts to entries in Wikipedia (a process also known as wiki cation). A name is considered to be plot-external when the entry in Wikipedia describes a non-ctitious entity. 7 The application of the method does not require a manually annotated training corpus. For evaluation, the pilot corpus was used, in which plot-internality and plot-externality is manually tagged. For the 3862 distinct name types and 35852 name tokens, we have obtained a type accuracy of 74.5% and a token accuracy of 79.8%. The most prominent type of error is perhaps over-resolution: plot-internal entity mentions are o en resolved to an apparently unconnected Wikipedia entry. This is understandable when we take into account that most proper names in a novel are expected to be plot-internal, and that the semanticiser has been designed to optimise the choice between di erent possible resolutions, rather than the decision between resolution and non-resolution (for more details see van .

Search Interface
To enable the user to search and browse the texts, a search interface was built using XQuery on an eXist XML database (http://search.namescape.nl; see Figure 30.1). Unfortunately, IPR restrictions forced us to limit access to the full texts for part of the corpus.

Visualisations
Visualisations of NEs in a single text are enabled through the Namescape visualiser. Visualisation of NEs in a corpus is done via the barcode browser.

Namescape Visualiser
The Namescape Visualizer 8 (http://visualizer.namescape.nl/) gives an overview of the names in a text and shows the co-occurrence of names in paragraphs. To create a picture of the onymic landscape of proper names in the novel De vergaderzaal by A. Alberts, you may select the novel in the tool, and then an overview is given of the top twenty most frequent Figure 30.1: Components of the Namescape search interface. Le : search form (above) and hits (texts in which the name 'Michiel' occurs). Right: detailed view of the second hit, with an overview of all names (le ) and part of the text with names highlighted (right). The three colours represent the three main name types: personal names (pink), place names (green, no example in the paragraphs shown), and names of organisations (blue). The example also shows that the tagger does not yield complete accuracy. names (as recognised by an earlier tagger), with an automatically generated link to Wikipedia. The network of named entities in the novels is visualised in three ways: two di erent representations of the co-occurrence network, and a dispersion plot (see Figure 30.2).

Network of Characters
Each book contains a network of named entities, usually characters. Two named entities are considered connected if they are both mentioned in the same paragraph. Clustering is performed according to the Louvain method (Blondel et al., 2008) for nding communities in social networks. This is a fast algorithm that optimises a modularity criterion (the modularity of a partition is a measure that compares the density of links inside a cluster to links between clusters). 9

Character bundle
Matrix graph Dispersion graph The Character bundle and Matrix graph are di erent visualisations of co-occurrence of names in the same paragraphs; the more co-occurrences, the thicker the lines (Character bundle) or the darker the colour (Matrix graph). The Dispersion graph shows in which paragraphs the names occur, linearly through the text from le to right.
The character bundle and the matrix graph are di erent ways of displaying the network. The colours in the matrix graph correspond to the clusters, and the intensity of the colour indicates the frequency of the name co-occurrence in the book.

Dispersion Graph (Barcode Graph)
The dispersion graph shows which character is mentioned in which paragraph. The horizontal axis represents paragraphs in the book, from the rst on the le to the last on the right; the vertical axis represents characters. A coloured bar at (x, y) means that character y is mentioned in paragraph x. The dispersion measure (cf. Juilland et al., 1970), based on the frequency and the distribution of occurrences, is believed to be a good indicator for the prominence of a character in a novel (cf. Karsdorp et al., 2012 for this point in the context of folk tales).

Barcode Browser
The Barcode Browser (http://barcode-browser.namescape.nl/index.xql) gives an overview of the search results for a collection of documents (see Figure 30.3). Each document in the search result is a column; the lines represent the paragraphs in the document. Paragraphs matching the search query are highlighted with a colour ranging from yellow (low relevance) to red (highly relevant).

Figure 30.3:
The Namescape Barcode Browser (part of the search interface). The example shows a search for the name 'Jan' in one of the subcorpora, with the search form above, and part of the results below. Each text in which the name was found is shown as a vertical column, with a bar for each paragraph. The paragraphs containing the name Jan are highlighted, from yellow (low relevance) to red (high relevance). Hovering over a coloured bar shows the text of the paragraph with all names in red and the search query in red and bold.

Evaluation
The most important results of the Namescape project are the specialised tagger for Dutch literary texts, and the visualisation tools to explore the landscape of names in individual texts. The search option is a great help to check the occurrence of certain names in a large corpus. Of course the corpus is still too small for really ambitious overviews, and it is unfortunate that we cannot make the full text of many works available, but it still gives a much broader view than ever before. Furthermore, although the scholar can get a nice overview on screen, it is not possible yet to download statistical reports or to view them in more detail. Another wish for the future is the option to submit privately owned texts to the tagger and download the results. The Visualizer is a great tool to help explore the data. This is truly inspirational and may lead to many new hypotheses that could be tested in future projects. A drawback of the current Visualizer is that it does not have an upload function and only works on a static set of annotated novels which, furthermore, have been annotated with an older version of the NER tagger. There is therefore a need to update the tagging in the les underlying the Visualizer to make the explorations more reliable as well as more detailed (recall that the nal Namescape NER distinguishes rst names, family names and nicknames).
New onomastic research was not part of the Namescape project, but is in preparation. The new possibilities were not tested outside the project team. In a paper for the International Congress of Onomastic Sciences in Glasgow in August 2014, we gave an overview of the project, focusing on a problem many mainstream literary onomasticians still have when looking at the results of so ware. For the human eye, the results still show a lot of mistakes. Even now, many scholars conclude that it is therefore better not to use the tools at all. We think it would be better to nd ways to deal with all this noise, and we described a couple of these potential solutions in an earlier paper (van Dalen-Oskam and de Does, 2016). Still, one of the ways we do not mention and which would certainly be useful is to try to improve the new tagger of literary texts.

What Came A er
A er the Namescape project, we have been able to enhance the performance of the SVM NER tagger by adding distributional word vectors (cf. Turian et al., 2010), produced by the word2vec program (Mikolov et al., 2013), as features to the classi er. This has yielded a signi cant improvement of tagging accuracy on the Namescape training corpus (cf. The project also inspired a project called Beyond the Book; the ultimate aim of the researchers in this project was to examine if knowledge of names in a novel could contribute to a book being found interesting for readers in another language. They assumed that a novel that mentions a lot of culture-speci c information may be less interesting for readers from other cultures, unless there is a special hype of literature focusing on the exotic. They thought that a tool that can show how exotic a novel is could be useful for publishers in helping them decide which novels to push for translation in which languages. To make a very rst step towards this possible goal, Beyond the Book focused on names and applied the Semanticizer for named entity resolution. Names were linked to the most probable Wikipedia entry. For each of these Wikipedia entries the researchers calculated the number of contributors and their background (country of origin) and the number of edits. Then they compared these with the mean number of contributions from a certain country to the whole of Wikipedia. The di erence between the outcome per entry then showed if the editors from certain countries made more changes than average to this entry, or less. If they made more, the scholars assumed that in the country of origin of these editors, the named entity was well-known and found culturally relevant. They explored several ways of visualising the results of such analysis for individual names and individual novels (Martinez-Ortiz et al., 2015). A tool that could be used by publishers and translators to get suggestions for the selection of novels for translation is still far away, however.

Future Work
Apart from nding more ways of dealing with noise, we have several other wishes for next steps in this research. Obviously, one would want to optimise the tools for automatic tagging. As we have seen, progress in the eld of NE recognition is possible; for NE resolution, a rst step is to develop more gold standard data. Furthermore, to truly turn the Namescape interactive environment into a virtual research environment that enables researchers to tag, explore, re ne, and publish their data, we need to implement additional functionality.
A er uploading documents to the NE tagger, researchers should be able to use the exploration and visualisation tools on their own data. To be able to deal with the noise problem described above, scholars should have the option to correct the markup a er automatic tagging in a userfriendly way. Finally, we would like to have options to publish tagged material: users should at least be able to download not only their tagged texts (and, if there are no IPR issues at stake, to make them available to other users), but also statistical overviews of names and name co-occurrences.