Stylene : an Environment for Stylometry and Readability Research for Dutch

We describe an educational demonstration interface and tools for stylometry (authorship attribution and pro ling) and readability research for Dutch. The Stylene system consists of a popularisation interface for learning about stylometric analysis, and of web-based interfaces to so ware for readability and stylometry research aimed at researchers from the humanities and social sciences who do not want to develop or install such so ware themselves.


Introduction
The last decade has seen a marked increase in research on computational stylometry, the subarea of natural language processing that concerns itself with the categorisation of texts according to the psychological and sociological properties of their authors.Also called text pro ling, this research tries to develop systems, mostly based on text analytics techniques, that combine natural language processing and machine learning methods.These systems are trained to determine whether the author of a text is male or female, their education level, region of origin, personality, and even mental health, whether they are a native speaker or not, and many other potentially useful attributes.Of course, authorship attribution research has existed for a long time, and is in a sense the limit case of computational stylometry: supposing that everyone has a unique combination of demographic, psychological and idiosyncratic style properties, this would be their idiolect or 'stylome' (Van Halteren et al, 2005;Coulthard, 2004), and it should be possible to assign texts of unknown authorship to speci c authors provided that models of their stylome exist.
Another useful type of information that can be extracted from text using natural language processing and text categorisation is the readability of a text.Readability research and the automatic prediction of readability has a very long and rich tradition (see surveys by Klare 1976;DuBay 2004;Benjamin 2012;and Collins-Thompson, 2014).Whereas super cial text characteristics leading to on-the-spot readability formulas were popular until the 1990s (Flesch 1948;Gunning 1952;Kincaid et al. 1975), recent advances in the eld of computer science and natural language processing have triggered the inclusion of more intricate characteristics in present-day readability research (Si and Callan 2001;Schwarm and Ostendorf 2005;Collins-Thompson and Callan 2005;Heilman et al. 2008;Feng et al. 2010).
Current approaches model lexical, syntactic, semantic and discourse complexity, while also considering shallow traditional text characteristics.Furthermore, the focus has shi ed from using the formulas to select reading material for children or L2 language learners to assessing the readability of a variety of text types with other user groups or applications in mind.
This chapter introduces the results of a CLARIN Flanders project on the development of practical tools for stylometry and readability. 1The goal of that project was to implement a robust, modular system for stylometry and readability research on the basis of existing methods, and the development of a web service that would allow researchers in the humanities and social sciences to analyse texts with this system.
The website has three sub-interfaces: (i) a popularisation interface intended to provide basic insight into what stylometry can do; (ii) a readability interface that allows the input of texts and provides elementary and more advanced feedback on the readability of the text; and (iii) a machine learning interface that allows basic experiments in computational stylometry.
In this chapter, we will describe the underlying methods and approaches in the backend of the interfaces.We also developed a stand-alone system for machinelearningbased stylometry that underlies the third interface, but which allows more options and exibility as a stand-alone system than can be accessed from the interface.The stand-alone system may eventually replace the corresponding interface on the website.

The Stylometry Popularisation Interface
Computational Stylometry is not yet well-known outside computational linguistics and the specialised digital humanities research community.In order to educate interested lay persons and humanities and social sciences colleagues about the possibilities (and limitations) of the approach, an interface was designed to help a general audience understand computational stylometry in an easy and fun way.An early version was tested out successfully during the 2011 Flemish 'Wetenschapsweek' (Science Week) with secondary school pupils, and a erwards extended.There has been a large interest for the interface (around 50 visitors per month) and some media attention.Figure 16.1 shows the start screen of the interface.Input can be provided either by cut and paste or through le upload.In both cases the input should be raw text (uploaded les should have .txtextension).The demo will only give complete output in browsers that are HTML5compatible and that allow JavaScript.For practical reasons, cutandpaste input is limited to 4000 characters and le upload to 300 sentences.
A er the user enters a text and clicks on 'analyse' , the so ware returns a screen with didactic information about the general approach taken in stylometry and information about di erent stylometric aspects of the text provided.Figure 16.2 shows the introductory information that is  given about the analysis (users can click through to sample linguistic analyses of the data) and a visual (colour) representation of the distribution of words in the text for some of the features used in the system (the full feature representation can be clicked on as well).Darker features represent more frequent features.For the linguistic analysis, the so ware package Frog was used ( Van den Bosch et al., 2007).As features, token unigrams and the LIWC features (Pennebaker and Francis, 1996;Pennebaker et al 2001;Pennebaker et al, 2007) were used.The latter features group vocabulary associated with speci c cognitive and emotional styles and themes, as well as grammatical categories (for example personal pronouns) associated with di erences in demographic and psychological properties of authors.
Figures 16.3-16.5show the additional information that is provided by the demo system: a guess of the gender of the author (based on a model learned by a support vector machine learning algorithm using all features and trained on part of the Corpus Gesproken Nederlands (CGN 2004)   data; Figure 16.3); a guess of the genre of the text (based on a small corpus that was collected solely for this demo; Figure 16.4); and a representation of the closeness to samples of the works of a random selection of di erent Dutch and Flemish authors (based on, on average, the rst 11,000 tokens of one of their novels; Figure 16.5).The gender infobox is the result of assessing the proportion of male and female labels of 71 vectors that are in the vicinity of the vector that is based on the input, using cosine similarity.The other two infoboxes are based on the Dice coe cient (a metric that measures similarity) between the di erent (normalised) vectors that represent the style of the authors.
It should be noted that the models used are simpli ed, and that we do not make any scienti c claims about the selection of genres and authors, or about the meaning of the output of the system.The only goal of this interface is to show what computational stylometry is, and provide a feeling for the type of information it uses and the type of output it produces.

The Readability Interface
Automatic readability prediction has a long and rich tradition.Research in the 20th century, fuelled especially by educational purposes, has resulted in a large number of readability formulas.Typically, these yield either an absolute score (Flesch, 1948;Brouwer, 1963) or a grade level at which a text is deemed appropriate (Dale and Chall, 1948;Gunning, 1952;Kincaid et al., 1975) and are based on shallow text characteristics such as average word and sentence length and word familiarity.
Over the years, many objections have been raised against these traditional formulas: their lack of absolute value (Bailin and Grafstein 2001), the fact that they are solely based on super cial text characteristics (DuBay 2004;DuBay, 2007;Davison and Kantor 1982;Feng et al. 2009;Kraf and Pander Maat 2009), the underlying assumption of a regression between readability and the modelled text characteristics (Heilman et al. 2008), etc.Furthermore, there seems to be a remarkably strong correspondence between the readability formulas themselves.When evaluating the performance of 12 readability formulas, of which 7 designed for English, 5 for Dutch and one for Swedish, van Oosten et al. (2010) found strong correlations between the formulas, within a given language, but also across languages.
These objections have led to new quantitative approaches for readability prediction which adopt a machine learning perspective for the task.Advancements in these elds have introduced more intricate prediction methods such as Naïve Bayes classi ers (Collins-Thompson and Callan 2004), logistic regression (Franc ¸ois 2009) and support vector machines (Schwarm and Ostendorf 2005;Feng et al. 2010;Tanaka-Ishii et al. 2010) -and especially more complex features.Rather than a sole reliance on super cial text characteristics, the added value of features measuring lexical complexity based on n-gram modelling (Schwarm and Ostendorf 2005;Pitler and Nenkova, 2008;Kate et al. 2010) or those relying on deep syntactic parsing (Schwarm and Ostendorf, 2005) have been corroborated repeatedly in the computational approaches to readability prediction that have surfaced in the last decade (Heilman et al. 2007;Petersen and Ostendorf, 2009;Nenkova et al. 2010).Features relating to semantics and discourse processing have proven more di cult to corroborate.While Pitler and Nenkova (2008) have clearly demonstrated the usefulness of discourse relations, the predictive power of these was not corroborated by Feng et al. (2010), for example.Especially for those features requiring deep linguistic processing, a lot still has to be explored (Collins-Thompson 2014).
In the readability interface, we present a re-implementation of several readability formulas and propose a new readability prediction system which does not only take into account these supercial text characteristics, but also relies on features grasping lexical complexity based on n-gram modelling and syntactic complexity based on deep syntactic dependency parsing.

General Text Characteristics
Once a text is provided, either by cut and paste or through le upload, we rst present the user with some of the more general characteristics of the text.We include three length-related features that have proven successful in previous work (Nenkova et al. 2010;Feng et al. 2010; Franc ¸ois and Miltsakaki 2012): the average word and sentence lengths and the percentage of polysyllabic words (i.e.words containing more than three syllables).We also incorporate two traditional lexical features: on the one hand, we provide the percentage of words also found in a Dutch word list with a cumulative frequency of 77% (or 'freq77'). 2On the other hand we also calculate the type token ratio (TTR) to measure the level of lexical complexity within a text.
All these characteristics are obtained a er processing the text with a state-of-the-art Dutch preprocessor, Frog (Van den Bosch et al. 2007) and a designated classi cation-based syllabi er (van Oosten et al. 2010).Figure 16.6 illustrates how these general characteristics are presented to the user.It should be noticed that we also allow the user to actually highlight those words that contain more than three syllables or that are infrequent in Dutch.

Readability Judgement Based on Classical Formulas
Though many objections have been raised against the classical readability formulas, they remain popular and are still the go-to solution in many disciplines where a reader or author desires a rst insight into text readability, e.g.corporate communication (Dempsey et al. 2012) or legislation (van Boom 2014).This is why, in a second step, we apply a number of readability formulas to the text which was entered by the user in the interface.
In essence, a readability formula is a mathematical formula intended for indicating the difculty of a particular text.The formula typically consists of a number of variables, which are characteristics of the text (as displayed in Figure 16.6) and constant weights.Besides the ve general text characteristics that were introduced earlier (the average sentence length avgsentencelen; average word length avgwordlen; percentage of polysyllabic words, ppolysylword; freq77; and TTR), ve additional variables are required for calculating all of the di erent formulas presented in the interface.These variables were derived using the same preprocessing toolkits as mentioned above and are listed below: • avgnumsyl: average word length in number of syllables.
• psw: percentage of sentences per word.
• freq3000: percentage of words not on the Dale-Chall (1948) word list3 • avgpolysylsent: average number of words with three or more syllables per sentence.
• ratiolongword: ratio of words with more than six characters These additional variables are not presented as such to the user.Instead, we display the results of the di erent formulas for a text which was entered by the user (see Figure 16.7).These formulas have been designed for Dutch (Douma, 1960;Brouwer, 1963;Staphorsius, 1994), English (Dale and Chall 1948;Flesch, 1948;Gunning, 1952;Senter and Smith, 1967;McLaughlin, 1969;Coleman, 1975;Kincaid et al., 1975) or Swedish (Björnsson, 1968).As van Oosten et al. ( 2010) have shown that there is a strong correspondence between the readability formulas intended for di erent languages, all readability formulas are displayed in the interface independently of the language they aim to model.The following readability formulas are displayed in the interface: Dutch-language formula:  English-language formula: • Flesch Reading Ease (207 − avgsentencelen − 85 × avgnumsyl) • Dale-Chall Reading Grade Score (0.16 × freq3000 + 0.05 × avgsentencelen + 3.6) • Coleman-Liau Index (5.9 × avgwordlen − 0.3 × avgsentencelen − 16) • Flesch-Kincaid Grade Level (0.39 × avgsentencelen + 12 × avgnumsyl − 16) • Gunning Fog Index (0.4 × (avgsentencelen + ppolysylword)) • ARI: Automated Readability Index (4.7 × avgwordlen + 0.5 × avgsentencelen − 21) • SMOG: Simple Measure of Gobbledygook: √ (30 × avgpolysylsent) + 3.1 Swedish-language formula: • Läsbarhetsindex Björnsson: avgsentencelen + ratiolongword The last column in Figure 16.7 gives information on the scale on which the formulas are calculated.For some formulas (all English formulas except for Flesch Reading Ease; the Swedish formula; and the Dutch CLIB and CILT), a higher score applies to a more di cult text and a lower score to a more readable text; their slope is considered positive.For the other formulas, viz.Flesch Reading Ease, Flesch-Douma and Leesindex Brouwer, the situation is exactly opposite and the slope is considered negative.

Readability Prediction Based on Supervised Machine Learning
Given the many objections raised against the classical formulas, we also judge the readability of the entered text using the corpus-based readability prediction system developed by De Clercq et al. (2014).In order to compile the gold standard underlying this system, rst a general-purpose corpus consisting of a large variety of text genres was compiled which was then assessed on readability.For the actual assessments, two web applications were designed to collect readability assessments for Dutch and English texts: one that is intended exclusively for language experts and one that is open to the general public.Both applications are available under the following link: http://www.lt3.ugent.be/en/tools/ Figure 16.8 gives an overview of these text scores assigned to texts by the experts and the crowd.The red line in both gures shows how our corpus-based readability prediction system scores the text compared to the other texts in the corpus.
Two avours of the Hendi system have been integrated in this interface: a system which mainly relies on traditional and lexical text characteristics, and a second system which also integrates information representing the syntactic complexity of the text.
The former readability prediction system relies on a feature space of the traditional features mentioned above and lexical n-gram features which have proven to be good predictors of readability in previous work.Since we tried not to have presuppositions about the various levels of complexity in our corpus, a generic language model for Dutch was built based on a subset of the SoNaR corpus (Oostdijk et al. 2013).This subset contains only newspaper, magazine and Wikipedia material and should qualify as a generic representation of standard written Dutch.The language model was built up to an order of 5 (n = 5) with Kneser-Ney smoothing using the SRILM toolkit (Stolcke 2002).As features we calculated the perplexity of a given text when compared to this reference data and also normalised this score by including the document length, as seen in Kate et al. (2010).For more information on this system we refer the reader to De Clercq et al. (2014) andDe Clercq andHoste (2016).
In the latter system, syntactic information as displayed in Figure 16.9 is also taken into account.To this purpose we incorporated the parse tree features as rst introduced by Schwarm and Ostendorf (2005) and that have proven successful in many other readability prediction studies Figure 16.9:Syntactic information calculated on the basis of the dependency tree of the sentence under consideration.(Pitler and Nenkova 2008;Petersen and Ostendorf 2009;Nenkova et al. 2010;Feng et al. 2010).
We calculate the parse tree height, the number of subordinating conjunctions and the ratios of the noun, verb and prepositional phrases.As an additional feature, we also include the average number of passive constructions in a text.The parser underlying these features is the Alpino parser (van Noord et al. 2013), a state-of-the-art dependency parser for Dutch.
As the parsing of the text may take some time, this calculation is performed o ine and a pdf report is sent to the user as soon as the text is fully processed.

The Stylometry Interface
The Stylometry Machine Learning (ML) interface makes possible experiments following the full textcategorisation approach to stylometry: it allows the linguistic analysis of Dutch language documents, the extraction of features used regularly in the research literature, the creation of instances for ML experiments using these features, and the ML experiments themselves.We will describe here the di erent steps to use it in turn.The interface itself contains helpful hints, examples, and information as well.The system available through the interface has reduced functionality compared to the full stand-alone system, which is also available from the authors.
To use the interface on their data, users must rst provide an email address in the appropriate eld of the interface so that results can be sent to that address.Then the following procedure must be applied.

Step 1. Preparing and Uploading Data for Training
The goal of a supervised ML experiment is to use examples of some mapping to learn a model that generalises to independent similar data.For example, on the basis of a number of texts we know to have been written by Willem Elsschot and other texts written by other authors, we train a machine learning method to learn a model of the style of Elsschot.A erwards we can test the accuracy of this model by applying it to texts that we did not use for training.The interface therefore makes a distinction between a Training run and a Testing run, and the user starts by uploading data for training.
Suppose we want to do a stylometry experiment predicting the gender of the author of tweets.We create a directory with two subdirectories (one for male, one for female), and put the 'train' tweets each in a separate le in their corresponding subdirectory.All les should be .txtles with utf8 encoding.A er creating a .ziple by compressing this directory (a directory that has as many subdirectories as classes -here, two -and with the texts belonging to each class in their corresponding subdirectory), this archive can be uploaded for training.A er uploading, a result screen is presented indicating successful uploading and providing an identity number for further use.

Step 2. De ning the Experimental Parameters
To set up the way the uploaded data will be treated in building a model about style, several types of information have to be provided.First of all, a name has to be provided for the corpus (i.e. the data) that has been uploaded -e.g., 'Elsschot-1' , or 'Gender-twitter' , etc.This could for example be the name of the top directory in which you provided subdirectories with training texts.
Next, up to three 'analyses' can be provided.An 'analysis' in this context is a speci c de nition of the information that will be used to represent the text for the ML algorithm (the so-called document representation or instance de nition).To de ne an analysis, the user selects a type, n-gram size, and frequency counting method.Analysis types supported are token (the tokenised words occurring in the text), character (the characters occurring in the text), lemma (the lemmatised tokens in the text), and pos (the part of speech, or grammatical category, of the words in the text).The n-gram size refers to the length of the sequences that we take into account; e.g. for characters, 'n' set to 3 would select all the character trigrams occurring in the text.A sentence such as 'Give me a break!' would result in the following character trigrams: '=Gi, Giv, iv=, =me, me=, =a=, =br, bre, rea, eak, ak=, =!=' .Analogously, selecting n = 2 with tokens would result for the same sentence in the token bigrams '= Give, Give me, me a, a break, break !, !=' .Additional information to be provided for each analysis is the frequency count type which can be absolute (how many times does a particular feature, for example the character trigram '=!=' occur in the document) or relative (what is the proportion of the occurrences of this feature in all the occurrences of all features in the document).
For each analysis speci ed, two datasets will be generated: one where document representations consist of binary vectors, and one where they consist of numeric vectors (where the numeric values are absolute or relative as selected by the user).In addition a dictionary is provided with the selected features for that experiment, their position in the document vector, and their frequency.

Step 3. Selecting the Features
The document representation de ned in the previous step can be very large.In the ' lter' step, this set of features can be reduced to a manageable number on the basis of frequency, informativeness or a combination of both.
There are three lters that can optionally be selected.If none is selected, all features will be used.The total set lter allows the de ning of a frequency band.For example, we might be interested in selecting the 10% most frequent features (set upper percentage to ten and leave lower percentage at 0), the 50% least frequent features (set upper percentage to 0 and set lower percentage to 50), or the middle band (in case one wants the features that are neither very frequent nor infrequent) -in this last case both thresholds could be set to 20, for example.(It is worth reminding that the term 'features' in this context refers to the items generated for the document representation, such as character trigrams or lemma bigrams.)Not all features are equally relevant for distinguishing between classes.Statistical and information-theoretic methods such as chi-squared and information entropy can be used to analyse the degree to which a particular feature (e.g. the character trigram '=!=') can di erentiate between the classes.The two remaining lters order the features according to relevance as de ned by these methods and allow the selection of a percentage of these most relevant features.
All that remains to be done at this stage is indicating whether one wants document features (average word length, average sentence length, average number of syllables, number of hapax legomena, number of hapax dis legomena and readability) to be computed, which Machine Learning algorithm one wants to use, and which document representation (binary or numeric).By clicking start, the whole process will be activated and an ID generated.
The user will receive by email a zip le that contains all instance vectors for the analyses and lters chosen for the current training run.The email will also contain a unique identi er that is used as a link between the training run and any test run the user may want to perform in relation to this training run.

Step 4. Testing
With the identi er provided, the user can enter the Stylene machine learning interface again, this time with a test dataset submitted in the same format as the training data.The trained model will be applied on the test data provided and an analysis will be returned.
The user will receive by email a zip le that contains all the instance vectors that have been generated for this test run.

Using the Interface for Text Analysis Only (Optional)
In case the user is interested only in parsing their text(s), it is possible to go to the Frog parser interface, and submit an archive of texts (again following a zip archive format now with one directory of les to be analysed) that will then only be parsed.No ML models will be built in that case and with each input le in the archive a Frog output le will appear with the parsed input text.The Frog parser is also accessible from the Readability interface.The Frog parser used for this project is frozen at version 0.12.15(c) ILK 1998 -2012 to prevent compatibility issues in the future.

Conclusion
The Stylene project, funded by the Department of Economy, Science, and Innovation of the Flemish government, and executed by the department of Economy, Science and Innovation of the Flemish government (EWI), and executed by the CLiPS4 and LT35 research groups, resulted in several resources, collected behind a single interface, that we hope will prove useful for di erent categories of users.People interested in the computational linguistics applications of stylometry and readability can analyse texts and be educated about the types of analysis that these research elds apply.Users in the digital humanities can test the automatic text categorisation approach to stylometry in a userfriendly interface suited for exploratory research.Whereas the rst stylometry interface is based on simpli ed models, the readability interface and the machine learning of stylometry interfaces rely on stateo heart so ware for Dutch.In addition, the interface provides an easy access to the stateo heart Dutch text analysis so ware package Frog.

Figure 16 . 6 :
Figure 16.6:General text characteristics of the entered text.

Figure 16 . 8 :
Figure 16.8:Hendi readability score of the text under consideration in comparison to the expert and crowd readability assessments of all texts in the training corpus.