Stylometry and Women Writers Online

By Molly Nebiolo, Research and Encoding Specialist

I was able to fly to Victoria, British Columbia to to attend DHSI 2018 thanks to a course waiver awarded by DHSI and a NuLab Seedling Grant that funded my transportation and housing for the class. Details on my experience with DHSI can be read here.

Figure 1: Stylo generated image analyzing the most used words in each text and comparing them to one another. This image is of 14 of the over 400 WWO texts to clearly illustrate the relationship between texts. The shorter the vertical line between two texts, the more related the texts lists are, thus the more related the style of the text was written. Note how for Askew, Astell and Adams, the authors’ styles are distinct from one another, but are less so for the others.

Stylometry is a way to compare the similarity of texts in vector space and visualize those connections or changes between authors, over time, or across genres. The algorithms that make up the Stylo package weigh the most used words in each text, then compares those lists across the textual corpus. The visual product in R illustrates how closely related these lists of most used words are between texts. For example, with the WWO corpus, many of these texts were clustered by author because this tool helps to determine if authors have writing styles across their writing careers, also called stylochronometry, and which authors write similarly to one another. (see Figure 1) By transferring the CSV file produced by the Stylo package in R to Gephi, I was able to create better visualizations of the similarities and differences between texts in a corpus. Gephi allows for a sharper vector visualization of the relatedness of texts. Instead of a tree using vertical distance between each point to represent nearness in writing style, Gephi produces a network between the texts. (See figure 2).

Figure 2: A visualization of the WWO corpus using Gephi. From afar, one can see some authors, including Margaret Cavendish and Eleanor Davies, who wrote quite differently than the rest of the corpus (indicated as almost separate clusters of texts). Others were more similar to one another, indicated in the central body of the image. The thicker the lines, the more related the texts were, based on most used words. To see a full-resolution and keyword-searchable version of this image click here

Various results can be seen from these two images. For example, in Figure 1, it can be seen that some of the female authors in the corpus write very distinctly from one another, particularly Davies and Cavendish. This results in the few clustered groups of texts seen in Figure 2. Yet other authors, like X, Y and Z, have stronger similarities with their texts that they are quite interconnected, which produced the larger mass of texts seen in the second figure.

But what is actually being compared between the texts? Stylo runs an algorithm that find the most used words from each text, and then compares the lists to one another. The results that consist of an image (Figure 1) and a CSV file of the word list and data can show a rough image tree of the interconnection between the texts and authors. Transferring the CSV file of the Stylo results to Gephi then produces the sharper vector image of textual relationships seen in Figure 2. Depending on how the corpus is constructed, Stylo and the partnering use of Gephi can produce analyses of authors, the evolution of one author over time, or even the evolution of literature over centuries. The process is not restrained to just comparing texts written by different authors.

Figure 3: Screenshot of results when using Classify to determine the author most similar to an anonymously written text. In this example, the result is that the anonymous author is, or writes very similarly to, Sir Arthur Conan Doyle. MFW indicates the number of most frequently used words in the text, The Brand of Silence, that are compared to the MFW in the primary and secondary corpora of Doyle’s work and Christie’s work.

Stylo can also be used to identify an anonymous author, using the process of Classify or Rolling Classify. Using two corpora of texts, we can compare and determine the relatedness of an anonymous author to potential writers who may be using an alias. For example, I found an anonymously written mystery novel written in the 1920s. This was my secondary corpus. My primary corpus consisted of the Sherlock Holmes series by Doyle, and a couple of mysteries by Agatha Christie based on which files of the two authors’ books were available on Project Gutenberg. The output produces a visual that shows the relatedness of the anonymous text to the contenders. In my example, the unknown text was most like Doyle’s works. I later found out this anonymous author (later determined to be Harrington Strong) was a male mystery novelist, which could be a gendered reason why his writing was most like Doyle’s (see Figure 3).

Rolling Classify is an algorithm that allows someone to see the style of how an author writes compared to other writers. Rolling Classify analyzes the text in segments with each segment (usually using a factor of every 100 or 1000-word segments of the text) will be compared with the style of writing in Doyle’s corpus and Christie’s corpus. In Figure 4, we can see the results. The unknown author was most like Doyle overall because his style of writing was most like Doyle’s in most of the text (green), while very few parts of the book were written most like Agatha Christie (red). This last technique in the course seemed most useful, because it can be used to compare the writing styles between authors on a more intimate level than just putting them in vector space. We can dissect a text into segments to make these comparisons. For the Women Writers project, Rolling Classify can provide a more intimate analysis of a text compared to similar authors. Rather than looking at the top ten most used words to create a network of similarity for the text as a whole, this process is one that occurs within segments of the text.

Figure 4: Rolling Classify analysis of The Brand of Silence compared with the writing styles of Agatha Christie and Sir Arthur Conan Doyle.

Stylometry opens up a multitude of doors in visualizing corpus data in many ways. While literature-based analyses can benefit the most from this digital tools, intellectual historians, and other scholars wanting to study the relationship between written corpora can benefit from this tool. Particularly for the Women Writers Online corpus, the results from using Stylo and Gephi can be helpful for looking at a large corpus from a broader perspective to find connections that might not have been seen before. While my own experience learning stylometry at DHSI 2018 was different than how most people may learn the tool, because I designated a full week to attend a course on the toolset, plenty of tutorials and articles on stylometry are out there for people to learn and use it for their own work. The possibilities are endless for this type of visualization tool.