This is a post in a series authored by our encoding team on the Intertextual Networks project. For more information, see here.
By Adam Mazel, WWP and DSG Research Collaborator, Northeastern University
What are some of the challenges of interpreting computer-generated literary statistics? In this blog post, I respond to this question by reflecting on my process of computationally analyzing textual citations in Women Writers Online (WWO), a collection of digitized writing in English by women between 1526 and 1850. These citations are encoded in an XML bibliography that catalogues the cited author’s name, the title of the work and the publication date, as well as the author’s gender and the genre of their work. This bibliography makes it possible to digitally trace trends and relationships via XPath and XQuery, computer languages that find and extract encoded data. The results of some of these digital analyses are outlined below, followed by an overview of challenges that come with interpreting this data specifically and computer-generated textual analyses more generally.
Numeric data can generate insights; it can also mislead. For instance, XQuery shows that between 1640 and 1653, the most frequently cited text by WWO authors is Revelation (155), followed by Isaiah (124) and, even more distantly, by Psalms (52). Before and after that period, Revelation was far less frequently cited.66 Initially, I interpreted this data as suggesting that interest in Revelation was catalyzed by the contemporaneous English Civil War (1642–51). That this citation spiked during wartime seemed appropriate for a number of reasons. First, Revelation’s portrayal of intense conflict echoed the wartime violence. Second, the book’s thematic emphasis on prophecy may have offered readers hope and clarity during the horror and confusion of the war. Third, its representation of God’s judgment of the corrupt may have been used to justify one’s cause and the bloodshed. Indeed, I was set to conclude that this spike in interest reaffirms that history and literary production and consumption relate dynamically, each stimulating the other.
However, I soon had to reconsider my hypothesis that the English Civil War stimulated interest in Revelation, for a project colleague pointed out to me that the majority of the citations of Revelation between 1630 and 1650 were not by a representative array of WWO authors responding to wartime events but rather by a single author: Eleanor Davies, who related Revelation to events in the early 1600s.67 The English Civil War was not reviving interest in Revelation (at least according to this data). Rather, a lone mystic was skewing the results. This type of error can be difficult to avoid in purely distant reading: a prominent Victorianist once admitted to me that she was wrestling with interpreting why newspaper references to “prosody” suddenly spiked in the 1890s only to find that “Prosody” was the name of a racehorse popular at that time! (Though that in itself suggests the contemporary popularity of prosody-the-concept.)
Likewise, how should we interpret the fact that when WWO authors cite novels, 260 of those cited novels were written by women, only 73 by men, nearly a 4:1 ratio in favor of women? Similarly, cited children’s literature has a nearly 3:1 ratio of female to male authors, and cited gender commentary a 6:1 ratio of women to men. Conversely, when WWO authors cite literature in the classical languages, those cited classical texts were written by men exclusively, just as cited histories are 10:1 ratio of male to female authors, and cited biographies have a 5:1 ratio in favor of men.68 This data can be interpreted in at least two ways. On the one hand, it suggests that these genres were gendered: i.e. in this period, the novel, like children’s literature and gender commentary, may have been written by women more than by men, just as the classics, history, and biography may have been primarily written by male authors. But on the other hand, WWO authors may have been more interested in novels that were written by women rather than by men and histories written by men rather than by women. In short, interpreting numbers is challenging—it is very difficult to know what they are saying without examining the data in detail.
Analyzing WWO data raises further questions. For instance, how much can one generalize from WWO citations? While the corpus is extensive, totaling over 400 texts, it represents only a tiny fraction of the texts published between 1526 and 1850. Likewise, genre definitions are subjective and based on interpretation, and the genre labels of the texts in the WWO bibliography are all the more so, given that some of those genre label decisions were based on minimal information (a reading of merely the title, rather than the text itself, as some of the texts have been lost to time).
Thus, rather than draw firm conclusions about this data here, it may be more appropriate to consider the methods of interpreting numerical data itself. As the above examples suggest, macroanalysis—large-scale computational analysis of literary trends—is insufficient on its own for drawing accurate conclusions. Rather, macroanalysis needs to be complemented with microanalysis—the close examination of literary data—in order to interpret what the numbers are saying. Like the abovementioned authors who cite each other, each method needs the other to succeed.