Are the romantics to blame? Exploring the WWP WordVectors Code as a Word Vectors Novice

By Emily Miller

Leading up to the July 2021 Word Vectors for the Thoughtful Humanist workshop, I’d been doing some work with the WWP website and incubating a fledgling word2vec project as part of my work on the Graduate Certificate in Digital Humanities here at Northeastern. The workshop approached, and the R files were prepared and polished. All that was left was to give them a road test. As someone new-ish to DH and learning these tools for my own project, I was the perfect guinea pig. My task was to run through the files critically, sitting in any discomfort, noticing it, taking notes about it. And, as someone with no prior coding experience, it was also my job to see if I could begin to do meaningful text analysis work using the WWP’s word2vec code share. The following post outlines what it was like to use the code and begin training and querying models on my very own corpus, as someone brand new to the toolset.

I hesitated for a long time before writing this, because there are so many beautiful, intelligent, comprehensive posts out there on word vector analysis, and I, an absolute neophyte, felt that I had very little to add. But then I remembered that around this time last year, as a first-year MA student who’d just learned that this cool thing called DH existed, I spent a long time falling through dead-end google searches, just looking for how you get started bringing digital methods into your literary studies research. What did I need to download? Would I be able to figure out how to use it? Would I need a one-on-one teacher to understand it? I’m an English major. One of those amazing coding English majors? Well, I learned some basic html in a mandatory high school class, and then forgot all of it? The point is, when I began this file-testing work, I had downloaded RStudio less than 24 hours beforehand. So this post is for the people like me, the English majors who downloaded RStudio yesterday, or might download it tomorrow. It’s about the kinds of things you can do to get started.

Here you’ll find me exploring new software, with the WWP WordVectors code holding my hand all the way. And, as I get to know the code and what kinds of questions I can ask with it, I’ll also begin to investigate my own corpus of 19th Century novels.

Corpus and Project

The WordVectors tutorials come with some small sample corpora for the first couple of runs through, when you’re acquainting yourself with the fundamentals of running code and all the new vocabulary. But it also gets you ready to train a model on your own corpus.

My corpus is relatively small (about 2,000,000 words; you want at least 1,000,000 for meaningful results with this tool) and it began, at Sarah Connell’s suggestion, with my physical bookshelf. It helps to be familiar with your texts when you start to see shards of them spat back at you by a computer. So, while there are almost certainly better examples out there of a long nineteenth-century fiction corpus for exploring attitudes about the natural world as the industrial revolution progressed, my project’s prototype corpus is built entirely of books I read in undergrad, or in the last several years. The corpus spans from 1796–1899, and contains mixed fictional genres, all published in England, Ireland, and Scotland. The most important selection criteria I kept in mind was relative temporal distribution—of the 20 novels, the largest cluster is near the mid-1800s, with sparser pickings before 1840 and after 1880. This way, I’m hoping my results don’t trend too far towards early or late 1800s until I begin splitting the time periods at a later stage of the project.

Something else I always wondered, and was embarrassed to wonder: how do digital humanists “build” a corpus? You can find tons of tutorials for scraping Gutenberg for huge projects, but if you are just getting started, here is what I did: I lovingly hand-highlighted the texts from Project Gutenberg, saved them as txt files, and manually highlight-deleted most of the front and back matter. Not glamorous, but digital methods do not have to be, especially at this scale. (Related: Laura Johnson has this wonderful Data Preparation Checklist for word vector text analysis projects!)

With this corpus, I hope to explore the developments of language around, and attitudes towards, the natural world as industrialization and urban expansion progressed in the nineteenth century. It’s now looking like the Anthropocene’s start date is more likely to be placed around the nuclear age in the 1900s, but we can first see warming temperatures in ice core samples beginning in the 1700s, with the start of the industrial revolution in Europe—this was another reason to choose nineteenth-century novels (Crutzen 17). So I want to look at those novels, the ways they represent the natural world: as land for the taking? Vessel for the soul? Resource for extraction? Really it’s all of those, and so much more nuanced, but I’m interested in the way novels as a form of popular entertainment can both show us prevailing attitudes towards the natural environment, and impact/perpetuate contemporary attitudes.

Word vectors were suggested to me as an optimal way of doing this research (at first I was thinking topic modeling) because I can go in and search for certain semantic neighborhoods. That’s exactly what happens later in this post.

In this corpus, the monster novels are overrepresented. The gothic is overrepresented. That will impact my results, as we’ll see later on, because the way they talk about nature and use metaphor is distinct. But it also gives me questions, meaty questions, to keep checking in on as the project progresses.

The idea is that the WordVectors toolkit gets the new researcher acquainted with word2vec in R, so that, as you begin to learn how the number of “iterations” or “negative samples” will impact the model you make of a corpus and dig further in your own research, you have a firm foot in the door. As you understand it more, you can begin tailoring your code to your own needs. I stick pretty close to the defaults here because, as above: neophyte. But this work will teach me what steps I may choose to take as my project moves forward.

Training my first model

The results in this post are from a model trained on the WWP files’ default settings. There are two reasons for this: it is easy to follow if you are coding along with me, and these settings produce meaningful results with my corpus. A good way to check if you have meaningful results is to run some clusters and see how reliably related the words in each cluster are.

Screencap of semantic clusters — The cluster results for my corpus with the WWP default settings

The files walk through using RStudio, and what you need to know about R for humanities researchers, like how to run a line of code, or a chunk of code. They give you a tour of your working environment (Introduction-to-R-and-RStudio), and guide you through installing the packages (Word-Vectors-Installation-Training-Querying-and-Validation). At the end of the Installation and Training file, you’ll have trained your very first model, either on one of the built-in sample corpora, or your own corpus. You can begin evaluating your model using the information at the end of the file, learn to read in (translation: open and use) an existing model that you’ve already trained, and—best part—begin exploring your corpus.

I would describe the feeling of having finally trained a model as complete euphoria. Every time my end-user fingers hit command+enter to run a line of code, I was ready for something to go wrong. But this code is difficult to break and easy to tinker with once you start to become comfortable with it. Modifying the code can be as simple as showing more or fewer words in the clusters, or on a plot.

Initial Results

So, with a model trained and ready for querying, I progressed to the Word-Vectors-Visualization file to really begin learning the contours of this model. There are a number of visualizations that the file walks you through, explaining what each setting means and how to read the results.

What we’re going to look at here are related word pairs. This visualization is a simple and powerful way to explore relationships between concepts in your corpus, by plotting some number of words with the closest cosine similarities to words on the x-axis and y-axis. Where they are on the chart indicates their proximity, in this model, to the x- and y-axis words.

Screencap of code — Code for visualizing related word pairs

Related word pairs plot of "gloom" against "bright" — Related Word Pairs visualization of “gloom” plotted against “bright”

The guidelines encourage you to try plotting some opposites (ex “man” and “woman”). Thinking about ways I could look for the words and attitudes about the natural world, I tried plotting “day” by “night” and “summer” by “winter.” I kept seeing the word “gloom,” so I decided to plot it against something that could be thought of as its opposite. I plotted it against “bright.”

I was expecting to find romantic nature words, because “gloom” comes up so often in the forest scenes of the gothic and romantic novels in this corpus (think Shelley and Radcliffe). I was also definitely expecting states of light and darkness, and found those, too. But also here—and this is what’s great about word vector analysis—are the metaphorical associations with those states of light and darkness. Because while these are not attitudes specifically towards nature, they are associations from the time period, which seems like a good place to start.

Words with a greater cosine similarity to “gloom” tended more towards expressions of human emotion than words with similarity to “bright.” In fact, across the chart, “bright” relates more to colors (“yellow,” “blue”), the outdoors (“landscape,” “sun”), and predictably, states of light (“clear,” “shining,” “brilliant”). That is, “bright” seems to be used overall more literally in this corpus.

By contrast, the words with greatest cosine similarity to “gloom” are “melancholy,” “gloomy,” “grandeur,” “stillness,” and “darkness.” Hovering just below 0.6 is the romantic cluster of “profound,” “solitude,” and “anguish.” What’s really interesting about this—we all know gloom can be an emotion word evoking the literal state of darkness—is its relationship to romantic attitudes about nature. I visited the OED to see at what point the figurative meeting rose to popularity.

“Gloom” has been in use since the late 1500s, primarily defined: “An indefinite degree of darkness or obscurity, the result of night, clouds, deep shadow, etc,” though the OED definition is careful to note that the word is “Originally poetic, and still somewhat rhetorical in use. By association with the figurative sense, the word has latterly tended to denote a painful or depressing darkness…” (OED “Gloom,” emphasis original).

The first use-example of “gloom” in the figurative sense is from 1744, and reads, “The Face of Nature, said he, will perhaps dispel these Glooms” (OED “Gloom”). So the figurative meaning was in vogue by the era of my corpus, as was this idea, above, that nature can alleviate human sadness, or gloom specifically. This is an illustrative example of a prominent attitude in at least the earlier works in the corpus.

In Captain Walton’s letters, in Frankenstein, Shelley writes that the wilderness should have the power to elevate Frankenstein’s melancholy soul:

“…but many things will appear possible in these wild and mysterious regions which would provoke the laughter of those unacquainted with the ever-varied powers of nature…” (n.p. Letter from August 19th, 17–).

One of the early projects of ecocriticism was to point out the ways in which we (humans) project our meanings and needs onto the natural world around us, as above in Frankenstein, disappearing into the woods in order to find solace, clarity, escape from society. So, a vessel for emotion, and a place of solace (from an increasingly urban world) are some of the early (1818) ways that this corpus uses the idea of the natural world. What other ways does this corpus think about the relationship of the human and the nonhuman?

This is only very early work for my project, but in the relationship of “gloom” to “bright,” a transactional attitude is brought to the fore: the character enters the natural world in hopes of shedding some of their human suffering. What is the relationship of these literary attitudes to the ongoing industrialization of the 1800s?

And more importantly, what are the effects of literary/cultural attitudes towards the natural world? In other words, why does it matter? Historically, attitudes towards nature have been dynamic and ultimately extractive. But they also—and cultural products like novels and movies perpetuate these ideas—can be actively harmful, especially to peoples who are already marginalized. William Cronon has written that, in western thought, nature has shifted from a wasteland to a place of romantic sublime self-discovery. In America, a vessel for “rugged individualism”—the frontier. It is also, he writes, an escape from tiring, urban modernity, a tourist destination, a mirror for the self-image of the wealthy and middle classes. Cronon raises the issue that, with the creation of the Glacier National Park as “the illusion of [the tourists’] nation in its pristine original state,” the Blackfeet people were shut out of that land, which was never ceded by treaty (36, 38). The idea of wilderness shifts, and is bound up in dominant philosophies of extractivism, capitalism, colonialism. In exploring the ways that these novels represent the natural world, (in the models I’m training) I hope to trace some of the ideological shifts that went on; the novelistic movements that spread and bred the next movements. Essentially: what is the role of the novel in early climate change? How do these novels engage with the natural environment and extractive mentalities? And what is their connection/implication with popular socio economic ideologies that bred early climate change?

These are all questions I’m excited to engage with as I become more comfortable in the RStudio work environment. Going forward, I have some more cleaning to do (that word2vec helped me find), and I’d like to round out my genres. But I am so excited about the running start these walkthroughs gave me.

Works Cited

Cronon, William. “The Trouble With Wilderness: Or, Getting Back to the Wrong Nature.” Out of the Woods, edited by Char Miller and Hal Rothman, University of Pittsburgh Press, 1997, pp.28-50. JSTOR, https://www.jstor.org/stable/j.ctt7zw9qw.8. Accessed 27 Nov. 2020.

Crutzen, Paul and Stoermer, Eugene. “The ‘Anthropocene.’” Newsletter of the International Geosphere-Biosphere Programme, May 2000, pp.17-18. http://www.igbp.net/download/18.316f18321323470177580001401/1376383088452/NL41.pdf. Accessed 1 Dec. 2020.

“gloom, n.1.” OED Online, Oxford University Press, September 2021, www.oed.com/view/Entry/79084. Accessed 17 September 2021.

Shelley, Mary. Frankenstein. Project Gutenberg, prepared by Judith Boss, Christy Phillips, Lynn Hanninen, and David Meltzer. https://www.gutenberg.org/files/84/84-h/84-h.htm. Accessed 1 Dec. 2020.