By Hayley C. Stefan (she/her)

This post is part of a series we are publishing with projects from the WWP’s Institutes Series: Word Vectors for the Thoughtful Humanist.

In May 2021, seemingly a lifetime ago, I had the opportunity to attend Word Vectors for the Thoughtful Humanist, a week-long pedagogy-focused workshop put on by the Women’s Writers Project at Northeastern University, funded by an NEH Institutes for Advanced Topics in Digital Humanities grant. One of our goals for the workshop was to think seriously about how we might use word vectors in our teaching, from digital humanities-specific courses to those that might integrate digital humanities & media studies tools or methods. Even though we met online due to COVID-19, the workshop & community were both still interactive, and I came away from the week having met new collegial friends, feeling intrigued & invigorated by the possibilities of word-embedding models. I remain excited about the research I am now able to do because of the patient and thorough guidance of the WWP team. In many ways the workshop feels as though it’s helped me find ways of approaching some of the questions I hope to ask about my corpus, some of which I discuss in the first part of this post. Despite the advances the workshop has helped me make in my research, though, I have not yet found a feasible way to incorporate word-embedding models in the courses I teach.

In this post, I walk through some of my research successes with word vectors and my failure to bring them into the classroom. It’s a dilemma that is no fault of the organizers, who asked every question and offered funds (actual funds!) to help folks develop ways and means of teaching with word vectors. We spent several parts of the workshop discussing opportunities for teaching with word-embedding models, from brainstorming to working in small groups to plot out how one person’s imagined assignment or exercise might work. The problem I’m currently facing—and one of the reasons it’s taken me so long to produce a blog post that addresses the question of teaching (with) word-embedding models—is closer instead to the big DH/small DH discourse. The resources and labor needed to bring word vectors into my classroom are currently too much for me to consider as a contingent, early-career scholar. The benefits of the tool are many, but my barriers to teaching them have left me with some concerns about their accessibility and our goals for using them.

First: Succeeding in My Research with Word Vectors

I was looking forward to attending for several reasons: developing my repertoire of DH tools I could teach & use, finding alternate ways to ask questions about a growing corpus in my research, and getting to chat with folks in DH, including Sarah Connell, with whom I’d been on a panel for the Digital Americanists Society at the American Literature Association in 2018. To learn from the process, I took along a selection from a corpus I presented on at that panel concerning fiction about school shootings. While I had initial goals of just identifying texts missed by the Library of Congress catalog, I learned rather quickly (and then kept relearning) how much time the actual data collection takes up. The project went from initial plans of conventional, though no less significant, text mining (interested in how often the names of other real life shootings or perpetrators appeared in other texts) to collocation (interested in how words like “crazy” were employed in the books adjacent to gender and race markers) to questions about the demographics of writers, the publication dates in reference to actual shootings, and the popularity of these books. To say that I have answered any of these definitively would not only be false, but also ignore the realities of a project like this. Mine is not a fixed corpus. Many of the books are self-published, and classifying a book as about school shootings relies on several different queries, sites, and organizations: Has the author sold the book in the last ten years? Is it accessible in digital form? Did they seek copyright? Is it in English? Has it been reviewed online? As the project has changed shape, I have tried out different tools and methods to examine this developing collection of texts.

I approached the WWP workshop and word-embedding models with some initial questions and with many assumptions about how the texts would be represented in smaller chunks. To prepare for the workshop, I gathered digital copies of 20 fiction books from my list of identified texts, ranging in publication date from 1979 to 2020, with the majority of the texts published in the first two decades of the 21st century. The texts also ranged quite a bit in size: the shortest (Todd Strasser’s Give a Boy a Gun) clocked in at 35,183 words, while the longest (Wally Lamb’s The Hour I First Believed) totaled 257,384 words. Altogether, the corpus had 1,686,851 words. For this first foray into word-embedding models, I did very little cleaning and adjustment of the texts. I purchased digital copies of the texts, removed front matter and author notes, and any supplemental materials, such as lists of additional resources for readers (not uncommon for these texts because they discuss substantial trauma for young adults), previews of new books, or afterwords. The staff at the Women Writers Project helpfully did some additional work to prepare the corpus, and when assembled into one data folder & run through Word2Vec, the texts were summarily lower-cased and most punctuation was removed.

An image of a set of term clusters from the Women Writers Vector Toolkit — Fig. 1. Clusters generated via the Women Writers Project sandbox created for the Word Vectors for the Thoughtful Humanist.

Figure 1 shows some of these results. The “cluster” function identifies words based on their semantic similarity, using K-means clustering (Douglas Duhaime and Ben Schmidt each have accessible explanations of this process). Essentially, researchers tell Word2Vec in what ways and how large we’d like it to form clusters, without specific directions about the content of each grouping, rather than, for instance, searching for a particular term. Each of the clusters above has ten words (or jumbles of letters, given my poor initial cleaning of the mini corpus). The words are close in that they commonly feature in similar semantic positions, occupying what the WWP team called similar “neighborhoods.” Cluster 10, for instance, is filled with names, which would understandably have similar semantic roles in a text. (Despite the Community joke where characters use Britta’s name as a verb to mean someone messed up, we mostly use names as nouns.)

The words in cluster 4 show how the results that were specific to my corpus helped me think through how I might analyze these texts. The cluster opens with the word “chapter,” before including several digits, the names of two months, and “a.m.” Because I knew my corpus, it was easier for me to interpret and find meaning in the grouping. Based on thematic and structural patterns across individual books that I had read in the conventional sense, I had previously hypothesized that there was a pattern to how fiction about school shootings was narrated. Several texts, including Jodi Picoult’s Nineteen Minutes and Marieke Nijkamp’s This Is Where It Ends, rely on timestamps or dates as chapter breaks, moving the reader forward or backward in time, pivoting around the beginning of the shooting. I had not read all of the books in the corpus, but I estimated that these were not novel features of Picoult’s or Nijkamp’s writing, and looking at the cluster results from Word2Vec reinforced this idea. While that information might be relatively easy to identify by digitally or physically flipping through the 20 books individually, the task would get more difficult as the corpus grew (as it has—today it totals 116 books). The clustering indicated that, rather than outliers, time and dates are significant methods of structuring school shooting fiction, which, for my purposes, indicates that we have more to learn about how we understand temporality in regards to gun violence, childhood, distress, and collective trauma.

Conversations with the workshop organizers and participants helped me determine how to proceed when shifting from the mini corpus to later research. When discussing other ways we might shape or clean our corpora for future research, we realized that unlike others, I would want to keep chapter titles and numbers intact in my corpus, something that might have been easily overlooked if I assumed these were narrative anomalies or that I could do the work “by hand.” For me, examining the books using word-embedding models reaffirmed the complex significance of temporality in works around gun violence and generated new questions about the way readers are intended to experience time while reading the novels. I find myself now with increasing plans for further research — additions and adjustments to my corpus, different analogies to query, brainstorms about how to analyze my results. I am particularly excited by the queries I am doing about disability, distress, and violence.

Then: Failing to Bring Word Vectors into My Classroom

Even this smidgen of my findings shows how word-embedding models and the visualizations offered by word vectors could provide interesting, novel ways of encountering literature that I think my students and I could engage with together. The seemingly easy success led me to think that bringing these methods into the classroom would generate equally felicitous finds and new ways to help students uncover the relevance and importance of examining literature. I spent lots of time during the workshop and after considering what shape word-embedding models could take in my classes. During the workshop, I primarily imagined assignments students might undertake in survey-type literature classes or semester-long projects designed for digital humanities-centered courses. Following the workshop, I shifted my brainstorming to the courses that I currently teach.

Right now, one of my regular course assignments is “Intro to Literary Study,” a regular class in the English department for first- and second-year students, often taken by majors and non-majors alike to fulfill general education requirements. I have built the class around the concept of “reading bodies,” so that students and I ask how our own embodiment influences our reading and learning, why and how we assume that our ways of knowing or analyzing content change based on their form, how the body is invoked in literature and visual media, and how we might unbuild the assumptions we’ve long carried about our bodies and how they “should” or “could” operate. My department frames the class as a way for students to hone their close reading skills, begin working toward strong habits of college writing, and practice engaging in academic discourse at the class and small-group discussion levels. Beyond that, I have my own learning goals for my students to: think critically about how and what we read, watch, listen to; examine how we assign value and merit to arts; build critical awareness of ableism and access; add structure to their visual literacy skills; and really exercise their information literacy.

Students and I could work toward these goals using word-embedding models. We might look at boilerplate syllabus language about disability and access across higher-education institutions, for instance, and see what we learn about the disabled subject (like Tara Wood and Shannon Madden’s work on accessibility statements or Allison C. Carey and Cheryl Najarian Souza’s analysis of disability & sociology syllabi). Or we might curate a corpus using the Disability in Kids Lit website and together develop queries that help us approach the texts’ ideas about childhood and bodies in alternate ways. I am committed to making sure every part of my course is accessible to students, so I would have to consider how the materials we use work with screen readers, for instance, or design additional processes or activities that let students work toward the same learning goals. An exciting class could do all of this, perhaps starting by learning about disability rhetorics and reading one or two of the texts from the DKL website, before practicing with word vectors—first on institutional discourse on disability as guided class exercises and then in more extended analyses on a corpus of children’s and young adult books featuring disabled protagonists, which might take shape as a final project of sorts. This would be an incredibly exhilarating class and one that I think would foster important conversations about technology, how “close” close reading can get (whether we stop at the paragraph or word), and what that shift from “book” to “corpus” does to ideas and narrative.

For all of the excitement this imagined class brings me, questions remain for me about how to balance my course requirements, my department’s needs, and my own limitations with what would be a cool new digital humanities lesson (and no sarcasm here, this stuff really is neat). I have meditated on what resources my institution currently has, my own grasp of how word-embedding models work, and how much additional labor I would be asking myself and others to undertake to meet the gap between what exists and what I would foresee needing to successfully use word vectors in my current classes. I am a visiting faculty member at a private liberal arts college on a limited-term contract, hired to teach specific courses in an English department and affiliated with the first-year student program. While I have creative leeway to design my assignments and content, I question the point at which the class I’ve imagined above stops fitting the parameters of an “intro” course. I am unsure how I can rationalize this project to my students and department chair given the relative conservativeness of the department’s idea of an English major and the institutional supervision of contingent faculty, whose reappointment involves reviewing course materials and student evaluations alongside departmental norms. I worry about increasing workloads for students and myself in efforts to devise a class that would enable us to use Word2Vec, rather than finding spaces in my existing course load to use word-embedding models.

My difficulties designing exercises or assignments using word-embedding models are, I fear, not just personal to me, but rather institutional and inherent within broader conversations about how and by whom digital humanities are taught and used. At small liberal arts colleges like the College of the Holy Cross, DH features predominantly in individual classes, which might use digital tools, rather than themselves being classified as “digital humanities” courses. Such schools often have interdisciplinary faculty-driven research projects and engage in trans-institutional collaboration. While we are a private institution, my school does not have a set Digital Humanities degree program or center, though we have an active Educational Design & Digital Media Services group. Several faculty employ digital media and DH skills in their courses, and few students have self-designed Digital Media Studies majors. As far as I can tell from my own research of the course catalog, there has not yet been a DH course housed in the English department. This is a not-uncommon diffusion of DH work, as Bryan Alexander and Rebecca Frost Davis have written (in a piece that actually nods to process-driven student research and learning that happened at Holy Cross in 2010).

That gap between what I myself know or have access to at Holy Cross and what I think I would need in order to comfortably teach with word-embedding models feels too big for the small DH I can do in the classroom right now. I fall instead into Roopika Risam’s and Susan Edwards’s “Micro DH” category, sharing open-access tools and developing lessons with minimal “learning curves and barriers to entry that do not require affiliation with centers, access to expensive technologies, or substantial resources.” Like the sometimes unconnected DH presence at smaller institutions, my journey toward digital research and pedagogy has meant spending personal money and time outside of institutional paths. If I am a DH scholar at all, I am a sort of pastiche of shared knowledge from fellow graduate students, generous librarians, conference talks, workshops like this one, and helpful social media threads. Word-embedding models, for all of their potential and open-access options, are still, I think, big DH work. Teaching them and teaching with them require knowledge, support, and course design that I and many other early career scholars do not have access to or control. Because of this, word vectors have been placed, at least for me, solely in the research box.

While not every DH method or tool needs to be relevant for the first-year classroom or every teaching scenario, the workshop’s questions about how we might teach with word vectors has left me wondering about the implications of doing so. Over a year after our energizing week together, I’m still thinking about what my relative inaccessibility of word-embedding models means for me as someone who purports to “do” or “teach” digital humanities. And also about both the potential value and gatedness of DH, about who learns what and what they do with that knowledge, who they share it with. These aren’t new thoughts for DH or higher education more broadly, but they are questions that feel pertinent to the workshop and the continuing (important and valuable) work that the Women Writers Project and others are doing.

As in my research, I know I am at my best teaching in a way that is designed to call attention to inequity, empower others, and generate critical awareness of the world around us. In hopes of encouraging other, “bigger” DH scholars to ask what we’re hoping to do with word vectors and the tools to come, I leave us with this: What can we do to make using word-embedding models more accessible—in cost, tech literacy, and user ability? How are we using these methods to call attention to and break down inequity? How does such work embolden curiosity in us or our students? And if we can’t answer these, then we ought to think seriously about why and how we want to use them in the classroom.

About the Author

Hayley C. Stefan (hstefan@holycross.edu) is a Visiting Assistant Professor of English at the College of the Holy Cross and the 2021-2022 president of the Digital Americanists Society. Her teaching and research focus on the relationship between disability and race in contemporary, children’s, and young adult literature & digital media. More about her teaching and research is available at hayleystefan.org. For the work behind this blog post, she is especially grateful for the funding offered by the Women Writers Project and the NEH, as well as the kind guidance offered by Sarah Connell, Julia Flanders, Syd Bauman, Juniper Johnson, and her fellow participants in the May 2021 Word Vectors for the Thoughtful Humanist workshop.

Works Cited

Alexander, Bryan, and Rebecca Frost Davis. “Should Liberal Arts Campuses Do Digital Humanities? Process and Products in the Small College World.” Debates in the Digital Humanities, edited by Matthew K. Gold, U of Minnesota P, https://dhdebates.gc.cuny.edu/read/untitled-88c11800-9446-469b-a3be-3fdb36bfbd1e/section/d61cb020-c749-405a-ab59-63f8da332425#ch21.

Carey, Allison C., and Cheryl Najarian Souza. “Constructing the Sociology of Disability: An Analysis of Syllabi.” Teaching Sociology, vol. 49, no. 1, 2021, pp. 17-31. SAGE Journals, doi: 10.1177/0092055X20972163.

Disability in Kidlit. Edited and coordinated by Corinne Duyvis, et al., https://disabilityinkidlit.com/.

Duhaime, Douglas. “Clustering Semantic Vectors with Python.” DouglasDuhaime.com, 12 Sept. 2015, https://douglasduhaime.com/posts/clustering-semantic-vectors.html.

Nijkamp, Marieke. This Is Where It Ends. Sourcebooks, 2016.

Picoult, Jodi. Nineteen Minutes. Atria, 2007.

Risam, Roopika, and Susan Edwards. “Micro DH: Digital Humanities at the Small Scale.” Digital Humanities 2017: Abstracts, https://dh2017.adho.org/abstracts/196/196.pdf. Alliance of Digital Humanities Organizations Conference, 8-11 Aug. 2017.

Schmidt, Ben. “Vector Space Models for the Digital Humanities.” Ben’s Bookworm Blog, 25 Oct. 2015, http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html.

Svensson, Patrik. “Beyond the Big Tent.” Debates in the Digital Humanities, edited by Matthew K. Gold, U of Minnesota P, https://dhdebates.gc.cuny.edu/read/untitled-88c11800-9446-469b-a3be-3fdb36bfbd1e/section/38531431-5bd6-4eb1-95f5-fa49c025322d#ch04.

“To ‘Britta.’” YouTube, uploaded by stayasleep, 31 Dec. 2011, https://www.youtube.com/watch?v=1qS85kvq8Xo.

Wood, Tara, and Shannon Madden. “Syllabus Accessibility Statements: Revealing Ethos & Supporting Learners.” TILT: Techniques in Learning & Teaching, 16 Jan. 2016, https://uminntilt.com/2016/01/16/suggested-practices-for-syllabus-accessibility-statements/.