The WWP Begins Research into Word Vector Analysis

We are thrilled to announce that the Women Writers Project has begun work on a new project “Word Vector Analysis for TEI/XML: A user-Friendly Toolkit,” funded under a Tier 1 grant from Northeastern University, awarded to project co-PIs, professors Julia Flanders, Elizabeth Dillon, and Cody Dunne.

“Word Vector Analysis for TEI/XML” brings together two major digital humanities methodologies: text encoding and text analysis, as we aim to develop an exploratory web interface as part of the WWO Lab, which will allow users to visualize vector-space models with data from the Women Writers Online (WWO) TEI corpus and its sister corpus from the Victorian Women Writers Project (VWWP). For this collaborative project, a team of faculty, graduate students, and WWP staff will develop mechanisms to transform TEI documents for analysis with word embedding models, using the WWO/VWWP corpus as a test case. We will also publish a prototype web interface for exploring these models. The interface will enable users to input terms of interest and discover similar terms, locate analogies, and explore thematic clusters of terms. The WWP is especially suited to this type of text analysis because of our collection of TEI-encoded documents that are information-rich and relatively free from digitization errors. We will thus be able to create text-analysis-friendly data from TEI documents without losing the significant informational content of the markup.

Our prototype will integrate the word2vec text analysis tool using skip-gram negative sampling, as well as other experimental word vector model training methods, into an information pipeline that takes in TEI/XML data, performs a variety of preliminary processing steps taking advantage of information in the markup, passes the resulting text file on for analysis, and then sends the results to an appropriate visualization tool. The user interface will enable users to make selections based on information in the markup (e.g., to analyze texts within a certain time period or of a certain genre, or to focus on specific textual components such as background narration or direct speech), and to choose the visualization options for the output. The tool set will take advantage of TEI markup to improve tokenization in text analysis (for example, encoding with <orgName> removes any ambiguity about whether the string “Massachusetts Historical Society” refers to a single organization) and to enable comparison within corpora based on the semantic features represented by the markup. The prototype will initially focus on word vector analysis, but other text analysis routines such as topic modeling could be included as options in the pipeline as well.

As we move forward in our research, we will be publishing use cases and sample research projects using the word2vec text analysis tool. For example, a scholar of early American literature might be interested in the ways in which “America” as a concept is represented in early women’s writing. Using our tool on the WWO/VWWP corpus, she may start by finding words that are nearest to words like “America,” “England,” “country,” and “nation.” Or, she might test out the analogy function (such as in the often cited [king] – [man] + [woman] = [queen] word2vec example), experimenting with relationships between words like “England” and “New England” or “Virginia” and “Massachusetts”.

For further reading on word embedding models and word2vec, we’ve provided some suggested blog posts below:

“Vector Space Models for the Digital Humanities” by Ben Schmidt

“Pride and Prejudice and Word Embedding Distance” by Lynn Cherny

“Word Vectors in the Eighteenth Century” by Ryan Heuser

“Numberless Degrees of Similitude”: A Response to Ryan Heuser’s ‘Word Vectors in the Eighteenth Century, Part 1’” by Gabriel Recchia