Breaking Down Markup Revision Projects: An Approach for Adding Line Breaks to Encoded Documents

In this blog post, we describe the WWP’s solution to a problem that other projects may well face: inserting encoding for line breaks after a text has been transcribed. Here is a lightly edited transcript of an interview Kyle Wholey (Outreach Coordinator at the WWP) conducted with Syd Bauman (Senior XML Programmer-Analyst) and Sarah Connell (Assistant Director).

So what exactly was the problem you worked on here?

Sarah: We recently discovered an issue with a text that had been started by one of our encoders several years ago. This encoder had transcribed the first 90 pages of a novel, but forgot to include all of the line beginning (<lb>) elements, which are important for our proofreading practices and for accurately encoding phenomena such as words that break over lines. To fix this problem, Syd and I worked together, since it would be an extremely time-consuming task for one person to take on. I read the text aloud (indicating where the line beginnings needed to be) while Syd used a program he wrote to add in the missing <lb> elements.

Can you describe your approach a bit more?

Syd: What we did was pretty consistent with our general approach to solving problems. We automated everything as much as possible. One of our main goals was to try and minimize the amount of time spent by a person rereading the document.

Sarah: With certain kinds of issues, it’s quicker and easier to fix minor errors automatically; for example, if someone consistently uses the wrong element, we can use a simple find-and-replace to change all instances to the correct element. But in this case, there was no way to resolve this issue without consulting the original text to see where we needed to insert the missing <lb> elements. So, we took an approach that would let us automate things as much as possible, using a bit of teamwork and a single-purpose program.

Syd: It was important that the command to insert the <lb> element and move the cursor forward had to be only one keystroke (it’s significantly less time consuming this way). In this case I used a slash for inserting the encoding for just a missing <lb> element, or a hyphen for inserting the encoding for cases where a word broke over the line.

So what is this shortcut? How did you develop a quicker way of going through this document?

Syd: Luckily, a large portion of the text did already have line breaks encoded. I did some initial XSLT-based analysis to calculate that number of words per line in the part of the text that already had line breaks encoded, and found an average of 8 to 9 words, something like every 8.45 words if you want to get precise. So I developed an Emacs Lisp program that would insert the missing <lb> tag and a new line and then move the cursor forward 8 words automatically. For words that break over lines, I set up a version that inserts not only the <lb> but also the end-of-line-hyphen (or “soft hyphen”) that marks the break in the word. One of the advantages of using Emacs here is that any program can be bound to any key. So I bound the former program to ‘/’ and the latter to ‘-’. So in one keystroke the right thing got entered, and “pop”, my cursor was at or pretty close to where it needed to be for the next one.

Sarah: After that, Syd and I sat down together to work on this. I would read out the first word in each new line from the text (or read out where the new line began in the middle of a word) and he moved the cursor where it needed to be and hit the right key to insert the missing encoding and keep moving forward.

Syd: And, generally speaking, I only had to move the cursor back or forward one word [before inserting the next <lb> each time] throughout the entire text. Our estimate, that a line break occurred every 8–9 words, really made this process easier and quicker to correct.

How exactly does this relate to your proofreading processes?

Sarah: WWO’s interface doesn’t show line breaks, but we transcribe them because it makes proofreading so much easier. Our proofreading process involves having an encoder read line by line, checking the transcription against our copy of the original text. So, without these <lb>s, it would be extremely difficult to locate where you are in the text to make sure that you’re proofing each line properly.

I’ve heard you talk about soft hyphen encoding a few times. Could you explain that a bit more?

Syd: We use this encoding for words that break over lines. Soft hyphens are particularly common in prose paragraphs, where the typesetter will often break words over the line to fit as much text on the page as possible. At the WWP, we encode these by using a Unicode character [U+00AD, see here for more details]. But sometimes this encoding can create more complications. I’ve written about this problem in an article: “The Hard Edges of Soft Hyphens.”

What was the final outcome of this process? Was it a success?

Sarah: It was a long process, we had four sessions to get through it all, but we accomplished a lot. Syd can tell you a bit more about the specifics.

Syd: Yes, here’s a list of our accomplishments. At the end of our sessions we: inserted 3,140 <lb>s, added 482 soft hyphens, added ~152 additional words that had been missing, and fixed 28 typos.

Wow that’s impressive! So what’s the takeaway from all this?

Syd: I would say that the take-home lesson is this: it’s often the case that you cannot write programs to do this kind of work, but you can write programs that make it easier for humans to do the work. And that can still be very powerful.

Screenshots of the code can be found here and here. See a GitHub repository with the full code for download here.