Skip to content

A very large forest

This week and next are text mining and topic modeling in the class on programming for historians. I’ve been reading around on both topics (I present next week, on topic modeling), and I keep shifting back and forth from “okay, this makes sense” to “wait, what?”

It is the problem of forest and trees. Individual trees I can identify: oh, look, a sugar maple, or oh, a short line of python. It’s when I start trying to understand the whole forest (just what is this Text Mining thing, anyway? How can I use topic modeling when I don’t have texts yet?) that I run into trouble.

Further, I admit to being a little daunted by the process, largely because my era is late 18th and early 19th century. Which means I’ll come up against the problems described and solved by Ted Underwood. It’s fantastic that he’s found a solution which works, but 4,600 rules for solving spelling errors is rather overwhelming. I’m trying not to lose sight of the trees in the forest.

What I really need is to play with the tools, and for that I need some texts. Does anyone know of a friendly, downloadable corpus? Preferably one from the late 18th or early 19th century?


  1. If you’re working in English, you could try the TCP-ECCO collection, of about 2200 volumes. It’s public, and hand-keyed, so OCR correction isn’t an issue. I think they’re available here:

    or just google around.

    The files probably come as xml, though. If that’s a problem, you could try our 19c documents, available w/metadata at the end of this article,

    I think there are a few minor glitches in that version, but for playing around with, it should work.

    • Megan Megan

      Oh, wow, texts from ECCO. That’s wonderful, thank you so much!

Comments are closed.