This week and next are text mining and topic modeling in the class on programming for historians. I’ve been reading around on both topics (I present next week, on topic modeling), and I keep shifting back and forth from “okay, this makes sense” to “wait, what?”
It is the problem of forest and trees. Individual trees I can identify: oh, look, a sugar maple, or oh, a short line of python. It’s when I start trying to understand the whole forest (just what is this Text Mining thing, anyway? How can I use topic modeling when I don’t have texts yet?) that I run into trouble.
Further, I admit to being a little daunted by the process, largely because my era is late 18th and early 19th century. Which means I’ll come up against the problems described and solved by Ted Underwood. It’s fantastic that he’s found a solution which works, but 4,600 rules for solving spelling errors is rather overwhelming. I’m trying not to lose sight of the trees in the forest.
What I really need is to play with the tools, and for that I need some texts. Does anyone know of a friendly, downloadable corpus? Preferably one from the late 18th or early 19th century?