Week 10: Data Mining and Distant Reading

This week Jeri and I are leading the discussion. She has already posted an excellent overview of the readings, so I thought I would look at the sites and tools.

With Criminal Intent was a response to the Digging into Data challenge in 2009. It combines a specialized API with a personal research environment and visualization tools. The data is all from the records of the Old Bailey.

Let me just say that Dan Cohen is right about the importance of a good API. It makes a huge difference. I mucked around with the Old Bailey website when I was working on my Masters in Edinburgh – we talked about its utility in a class on material culture in 18th century Britain. It was fun to poke around but hard to get anywhere.  The API developed for With Criminal Intent is so much more useful, because you can drill down so quickly.

Compare the two search pages:

Old Bailey Search
Old Bailey search

The old search page ( top left) was oriented more towards punishments, verdicts and specific persons. The API (bottom left), on the other hand, looks more towards general categories and helps you narrow down to subcategories of punishment or offense.  Moreover, once you’ve started the search you can further narrow by the existing categories, based on what the results are.

Old Bailey API

To explain: I ran a search for offence category Theft, subcategory shoplifting, where the victim is female. I was then able to see the rate of punishments for qualifying crimes – the top being transportation with 144 sentences. From here I can further narrow my search, view results, or move the data into Zotero or Voyeur.

What this API allows me to do that the old search did not is to generalize while still narrowing down. Not only did the creators of the API make gender a category for analysis, but they also defined for the users the subcategories of offences, verdicts, and punishments.

With Criminal Intent is, in my opinion, a good model for data mining in history. Note that  from the API you can directly access the raw source, the actual entries.  While a historian using the site can look at larger trends they can also zoom in on each and every instance if they want.

Compare that functionality with Google’s Ngram or Moretti’s graphs of novels. As Moretti points out, on the graphs each work is only “tiny dots in the graph of figre 2, indistinguishable from all the others.” ((Franco Moretti, Graphs, Maps, Trees: Abstract Models for a Literary History (London, New York: Verson, 2005), 8.)) From Google NGrams you can move to book search for a year or set of years, probably best done by opening a window. You cannot, however, narrow the search beyond the date and the general language corpus.

What do we make of these sites? What do they make of history? Which are tools and which are methodologies? Any advanced search option gives you choices of which parameters to narrow, but those parameters are pre-defined.

Do these tools, or methodologies, change the way we formulate and ask questions of our historical data? If nothing else, it certainly alters what we can discover, in very little time.

4 Replies to “Week 10: Data Mining and Distant Reading”

  1. Excellent! Connecting the data-mining to “Maps,” the Spatial History project at Stanford has this really cool visualization of similar information on prostitution arrests in Philadelphia. The information gathered using tools like the search API can be turned into graphics to reveal patterns and questions.

Comments are closed.