After last week, I was convinced that web scraping (especially with wget) was a nifty tool, but I wasn’t sure how useful it would be to me. After all, most of the data I’m working with and putting into my database is coming from an archive collection which doesn’t even have a detailed finding aid. The names, dates, summaries, and everything else are created by me as I go through the hundreds of photos I take each time I visit the archive.
Most of the data – but not all. I knew that there were other collections of family letters in other Virginia repositories. It turns out some of these not only have finding aids but have digitized a few letters and put them online as pdfs! The finding aids are also available as pdfs. Which led me to wonder… can I just scrape the pdfs?
The answer is yes, and I didn’t even have to try and write a python script! I’m currently running a wget with the command -A pdf. A stands for accept and it means “keep only the files which end with”. The downside is that it starts to get everything and then checks extension, so the process takes a while.
Another problem is that the library website seems to redirect robots (and commands like wget) from individual pages to the general special collections website. So I have not yet successfully snagged the pdfs. At this point, it might be easier to right click (after all, there’s maybe 10). Still, I did try. Plus, if I can figure out how to get the pages I want to scrape, I can grab the on page of the finding aid they’ve made into html.