I have a potentially unhealthy obsession with information graphics. It took me a good chunk of the day to make the graphics in this post and while part of the motivation behind making them was to have something visual to post on this blog (which could have been accomplished in *much* less time), there was also a methodological motivation. I wanted to find a way to make infographics that were not only good at illustrating a point for readers, but also analytically helpful.
Here’s the challenge: the project I’m working on this summer utilizes a web crawler to scrape the connections among tens of thousands of food blogs from ye olde interwebs. This is a methodology that I have not used in the past. As anyone who has done something like this before has told me, the web crawler will pretty much just keep going and going even with the constraints that are built into it. It never sits back on its haunches and pops out a message that says something like, “That’s it, lady, the whole network is yours for the examining”. It does not tell me when it is done, I tell it when to stop. If I cut it off too soon, I risk performing analysis on an incomplete network, one that I erroneously believe to be more or less all there. If I cut it off too late, I risk building a noisier dataset, wasting time, and generating a larger database that could crash my software or my computer altogether. (I speak from experience. I have already run Excel into the ground repeatedly, something that made me proud considering how robust Excel is.)
When thinking about how to tell when the web crawler is done, I find it is helpful to think of asymptotes.
So, thinking of asymptotes, I realized that there must be some function that I could plot, a graph I could make based on the information I have about the ongoing behavior of the crawler, that would help me visualize its state of approaching done-ness. [Pardon my free reign in the word creation department. I prefer the incorrect ‘done-ness’ over the grammatically superior ‘completeness’.] My first attempt at graphing was not the burgeoning eggs you see above, it was the top half of the graph below. The eggs just turned out to be more interesting to look at and they provide just about as much useful information as the top graph (ie not that much useful information from an analytical standpoint).
The top graph shows cumulative growth in the number of nodes gathered over time – the total number of nodes is around 32,000 as of 19 July. For the eggs, I used the size (in MB) of the output file. Honestly, folks, it doesn’t really matter if I’m looking at megabytes of storage or number of nodes. This absolute size approach is analytically vacant – it does not help me determine done-ness of crawling activity. It tells me that the crawler is still gathering new nodes that pass the food blog test. Yes, I already knew that.
How can I tell if I have to spend another 3 days, 10 days, 3 weeks getting up at 5:00 to fiddle with it?
But what I really want to know is whether or not the crawler is slowing down. True enough, math wizards, I can examine the slope of the segments in the top graph and deduce that flatter slopes mean the crawler isn’t adding as many nodes. That is not satisfying enough for me. Visually, it’s not as easy to detect precise slope changes as I would like. I found it was more useful to take the number of new nodes added per crawl session and divide that by the number of hours in that crawl session. That gave me the number of new nodes added per hour of crawl time. The hourly growth rate varies from day to day (sure, it varies from hour to hour but I’m not a stickler here – the average hourly rate for an entire day is sufficiently precise).
I went ahead and plotted these hourly rates. They bounced around more than I thought they would, though they pretty much stay somewhere between 60 and 100 new nodes per hour. The day they dipped to 6 was not a true low, it was an artificial low. On that day I went back and retroactively removed all of the alcohol blogs from the existing collection of nodes because I am not studying wine, beer, or cocktail blogs. I am studying food blogs. So the boozy blogs got the boot and that made it look like the crawler spent that particular day picking its nose or otherwise dawdling. Not the case. The thing about bots is that they are never caught with their fingers up their noses. On the other hand, they may have to be taught to stay away from the swill.
The bottom graph will be more helpful as I try to figure out how the crawler is doing on any given day and if it is starting to approach an asymptote of a single digit new node augmentation rate.
If this post had a moral it would be: don’t be afraid to try new methods both at the macro-scale (like adding social network analysis to your methodological quiver) and at the micro-scale (like trying to use infographics to help guide your research).