About A Megabyte


Being a pronoiac meme-broker is a constant burn of future shock - he has to assimilate more than a megabyte of text and several gigs of AV content every day just to stay current. – Charles Stross, Accelerando

I was recently re-reading Accelerando– the book that inspired the Mnikr project– and this line struck me. While I’m not as accretive as the main character in the book, I do make a point of keeping up to date on my feeds each day, both through Google Reader (which I love) and through my Twitter incoming feed. While I know I don’t take in so much AV content each day (partially because viewing YouTube at work is frowned upon; my employers don’t check my Internet usage, but hearing Numa Numa blasting from my speakers would no doubt make some sort of impression), it occurred to me that I might well take in something like a megabyte of text in a day, or at least, some amount that would be interesting.

How, then, to go about calculating it? Well, I divided the task into two parts: Twitter and Google Reader.

For Twitter, the task is relatively easy. From their API, I can quickly find the call to get my incoming tweets. While their API indicates I can get the last 3200 tweets, I was only able to get the last 800; that’s fine, as I just need enough to make sure I can get a relatively long-term average.

I just used a quick-and-dirty method to get the tweets from the API, as all I really needed was the most recent tweet, and the 800th most recent. So then, the following commands got me what I wanted:


# Getting the most recent, and 800th most recent, Tweets from my timeline

wget "http://twitter.com/statuses/friends_timeline.atom?count=200&page=1" --http-user=XXX --http-password=XXX
wget "http://twitter.com/statuses/friends_timeline.atom?count=200&page=4" --http-user=XXX --http-password=XXX

From there, utilizing the calculator provided by Time and Date, I found that I had 473022 seconds between my 1st and 800th incoming tweet; that means I get approximately 146.124282 tweets per day, which (at 160 bytes per tweet– 160, to include the username and the like) equates to 22.831 kilobytes/day.

So, not that close to the megabyte, but hey– I’m just getting started!

Google Trends, in Google Reader, provides numbers that, at first, seem like they answer the problem. According to the trends, I’ve read 4418 items in the last 30 days, from a total of 83 subscriptions. Neat!

Unfortunately, that doesn’t tell me what I want to know. Some of those subscriptions, like XKCD, are just a single image, three times per week. Others of those, like Philosophy Walker’s Blog (she’s a friend from high school), can be multiple printed pages (if I printed them), posted irregularly.

What to do, then? Well, being a good little hacker, I wrote a Perl script. First, I exported my Google Reader feeds into an OPML file, so I could easily get at the feed URLs. Then, I scanned through all the feeds, grabbed the articles in the feeds, figured out how much content was there and over what time period, and came up with a bytes per day figure. There are a few caveats: this will include all the HTML tags, but I figured that’s not a terrible thing– helps to recover some of what I lose from the images not being counted, and anyway, in the blog posts themselves, there’s not that much HTML (as opposed to grabbing the entire HTML page). Also, one of my feeds only keeps one item in the feed at a time, but since I know that’s a daily post, I just told it to fudge the duration there.

This isn’t the nicest Perl code ever, but it’s relatively readable. (And I like Perl for this kind of task; it’s good with text munging, which is all this is.)


# Script to calculate how many bytes of text I eat per day from my Google Reader
# subscriptions.

use strict;

use XML::Feed;
open(OPML, "<google-reader-subscriptions.xml");
my @lines = <OPML>;
close(OPML);

my @urls;
foreach (@lines)
{
	my $line = $_;
	if ($line =~ m/xmlUrl\=\"(.+?)\"/)
	{
		push(@urls, $1);
	}
}

my $overall = 0.0;
foreach (@urls)
{
	print("URL: $_\n");
	my $feed = XML::Feed->parse(URI->new($_));
	if ($feed)
	{
		print("Blog: ".$feed->title."\n");
		my $entries = scalar $feed->entries;
		if ($entries > 0)
		{
			my $bytes = 0;
			foreach ($feed->entries)
			{
				$bytes += length($_->content->body);
			}
			if ($bytes > 0)
			{
				print("$bytes bytes in $entries entries\n");
				my $firstdate = ($feed->entries)[0]->issued;
				#Some feeds (notably XKCD) don't use the issued feed, oddly
				if (!$firstdate)
				{
					$firstdate = ($feed->entries)[0]->modified;
				}
				my $lastdate = ($feed->entries)[$entries-1]->issued;
				if (!$firstdate)
				{
					$lastdate = ($feed->entries)[$entries-1]->modified;
				}
				my $duration = $firstdate->delta_ms($lastdate)->delta_minutes;
				if ($duration == 0) # Happens when the feed only has one item
				{
					$duration = 24*60; # Assume one day
				}
				my $perday = ($bytes/$duration)*24*60;
				print("$bytes bytes / $duration minutes = $perday bytes/day\n");
				$overall += $perday;
			}
			else
			{
				print("No content, moving on.\n");
			}
		}
		else
		{
			print("No entries, moving on.\n");
		}
	}
}

printf("Total of $overall bytes per day consumed from these feeds.\n");

The output of this script? For me, 807592.584289259 bytes per day, or 788.664 kilobytes (using real kilobytes, not the base-10 approximations).

So there you have it; I consume, to a first approximation, 811.495 kilobytes per day of text. Not the megabyte I thought, but not too far off, either– and it doesn’t take into account the fact that I sometimes read linked articles, too– when a Boing Boing post discusses something on the New York Times, I’ll sometimes read the original NYT post just to get a sense of what’s going on. So adding that in likely makes it near a megabyte.

To me, this is what it takes to keep me grounded– to keep me aware of what’s going on outside the limited confines of the work I’m doing. It makes me a better hacker, I think, as well as a more amused one (when someone sends me jokes on Twitter, or the like). Good times. I welcome any suggestions or updates, by the way, to my method; I’d be happy to post a better approximation.