I collect internets

EU ParliamentThis is the story of how I almost got two copies of the World Wide Web stored on my web server. Yes, the entire World Wide Web. 2 Copies. Now, I’m a geek so I’m not sure how much of this “normal” people (like yourself) will be able to follow, so if you find yourself getting bored, please skip to another blog entry which will almost certainly have nothing more than ranting and swearing in it. For the rest of you, read on…

I had an email waiting for me this morning. Nothing unusual there – despite Spamassassin doing its best, I usually have the opportunity to enralge my mebmer, claim money from lotteries I’ve never entered or help wealthy Nigerian businessmen get rid of colossal quantities of money every morning. This one however, was telling me that the disk space on my server had just run out. So no enormous genitals or wads of cash for me this morning then.

Logging on, it became apparent that about 3 Gigs of hard drive space was being eaten up by something and heading over to the log files showed that the MySQL database had written 3GB of logs over the past couple of weeks. What was puzzling was that the database seemed to be filling up with copies of the European Parliament web site. In Slovakian. Argh! Rooted! Someone with a grudge has got control of my server and is using it to DOS attack the Slovakian language version of the European Parliament web site. Quick! Who is connected to the server? Hmm… Googlebot is getting a page and now there are a load of connections to other web servers. Hang on, what page is Google getting? Ah. It’s charredbadger.php from the sodwork.com site… Ding. Is that the sound of a light bulb appearing over my head? No, actually it wasn’t – it was a coworker stirring his tea. I guess that light bulb thing only happens in cartoons.

But what is charredbadger.php?, I imagine you asking in a manner that makes me look clever and you look stupid. Well, the short answer is that it’s a browser within a browser, designed to let the user pick an image from another web page – the “foreign” page. This is done by showing the foreign page in a frame with all the images extracted and shown underneath. The user can either click on an image to use it for nefarious deeds or click on a link in the foreign page to follow it. Click on that link up there to see what I mean. Try one of the 3 links on that page – you’ll get the idea.

Of course it’s not as simple as it first looks. Any links clicked on the foreign page have to point back to sodwork or the user would simply be navigated away from sodwork completely. So when a link is clicked on, rather than your browser fetching the page, what actually happens is that the sodwork server (disguising itself as an ordinary web browser) fetches the page from the foreign site. It then looks through the code in the page and replaces all the links in it. The links, which would normally look like “foreignpage.com”, are edited to point back to sodwork in the form:


So although it looks as if you are using the foreign site normally, everything goes through the sodwork server before it appears on your screen.

OK, so what has all this got to do with 3GB of log files and multiple copies of the whole web? Well, calm down and I’ll tell you. Now then. The foreign page is stored in a database on the server so the scripts that produce both the frame with the foreign page and the outer page can extract what they need. The frame gets the links and the outer page extracts the images. This is a temporary database entry that gets cleaned up after it’s used.

Except it doesn’t. Mr. Lazy here (that’s me) didn’t get round to doing the clean-up code. And the logs that MySQL produce don’t get cleaned up between reboots either. So every page that charredbadger fetches is permanently stored in the database and the command that stores it (which includes the entire code of the page) is stored in the database logs. So every page loaded by charredbadger is stored twice on my server. This isn’t normally a problem. charredbadger is not used that much so the database doesn’t get that big and the log files are erased before they start taking up any space.

Until Googlebot comes along, that is. Hello Googlebot. Googlebot is a program used by Google. Googlebot gets a page from a server, stores it (“indexes” it) so that the words in it can be found by the Google search engine and then follows any links in that page to index those pages as well. It uses this method to index entire sites and hop from one site to the next, following the internal and external links, until it’s done the whole World Wide Web.

Now, unlike us dumb humans, when Googlebot looks at the frame within charredbadger, it is smart enough to see that the all links in the foreign site web page, as shown in the frame, are actually links to the sodwork website. Every link on the foreign site appears to Googlebot as an internal link on sodwork.com and following them leads to other pages with even more links which also look like internal links on sodwork.com.

So it follows them. All of them. They lead to other pages which have more links on them to other pages with even more links on them to other pages… You get the idea. Googlebot thinks it’s indexing my site because all these links start with “sodwork.com”, but thanks to the way my server fetches the foreign web pages and adds that “sodwork.com” on the front, it’s actually indexing whatever foreign site happens to be loaded into charredbadger. Remember that this could be any site on the whole WWW.

So where does it stop? It doesn’t. There are supposedly 6 degrees of separation between any two web sites; i.e. 6 links will get from any one site to any other site on the internet. So in theory, Googlebot will keep following links in charredbadger until the whole of the World Wide Web is indexed. Again. Via my web server. Which, if you remember, is storing 2 copies of every page.

So Google gets another copy of the entire internet (well, the WWW bit of it), except with “http://sodwork.com/gamepic/charredbadger.php?” in front of it, and I get two copies of every web page in existence stored on my server – one in a database and one in the logs for that database. It’s handy to have a backup I suppose. Except what happens when the sub-internet indexing that Googlebot is doing gets round to charredbadger on sodwork again? Oh yes, it’s going to start indexing a sub-sub internet copy, with me getting 4 copies of the World Wide Web on my server. And on an on it goes in an endless loop until either Google or I run out of hard drive space.

So it turns out we can get up to the Slovakian language version of the European Parliament website before I run out of hard drive space to store my internets in. I wonder how much further Googlebot and I would have got if I had a bit more space available? I’ll never know – I’ve started tidying up the logs automatically and I’ve told Googlebot not to index anything starting in sodwork.com/gamepic.

Like I should have done to start with. That’ll teach me.