• 26 Oct 2007 /  Sites

    EU ParliamentThis is the story of how I almost got two copies of the World Wide Web stored on my web server. Yes, the entire World Wide Web. 2 Copies. Now, I’m a geek so I’m not sure how much of this “normal” people (like yourself) will be able to follow, so if you find yourself getting bored, please skip to another blog entry which will almost certainly have nothing more than ranting and swearing in it. For the rest of you, read on…

    I had an email waiting for me this morning. Nothing unusual there – despite Spamassassin doing its best, I usually have the opportunity to enralge my mebmer, claim money from lotteries I’ve never entered or help wealthy Nigerian businessmen get rid of colossal quantities of money every morning. This one however, was telling me that the disk space on my server had just run out. So no enormous genitals or wads of cash for me this morning then.

    Logging on, it became apparent that about 3 Gigs of hard drive space was being eaten up by something and heading over to the log files showed that the MySQL database had written 3GB of logs over the past couple of weeks. What was puzzling was that the database seemed to be filling up with copies of the European Parliament web site. In Slovakian. Argh! Rooted! Someone with a grudge has got control of my server and is using it to DOS attack the Slovakian language version of the European Parliament web site. Quick! Who is connected to the server? Hmm… Googlebot is getting a page and now there are a load of connections to other web servers. Hang on, what page is Google getting? Ah. It’s charredbadger.php from the sodwork.com site… Ding. Is that the sound of a light bulb appearing over my head? No, actually it wasn’t – it was a coworker stirring his tea. I guess that light bulb thing only happens in cartoons.

    But what is charredbadger.php?, I imagine you asking in a manner that makes me look clever and you look stupid. Well, the short answer is that it’s a browser within a browser, designed to let the user pick an image from another web page – the “foreign” page. This is done by showing the foreign page in a frame with all the images extracted and shown underneath. The user can either click on an image to use it for nefarious deeds or click on a link in the foreign page to follow it. Click on that link up there to see what I mean. Try one of the 3 links on that page – you’ll get the idea.

    Of course it’s not as simple as it first looks. Any links clicked on the foreign page have to point back to sodwork or the user would simply be navigated away from sodwork completely. So when a link is clicked on, rather than your browser fetching the page, what actually happens is that the sodwork server (disguising itself as an ordinary web browser) fetches the page from the foreign site. It then looks through the code in the page and replaces all the links in it. The links, which would normally look like “foreignpage.com”, are edited to point back to sodwork in the form:

    “sodwork.com/charredbadger.php?link=foreignpage.com”

    So although it looks as if you are using the foreign site normally, everything goes through the sodwork server before it appears on your screen.

    OK, so what has all this got to do with 3GB of log files and multiple copies of the whole web? Well, calm down and I’ll tell you. Now then. The foreign page is stored in a database on the server so the scripts that produce both the frame with the foreign page and the outer page can extract what they need. The frame gets the links and the outer page extracts the images. This is a temporary database entry that gets cleaned up after it’s used.

    Except it doesn’t. Mr. Lazy here (that’s me) didn’t get round to doing the clean-up code. And the logs that MySQL produce don’t get cleaned up between reboots either. So every page that charredbadger fetches is permanently stored in the database and the command that stores it (which includes the entire code of the page) is stored in the database logs. So every page loaded by charredbadger is stored twice on my server. This isn’t normally a problem. charredbadger is not used that much so the database doesn’t get that big and the log files are erased before they start taking up any space.

    Until Googlebot comes along, that is. Hello Googlebot. Googlebot is a program used by Google. Googlebot gets a page from a server, stores it (“indexes” it) so that the words in it can be found by the Google search engine and then follows any links in that page to index those pages as well. It uses this method to index entire sites and hop from one site to the next, following the internal and external links, until it’s done the whole World Wide Web.

    Now, unlike us dumb humans, when Googlebot looks at the frame within charredbadger, it is smart enough to see that the all links in the foreign site web page, as shown in the frame, are actually links to the sodwork website. Every link on the foreign site appears to Googlebot as an internal link on sodwork.com and following them leads to other pages with even more links which also look like internal links on sodwork.com.

    So it follows them. All of them. They lead to other pages which have more links on them to other pages with even more links on them to other pages… You get the idea. Googlebot thinks it’s indexing my site because all these links start with “sodwork.com”, but thanks to the way my server fetches the foreign web pages and adds that “sodwork.com” on the front, it’s actually indexing whatever foreign site happens to be loaded into charredbadger. Remember that this could be any site on the whole WWW.

    So where does it stop? It doesn’t. There are supposedly 6 degrees of separation between any two web sites; i.e. 6 links will get from any one site to any other site on the internet. So in theory, Googlebot will keep following links in charredbadger until the whole of the World Wide Web is indexed. Again. Via my web server. Which, if you remember, is storing 2 copies of every page.

    So Google gets another copy of the entire internet (well, the WWW bit of it), except with “http://sodwork.com/gamepic/charredbadger.php?” in front of it, and I get two copies of every web page in existence stored on my server – one in a database and one in the logs for that database. It’s handy to have a backup I suppose. Except what happens when the sub-internet indexing that Googlebot is doing gets round to charredbadger on sodwork again? Oh yes, it’s going to start indexing a sub-sub internet copy, with me getting 4 copies of the World Wide Web on my server. And on an on it goes in an endless loop until either Google or I run out of hard drive space.

    So it turns out we can get up to the Slovakian language version of the European Parliament website before I run out of hard drive space to store my internets in. I wonder how much further Googlebot and I would have got if I had a bit more space available? I’ll never know – I’ve started tidying up the logs automatically and I’ve told Googlebot not to index anything starting in sodwork.com/gamepic.

    Like I should have done to start with. That’ll teach me.

    Tags: , , ,

  • 19 Oct 2007 /  Road rage

    So you think you might have what it takes to be a BMW driver, eh? You’ve got the money, you’ve got a hankering for some German metal, but do you have the right attitude? Not everyone is cut out to be in command of the Ultimate (Crap) Driving Machine ™ and the following questions will show if you are special enough to drive it in the manner everyone will expect you to.

    Question 1:

    How would you describe your job?

    1. I work in a shop
    2. I program computers
    3. I enable high-end enterprise solutions from synergistic paradigms.

    Question 2:

    How do you like your coffee?

    1. White, frothy and sweet.
    2. Black and strong – like my men.
    3. The temperature of molten lava, sipped out of a paper cup whilst hurtling down the fast lane of the M25 at 95MPH, two inches away from the bumper of the car in front, flashing my headlights and screaming with futile rage.

    Question 3:

    The thought of a BMW in the shape of an SUV makes you

    1. come out in a cold sweat at the thought of the sort of person who is going to want to buy something that’s a combination of the two most wankerish vehicles on the road.
    2. come to the conclusion that car manufacturers have given up even the slightest pretence that one of these fuck-ugly behemoths might actually be used off-road.
    3. come.

    Question 4:

    What is the correct procedure for driving in poor visibility conditions, such as fog or heavy rain?

    1. Always drive so that you can see the tail lights of the car in front. That way you won’t get lost.
    2. Drive as normal, peering myopically out of the windscreen. Grit teeth, cross fingers, pray.
    3. Stay in the fast lane, accelerate hard up to the car in front, slam on your brakes at the last minute, drive 2 inches away from the rear bumper flashing your headlights until they get out of the way. Look – fog isn’t a problem for people like me. I’m in a hurry and I’m in a fucking BMW – get out of the way.

    Question 5:

    What does that yellow hatching in a box on the ground at a junction signify?

    1. I don’t know.
    2. I don’t care.
    3. It’s an advanced stop box for BMW drivers to wait in until their exit is clear.

    Question 6:

    There are roadworks ahead and the outside lane is closing 1/2 a mile down the road. You are in the outside lane, sailing past the huge queue of cars. Why are you the only one doing this?

    1. Oh christ, is the lane closing? I didn’t realise. I wondered why all those cars were queuing.
    2. No-one else has thought of doing this. I’m so clever. Suckers.
    3. Look, I really am more important than you and, unlike you, I can’t afford to be late.

    Question 7:

    You are in a narrow road with oncoming traffic and have been stuck behind a cyclist for 15 seconds. It looks like it’ll be another 15 agonising seconds before you can get past without knocking him into the gutter. What are you thinking?

    1. I think I’ve stayed here long enough to show that I’m not the sort of person who just barges past, so I’ll squeeze past and hope I don’t knock them off. Easy does it…
    2. Bloody bikes. Don’t they know how much they hold me up? I’ve stayed here long enough, I’m going to overtake anyway. Sod him. Why doesn’t he drive a car like normal people? Out of the way peasant.
    3. What cyclist? You mean the one back there, in the pool of blood? I wondered what the noise was. I hope he didn’t fucking scratch the paintwork.

    Question 8:

    Why did that bloke just shout “WANKER!” at you?

    1. I accidently carved him up. Oops. Sorry.
    2. I deliberately carved him up. Fuck him.
    3. He is so jealous of my superior driving skills it comes out as pure hatred. I love it when someone shouts at me – it shows how awesome I am.

    Results:

    Mostly a’s: Oh dear. You really aren’t cut out for a BMW and you probably never will be. You would be better off with something like a Prius, a Smart car or, god help you, a bicycle. You might even be a vegetarian. You make me sick.

    Mostly b’s: This is slightly better. While you aren’t there yet, there is hope for you. With a bit more aggression and a 1000 PSI ego inflation you might get there one day. Keep acting like you own the road and one day you’ll genuinely believe you do.

    Mostly c’s: You’ve made it. You top dog. Everyone else might think you are a wanker but you’ve got enough love for yourself to more than make up for their revulsion. You can barge people out of the way or push in with impunity because you really ARE more important than anyone else. Everyone knows this, they hate you for it and that makes you feel good. You are a natural BMW driver.

    Tags: , , , ,