Search Engines and Spiders

There seems to be a bit of confusion as to exactly what a Web Spider is/does so let me explain! :)

Search Engines, how do they work? When you enter in a search and click go, does it go to all the sites on the Internet or something else? The answer is the latter. It would take far too long to search all the sites on the Internet whenever someone entered a query, not to mention bringing the Internet to a standstill :)

Search Engines are structured such that you have an enormous database in the middle containing web pages in a special form that is very very fast to access. Google for example has something like 1.3 billion pages and is the biggest engine currently online I believe.

Then on one end you have the web server, this is the bit that you see when you go to the search engine. It sends your query into the high speed database and gets its responses back in a split second.

But what puts the web pages into the database? A Web Spider does. They’re called Spiders because of the way they work – ‘crawler’ is another term for it. Essentially they start out with a URL, whether it’s entered by a webmaster using the ‘Add Site’ button found on most search engines or some other means such as newsgroup trawling. The spider downloads that page, just the HTML and analyses it, storing it into the database. It finds all of the hyperlinks on that page and then goes and gets each one of those, analyses each of them and stores them into the web page and them visits all the links on those pages.

So if a Web Spider visits our front page it will immediately disappear off down to all of the links that are on the front page such as all the top ten diaries, all the recently updated diaries and so forth. From inside a diary it will visit all the ‘next’ entries, all the ‘previous’ entries and all the diaries that are on the menu bar not to mention all of the other sites on the menu bar.

The only thing that a web spider does NOT generally follow is anything that looks like a ‘CGI’. That is, a page that is dynamic, much like the entirety of Dear Diary. What we’ve done is made our site search engine friendly so that all the links don’t look like CGI’s (hence when we added – that is a search engine friendly URL).

Why have we done this? The reason is simple, the big problem with the current generation of sites online is that sites such as Dear Diary hide all the really interesting content behind CGI’s – and this was starting to cause Search Engines to become less useful, for example you couldn’t get CNN articles, ZDNet articles, Dear Diary entries or basically anything behind an automated website (and most are now). This mechanism means that search engines can still find your entries so if someone types in a keyword that’s on your page then you’ll be offered up as one of the three billion returns :)

What if you dont want to be spidered? It’s not currently possible but we realised today that some people might not want their diary on Search Engines so as of the next release we will be putting in a feature so that you can prevent Web Spiders from going into your diary.

Now, quite why it’s accessed my diary 3500 times in 7 days, given that there are only about 400-450 entries in my diary is beyond me!

By all means add a comment or drop us a note at support if you have any more questions or concerns about this!


It's not a conspiracy

We really are out to get you… uh, no hang on. :)

The observant among you may well have noticed that Steve (DeLancie), myself (NeutronIC) and this diary (deardiary1) have appeared on the top ten most read diaries. Furthermore, i’m in position 1 and Steve is in position 2.

I’m just writing to say that this is entirely correct and not because we’ve engineered something, before the conspiracy theorists start getting their theories in motion :)

What’s happening (still) is that there is a web spider crawling the entire website including all entries – I believe it is the spider that powers Lycos amongst a couple of others. It just happens to have spent rather too much time in our diaries and as such I’ve amassed over 3500 hits this week.

We have removed all of the spider hits from the ‘reports’ and we are endeavouring to remove the spider hits from the top ten and from your personal stats so that the fluff is removed and everything can return to normal, ie. my diary being the least read diary on the system, not the (apparently) most read one :)


Xedri's Comment

Hi Xedri,

Yeah, the front screen also uses a connection to the database, for pretty much everything, the genre selector which shows the different genres requires a connection, the Recent Updates requires a connection once every ten minutes (its a fairly long (well, 5 seconds or so) process to work out the latest updates, so its done offline every 10 minutes). The Top10 and the Choice of the Week don’t require a database connection – but because the Genre Selector does, and in fact the whole look of the front page is changeable based on information in the database – that screwed that page up too… Sorry.

My information thus far indicates that the server first failed somewhere around 17:00 EST, for approximately half an hour. It then recovered somewhat, until about 21:00 EST, where it remained dead until I discovered it at around 02:00 EST.

We currently believe the problem was caused by the database software itself not quitting when it had finished, and since the system is configured so that the database cannot request so many resources as to kill the machine, it eventually ran out. Its likely that a tweak we made that day (and tested successfully!) had the unfortunate side effect of tickling a bug in the database software that did not come to light until many hours later. The tweak has been reverted and hopefully the problem won’t re-occur. Either way, if it does, the system SHOULD now let you know its currently too busy rather than just error like that.

More information as we find it out.


Down and Out!

Anyone trying to access the site at the sites busiest time (1am GMT) will have noticed a rather annoying phenomenon… You couldn’t. Any time you turned around there was an Internal Server Error or if you tried to login you were constantly denied…

The overall reason for this is simply that for some reason the database server became too busy to serve any more connections. Normally this would be a good thing (though would mean we need to invest in some more hardware!) since it would mean the server was being used. However, since its configured to accept more than double the normal usage, we’re still not sure what exactly happened.

The previous night we were spidered by Altavista in a MOST UNFRIENDLY manner, so if anyone from AV is reading, fix your goddamn spiders so it doesn’t request 1000 pages from us all at once. Its not nice, it locks everyone else out…

Last night was NOT about that. The database failed for some reason, and wound itself up in a knot to the point where it couldn’t close any of its processes down properly, resulting in more and more being opened, each time someone made an entry or read an entry. Pretty quickly that expired all the resources and everything else stopped. The database processes were still open this morning when I got here… They shouldn’t have been, but my efforts to kill them off gracefully were not rewarded either. Brute force almost didn’t work and the next step was a complete reboot… Fortunately that was not necessary in the end.

Investigations will proceed into what happened and we’ll let you know if we find anything concrete. Apologies to anyone who suffered as a result, and I will get to replying to all your emails today.



A script error has meant that any new users have not been able to access their diaries properly since creating them. This has been since some time early this morning, but is now fixed. If you had never customized your diary, you would see Internal Server Error’s when trying to access your diary, and when anyone else was trying to access it.

I apologise for any inconvenience and stress caused to new users!!