Search Engines and Spiders

There seems to be a bit of confusion as to exactly what a Web Spider is/does so let me explain! 🙂

Search Engines, how do they work? When you enter in a search and click go, does it go to all the sites on the Internet or something else? The answer is the latter. It would take far too long to search all the sites on the Internet whenever someone entered a query, not to mention bringing the Internet to a standstill 🙂

Search Engines are structured such that you have an enormous database in the middle containing web pages in a special form that is very very fast to access. Google for example has something like 1.3 billion pages and is the biggest engine currently online I believe.

Then on one end you have the web server, this is the bit that you see when you go to the search engine. It sends your query into the high speed database and gets its responses back in a split second.

But what puts the web pages into the database? A Web Spider does. They’re called Spiders because of the way they work – ‘crawler’ is another term for it. Essentially they start out with a URL, whether it’s entered by a webmaster using the ‘Add Site’ button found on most search engines or some other means such as newsgroup trawling. The spider downloads that page, just the HTML and analyses it, storing it into the database. It finds all of the hyperlinks on that page and then goes and gets each one of those, analyses each of them and stores them into the web page and them visits all the links on those pages.

So if a Web Spider visits our front page it will immediately disappear off down to all of the links that are on the front page such as all the top ten diaries, all the recently updated diaries and so forth. From inside a diary it will visit all the ‘next’ entries, all the ‘previous’ entries and all the diaries that are on the menu bar not to mention all of the other sites on the menu bar.

The only thing that a web spider does NOT generally follow is anything that looks like a ‘CGI’. That is, a page that is dynamic, much like the entirety of Dear Diary. What we’ve done is made our site search engine friendly so that all the links don’t look like CGI’s (hence when we added www.deardiary.org/show/diaries/NeutronIC – that is a search engine friendly URL).

Why have we done this? The reason is simple, the big problem with the current generation of sites online is that sites such as Dear Diary hide all the really interesting content behind CGI’s – and this was starting to cause Search Engines to become less useful, for example you couldn’t get CNN articles, ZDNet articles, Dear Diary entries or basically anything behind an automated website (and most are now). This mechanism means that search engines can still find your entries so if someone types in a keyword that’s on your page then you’ll be offered up as one of the three billion returns 🙂

What if you dont want to be spidered? It’s not currently possible but we realised today that some people might not want their diary on Search Engines so as of the next release we will be putting in a feature so that you can prevent Web Spiders from going into your diary.

Now, quite why it’s accessed my diary 3500 times in 7 days, given that there are only about 400-450 entries in my diary is beyond me!

By all means add a comment or drop us a note at support if you have any more questions or concerns about this!

Matt.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *