Dear *********:The company I work for, Whizbang Labs, does extract information from the
web, as you already know. We crawl millions of web sites a week and
extract different pieces of information for different customers. In
your case, your site is in a list for company information extraction.
This simply involves crawling a site and figuring out the name of the
company owning the site, as well as contact information such as mail
address or phone number. We often have to crawl many pages before
finding this information, and so we take the shotgun approach of
crawling more pages up front than we may need to. We also don't know
(without looking) what sites have actual companies behind them or not.
In your case I would imagine that we extracted the name as deardiary.net
and didn't find any contact information other than an email address.
Our largest customer for this information is Dun & Bradstreet, who are a
leader in the business of deciding who's a company and collecting
information about them.I can assure you that we have not targetted any information that is
particular to your site, other than who you are. Specifically, we have
not logged and extracted any information from the diary posts of your
customers. It is not in our best interest to target pieces of
information that are that specific, we only go after information that
can be found across thousands or millions of web sites.I noticed that your site does not have a /robots.txt file. This is a
simple text file that tells robots where they shouldn't go. For
example, if your /robots.txt file contained:User-agent: *
Disallow: /showThis would tell all robots to not visit URLs that have '/show' at the
start of their path.While we may have hit you quite a few times over the course of a few
days, it is our intention to temper the crawl so that it doesn't pound
your site to quickly. Please let me know if that is not the case. If
you have other questions or concerns, feel free to respond to me
directly.Matt Jacobsen
Software Engineer
Whizbang Labs
And:Hello -
I responded to this message yesterday and sent it to your
'*************' account. If you didn't receive the message please let
me know. In short, we do extract information from the web, but our
current projects involve extracting highly general information from
millions of websites. Things like company name and company contact
information. We did not target your site for anything specific to it
(i.e. the journal entries). Rest assured that we have not logged or
extracted any of that information, nor are we selling it to others. We
visit millions of sites per week and sometimes crawl quite deep before
stopping on a site. I hope that we did not hit the site too quickly.To answer your question, we did not stop the hits due to your request,
it was just a matter of circumstance. Did it appear that we visited the
same pages over and over, or just a lot of pages on your site? If it is
the former, it could be that your site is somehow in a test list and I
will do what I can to remove it. Otherwise, you probably won't be hit
by our crawler for some time.Let me know if there are other questions or comments.
Matt Jacobsen
Software Engineer
Whizbang Labs
Of course, I have no way of finding out what page(s) the spider hit, but I have to believe it was more than one. Which means I can't even grip about being on a test list and demand I be removed immediately. I am very curious why site meter doesn't register these hits though. Maybe I'll shoot them a quick email and ask.
In somewhat unrelated news, I took the Internet Junkie test. (Hey, it's been a while since I did the useless quiz thing.)
Are you Addicted to the Internet?
The Are you Addicted to the Internet? Quiz at Stvlive.com! |