What do people think to this?
http://perishablepress.com/press/2007/06/28/ultimate-htaccess-blacklist/
Worth implementing?
What do people think to this?
http://perishablepress.com/press/2007/06/28/ultimate-htaccess-blacklist/
Worth implementing?
I use a .htaccess blacklist which is very similar to the "ultimate" blacklist.
http://en.linuxreviews.org/HOWTO_stop_automated_spam-bots_using_.htaccess
One thing I immediately noted regarding this "ultimate" list is that it includes robots like Archive.org (ia_archiver). This bot does have public benefit (and also respects robots.txt, so you don't really need to deny it by .htaccess).
It also misses bots like libghttp (a gnome library used mainly by spamsoftware).
A .htaccess blacklist IS a good idea, but cut'n'paste of lists like this "ultimate" blacklist - or the one I use for that matter - isn't a good idea. For example, if someone posts a .zip or .tar.bz2 or even a large .avi file on their blog then I'll much likely download it using Wget, which happens to be my favorite download manager(!) - but I can't if Wget is in a .htaccess blacklist .. or can I? Yes I can, because my wget is an alias for wget -U "Mozilla". And this is why such blacklist are worth very little alltogether, you can simply configure the software (including browsers..) to supply whatever commonly used User-Agent string you want.
Thanx, good answer!
I liked this quote from the article:
".htaccess can effectively block any spam-bot which admits to being one." :)
this means 50 conditions evaluated for every HTTP request received by the server. I'd rather use a bot trap, which won't bother regular users, and will trap spambots which don't admit being one.
One thing I immediately noted regarding this "ultimate" list is that it includes robots like Archive.org (ia_archiver).
Just a small correction. The Archive.org bot is not ia_archiver. ia_archiver is the Alexa Web bot. I don't think the "Ultimate Blacklist" blocks the Archive.org bot.
".htaccess can effectively block any spam-bot which admits to being one."
That's the issue. First thing I would do is I was writing a scraper would be to ignore htaccess.
Scrapers can't ignore htaccess, that on the server side. Were you meaning robots.txt, perhaps?