FAS Talk

"When you go looking for anything at all, your chances of finding it are very good." -- Darryl Zero

January 09, 2008

The Web is a Messy Place

Even major players like Technorati don't get it right.

My company hosts hundreds of commercial blogsites, so we spend a lot of time dealing with the underbelly of the Web—spammers, hackers, orange alligators, and just plain bad programmers.

I'm used to seeing poorly designed and/or poorly implemented web sites and web services.  Just about anybody can toss something onto the web.  That's actually a great thing; it's what has made the Web succeed.  But tossing something onto the Web is a far cry from deploying a robust, scalable, maintainable, secure, and well-performing web application.

As far back as 1975, in his book The Mythical Man-Month, Fred Brooks described a fundamental issue in software development that is as true today as it was then.  The general idea is that once a piece of software appears to work, it is still a long, long way from being a solid, commercial-grade product.  In fact, depending on the nature of the application, you should plan on investing 3-9 times more to take the application the rest of the way to commercial-grade.

One of the reasons the Web is so messy today is that a very large number of applications are presented as commercial-grade applications long before they actually are.  This is not surprising, but as someone who deals daily with the dynamic world of Web 2.0—web services, integrations, content syndication, etc.—it surely can get frustrating.

Case in point... I was trying to figure out why some of our clients were having trouble "claiming" their blogsite under Technorati.  The basic concept is simple, claiming you blogsite lets you tell Technorati which blogs are yours so that Technorati better serve its purpose as a massive cross-blog directory/search engine.

As a rudimentary security measure, the process for claiming a blog in Technorati requires that you place a special key in your blog content so that Technorati can confirm that you, indeed, have authoring permission to the blog (and therefore, probably are its owner).

So far, so good.  This a simple yet reasonably reliable security check much like the credit card company calling your home phone to see if you know the account number of the new card they mailed you.  But where this gets messy is that the Technorati program (known as a spider) that visits your blog is sloppily programmed and makes requests indistinguishable from many undesirable applications (often called spambots) that inhabit the Web.

MyST Blogsite servers are protected against spambot traffic by software that automatically detects and manages spambots (and various other ner-do-wells).  With Technorati's spider looking like a spambot, it was being prevented from accessing the servers.

The solution is trivially easy and should have been done by Technorati engineers years ago.  Specifically, Technorati could identify its spider by passing a simple piece of data known as  a user agent string.  This is a well-document, trivial-to-implement, Web standard protocol that is has long been accepted as a best practice for spider developers.  But, web programmers can be sloppy and so many, in fact, are sloppy.

After much time trying to communicate with Technorati's engineering staff (without response), we deployed a simple work around to make allowances for Technorati's poor programming.

I know our own software is not perfect—none ever is.  But if more web developers would spend just a little cleaning up sloppy little messes in their own application, just imagine how many millions of lines of "work around" would become unnecessary.


Syndication OptionsRSS (Rich Site Summary) Feed Atom Feed OPML (Outline Processor Language) Feed MYST-ML (MyST Markup Language) Content Feed MS-Office Smart Tag Subscription