LJ, blog searches, datamining

Google’s new blog search is pretty nifty if you either like searching through people’s weblogs or are an egotist who likes to kiboze. I’m both. Since I’ve always been a shameless self-promoter and I ping all available services, index myself in search engines etc. this is just peachy.

The way LJ did it was to provide a large-scale XML data feed of Livejournal and Typepad blogs. The feed is explicitly intended for use by larger organizations who want to resyndicate or index this huge quantity of data. It’s not usable by end users; it’s an institutional service.

This is great if you’re Google, or AOL, or an MIT grad student doing a thesis on weblogging. However, if you’re an LJ user who checked the “please do not let search engines index me” button, it may be an unwelcome surprise. People who assumed a level of public presence that included friends and internet acquaintances, but not every coworker or family member who Googled them, have now discovered that the verb “to Google” now includes a well-indexed stream of all their public entries since March.

I had a frustrating conversation about this with mendel yesterday (sorry I got ruffled there, Rich) in which I think we were both right about different things. He quite rightly pointed out that public LJ entries were subject to data mining and indexing in a number of ways already, and that the check box for blocking robots did not imply privacy to someone who understands the current state of of the Internet. Certainly my personal expectation is that anything I post, even with the lock on it, could conceivably end up as the lead story on CNN, and I proceed with that risk in mind.

And of course many of the complaints received by Six Apart about this will be from people who are misinformed about technology or the law in various countries or any number of complicated issues. I actually have no idea what U.S. law would say about what a customer can reasonably expect in this situation, and since the technologies involved about about fifteen minutes old, it may be unknown anyway.

My concern was different. Providing a massive datastream only useful to large-scale operations is qualitatively different than allowing spidering, even. Marketers, credit agencies, insurance companies, and government agencies now have an optimized tool for data mining a huge chunk of weblogs. The amount of effort required to monitor and index all of LJ and Typepad just deflated tremendously.

I am reminded, for example, of FedEx providing a stream of their tracking information to the U.S. Department of Homeland Security, or of the supermarket loyalty card information being informally turned over to the government right after 9/11/01. A recent event I posted about in which auto repair records from dealers were aggregated and sold to Carfax comes to mind. I have been told by people in the email appliance business that spammers derive a good chunk of income these days by selling verified email addresses with names attached to insurers and credit reporting agencies as additional identifying information for their records (“appends”).

In short, Database Nation (Amazon link). To my mind these changes are inevitable, irresistible, and both exciting and frightening for different reasons.

But I also think that Six Apart failed their customers, at least in the customer satisfaction/PR department, by not providing a pre-launch opt-out or removing customers who checked that box from their institutional feed.

