LJ, blog searches, datamining

substitute Uncategorized September 15, 2005 2 Minutes

Google’s new blog search is pretty nifty if you either like searching through people’s weblogs or are an egotist who likes to kiboze. I’m both. Since I’ve always been a shameless self-promoter and I ping all available services, index myself in search engines etc. this is just peachy.

The way LJ did it was to provide a large-scale XML data feed of Livejournal and Typepad blogs. The feed is explicitly intended for use by larger organizations who want to resyndicate or index this huge quantity of data. It’s not usable by end users; it’s an institutional service.

This is great if you’re Google, or AOL, or an MIT grad student doing a thesis on weblogging. However, if you’re an LJ user who checked the “please do not let search engines index me” button, it may be an unwelcome surprise. People who assumed a level of public presence that included friends and internet acquaintances, but not every coworker or family member who Googled them, have now discovered that the verb “to Google” now includes a well-indexed stream of all their public entries since March.

I had a frustrating conversation about this with mendel yesterday (sorry I got ruffled there, Rich) in which I think we were both right about different things. He quite rightly pointed out that public LJ entries were subject to data mining and indexing in a number of ways already, and that the check box for blocking robots did not imply privacy to someone who understands the current state of of the Internet. Certainly my personal expectation is that anything I post, even with the lock on it, could conceivably end up as the lead story on CNN, and I proceed with that risk in mind.

And of course many of the complaints received by Six Apart about this will be from people who are misinformed about technology or the law in various countries or any number of complicated issues. I actually have no idea what U.S. law would say about what a customer can reasonably expect in this situation, and since the technologies involved about about fifteen minutes old, it may be unknown anyway.

My concern was different. Providing a massive datastream only useful to large-scale operations is qualitatively different than allowing spidering, even. Marketers, credit agencies, insurance companies, and government agencies now have an optimized tool for data mining a huge chunk of weblogs. The amount of effort required to monitor and index all of LJ and Typepad just deflated tremendously.

I am reminded, for example, of FedEx providing a stream of their tracking information to the U.S. Department of Homeland Security, or of the supermarket loyalty card information being informally turned over to the government right after 9/11/01. A recent event I posted about in which auto repair records from dealers were aggregated and sold to Carfax comes to mind. I have been told by people in the email appliance business that spammers derive a good chunk of income these days by selling verified email addresses with names attached to insurers and credit reporting agencies as additional identifying information for their records (“appends”).

In short, Database Nation (Amazon link). To my mind these changes are inevitable, irresistible, and both exciting and frightening for different reasons.

But I also think that Six Apart failed their customers, at least in the customer satisfaction/PR department, by not providing a pre-launch opt-out or removing customers who checked that box from their institutional feed.

Published by substitute

nope View all posts by substitute

Published September 15, 2005

26 thoughts on “LJ, blog searches, datamining”

potatohead says:

September 15, 2005 at 7:12 pm

ahh, not cool. on the one hand, i don’t have anything to hide and i’ve always been aware that i’m posting a JOURNAL on the INTERNET. on the other hand, I don’t need my cousins finding my journal that easily.

LikeLike

Reply
sachmet says:

September 15, 2005 at 7:14 pm

You’re right on the money. I unchecked that box recently, because I wanted to let my journal be spidered again. But I would expect that if I had left it checked, that nobody would be spidering my post, not even LJ itself.
Granted, robots.txt is only a suggestion. But most spiders are ethical enough to abide by it.

LikeLike

Reply
1. nikolasco says:
  
  September 15, 2005 at 10:38 pm
  
  On LJ’s scale, robots.txt is impractical because it would have to list every opt-out directory. It is provided for the subdomain of paid accounts, but that’s <4%, last I saw. So, they use the HTML meta tag. The closest approcimation for XML is the robots processing instruction, which needs more work and adoption.
  
  LikeLike
  
  Reply
odradak says:

September 15, 2005 at 7:15 pm

I’d be a bit more pissy about it if my comments to journals that aren’t protected weren’t already searchable.

LikeLike

Reply
1. ignatz says:
  
  September 15, 2005 at 7:21 pm
  
  Right, and that was the kind of point Rich was making; there isn’t any privacy anyway. I think a lot of the flak they’re going to get is from people who don’t understand that, and it’s going to be unreasonable and uninformed.
  
  LikeLike
  
  Reply
screed says:

September 15, 2005 at 7:20 pm

At least the UI is better than most blogs.

LikeLike

Reply
1. ignatz says:
  
  September 15, 2005 at 7:22 pm
  
  The only real problem with LJ is their customer relations. Everything else I really like.
  
  LikeLike
  
  Reply
  1. screed says:
    
    September 15, 2005 at 7:26 pm
    
    Customer relations? I thought the company was just a folder running 5 year old scripts. Like eBay.
    
    LikeLike
  2. nikolasco says:
    
    September 15, 2005 at 10:41 pm
    
    Er, no. LJ is under development and has been since launch. See or .
    
    LikeLike
  3. screed says:
    
    September 15, 2005 at 10:52 pm
    
    Yeah, I was kidding. Internet.
    
    LikeLike
scromp says:

September 15, 2005 at 7:43 pm

Hmm, I’ve had [X] No robotz on since day 1 and my public posts do not appear in Google even today. Some references to them outside of my own journal, but nothing from my posts themselves.

LikeLike

Reply
1. sachmet says:
  
  September 15, 2005 at 7:48 pm
  
  Really?
  
  LikeLike
  
  Reply
  1. scromp says:
    
    September 15, 2005 at 7:53 pm
    
    Hm, ok. I tried some really obviously unique searches and got nothing. Oh well, nevarmind. matrix ahoy!
    
    LikeLike
  2. rpkrajewski says:
    
    September 15, 2005 at 8:24 pm
    
    I think “No Robotz” should cover any kind of searchability. But how do you publish a feed without it being seen by Google ?
    
    LikeLike
  3. sachmet says:
    
    September 15, 2005 at 8:26 pm
    
    It’s not my call; I was just pointing it out.
    And the answer, as it turns out, is just a bit down the page.
    
    LikeLike
mendel says:

September 15, 2005 at 7:59 pm

For what it’s worth, there turns out to be a way to opt out of the full-site feed that the big aggregators are using (Google isn’t the only one, but I’m not sure what the others are): in the admin console, type set latest_optout yes and submit. (Replace “yes” with “no” to reverse the change.) That keeps your posts out of the XML and human-readable “latest posts” feeds as well as the “latest images” feed.

LikeLike

Reply
1. ignatz says:
  
  September 15, 2005 at 8:05 pm
  
  Thanks! I bet some people reading this thread will want to use that.
  
  LikeLike
  
  Reply
  1. rpkrajewski says:
    
    September 15, 2005 at 8:27 pm
    
    But isn’t that a different feature ? These two choices are orthogonal:
    
    Publish feeds – I can think of good reasons why someone would make this choice.
    Make my journal searchable (no matter how that’s accessed)
    All four combinations of these two features make sense.
    
    LikeLike
brianenigma says:

September 15, 2005 at 8:00 pm

From the LJ point of view, it’s probably better to have a Google-friendly feed of what changed than it is to have spiders constantly hit all the pages–less CPU time devoted to page renders that the spiders ultimately don’t care about. It’s like RSS on a massive scale.
Still; I’m of the mindset that everything I type here is public and have a strong grasp of the technology involved; not the kind of person who would be complaining to Six Apart. It is hard for me to imagine that there are people dumb enough to not know that everything they type on a public webpage on the internet isn’t locatable by Google (or other search engines), but I guess they *do* exist somewhere.
Allowing people to opt out of the feed really isn’t a solution–in much the same was as XOR “encryption” or security through obscurity is not a solution. The data is still there and public and easily obtainable by someone with bandwidth. Other non-Google search engines that directly spider the site instead of using the feed will catch those entries. If the data were more valuable, a sort of gray-market rift would be created where official, fine, upstanding search engines would be missing data that more down-and-dirty spidering search engines would get–but we’re talking about blogs here, so nobody cares enough.
The only real solution is to let people delete their accounts, wait until Google catches up to the deletion (which, I imagine, is faster now that they have the feed), and then forget that things like archive.org’s Internet Wayback Machine exist.

LikeLike

Reply
1. ignatz says:
  
  September 15, 2005 at 8:10 pm
  
  Yes.
  I mostly agree.
  I still think that introducing a feature that dramatically reduces the cost and effort of datamining the blog stream, without offering people a pre-launch opt out, is a mistake. I obviously don’t think that the NSA is honoring robots.txt or that they won’t take the trouble to spider all of LJ, but having pretty much every marketing company with a few grand to spend on infrastructure snarf up the stream and munch happily on it is a bit too Panopticon for me.
  
  LikeLike
  
  Reply
rroseselavyoui says:

September 15, 2005 at 8:37 pm

I’m one of those pissed off amateur internet users. I selected that no spider/bots thing when I first created my LJ in July 2001, so I kinda thought I would be safe, you know?
There IS an option on the Google blog search page to prevent their bots indexing your pages, but I have no idea where to paste the code. Plus, it will take some time for it to disappear apparently.
I think Six Apart REALLY did some of us a disfavor with this one…no warning or anything?

LikeLike

Reply
1. ignatz says:
  
  September 15, 2005 at 8:45 pm
  
  Yeah.
  See elsewhere in this thread for the admin console command to turn off your participation in the feed, though! 🙂
  
  LikeLike
  
  Reply
  1. rroseselavyoui says:
    
    September 15, 2005 at 8:49 pm
    
    I did it, thanks so much for that!
    I know it’s foolish to expect that such things won’t happen on the internet, but you know…some folks tend to freeze the internet and technology at certain years.
    
    LikeLike
nikolasco says:

September 15, 2005 at 10:33 pm

AFAIK, Google is still using regular feeds. There’s interest in updates.sixapart.com, but it’s far from done. The only change I see from the prototype posted about is that it no longer uses an unterminated AtomStream element. It doesn’t even respect opt_preformatted (“autoformatting”) or spit out valid Atom.
Currently, the only available opt-out is set synlevel level from the console or going friends-only. Back in April and May, I chatted with (a Six Apart VP) and drafted a fresh spec for the Robots Processing Instruction, which generalized the HTML meta robots tag to all XML.
Since people have brought this up again, I’m halfway through writing a spec for an X-Robots header and an XML-attribute approach. The former moves it to the HTTP level, so it works for anything sent over HTTP, and the latter provides finer (element-level) granularity.

LikeLike

Reply
1. nikolasco says:
  
  September 15, 2005 at 11:45 pm
  
  Now I see ‘s comment above. At least I now know why 1) it works and 2) I didn’t know about it. When you post, a check is run to see if latest_optout is present and, if it is, insert it into the queue. This queue is used by all of the latest-* things, so none of them have to worry about it. The implementation is lurking in ljcom, which isn’t GPL-ed and I stay away from.
  
  LikeLike
  
  Reply
flipzagging says:

September 17, 2005 at 1:00 am

Supposedly the problem is now fixed. That ‘scromp’ search now only shows journals which mention ‘scromp’, as far as I can tell.
I have an update from Google’s BlogSearch team over here.

LikeLike

Reply

Share this:

Published by substitute

26 thoughts on “LJ, blog searches, datamining”

Leave a comment Cancel reply