LJ, blog searches, datamining

Google’s new blog search is pretty nifty if you either like searching through people’s weblogs or are an egotist who likes to kiboze. I’m both. Since I’ve always been a shameless self-promoter and I ping all available services, index myself in search engines etc. this is just peachy.

The way LJ did it was to provide a large-scale XML data feed of Livejournal and Typepad blogs. The feed is explicitly intended for use by larger organizations who want to resyndicate or index this huge quantity of data. It’s not usable by end users; it’s an institutional service.

This is great if you’re Google, or AOL, or an MIT grad student doing a thesis on weblogging. However, if you’re an LJ user who checked the “please do not let search engines index me” button, it may be an unwelcome surprise. People who assumed a level of public presence that included friends and internet acquaintances, but not every coworker or family member who Googled them, have now discovered that the verb “to Google” now includes a well-indexed stream of all their public entries since March.

I had a frustrating conversation about this with mendel yesterday (sorry I got ruffled there, Rich) in which I think we were both right about different things. He quite rightly pointed out that public LJ entries were subject to data mining and indexing in a number of ways already, and that the check box for blocking robots did not imply privacy to someone who understands the current state of of the Internet. Certainly my personal expectation is that anything I post, even with the lock on it, could conceivably end up as the lead story on CNN, and I proceed with that risk in mind.

And of course many of the complaints received by Six Apart about this will be from people who are misinformed about technology or the law in various countries or any number of complicated issues. I actually have no idea what U.S. law would say about what a customer can reasonably expect in this situation, and since the technologies involved about about fifteen minutes old, it may be unknown anyway.

My concern was different. Providing a massive datastream only useful to large-scale operations is qualitatively different than allowing spidering, even. Marketers, credit agencies, insurance companies, and government agencies now have an optimized tool for data mining a huge chunk of weblogs. The amount of effort required to monitor and index all of LJ and Typepad just deflated tremendously.

I am reminded, for example, of FedEx providing a stream of their tracking information to the U.S. Department of Homeland Security, or of the supermarket loyalty card information being informally turned over to the government right after 9/11/01. A recent event I posted about in which auto repair records from dealers were aggregated and sold to Carfax comes to mind. I have been told by people in the email appliance business that spammers derive a good chunk of income these days by selling verified email addresses with names attached to insurers and credit reporting agencies as additional identifying information for their records (“appends”).

In short, Database Nation (Amazon link). To my mind these changes are inevitable, irresistible, and both exciting and frightening for different reasons.

But I also think that Six Apart failed their customers, at least in the customer satisfaction/PR department, by not providing a pre-launch opt-out or removing customers who checked that box from their institutional feed.

26 thoughts on “LJ, blog searches, datamining

  1. ahh, not cool. on the one hand, i don’t have anything to hide and i’ve always been aware that i’m posting a JOURNAL on the INTERNET. on the other hand, I don’t need my cousins finding my journal that easily.

  2. You’re right on the money. I unchecked that box recently, because I wanted to let my journal be spidered again. But I would expect that if I had left it checked, that nobody would be spidering my post, not even LJ itself.
    Granted, robots.txt is only a suggestion. But most spiders are ethical enough to abide by it.

    1. On LJ’s scale, robots.txt is impractical because it would have to list every opt-out directory. It is provided for the subdomain of paid accounts, but that’s <4%, last I saw. So, they use the HTML meta tag. The closest approcimation for XML is the robots processing instruction, which needs more work and adoption.

    1. Right, and that was the kind of point Rich was making; there isn’t any privacy anyway. I think a lot of the flak they’re going to get is from people who don’t understand that, and it’s going to be unreasonable and uninformed.

  3. Hmm, I’ve had [X] No robotz on since day 1 and my public posts do not appear in Google even today. Some references to them outside of my own journal, but nothing from my posts themselves.

  4. For what it’s worth, there turns out to be a way to opt out of the full-site feed that the big aggregators are using (Google isn’t the only one, but I’m not sure what the others are): in the admin console, type set latest_optout yes and submit. (Replace “yes” with “no” to reverse the change.) That keeps your posts out of the XML and human-readable “latest posts” feeds as well as the “latest images” feed.

      1. But isn’t that a different feature ? These two choices are orthogonal:

        Publish feeds – I can think of good reasons why someone would make this choice.
        Make my journal searchable (no matter how that’s accessed)
        All four combinations of these two features make sense.

  5. From the LJ point of view, it’s probably better to have a Google-friendly feed of what changed than it is to have spiders constantly hit all the pages–less CPU time devoted to page renders that the spiders ultimately don’t care about. It’s like RSS on a massive scale.
    Still; I’m of the mindset that everything I type here is public and have a strong grasp of the technology involved; not the kind of person who would be complaining to Six Apart. It is hard for me to imagine that there are people dumb enough to not know that everything they type on a public webpage on the internet isn’t locatable by Google (or other search engines), but I guess they *do* exist somewhere.
    Allowing people to opt out of the feed really isn’t a solution–in much the same was as XOR “encryption” or security through obscurity is not a solution. The data is still there and public and easily obtainable by someone with bandwidth. Other non-Google search engines that directly spider the site instead of using the feed will catch those entries. If the data were more valuable, a sort of gray-market rift would be created where official, fine, upstanding search engines would be missing data that more down-and-dirty spidering search engines would get–but we’re talking about blogs here, so nobody cares enough.
    The only real solution is to let people delete their accounts, wait until Google catches up to the deletion (which, I imagine, is faster now that they have the feed), and then forget that things like archive.org’s Internet Wayback Machine exist.

    1. Yes.
      I mostly agree.
      I still think that introducing a feature that dramatically reduces the cost and effort of datamining the blog stream, without offering people a pre-launch opt out, is a mistake. I obviously don’t think that the NSA is honoring robots.txt or that they won’t take the trouble to spider all of LJ, but having pretty much every marketing company with a few grand to spend on infrastructure snarf up the stream and munch happily on it is a bit too Panopticon for me.

  6. I’m one of those pissed off amateur internet users. I selected that no spider/bots thing when I first created my LJ in July 2001, so I kinda thought I would be safe, you know?
    There IS an option on the Google blog search page to prevent their bots indexing your pages, but I have no idea where to paste the code. Plus, it will take some time for it to disappear apparently.
    I think Six Apart REALLY did some of us a disfavor with this one…no warning or anything?

      1. I did it, thanks so much for that!
        I know it’s foolish to expect that such things won’t happen on the internet, but you know…some folks tend to freeze the internet and technology at certain years.

  7. AFAIK, Google is still using regular feeds. There’s interest in updates.sixapart.com, but it’s far from done. The only change I see from the prototype posted about is that it no longer uses an unterminated AtomStream element. It doesn’t even respect opt_preformatted (“autoformatting”) or spit out valid Atom.
    Currently, the only available opt-out is set synlevel level from the console or going friends-only. Back in April and May, I chatted with (a Six Apart VP) and drafted a fresh spec for the Robots Processing Instruction, which generalized the HTML meta robots tag to all XML.
    Since people have brought this up again, I’m halfway through writing a spec for an X-Robots header and an XML-attribute approach. The former moves it to the HTTP level, so it works for anything sent over HTTP, and the latter provides finer (element-level) granularity.

    1. Now I see ‘s comment above. At least I now know why 1) it works and 2) I didn’t know about it. When you post, a check is run to see if latest_optout is present and, if it is, insert it into the queue. This queue is used by all of the latest-* things, so none of them have to worry about it. The implementation is lurking in ljcom, which isn’t GPL-ed and I stay away from.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.