Horrible WP export data format

So, I’m writing blog software, and one of the obvious things I want to do is import from this blog. As a first step towards that, I export the entire contects using WP’s export tool (after purging more than 17,000 comment spams since I last manually purged) and this is what it looks like (previous post to this):

>
>Interesting words in your OSX Dictionary>
>https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
>Tue, 11 Mar 2008 03:03:46 +0000>
>Sho>
        >>
        >>
        >>
         domain="tag">>
         domain="tag">>
 
 isPermaLink="false">https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
>>
>>
>713>
>2008-03-11 12:03:46>
>2008-03-11 03:03:46>
>open>
>open>
>fake-words-in-your-osx-dictionary>
>publish>
>0>
>0>
>post>
        >

Jesus, that is *horrible*. Firstly, if the post_type is defined only towards the end, what’s with the post_id, post_date, post_name etc? It’s a post – of post_type post! Secondly, where’s the “updated at” field? What’s the “dc:” namespace for the creator tag only? What’s with having an “isPermalink” switch in the guid tag? The permalink is in the link tag, I presume. Why does it need to be content:encoded when obviously the content is CDATA – implying that WP somehow supports XML parsing inside some contents!? Why is pubDate camelCase while everything else is underline_style? Man, I hate camelCase. Etc etc. What a mess.

I know what you’re thinking: that’s just RSS format! Sure it’s ugly, it’s RSS! Well, no. The RSS is similar but different for this post – I examined the feed for that, too. Note that the description is empty, it isn’t in the RSS. So they’re using a modified RSS format to store internal data. If they’re not going to store description, but just generate it on the fly – why export empty description tags?!

Just for comparison, here’s the much nicer atom feed. Obviously doesn’t have all the wp: internal data, but I much prefer the design:

>
        >
                >Sho>
                >https://fukamachi.org/>
        >
         <span style="color: #000066;">type</span>=<span style="color: #ff0000;">"html"</span><span style="color: #000000; font-weight: bold;">></span>
>
         rel="alternate" type="text/html" href="https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/" />
        >https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
        >2008-03-11T03:04:31Z>
        >2008-03-11T03:03:46Z>
         scheme="https://fukamachi.org/wp" term="Language" />
         scheme="https://fukamachi.org/wp" term="leopard" />
         scheme="https://fukamachi.org/wp" term="mac" />
         scheme="https://fukamachi.org/wp" term="dictionary" />
         scheme="https://fukamachi.org/wp" term="esquivalience" />
         type="html">>
         type="html" xml:base="https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/">
                Using Leopard? Try this. Look up the word esquivalience by selecting it and choosing dictionary from the contextual menu. Read the dictionary definition, then the wikipedia one underneath : )

]]>
>
>

Note logical, consistent design, self-closing tags, and other innovations.

UPDATE: Check out the comment format:

>
>3>
>>
>>
>http://nigger.org/>
>127.0.0.1>
>2005-07-16 10:23:48>
>2005-07-16 14:23:48>
>Hey, is this that new gay nigger cock website I've been hearing about?>
>1>
>>
>0>
>

The comment author is CDATA, but the content isn’t? WTF?

2 Responses to “Horrible WP export data format”

  1. Wincent Colaiuta Says:

    Well, look on the bright side: this might be unpleasant, but it’s part of your liberation from WP so it’ll be worth it in the end.

    I am having to do the same kind of stuff with my own migration at the moment. Have you considered that it might be easier to just do the following?

    1. Use wget to make a static mirror of the site.

    2. Pull the data you need directly out of your database, and (I assume you are going to be importing this into a Rails app) create ActiveRecord objects from it using script/runner. Basically, working with XML is so painful that it’s often easier to just talk raw SQL with the database to get the data you want.

    This is what I am trying to do right now, in fact, but it’s a nasty job as I have to do it for a MediaWiki install, a UBB.threads install, a MovableType install, a Bugzilla install, a Mailman install etc…

  2. Sho Says:

    Good idea, and I did consider it – but I’d like to offer others the choice to import *their* blogs as well (this is part of a larger project), which kind of rules out direct DB access. I would also like the ability to loosely consume other sources of XML – for example, a subversion commit log or similar – without trying to tie it all into one mega-app.

    With those requirements in mind, I was going to have to write some XML importers anyway, despite the pain. And I found some pretty decent prior art so it actually didn’t take all that long. I have some working scripts now.

    I share your distaste for XML but to be fair, if it’s well-formed and you get it all in one piece it’s not all that hard. Tools like hpricot make working with it much more palatable, not to mention faster. And at least it’s fairly terse – just for fun I converted this blog into YAML, and it was over 200MB.

Leave a Reply