Site menu:

Tags

Recent Comments

Site search

Categories

July 2009
M T W T F S S
« May    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Tags

Blogroll

Generating CSV from XML

I was helping a friend out recently who wanted to import some XML data he got into a more useful format [ ed. WHAT? err, useful to him, 'kay?].  It seems like there are a few services out there that will give you data in some kind of home-grown XML format in a record-oriented structure, e.g.

<contacts>
    <contact>
        <id>...</id>
        <name>....</name>
        <email>...</email
    </contact>
    ... <!-- more contact elements -->
</contacts>

When you have data like this, what you’ve got is essentially a degenerate spreadsheet, easily represented as CSV.  But if the service doesn’t provide CSV export, you can get it fairly easily via XSLT.  The idea is, you want to output one row (the header) with the names of the elements in each record, and then output each row thereafter.  What matters, as far as the input, is that it has the structure mentioned above: the document consists of a root element with a number of child elements, each one of which represents a record in the data. Note that the following restriction applies: each record element must contain the same number of child elements in the same order. In order to make it a little more robust, I added some logic to quote non-numeric values, which should provide a reasonable amount of protection from values that contain commas.   For extra fun (and this was my friend’s idea, and I was too lazy to follow through the steps) you could register this XSLT as a filter in OpenOffice.org so you can (nearly) automatically import these files into oocalc. It’s not entirely elegant (the logic for outputting the header row is duplicated with the logic for outputting a normal row), but it gets the job done. So here it is, I place it in the public domain.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"
    encoding="iso-8859-1"/>

    <xsl:template match="/">
        <xsl:variable name="records" select="*/*"/>
        <xsl:call-template name="header-row">
            <xsl:with-param name="header" select="$records[1]"/>
        </xsl:call-template>
        <xsl:for-each select="*/*">
            <xsl:call-template name="output-row"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="output-row">
        <xsl:for-each select="child::*">
            <xsl:variable name="numeric" select="not(string(number(.)) = 'NaN')"/>
            <xsl:choose>
                <xsl:when test="$numeric">
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:text>"</xsl:text>
                    <xsl:value-of select="normalize-space(.)"/>
                    <xsl:text>"</xsl:text>
                </xsl:otherwise>
            </xsl:choose>

        <xsl:choose>
            <xsl:when test="position() = last()">
                <xsl:text>&#13;&#10;</xsl:text>
            </xsl:when>
            <xsl:otherwise>
            <xsl:text>,</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="header-row">
        <xsl:param name="header"/>
        <xsl:for-each select="$header/*">
            <xsl:call-template name="quotevalue"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="quotevalue">
        <xsl:text>"</xsl:text>
        <xsl:value-of select="normalize-space(name(.))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="position() != last()">
                <xsl:text>,</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:text>&#13;&#10;</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>

A Few Minutes With Apache Sling

Apache Sling is almost painfully hip, in a way only a dedicated nerd could appreciate (or, ok, believe) — it provides a RESTful frontend to a Java Content Repository, and the whole thing is based on OSGi. Roughly, it gives you a content repository with customizable processing and presentation for different types of content, and the only ‘driver’ you need is a library that truly understands HTTP.

As part of evaluating it for the day job, I put together an s5 presentation with that other reST, and the result is Apache Sling Overview. I also dug into the codebase to figure out a bit more about Sling’s default POST processing servlet. I do hope I didn’t say too many materially false things.

Straight Outta Victoria

UVic’s Electronic Textual Cultures Lab encodes a song by some music guy in the Text Encoding Initiative XML format. There is, of course, a video.

What I want to know is, does this mean XML is cool or hopelessly pass&eacute;?  Discuss.

Goings On About Town

So, there’s this small office in downtown Chapel Hill that used to have a paper “Google” banner in the window.  Today, on a trip past the Cosmic Cantina, I noticed that the window now has a more permanent logo for Android. I’ll admit that the basic idea behind Android – a generally open cellphone platform mostly developed by the ‘net’s largest advertising distribution network — is really appealing, given how disappointing the current situation in the US is (e.g. I can take a picture on my current phone, but I can’t transfer it off the phone without emailing it, which means I’d have to pay a fee; and, equally important is the fact that the applications installed on the phone … well, they suck).  If it takes off, it could open up a range of possibilities for “mobile computing,” and I seriously hope it pulls the rest of the industry along with it.  We have these pocket communicators and the dominant business model for them is oriented around ringtones, fer gosh sakes.  Hm, on second thought, I don’t see how a modular, extensible platform’s going to change that, but maybe it will let me ignore it somewhat, which is good enough.

But why’s there an “Android” office in downtown Chapel Hill?  One related development suggests itself …

Fun With Copyright Renewal Records

Based on an enormous amount of work by contributors to Project Gutenberg and the Distributed Proofreaders, combined with healthy sourcing of the US copyright office’s records, Google has compiled a a list of works originally copyrighted between 1923 and 1963 which have been renewed at some point, the upshot being that if a given work published in that time span is not on the list, it’s likely in the public domain.

One problem with the list that the database is a 370+ megabyte XML file, which is hard to load up in an XML-aware editor and even caused eXist to choke.  So I broke it up into chunks with a shortish Groovy script, for neat ingestion into an XML database.  The heart of the script is a SAX handler that basically churns each record in the XML file into a Groovy object, and a closure (there’s that word again!) that handles each record as it is constructed.  As written, the script simply breaks the big file into a bunch of files, one for each year (you will of course have to edit the paths).  By supplying a different closure, you could do all sorts of different things with the records, e.g. stuff them into a relational database.

In the spirit of the thing, the script is in the public domain — but I make no representations as to the quality, idiomaticity or overall efficiency of the script; despite being SAX-based, it still manages to chew up quite a bit of memory, so watch out.  Note that you will need Apache Commons Lang (say, version 2.4) on the classpath (e.g. in $HOME/.groovy/lib) for this script to work. Developed with Groovy 1.5.6.

I’ve tried to stop wordpress from ‘prettyfying’ the output, which appears to mangle quotes. I hope to have that fixed soon …

import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.Attributes
import org.xml.sax.helpers.XMLReaderFactory
import org.xml.sax.InputSource

import org.apache.commons.lang.StringEscapeUtils
import org.xml.sax.Locator

/**
 * Represents an individual <Record> element
 * in the document.
 **/
class Record {
    def file

    def lines

    def recno

    def title

   def copyrightYear

    def copyrights = []

    def renewalYear

    def renewals = [] 

    // where it was published
    def published

    // rare!
    def note

    // source of the copyright info
    def source
    def snippet
    def md5sum

    // contributors, holders, and pseudonyms
    def people = []

    /**
     * Get the XML representing this element.  Note
     * that proper functioning here depends on how the
     * handler builds the elements.
     * @return a string containing this record's XML.
     */
    def xml() {
        def buf = new StringBuffer()
        buf << """
<Record>
    <Title>${title}</Title>
    <File>${file}</File>
    <Lines>${lines}</Lines>
    <MD5Sum>${md5sum}</MD5Sum>
"""
        if (snippet) {
            buf << "\t<Snippet>${snippet}</Snippet>\n"
        }
        if (note) {
            buf <<"\t<Note>${note}</Note>\n"
        }
        buf <<
"""
    <Source>${source}</Source>
    <CopyrightYear>${copyrightYear}</CopyrightYear>
    <RenewalYear>${renewalYear}</RenewalYear>
"""
        copyrights.each() {
            buf << it.xml()
        }
        renewals.each() {
            buf << it.xml()
        }
        people.each() {
                buf << it.xml()
        }
        buf << "</Record>\n"
        return buf.toString()
    }
}

/**
 * An inelegant class representing the elements that denote
 * people (copyright holders, contributors, aliases, etc.)
 **/
class Person {

    static ELEMENTS = ["Holder" :   [ "Name", "Type" ],
                        "Contrib" : [ "Name", "Role" ],
                        "Pseudonym" : [ "Pseudo", "Real" ],
                        "Neenym" : [ "Nee", "Now" ],
                        "Aka" : [ "Alias", "Real" ] ]

    static ROLES = ELEMENTS.keySet()

    def role

    def name

    def honorific

    def type

    def xml() {
        def firstElement = ELEMENTS[role][0]
        def secondElement = ELEMENTS[role][1]
        def buf = new StringBuffer()

        buf << """
<${role}>
    <${firstElement}>${name}</${firstElement}>
    <${secondElement}>$type</${secondElement}>"""
    if ( honorific ) {
        buf << "\t<Hon>${honorific}</Hon>\n"
        }
    buf << "</${role}>\n"
    return buf.toString()
    }
}

/**
 * Represents copyright and renewal date elements.
 */
class RecordDate {

	static ELEMENTS = ["Copyright", "Renewal"]

    def role
    def date
    def id
    def xml() {
        return """<${role}>
    <Date>${date}</Date>
    <Id>${id}</Id>
</${role}>"""
    }
}

/**
 * SAX handler that turns each <code>Record</code> element
 * into a <code>Record</code> domain object.
 **/
class RecordHandler extends DefaultHandler {

    /**
     * Stack of strings that represents the current
     * element context.
     **/
    Stack context = new Stack()

    /**
     * the current record being built.
     **/
    Record currentRec

    /**
     * the current Person element being built.
     **/
    Person currentPerson

    /**
     * The current date information being collected.
     **/
    RecordDate currentRecDate

    /**
     * A closure which will be called as each record is
     * read in.
     **/
    def recordListener

    /**
     * a buffer to collect the current text, since SAX might
     * not report all contiguous chunks of text at once.
     **/
    StringBuilder currentText = new StringBuilder()

    def locator

    @Override
    public void setDocumentLocator(Locator locator)
    {    println "Got a locator: ${locator}"
        this.locator = locator
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts)
    {
        context << localName
        switch( localName ) {
            case "Record":
                currentRec = new Record()
                break
            case Person.ROLES:
                currentPerson = new Person()
                currentPerson.role = localName
                break
            case RecordDate.ELEMENTS:
                currentRecDate = new RecordDate()
                currentRecDate.role = localName
                break
        }
    }

    @Override
    public void characters(char [] ch, int start, int len)
    {
        currentText.append(ch,start,len)
    }

    @Override
    public void endElement(String uri, String localName, String qName)
    {
        String txt = StringEscapeUtils.escapeXml(currentText.toString().trim())
        switch(localName) {
            case Person.ROLES:
                currentRec.people << currentPerson
                break
            case ["Type", "Role", "Real", "Now"]:
                currentPerson.type = txt
                break
            case ["Name", "Pseudo", "Nee", "Alias"]:
                currentPerson.name = txt
                break
            case "Hon":
                currentPerson.honorific = txt
               break;
            case "CopyrightYear":
                currentRec.copyrightYear = Integer.parseInt(txt)
                break
            case "Date":
                currentRecDate.date = txt
                break
            case "Id":
                currentRecDate.id = txt
                break
            case "Copyright":
                currentRec.copyrights <<currentRecDate
                break
            case "RenewalYear":
                currentRec.renewalYear = Integer.parseInt(txt)
                break
            case "Renewal":
                currentRec.renewals << currentRecDate
                break
            case "Recno":
                currentRec.recno = txt
                break
            case "Source":
                currentRec.source = txt
                break
            case "Lines":
                currentRec.lines = txt
                break
            case "MD5Sum":
                currentRec.md5sum = txt
                break
            case "File":
                currentRec.file = txt
                break
            case "Snippet":
                currentRec.snippet = txt
                break
            case "Title":
                currentRec.title = txt
                break
            case "Published":
                currentRec.published = txt
                break
            case "Record":
                recordListener(currentRec)
                break
            case "Note":
                currentRec.note = txt
                break
            case "CopyrightRenewalRecords":
                break
            default:
                println "Unrecognized element '${localName}' at line ${locator.lineNumber}"
                System.exit(1)
            }
        currentText.length = 0
    }

}

def file = new File("input-dir/google-renewals-20080624/google-renewals-20080624.xml")

/**
 * A listener that will output each record into a different stream depending
 * on the CopyrightYear of the record.
 **/
def listenerBase = { Map streams, Record it ->
    if ( !streams.containsKey(it.copyrightYear) ) {
        def f = new File("/output/dir/copyright-${it.copyrightYear}.xml")
        println "creating ${f.absolutePath}"
        def stream = f.newWriter()
        streams[it.copyrightYear] = stream
        stream.append("<CopyrightRenewalRecords>")
    }
    Writer s = (Writer)streams[it.copyrightYear]
    s.append(it.xml())
    s.flush()
}

def reader = XMLReaderFactory.createXMLReader()
def handler = new RecordHandler()
def outputStreams = [:]
handler.recordListener = listenerBase.curry(outputStreams)
reader.setContentHandler( handler )

try {
    reader.parse( new InputSource( file.newInputStream() ) )
} catch (Exception x) {
    x.printStackTrace()
    println "Error at line ${handler.locator.lineNumber}"
}

outputStreams.each() {
    k, BufferedOutputStream v ->
        println "Closing ${k}"
        v.append("</CopyrightRenewalRecords>")
        v.flush()
        v.close()
}

Adding Shift-select with jQuery

I’ve been using jQuery a bit here and there to add some (I hope) usability enhancements and for light AJAJ work. Today I encountered a situation where I thought adding the “shift-select” feature on a longish list of checkboxes would be a good thing. This sort of feature pops up in webmail interfaces, where you tick off one box, scroll down through 750 spam messages, and then, while holding down shift on the 751st piece of spam in a row, click its checkbox to select all of the rows in between. It turns out that adding this with jQuery is pretty elegant, so here’s the code. I don’t for a moment think this is the best implementation of this idea, but I was struck by how concise the result was, while supporting — via a straightforward use of a closure, multiple instances on the same page. For giggles, I added a feature that allows you to de-select a range, although I’m not convinced it works in an intuitive way.

To use this, you’ll need jQuery (tested against 1.2.3) in your page, and a CSS selector that matches the checkboxes you want to enable shift-select on. Then call $(selector).shiftSelect(); and you’re done.

 jQuery.fn.shiftSelect = function() {
    var checkboxes = this;
    var lastSelected;
    jQuery(this).click( function(event) {

        if ( !lastSelected ) {
            lastSelected = this;
            return;
        }

        if ( event.shiftKey ) {
            var selIndex = checkboxes.index(this);
            var lastIndex = checkboxes.index(lastSelected);
            /*
             * if you find the "select/unselect" behavior unseemly,
             * remove this assignment and replace 'checkValue'
             * with 'true' below.
             */
            var checkValue = lastSelected.checked;
            if ( selIndex == lastIndex ) {
                return true;
            }

            var end = Math.max(selIndex,lastIndex);
            var start = Math.min(selIndex,lastIndex);
            for(i=start;i<=end;i++) {
                checkboxes[i].checked = checkValue;
            }
        }
        lastSelected = this;
    });
}

D2′d

Not that I want to generate the impression here that all I do is buy electronic gadgets and software, but I just got an 8GB Cowon D2 [pronounced "coe-on", btw], which costs about the same as an 8GB version of a somewhat more popular digital music player, except that it has a touch screen, accepts SD cards for extra storage, and, oh, yes, plays OGG Vorbis and FLAC files (and it has an FM tuner and can also act as a sound recorder).

One also, get this, does not generally interface with a D2 via a proprietary piece of software (although you have that option if you want it)– one attaches the player via a USB cable and accesses it like any other USB storage. OTOH, it’s a little bit bigger, has a somewhat clunkier interface, and it appears to be extremely sensitive to being disconnected before it’s ready, to the extent that it becomes unresponsive if you don’t “stop” (windows) or “unmount” (*nix) it; as a matter of fact, mine doesn’t cotton to being connected to OS X at all. That’s fine, though, because for some funny reason I don’t have much in the way of .ogg and .flac files on my OS X machine …

Plus: a scientific calculator!

XO’d

Shh, don’t tell the nephew, but he’s getting an XO laptop. It’s here, but I haven’t taken it out yet — I am personfully resisting the temptation until I have a good excuse in a few days.

If you haven’t already plumped for one, there’s still 12 days left!
Just sayin’ …

The Messagening

Now that the day job is looking like it’s going to be more and more SOA oriented, in the WS-* sense, my interest in less complex things that can play in that space is getting piqued. It’s been clear for a while that XMPP is going to be one of those things — there are a lot of things you can do on a foundation of durable messaging and presence.

There’s an extension to the protocol to support the Pub/Sub model which allows, among other things, pushing updates out to an XMPP server instead of requiring clients to poll a feed — with built in support for authentication and SSL. More experimental work concerns “shared editing” of documents.

Set up a private server, and run clients on the systems you administer. You can query, monitor,and even run commands on the them; support for presence lets you see which of them are currently online. Take a look at the list of Extensions to XMPP already available, or in progress.

So, I Broke Down

Just a little bit; I tried out the prerelease of IntelliJ IDEA 7.0, and I liked it enough after some rough handling in casual use to actually pony up for a license now that it’s officially released. Like Netbeans, a default install does a whole lot more things I’d like an IDE to do than Eclipse does, and like Netbeans and unlike Eclipse, it’s failed to crash randomly on me. Yep, the prerelease was pretty smooth. Now that I’ve actually fallen for that old trick, although I tried the release (build 7361) on three different operating systems with a total of five different JVMs, I can’t get it to run. If I were to run a guessing game, I would not award points for guessing which platform provides the least help in figuring out what might be the problem. Essentially, the startup sequence seems to be going into an infinite loop looking for an error handler that isn’t there.

Now, of course, this thing had better ought to be really good.

update: what appears to have solved my problem is changing the options for the JVM that IntelliJ runs in; namely, make sure to add -Doldlicenseserver to them (which file to edit depends on the platform).