w3c.recommend(xproc)

As an unabashed fan of the angle brackety type things, I’m chuffed to learn, via Norman Walsh,  that XProc is now a W3C recommendation. Congratulations to all the people who put in all the work to get it there.  Take a look at if if you need to run your XML documents through a bunch of steps and produce a bunch of results (and do other things along the way).

I’ve used XProc in a limited way to run a sort of enhanced XSLT process, and it was slow to get started, but once I wrapped my head around the central concepts, the rest went like butter.  Given that the specification provides for making HTTP requests, I’d think it could serve as an especially useful component in a RESTful document publishing architecture.  But then, I would say that.

Posted in Tools, Uncategorized, nerdination | Tagged | Leave a comment

On Bankers

Here’s some unsolicited advice for performing long-distance one-off transactions with financial institutions: if your transaction is at all unusual and requires that documentation of some sort or another be passed around and notarized and suchlike — get the procedure documented.  That person  you’re talking to on the phone, unless you’re really lucky, doesn’t know that the dark, cold hands of Institutional Policy are poised to strangle any vestiges of trust you have in corporate behavior if you follow the seemingly simple procedure.  The bureaucracy always wants a Very Serious Document prepared, and it’s quite easy to blow you off, because what are you but some disembodied, highly compressed voice coming out over a Very Small Speaker?

Of course, if banks weren’t somewhat risk averse, we wouldn’t put our money into them.  Know, however, that the risk aversion is about you (the customer) not so much or not at all, and almost entirely about the Organization.  And the rules and regulations governing this stuff are Very Complex indeed, and typically aren’t the sorts of things that are known by the type of employee who has to (eurk!) take phone calls from the public. If they knew that, they’d be too busy in meetings deciding what the next set of rules are to take the time to deal with your piddly little problems.

The beauty of this whole setup is that it takes no active or deliberate malice on any individual’s part, and you, dear friend, will have to seek and pay for the advice of a lawyer.  Ain’t modern life grand?  Philly Joe Remarkable is not the only one looking on in disbelief.

Posted in Nonsense, Uncategorized | Leave a comment

On Nostalgia

There’s a whiff of wistfulness out there on the ‘tubes for the passing of Sun Microsystems, and I’ve got to admit I’ve participated a bit in that; the absorption by Oracle of Sun’s assets certainly marks some kind of transition in the industry that helps me pay my bills, call it ‘maturity’ or ‘loss of innocence’, or “oh no, we’re all doomed!”

On the other hand, I was clearing out some gunk in my attic this evening, and I came across a pretty hefty printout that details how to write a very simple custom component for Java Server Faces 1.0; it clocks in at around 15 pages or so. And then, you know, maybe I’m not so surprised at what happened to Sun.

And then it hits me that what’s swallowing Sun is Oracle. And then I’m surprised again.

File under “Vendorprisey”

Posted in nerdination | Tagged | Leave a comment

The Third G Drops

I’ve been thinking of it as effectively a rumour up until now, but today, my Android phone started getting a 3G signal in Chapel Hill and Carrboro (that’s T-Mobile, in case you didn’t know). So, now I go from having been a double early-adopter sucker (64Kbits/s and the first generation phone) to a reasonably fast-browsin’ (750-880Kbits/s) early-adopter sucker.

It’s progress, I guess. I understand that it’s possible to write applications for these things. If only I knew how to use a computer.

Posted in carrboro, nerdination | Tagged | Leave a comment

In Which I Become a Food Blogger

Some time ago, a friend pointed me to this magical stuff that turns fats into powders. T’other day, I finally got my hands on some of this tapioca maltodextrin, as it’s called; it’s a starch, and there’s really not much more to turning a really fatty thing into a powder than mixing the two things together.

The starch is close enough to flavourless, but any statements you may have encountered to the effect that you put the powdered (olive oil/peanut butter/hazelnut-and-chocolate-spread) into your mouth and voilà! it’s the original stuff again! are not really operative. There’s a noticeable effect on the texture, and you’ve got a bunch of starch that wasn’t there before.

Still, it’s amazing to work with and it’s truly a surprise for your taste buds.

Posted in Uncategorized, food, nerdination | Tagged | Leave a comment

Generating CSV from XML

I was helping a friend out recently who wanted to import some XML data he got into a more useful format [ ed. WHAT? err, useful to him, 'kay?].  It seems like there are a few services out there that will give you data in some kind of home-grown XML format in a record-oriented structure, e.g.

<contacts>
    <contact>
        <id>...</id>
        <name>....</name>
        <email>...</email
    </contact>
    ... <!-- more contact elements -->
</contacts>

When you have data like this, what you’ve got is essentially a degenerate spreadsheet, easily represented as CSV.  But if the service doesn’t provide CSV export, you can get it fairly easily via XSLT.  The idea is, you want to output one row (the header) with the names of the elements in each record, and then output each row thereafter.  What matters, as far as the input, is that it has the structure mentioned above: the document consists of a root element with a number of child elements, each one of which represents a record in the data. Note that the following restriction applies: each record element must contain the same number of child elements in the same order. In order to make it a little more robust, I added some logic to quote non-numeric values, which should provide a reasonable amount of protection from values that contain commas.   For extra fun (and this was my friend’s idea, and I was too lazy to follow through the steps) you could register this XSLT as a filter in OpenOffice.org so you can (nearly) automatically import these files into oocalc. It’s not entirely elegant (the logic for outputting the header row is duplicated with the logic for outputting a normal row), but it gets the job done. So here it is, I place it in the public domain.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"
    encoding="iso-8859-1"/>

    <xsl:template match="/">
        <xsl:variable name="records" select="*/*"/>
        <xsl:call-template name="header-row">
            <xsl:with-param name="header" select="$records[1]"/>
        </xsl:call-template>
        <xsl:for-each select="*/*">
            <xsl:call-template name="output-row"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="output-row">
        <xsl:for-each select="child::*">
            <xsl:variable name="numeric" select="not(string(number(.)) = 'NaN')"/>
            <xsl:choose>
                <xsl:when test="$numeric">
                    <xsl:value-of select="normalize-space(.)"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:text>"</xsl:text>
                    <xsl:value-of select="normalize-space(.)"/>
                    <xsl:text>"</xsl:text>
                </xsl:otherwise>
            </xsl:choose>

        <xsl:choose>
            <xsl:when test="position() = last()">
                <xsl:text>&#13;&#10;</xsl:text>
            </xsl:when>
            <xsl:otherwise>
            <xsl:text>,</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="header-row">
        <xsl:param name="header"/>
        <xsl:for-each select="$header/*">
            <xsl:call-template name="quotevalue"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="quotevalue">
        <xsl:text>"</xsl:text>
        <xsl:value-of select="normalize-space(name(.))"/>
        <xsl:text>"</xsl:text>
        <xsl:choose>
            <xsl:when test="position() != last()">
                <xsl:text>,</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:text>&#13;&#10;</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>
</xsl:stylesheet>
Posted in nerdination | Tagged , , , | Leave a comment

A Few Minutes With Apache Sling

Apache Sling is almost painfully hip, in a way only a dedicated nerd could appreciate (or, ok, believe) — it provides a RESTful frontend to a Java Content Repository, and the whole thing is based on OSGi. Roughly, it gives you a content repository with customizable processing and presentation for different types of content, and the only ‘driver’ you need is a library that truly understands HTTP.

As part of evaluating it for the day job, I put together an s5 presentation with that other reST, and the result is Apache Sling Overview. I also dug into the codebase to figure out a bit more about Sling’s default POST processing servlet. I do hope I didn’t say too many materially false things.

Posted in Tools, nerdination | Tagged , , , | 1 Comment

Straight Outta Victoria

UVic’s Electronic Textual Cultures Lab encodes a song by some music guy in the Text Encoding Initiative XML format. There is, of course, a video.

What I want to know is, does this mean XML is cool or hopelessly pass&eacute;?  Discuss.

Posted in nerdination, whodathunk | Leave a comment

Goings On About Town

So, there’s this small office in downtown Chapel Hill that used to have a paper “Google” banner in the window.  Today, on a trip past the Cosmic Cantina, I noticed that the window now has a more permanent logo for Android. I’ll admit that the basic idea behind Android – a generally open cellphone platform mostly developed by the ‘net’s largest advertising distribution network — is really appealing, given how disappointing the current situation in the US is (e.g. I can take a picture on my current phone, but I can’t transfer it off the phone without emailing it, which means I’d have to pay a fee; and, equally important is the fact that the applications installed on the phone … well, they suck).  If it takes off, it could open up a range of possibilities for “mobile computing,” and I seriously hope it pulls the rest of the industry along with it.  We have these pocket communicators and the dominant business model for them is oriented around ringtones, fer gosh sakes.  Hm, on second thought, I don’t see how a modular, extensible platform’s going to change that, but maybe it will let me ignore it somewhat, which is good enough.

But why’s there an “Android” office in downtown Chapel Hill?  One related development suggests itself …

Posted in nerdination | Tagged | Leave a comment

Fun With Copyright Renewal Records

Based on an enormous amount of work by contributors to Project Gutenberg and the Distributed Proofreaders, combined with healthy sourcing of the US copyright office’s records, Google has compiled a a list of works originally copyrighted between 1923 and 1963 which have been renewed at some point, the upshot being that if a given work published in that time span is not on the list, it’s likely in the public domain.

One problem with the list that the database is a 370+ megabyte XML file, which is hard to load up in an XML-aware editor and even caused eXist to choke.  So I broke it up into chunks with a shortish Groovy script, for neat ingestion into an XML database.  The heart of the script is a SAX handler that basically churns each record in the XML file into a Groovy object, and a closure (there’s that word again!) that handles each record as it is constructed.  As written, the script simply breaks the big file into a bunch of files, one for each year (you will of course have to edit the paths).  By supplying a different closure, you could do all sorts of different things with the records, e.g. stuff them into a relational database.

In the spirit of the thing, the script is in the public domain — but I make no representations as to the quality, idiomaticity or overall efficiency of the script; despite being SAX-based, it still manages to chew up quite a bit of memory, so watch out.  Note that you will need Apache Commons Lang (say, version 2.4) on the classpath (e.g. in $HOME/.groovy/lib) for this script to work. Developed with Groovy 1.5.6.

I’ve tried to stop wordpress from ‘prettyfying’ the output, which appears to mangle quotes. I hope to have that fixed soon …

import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.Attributes
import org.xml.sax.helpers.XMLReaderFactory
import org.xml.sax.InputSource

import org.apache.commons.lang.StringEscapeUtils
import org.xml.sax.Locator

/**
 * Represents an individual <Record> element
 * in the document.
 **/
class Record {
    def file

    def lines

    def recno

    def title

   def copyrightYear

    def copyrights = []

    def renewalYear

    def renewals = [] 

    // where it was published
    def published

    // rare!
    def note

    // source of the copyright info
    def source
    def snippet
    def md5sum

    // contributors, holders, and pseudonyms
    def people = []

    /**
     * Get the XML representing this element.  Note
     * that proper functioning here depends on how the
     * handler builds the elements.
     * @return a string containing this record's XML.
     */
    def xml() {
        def buf = new StringBuffer()
        buf << """
<Record>
    <Title>${title}</Title>
    <File>${file}</File>
    <Lines>${lines}</Lines>
    <MD5Sum>${md5sum}</MD5Sum>
"""
        if (snippet) {
            buf << "\t<Snippet>${snippet}</Snippet>\n"
        }
        if (note) {
            buf <<"\t<Note>${note}</Note>\n"
        }
        buf <<
"""
    <Source>${source}</Source>
    <CopyrightYear>${copyrightYear}</CopyrightYear>
    <RenewalYear>${renewalYear}</RenewalYear>
"""
        copyrights.each() {
            buf << it.xml()
        }
        renewals.each() {
            buf << it.xml()
        }
        people.each() {
                buf << it.xml()
        }
        buf << "</Record>\n"
        return buf.toString()
    }
}

/**
 * An inelegant class representing the elements that denote
 * people (copyright holders, contributors, aliases, etc.)
 **/
class Person {

    static ELEMENTS = ["Holder" :   [ "Name", "Type" ],
                        "Contrib" : [ "Name", "Role" ],
                        "Pseudonym" : [ "Pseudo", "Real" ],
                        "Neenym" : [ "Nee", "Now" ],
                        "Aka" : [ "Alias", "Real" ] ]

    static ROLES = ELEMENTS.keySet()

    def role

    def name

    def honorific

    def type

    def xml() {
        def firstElement = ELEMENTS[role][0]
        def secondElement = ELEMENTS[role][1]
        def buf = new StringBuffer()

        buf << """
<${role}>
    <${firstElement}>${name}</${firstElement}>
    <${secondElement}>$type</${secondElement}>"""
    if ( honorific ) {
        buf << "\t<Hon>${honorific}</Hon>\n"
        }
    buf << "</${role}>\n"
    return buf.toString()
    }
}

/**
 * Represents copyright and renewal date elements.
 */
class RecordDate {

	static ELEMENTS = ["Copyright", "Renewal"]

    def role
    def date
    def id
    def xml() {
        return """<${role}>
    <Date>${date}</Date>
    <Id>${id}</Id>
</${role}>"""
    }
}

/**
 * SAX handler that turns each <code>Record</code> element
 * into a <code>Record</code> domain object.
 **/
class RecordHandler extends DefaultHandler {

    /**
     * Stack of strings that represents the current
     * element context.
     **/
    Stack context = new Stack()

    /**
     * the current record being built.
     **/
    Record currentRec

    /**
     * the current Person element being built.
     **/
    Person currentPerson

    /**
     * The current date information being collected.
     **/
    RecordDate currentRecDate

    /**
     * A closure which will be called as each record is
     * read in.
     **/
    def recordListener

    /**
     * a buffer to collect the current text, since SAX might
     * not report all contiguous chunks of text at once.
     **/
    StringBuilder currentText = new StringBuilder()

    def locator

    @Override
    public void setDocumentLocator(Locator locator)
    {    println "Got a locator: ${locator}"
        this.locator = locator
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts)
    {
        context << localName
        switch( localName ) {
            case "Record":
                currentRec = new Record()
                break
            case Person.ROLES:
                currentPerson = new Person()
                currentPerson.role = localName
                break
            case RecordDate.ELEMENTS:
                currentRecDate = new RecordDate()
                currentRecDate.role = localName
                break
        }
    }

    @Override
    public void characters(char [] ch, int start, int len)
    {
        currentText.append(ch,start,len)
    }

    @Override
    public void endElement(String uri, String localName, String qName)
    {
        String txt = StringEscapeUtils.escapeXml(currentText.toString().trim())
        switch(localName) {
            case Person.ROLES:
                currentRec.people << currentPerson
                break
            case ["Type", "Role", "Real", "Now"]:
                currentPerson.type = txt
                break
            case ["Name", "Pseudo", "Nee", "Alias"]:
                currentPerson.name = txt
                break
            case "Hon":
                currentPerson.honorific = txt
               break;
            case "CopyrightYear":
                currentRec.copyrightYear = Integer.parseInt(txt)
                break
            case "Date":
                currentRecDate.date = txt
                break
            case "Id":
                currentRecDate.id = txt
                break
            case "Copyright":
                currentRec.copyrights <<currentRecDate
                break
            case "RenewalYear":
                currentRec.renewalYear = Integer.parseInt(txt)
                break
            case "Renewal":
                currentRec.renewals << currentRecDate
                break
            case "Recno":
                currentRec.recno = txt
                break
            case "Source":
                currentRec.source = txt
                break
            case "Lines":
                currentRec.lines = txt
                break
            case "MD5Sum":
                currentRec.md5sum = txt
                break
            case "File":
                currentRec.file = txt
                break
            case "Snippet":
                currentRec.snippet = txt
                break
            case "Title":
                currentRec.title = txt
                break
            case "Published":
                currentRec.published = txt
                break
            case "Record":
                recordListener(currentRec)
                break
            case "Note":
                currentRec.note = txt
                break
            case "CopyrightRenewalRecords":
                break
            default:
                println "Unrecognized element '${localName}' at line ${locator.lineNumber}"
                System.exit(1)
            }
        currentText.length = 0
    }

}

def file = new File("input-dir/google-renewals-20080624/google-renewals-20080624.xml")

/**
 * A listener that will output each record into a different stream depending
 * on the CopyrightYear of the record.
 **/
def listenerBase = { Map streams, Record it ->
    if ( !streams.containsKey(it.copyrightYear) ) {
        def f = new File("/output/dir/copyright-${it.copyrightYear}.xml")
        println "creating ${f.absolutePath}"
        def stream = f.newWriter()
        streams[it.copyrightYear] = stream
        stream.append("<CopyrightRenewalRecords>")
    }
    Writer s = (Writer)streams[it.copyrightYear]
    s.append(it.xml())
    s.flush()
}

def reader = XMLReaderFactory.createXMLReader()
def handler = new RecordHandler()
def outputStreams = [:]
handler.recordListener = listenerBase.curry(outputStreams)
reader.setContentHandler( handler )

try {
    reader.parse( new InputSource( file.newInputStream() ) )
} catch (Exception x) {
    x.printStackTrace()
    println "Error at line ${handler.locator.lineNumber}"
}

outputStreams.each() {
    k, BufferedOutputStream v ->
        println "Closing ${k}"
        v.append("</CopyrightRenewalRecords>")
        v.flush()
        v.close()
}
Posted in RDF, Tools, conferences, nerdination | Leave a comment