A Few Minutes With Apache Sling

Apache Sling is almost painfully hip, in a way only a dedicated nerd could appreciate (or, ok, believe) — it provides a RESTful frontend to a Java Content Repository, and the whole thing is based on OSGi. Roughly, it gives you a content repository with customizable processing and presentation for different types of content, and the only ‘driver’ you need is a library that truly understands HTTP.

As part of evaluating it for the day job, I put together an s5 presentation with that other reST, and the result is Apache Sling Overview. I also dug into the codebase to figure out a bit more about Sling’s default POST processing servlet. I do hope I didn’t say too many materially false things.

Goings On About Town

So, there’s this small office in downtown Chapel Hill that used to have a paper “Google” banner in the window.  Today, on a trip past the Cosmic Cantina, I noticed that the window now has a more permanent logo for Android. I’ll admit that the basic idea behind Android – a generally open cellphone platform mostly developed by the ‘net’s largest advertising distribution network — is really appealing, given how disappointing the current situation in the US is (e.g. I can take a picture on my current phone, but I can’t transfer it off the phone without emailing it, which means I’d have to pay a fee; and, equally important is the fact that the applications installed on the phone … well, they suck).  If it takes off, it could open up a range of possibilities for “mobile computing,” and I seriously hope it pulls the rest of the industry along with it.  We have these pocket communicators and the dominant business model for them is oriented around ringtones, fer gosh sakes.  Hm, on second thought, I don’t see how a modular, extensible platform’s going to change that, but maybe it will let me ignore it somewhat, which is good enough.

But why’s there an “Android” office in downtown Chapel Hill?  One related development suggests itself …

Fun With Copyright Renewal Records

Based on an enormous amount of work by contributors to Project Gutenberg and the Distributed Proofreaders, combined with healthy sourcing of the US copyright office’s records, Google has compiled a a list of works originally copyrighted between 1923 and 1963 which have been renewed at some point, the upshot being that if a given work published in that time span is not on the list, it’s likely in the public domain.

One problem with the list that the database is a 370+ megabyte XML file, which is hard to load up in an XML-aware editor and even caused eXist to choke.  So I broke it up into chunks with a shortish Groovy script, for neat ingestion into an XML database.  The heart of the script is a SAX handler that basically churns each record in the XML file into a Groovy object, and a closure (there’s that word again!) that handles each record as it is constructed.  As written, the script simply breaks the big file into a bunch of files, one for each year (you will of course have to edit the paths).  By supplying a different closure, you could do all sorts of different things with the records, e.g. stuff them into a relational database.

In the spirit of the thing, the script is in the public domain — but I make no representations as to the quality, idiomaticity or overall efficiency of the script; despite being SAX-based, it still manages to chew up quite a bit of memory, so watch out.  Note that you will need Apache Commons Lang (say, version 2.4) on the classpath (e.g. in $HOME/.groovy/lib) for this script to work. Developed with Groovy 1.5.6.

I’ve tried to stop wordpress from ‘prettyfying’ the output, which appears to mangle quotes. I hope to have that fixed soon …

import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.Attributes
import org.xml.sax.helpers.XMLReaderFactory
import org.xml.sax.InputSource

import org.apache.commons.lang.StringEscapeUtils
import org.xml.sax.Locator

/**
 * Represents an individual <Record> element
 * in the document.
 **/
class Record {
    def file

    def lines

    def recno

    def title

   def copyrightYear

    def copyrights = []

    def renewalYear

    def renewals = [] 

    // where it was published
    def published

    // rare!
    def note

    // source of the copyright info
    def source
    def snippet
    def md5sum

    // contributors, holders, and pseudonyms
    def people = []

    /**
     * Get the XML representing this element.  Note
     * that proper functioning here depends on how the
     * handler builds the elements.
     * @return a string containing this record's XML.
     */
    def xml() {
        def buf = new StringBuffer()
        buf << """
<Record>
    <Title>${title}</Title>
    <File>${file}</File>
    <Lines>${lines}</Lines>
    <MD5Sum>${md5sum}</MD5Sum>
"""
        if (snippet) {
            buf << "\t<Snippet>${snippet}</Snippet>\n"
        }
        if (note) {
            buf <<"\t<Note>${note}</Note>\n"
        }
        buf <<
"""
    <Source>${source}</Source>
    <CopyrightYear>${copyrightYear}</CopyrightYear>
    <RenewalYear>${renewalYear}</RenewalYear>
"""
        copyrights.each() {
            buf << it.xml()
        }
        renewals.each() {
            buf << it.xml()
        }
        people.each() {
                buf << it.xml()
        }
        buf << "</Record>\n"
        return buf.toString()
    }
}

/**
 * An inelegant class representing the elements that denote
 * people (copyright holders, contributors, aliases, etc.)
 **/
class Person {

    static ELEMENTS = ["Holder" :   [ "Name", "Type" ],
                        "Contrib" : [ "Name", "Role" ],
                        "Pseudonym" : [ "Pseudo", "Real" ],
                        "Neenym" : [ "Nee", "Now" ],
                        "Aka" : [ "Alias", "Real" ] ]

    static ROLES = ELEMENTS.keySet()

    def role

    def name

    def honorific

    def type

    def xml() {
        def firstElement = ELEMENTS[role][0]
        def secondElement = ELEMENTS[role][1]
        def buf = new StringBuffer()

        buf << """
<${role}>
    <${firstElement}>${name}</${firstElement}>
    <${secondElement}>$type</${secondElement}>"""
    if ( honorific ) {
        buf << "\t<Hon>${honorific}</Hon>\n"
        }
    buf << "</${role}>\n"
    return buf.toString()
    }
}

/**
 * Represents copyright and renewal date elements.
 */
class RecordDate {

	static ELEMENTS = ["Copyright", "Renewal"]

    def role
    def date
    def id
    def xml() {
        return """<${role}>
    <Date>${date}</Date>
    <Id>${id}</Id>
</${role}>"""
    }
}

/**
 * SAX handler that turns each <code>Record</code> element
 * into a <code>Record</code> domain object.
 **/
class RecordHandler extends DefaultHandler {

    /**
     * Stack of strings that represents the current
     * element context.
     **/
    Stack context = new Stack()

    /**
     * the current record being built.
     **/
    Record currentRec

    /**
     * the current Person element being built.
     **/
    Person currentPerson

    /**
     * The current date information being collected.
     **/
    RecordDate currentRecDate

    /**
     * A closure which will be called as each record is
     * read in.
     **/
    def recordListener

    /**
     * a buffer to collect the current text, since SAX might
     * not report all contiguous chunks of text at once.
     **/
    StringBuilder currentText = new StringBuilder()

    def locator

    @Override
    public void setDocumentLocator(Locator locator)
    {    println "Got a locator: ${locator}"
        this.locator = locator
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts)
    {
        context << localName
        switch( localName ) {
            case "Record":
                currentRec = new Record()
                break
            case Person.ROLES:
                currentPerson = new Person()
                currentPerson.role = localName
                break
            case RecordDate.ELEMENTS:
                currentRecDate = new RecordDate()
                currentRecDate.role = localName
                break
        }
    }

    @Override
    public void characters(char [] ch, int start, int len)
    {
        currentText.append(ch,start,len)
    }

    @Override
    public void endElement(String uri, String localName, String qName)
    {
        String txt = StringEscapeUtils.escapeXml(currentText.toString().trim())
        switch(localName) {
            case Person.ROLES:
                currentRec.people << currentPerson
                break
            case ["Type", "Role", "Real", "Now"]:
                currentPerson.type = txt
                break
            case ["Name", "Pseudo", "Nee", "Alias"]:
                currentPerson.name = txt
                break
            case "Hon":
                currentPerson.honorific = txt
               break;
            case "CopyrightYear":
                currentRec.copyrightYear = Integer.parseInt(txt)
                break
            case "Date":
                currentRecDate.date = txt
                break
            case "Id":
                currentRecDate.id = txt
                break
            case "Copyright":
                currentRec.copyrights <<currentRecDate
                break
            case "RenewalYear":
                currentRec.renewalYear = Integer.parseInt(txt)
                break
            case "Renewal":
                currentRec.renewals << currentRecDate
                break
            case "Recno":
                currentRec.recno = txt
                break
            case "Source":
                currentRec.source = txt
                break
            case "Lines":
                currentRec.lines = txt
                break
            case "MD5Sum":
                currentRec.md5sum = txt
                break
            case "File":
                currentRec.file = txt
                break
            case "Snippet":
                currentRec.snippet = txt
                break
            case "Title":
                currentRec.title = txt
                break
            case "Published":
                currentRec.published = txt
                break
            case "Record":
                recordListener(currentRec)
                break
            case "Note":
                currentRec.note = txt
                break
            case "CopyrightRenewalRecords":
                break
            default:
                println "Unrecognized element '${localName}' at line ${locator.lineNumber}"
                System.exit(1)
            }
        currentText.length = 0
    }

}

def file = new File("input-dir/google-renewals-20080624/google-renewals-20080624.xml")

/**
 * A listener that will output each record into a different stream depending
 * on the CopyrightYear of the record.
 **/
def listenerBase = { Map streams, Record it ->
    if ( !streams.containsKey(it.copyrightYear) ) {
        def f = new File("/output/dir/copyright-${it.copyrightYear}.xml")
        println "creating ${f.absolutePath}"
        def stream = f.newWriter()
        streams[it.copyrightYear] = stream
        stream.append("<CopyrightRenewalRecords>")
    }
    Writer s = (Writer)streams[it.copyrightYear]
    s.append(it.xml())
    s.flush()
}

def reader = XMLReaderFactory.createXMLReader()
def handler = new RecordHandler()
def outputStreams = [:]
handler.recordListener = listenerBase.curry(outputStreams)
reader.setContentHandler( handler )

try {
    reader.parse( new InputSource( file.newInputStream() ) )
} catch (Exception x) {
    x.printStackTrace()
    println "Error at line ${handler.locator.lineNumber}"
}

outputStreams.each() {
    k, BufferedOutputStream v ->
        println "Closing ${k}"
        v.append("</CopyrightRenewalRecords>")
        v.flush()
        v.close()
}

Adding Shift-select with jQuery

I’ve been using jQuery a bit here and there to add some (I hope) usability enhancements and for light AJAJ work. Today I encountered a situation where I thought adding the “shift-select” feature on a longish list of checkboxes would be a good thing. This sort of feature pops up in webmail interfaces, where you tick off one box, scroll down through 750 spam messages, and then, while holding down shift on the 751st piece of spam in a row, click its checkbox to select all of the rows in between. It turns out that adding this with jQuery is pretty elegant, so here’s the code. I don’t for a moment think this is the best implementation of this idea, but I was struck by how concise the result was, while supporting — via a straightforward use of a closure, multiple instances on the same page. For giggles, I added a feature that allows you to de-select a range, although I’m not convinced it works in an intuitive way.

To use this, you’ll need jQuery (tested against 1.2.3) in your page, and a CSS selector that matches the checkboxes you want to enable shift-select on. Then call $(selector).shiftSelect(); and you’re done.

 jQuery.fn.shiftSelect = function() {
    var checkboxes = this;
    var lastSelected;
    jQuery(this).click( function(event) {

        if ( !lastSelected ) {
            lastSelected = this;
            return;
        }

        if ( event.shiftKey ) {
            var selIndex = checkboxes.index(this);
            var lastIndex = checkboxes.index(lastSelected);
            /*
             * if you find the "select/unselect" behavior unseemly,
             * remove this assignment and replace 'checkValue'
             * with 'true' below.
             */
            var checkValue = lastSelected.checked;
            if ( selIndex == lastIndex ) {
                return true;
            }

            var end = Math.max(selIndex,lastIndex);
            var start = Math.min(selIndex,lastIndex);
            for(i=start;i<=end;i++) {
                checkboxes[i].checked = checkValue;
            }
        }
        lastSelected = this;
    });
}

D2′d

Not that I want to generate the impression here that all I do is buy electronic gadgets and software, but I just got an 8GB Cowon D2 [pronounced "coe-on", btw], which costs about the same as an 8GB version of a somewhat more popular digital music player, except that it has a touch screen, accepts SD cards for extra storage, and, oh, yes, plays OGG Vorbis and FLAC files (and it has an FM tuner and can also act as a sound recorder).

One also, get this, does not generally interface with a D2 via a proprietary piece of software (although you have that option if you want it)– one attaches the player via a USB cable and accesses it like any other USB storage. OTOH, it’s a little bit bigger, has a somewhat clunkier interface, and it appears to be extremely sensitive to being disconnected before it’s ready, to the extent that it becomes unresponsive if you don’t “stop” (windows) or “unmount” (*nix) it; as a matter of fact, mine doesn’t cotton to being connected to OS X at all. That’s fine, though, because for some funny reason I don’t have much in the way of .ogg and .flac files on my OS X machine …

Plus: a scientific calculator!

The Messagening

Now that the day job is looking like it’s going to be more and more SOA oriented, in the WS-* sense, my interest in less complex things that can play in that space is getting piqued. It’s been clear for a while that XMPP is going to be one of those things — there are a lot of things you can do on a foundation of durable messaging and presence.

There’s an extension to the protocol to support the Pub/Sub model which allows, among other things, pushing updates out to an XMPP server instead of requiring clients to poll a feed — with built in support for authentication and SSL. More experimental work concerns “shared editing” of documents.

Set up a private server, and run clients on the systems you administer. You can query, monitor,and even run commands on the them; support for presence lets you see which of them are currently online. Take a look at the list of Extensions to XMPP already available, or in progress.

So, I Broke Down

Just a little bit; I tried out the prerelease of IntelliJ IDEA 7.0, and I liked it enough after some rough handling in casual use to actually pony up for a license now that it’s officially released. Like Netbeans, a default install does a whole lot more things I’d like an IDE to do than Eclipse does, and like Netbeans and unlike Eclipse, it’s failed to crash randomly on me. Yep, the prerelease was pretty smooth. Now that I’ve actually fallen for that old trick, although I tried the release (build 7361) on three different operating systems with a total of five different JVMs, I can’t get it to run. If I were to run a guessing game, I would not award points for guessing which platform provides the least help in figuring out what might be the problem. Essentially, the startup sequence seems to be going into an infinite loop looking for an error handler that isn’t there.

Now, of course, this thing had better ought to be really good.

update: what appears to have solved my problem is changing the options for the JVM that IntelliJ runs in; namely, make sure to add -Doldlicenseserver to them (which file to edit depends on the platform).