<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Clowns In My Coffee &#187; RDF</title>
	<atom:link href="http://clownsinmycoffee.net/category/rdf/feed/" rel="self" type="application/rss+xml" />
	<link>http://clownsinmycoffee.net</link>
	<description>Inanity of the most cogent sort you can find.</description>
	<pubDate>Thu, 02 Oct 2008 11:41:56 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<item>
		<title>Fun With Copyright Renewal Records</title>
		<link>http://clownsinmycoffee.net/2008/07/01/fun-with-copyright-renewal-records/</link>
		<comments>http://clownsinmycoffee.net/2008/07/01/fun-with-copyright-renewal-records/#comments</comments>
		<pubDate>Tue, 01 Jul 2008 03:32:08 +0000</pubDate>
		<dc:creator>adam</dc:creator>
		
		<category><![CDATA[RDF]]></category>

		<category><![CDATA[Tools]]></category>

		<category><![CDATA[conferences]]></category>

		<category><![CDATA[nerdination]]></category>

		<guid isPermaLink="false">http://clownsinmycoffee.net/?p=46</guid>
		<description><![CDATA[Based on an enormous amount of work by contributors to Project Gutenberg and the Distributed Proofreaders, combined with healthy sourcing of the US copyright office&#8217;s records, Google has compiled a a list of works originally copyrighted between 1923 and 1963 which have been renewed at some point, the upshot being that if a given work [...]]]></description>
			<content:encoded><![CDATA[<p>Based on an enormous amount of work by contributors to <a href="http://www.gutenberg.org/wiki/Main_Page">Project Gutenberg</a> and the <a href="http://www.pgdp.net/c/">Distributed Proofreaders</a>, combined with healthy sourcing of the US <a href="http://www.copyright.gov/records/">copyright office&#8217;s records</a>, Google has compiled a <a href="http://booksearch.blogspot.com/2008/06/us-copyright-renewal-records-available.html">a list of works originally copyrighted between 1923 and 1963</a> which have been renewed at some point, the upshot being that if a given work published in that time span is <em>not</em> on the list, it&#8217;s likely in the public domain.
</p>
<p>
One problem with the list that the database is a 370+ megabyte XML file, which is hard to load up in an XML-aware editor and even caused <a title="eXist open source XML database" href="http://exist.sourceforge.net">eXist</a> to choke.  So I broke it up into chunks with a shortish Groovy script, for neat ingestion into an XML database.  The heart of the script is a SAX handler that basically churns each record in the XML file into a Groovy object, and a closure (there&#8217;s that word again!) that handles each record as it is constructed.  As written, the script simply breaks the big file into a bunch of files, one for each year (you will of course have to edit the paths).  By supplying a different closure, you could do all sorts of different things with the records, e.g. stuff them into a relational database.
</p>
<p>
In the spirit of the thing, the script is in the public domain &#8212; but I make no representations as to the quality, idiomaticity or overall efficiency of the script; despite being SAX-based, it still manages to chew up quite a bit of memory, so watch out.  Note that you will need <a href="http://commons.apache.org/lang">Apache Commons Lang</a> (say, version 2.4) on the classpath (e.g. in <code>$HOME/.groovy/lib</code>) for this script to work.  Developed with Groovy 1.5.6.
</p>
<p style="color: red">I&#8217;ve tried to stop wordpress from &#8216;prettyfying&#8217; the output, which appears to mangle quotes.  I hope to have that fixed soon &#8230;</p>
<pre style="border: 1px solid #000; padding: 1em; background-color: #ccc;">import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.Attributes
import org.xml.sax.helpers.XMLReaderFactory
import org.xml.sax.InputSource

import org.apache.commons.lang.StringEscapeUtils
import org.xml.sax.Locator

/**
 * Represents an individual &lt;Record&gt; element
 * in the document.
 **/
class Record {
    def file

    def lines

    def recno

    def title

   def copyrightYear

    def copyrights = []

    def renewalYear

    def renewals = [] 

    // where it was published
    def published

    // rare!
    def note

    // source of the copyright info
    def source
    def snippet
    def md5sum

    // contributors, holders, and pseudonyms
    def people = []

    /**
     * Get the XML representing this element.  Note
     * that proper functioning here depends on how the
     * handler builds the elements.
     * @return a string containing this record's XML.
     */
    def xml() {
        def buf = new StringBuffer()
        buf &lt;&lt; """
&lt;Record&gt;
    &lt;Title&gt;${title}&lt;/Title&gt;
    &lt;File&gt;${file}&lt;/File&gt;
    &lt;Lines&gt;${lines}&lt;/Lines&gt;
    &lt;MD5Sum&gt;${md5sum}&lt;/MD5Sum&gt;
"""
        if (snippet) {
            buf &lt;&lt; "\t&lt;Snippet&gt;${snippet}&lt;/Snippet&gt;\n"
        }
        if (note) {
            buf &lt;&lt;"\t&lt;Note&gt;${note}&lt;/Note&gt;\n"
        }
        buf &lt;&lt;
"""
    &lt;Source&gt;${source}&lt;/Source&gt;
    &lt;CopyrightYear&gt;${copyrightYear}&lt;/CopyrightYear&gt;
    &lt;RenewalYear&gt;${renewalYear}&lt;/RenewalYear&gt;
"""
        copyrights.each() {
            buf &lt;&lt; it.xml()
        }
        renewals.each() {
            buf &lt;&lt; it.xml()
        }
        people.each() {
                buf &lt;&lt; it.xml()
        }
        buf &lt;&lt; "&lt;/Record&gt;\n"
        return buf.toString()
    }
}

/**
 * An inelegant class representing the elements that denote
 * people (copyright holders, contributors, aliases, etc.)
 **/
class Person {

    static ELEMENTS = ["Holder" :   [ "Name", "Type" ],
                        "Contrib" : [ "Name", "Role" ],
                        "Pseudonym" : [ "Pseudo", "Real" ],
                        "Neenym" : [ "Nee", "Now" ],
                        "Aka" : [ "Alias", "Real" ] ]

    static ROLES = ELEMENTS.keySet()

    def role

    def name

    def honorific

    def type

    def xml() {
        def firstElement = ELEMENTS[role][0]
        def secondElement = ELEMENTS[role][1]
        def buf = new StringBuffer()

        buf &lt;&lt; """
&lt;${role}&gt;
    &lt;${firstElement}&gt;${name}&lt;/${firstElement}&gt;
    &lt;${secondElement}&gt;$type&lt;/${secondElement}&gt;"""
    if ( honorific ) {
        buf &lt;&lt; "\t&lt;Hon&gt;${honorific}&lt;/Hon&gt;\n"
        }
    buf &lt;&lt; "&lt;/${role}&gt;\n"
    return buf.toString()
    }
}

/**
 * Represents copyright and renewal date elements.
 */
class RecordDate {

	static ELEMENTS = ["Copyright", "Renewal"]

    def role
    def date
    def id
    def xml() {
        return """&lt;${role}&gt;
    &lt;Date&gt;${date}&lt;/Date&gt;
    &lt;Id&gt;${id}&lt;/Id&gt;
&lt;/${role}&gt;"""
    }
}

/**
 * SAX handler that turns each &lt;code&gt;Record&lt;/code&gt; element
 * into a &lt;code&gt;Record&lt;/code&gt; domain object.
 **/
class RecordHandler extends DefaultHandler {

    /**
     * Stack of strings that represents the current
     * element context.
     **/
    Stack context = new Stack()

    /**
     * the current record being built.
     **/
    Record currentRec

    /**
     * the current Person element being built.
     **/
    Person currentPerson

    /**
     * The current date information being collected.
     **/
    RecordDate currentRecDate

    /**
     * A closure which will be called as each record is
     * read in.
     **/
    def recordListener

    /**
     * a buffer to collect the current text, since SAX might
     * not report all contiguous chunks of text at once.
     **/
    StringBuilder currentText = new StringBuilder()

    def locator

    @Override
    public void setDocumentLocator(Locator locator)
    {    println "Got a locator: ${locator}"
        this.locator = locator
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts)
    {
        context &lt;&lt; localName
        switch( localName ) {
            case "Record":
                currentRec = new Record()
                break
            case Person.ROLES:
                currentPerson = new Person()
                currentPerson.role = localName
                break
            case RecordDate.ELEMENTS:
                currentRecDate = new RecordDate()
                currentRecDate.role = localName
                break
        }
    }

    @Override
    public void characters(char [] ch, int start, int len)
    {
        currentText.append(ch,start,len)
    }

    @Override
    public void endElement(String uri, String localName, String qName)
    {
        String txt = StringEscapeUtils.escapeXml(currentText.toString().trim())
        switch(localName) {
            case Person.ROLES:
                currentRec.people &lt;&lt; currentPerson
                break
            case ["Type", "Role", "Real", "Now"]:
                currentPerson.type = txt
                break
            case ["Name", "Pseudo", "Nee", "Alias"]:
                currentPerson.name = txt
                break
            case "Hon":
                currentPerson.honorific = txt
               break;
            case "CopyrightYear":
                currentRec.copyrightYear = Integer.parseInt(txt)
                break
            case "Date":
                currentRecDate.date = txt
                break
            case "Id":
                currentRecDate.id = txt
                break
            case "Copyright":
                currentRec.copyrights &lt;&lt;currentRecDate
                break
            case "RenewalYear":
                currentRec.renewalYear = Integer.parseInt(txt)
                break
            case "Renewal":
                currentRec.renewals &lt;&lt; currentRecDate
                break
            case "Recno":
                currentRec.recno = txt
                break
            case "Source":
                currentRec.source = txt
                break
            case "Lines":
                currentRec.lines = txt
                break
            case "MD5Sum":
                currentRec.md5sum = txt
                break
            case "File":
                currentRec.file = txt
                break
            case "Snippet":
                currentRec.snippet = txt
                break
            case "Title":
                currentRec.title = txt
                break
            case "Published":
                currentRec.published = txt
                break
            case "Record":
                recordListener(currentRec)
                break
            case "Note":
                currentRec.note = txt
                break
            case "CopyrightRenewalRecords":
                break
            default:
                println "Unrecognized element '${localName}' at line ${locator.lineNumber}"
                System.exit(1)
            }
        currentText.length = 0
    }

}

def file = new File("input-dir/google-renewals-20080624/google-renewals-20080624.xml")

/**
 * A listener that will output each record into a different stream depending
 * on the CopyrightYear of the record.
 **/
def listenerBase = { Map streams, Record it -&gt;
    if ( !streams.containsKey(it.copyrightYear) ) {
        def f = new File("/output/dir/copyright-${it.copyrightYear}.xml")
        println "creating ${f.absolutePath}"
        def stream = f.newWriter()
        streams[it.copyrightYear] = stream
        stream.append("&lt;CopyrightRenewalRecords&gt;")
    }
    Writer s = (Writer)streams[it.copyrightYear]
    s.append(it.xml())
    s.flush()
}

def reader = XMLReaderFactory.createXMLReader()
def handler = new RecordHandler()
def outputStreams = [:]
handler.recordListener = listenerBase.curry(outputStreams)
reader.setContentHandler( handler )

try {
    reader.parse( new InputSource( file.newInputStream() ) )
} catch (Exception x) {
    x.printStackTrace()
    println "Error at line ${handler.locator.lineNumber}"
}

outputStreams.each() {
    k, BufferedOutputStream v -&gt;
        println "Closing ${k}"
        v.append("&lt;/CopyrightRenewalRecords&gt;")
        v.flush()
        v.close()
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://clownsinmycoffee.net/2008/07/01/fun-with-copyright-renewal-records/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Start Your Query Engines</title>
		<link>http://clownsinmycoffee.net/2006/12/14/what-he-said/</link>
		<comments>http://clownsinmycoffee.net/2006/12/14/what-he-said/#comments</comments>
		<pubDate>Thu, 14 Dec 2006 03:59:58 +0000</pubDate>
		<dc:creator>adam</dc:creator>
		
		<category><![CDATA[RDF]]></category>

		<guid isPermaLink="false">http://clownsinmycoffee.net/2006/12/14/what-he-said/</guid>
		<description><![CDATA[A visit to googlefight suggests that the XQuery meme outcompetes the SPARQL  [?] meme in the parts of the universe Google knows about1 at a rate upwards of 4:1.  XQuery has clearly been the go-to standard for  vendors of relational database engines that have recently added XML capabilities to their offerings.  [...]]]></description>
			<content:encoded><![CDATA[<p>A visit to googlefight suggests that the XQuery meme outcompetes the <acronym title="SPARQL Protocol And RDF Query Language">SPARQL </acronym> <sup><a title="SPARQL W3C Page" href="http://www.w3.org/TR/rdf-sparql-query/">[?]</a></sup> meme in the parts of the universe Google knows about<sup>1</sup> at a rate upwards of 4:1.  XQuery has clearly been the go-to standard for  vendors of relational database engines that have recently added XML capabilities to their offerings.  If your data has the structure, then it&#8217;s not a bad choice at all.  For example, having put together course packs from numerous scattered journal articles, I can really appreciate the power of something like <a href="https://www.safariu.com/">SafariU</a>, which is <a title="Jason Hunter: XQuery By Example: Making O'Reilly Books Sing and Dance" href="http://www.idealliance.org/proceedings/xml05/abstracts/paper128.HTML">built with XQuery</a>.</p>
<p>Reading over the accounts of XML 2006 sessions, I was kind of puzzled by the choice of XQuery as the underpinnings of a  <a title="Kenneth Sall and Ronald Reck: Applying XQuery and OWL to The World Factbook, Wikipedia and Project Gutenberg" href="http://2006.xmlconference.org/programme/presentations/57.html">mashup</a> that combines no less than three web-based data sources (Project Gutenberg, CIA World Factbook, and Wikipedia).  Given the use of OWL, and the <a href="http://www.rrecktek.com/xml2006/">metadata-oriented nature of the queries</a>, I would have thought SPARQL was a natural fit for the project (<strong>update: </strong>Ronald P. Reck replies in comments: XQuery was chosen because it&#8217;s a more mature standard; he also provides handy links to the full proceedings and other papers).<br />
The fact is, while they&#8217;re both query languages, as Bob DuCharme points out, <a title="Bob DuCharme: RDF versus XQuery" href="http://www.snee.com/bobdc.blog/2006/12/rdf_versus_xquery.html">SPARQL and XQuery aren&#8217;t designed to solve the same problems</a>.  It&#8217;s really worth reading his post if you&#8217;ve encountered SPARQL but don&#8217;t yet see the point of it (or this RDF business).</p>
<p><sup>1</sup> Stealing a phrase from the younger Wittgenstein: the limits of my search index indicate the limits of my world.</p>
]]></content:encoded>
			<wfw:commentRss>http://clownsinmycoffee.net/2006/12/14/what-he-said/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Name-Calling</title>
		<link>http://clownsinmycoffee.net/2006/07/26/name-calling/</link>
		<comments>http://clownsinmycoffee.net/2006/07/26/name-calling/#comments</comments>
		<pubDate>Wed, 26 Jul 2006 15:45:48 +0000</pubDate>
		<dc:creator>adam</dc:creator>
		
		<category><![CDATA[RDF]]></category>

		<guid isPermaLink="false">http://clownsinmycoffee.net/2006/07/26/name-calling/</guid>
		<description><![CDATA[Names are fun to think about.  Coming up with a theory about what, if anything, a name means was a good inroad to the philosophy of language for me.
In the land of frameworks for describing resources, there&#8217;s some confusion about how to go about constructing names for those resources.  The actual morphology of [...]]]></description>
			<content:encoded><![CDATA[<p>Names are fun to think about.  Coming up with a theory about what, if anything, a name <em>means</em> was a good inroad to the philosophy of language for me.</p>
<p>In the land of <a title="W3C RDF page" href="http://www.w3.org/TR/2003/WD-rdf-concepts-20030123/">frameworks for describing resources</a>, there&#8217;s some confusion about how to go about constructing names for those resources.  The actual morphology of names comes up because the names are processed by computers, which abhor ambiguity almost as much as nature abhors a vacuum.  There is, thus, a temptation to sneak some kind of significant structure into names, where the structure is supposed to help reduce the ambiguity.  &#8220;Charlotte&#8221; is a perfectly good set of characters, but does it refer to a fictional spider, a city in North Carolina, the actress who played Mrs. Garrett on <em>Diff&#8217;rent Strokes</em>, or something else?  Humans have the conversational context to help figure out which of these is the case, and you can always add more characters when appropriate (&#8221; &#8230; the spider,&#8221; &#8220;&#8230; North Carolina,&#8221; &#8220;Rae&#8221;).  Computers, however, need a little more help, as things currently stand.  They&#8217;re not great at context, and they&#8217;re able to handle much longer names than humans without breaking a sweat, so disambiguation is largely solved by throwing more characters into the name.</p>
<p>What should those characters be, and what should their structure be?  Accepting the constraint that a name must be a URI, there&#8217;s still lots to argue about.  There&#8217;s <em>my</em> setup for linking to <a title="Norm Walsh on naming resources" href="http://norman.walsh.name/2006/07/25/namesAndAddresses">Names and addresses</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://clownsinmycoffee.net/2006/07/26/name-calling/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
