An XQuery script for listing the contents of collections in eXist-db

This ‘ll be a fairly straightforward post: an utility script for recursively listing all sub-collections and resources of collections in an eXist-db database (version 2.0). I often find myself looking for this, and stumbling my head against errors raised by insufficient resource permissions. Hence, this script has some checks built in that should avoid permission-related errors, and get following info for the resources available:

  • name
  • path inside the eXist db
  • (for collections) number of files or sub-collections
  • (for files) MIME type, file size
  • permission information: owner, group, permissions

Read more of this post

Reverse Proxying Tomcat Web Applications Behind Apache

As far as sysadmin feelings go, I’ve been living happily with my Tomcat web application server, caged behind a hidden port and fronted with an Apache web server via mod_jk. While preventing direct web access to the Tomcat server (which is generally considered unsecure), it still makes Tomcat web applications publicly available via requests to the Apache web server. This setup served well for deploying our (mostly eXist-driven) web applications. Yet, this all-embracing symbiotic happiness started to dwindle somehow when I discovered that the latest generation code of eXist (2.x branch) seemed to have an issue with mod_jk. Apparently, when piped through mod_jk, eXist-2.x doesn’t seem to be able to set or get session attributes (whereas all works fine when those same requests are directed straight to the Tomcat application server). After my efforts to switch application code from Cocoon-based sitemaps to eXist’s own MVC controller framework and catch up with the latest eXist versions, this observation felt like a cold shower. Especially since session attributes have proven a great means to add state to my webapps and hence feature prominently in my webapp code logic.

Lacking any Java skills myself, my hopes for seeing this issue fixed were low, as it clearly falls in between both technologies (eXist and mod_jk). Moreover, eXist developers tend to prefer the mod_proxy Apache module over mod_jk for the communication between Apache and Tomcat, which reduces the chances that this mod_jk-related issue will have much priority. Fortunately, there is some basic eXist documentation on how to configure Apache to act as a reverse proxy for eXist webapps, which provided a good basis for investigating how I could switch my current mod_jk configuration to a reverse proxy setup with mod_proxy.

In this post, I’ll try to explain the different cliffs I came across for my scenario. I’ll set out with explaining some specifics of our configuration, and work my way from functional to optional proxy configuration. Before I start, I want to point out two disclaimers: first, I’m only an accidental sysadmin who started this ‘investigation’ without much prior knowledge about (reverse) proxying. Yet, I’ll try my best to explain things both as understandably and accurately as possible. Second, although the issues I’ll describe apply to any setup where Tomcat apps are fronted with Apache via reverse proxy, I’ll illustrate them with some aspects of the the eXist webapps I’m familiar with. Yet, the scope is definitely broader than eXist.

Read more of this post

Internal URL Rewriting with eXist’s MVC Framework

Since version 1.4, the eXist native XML database has been equipped with a Model View Controller (MVC) framework designed to express the logic for request routing of eXist-based web applications in XQuery. In this post I’ll illuminate a (in my opinion) somewhat under-exposed feature of eXist’s MVC framework: internal URL rewriting. With this term, I mean the fact that a URL, say http://localhost:8080/exist/urltest/test.xql is resolved internally to another URL like http://localhost:8080/exist/urltest/xquery/test.xql. Internally, meaning that the original request is not redirected to another one, and the user still sees the original URL in the browser address bar. As section 1of this post will illustrate, this works like a charm for ‘simple’ rewrites, like the previous one, but requires some thought if you would like to ‘chain’ multiple internal rewrite rules. In this post, I’ll try to provide a flexible coding pattern to achieve such internal rewriting with eXist’s MVC framework.

Read more of this post

From KWIC display to KWIC(er) processing with eXist

The eXist XML database has a dedicated XQuery module for displaying search results in a fixed context window, a visualization that is commonly known as a KeyWord In Context view. Search results are presented with a preceding and following text context (called further in this text left and right text context):

<p>
    <span class="previous">... s effect, sir; after what flourish your </span>
    <span class="hi">nature</span>
    <span class="following"> will.</span>
</p>

This formatting of search results invites to exploit its particular features, such as sorting the search results according to their left or right contexts, or even according to the nth word preceding or following the search term. This is heavily facilitated by the XML representation of the KWIC search results, where all three parts are isolated in their own XML element. However, while eXist’s current KWIC display module (as it is consistently called) does its job in presenting a KWIC display, in my opinion it is too much display-oriented:

  • it lacks performance on large result sets, and / or wide context widths, which is crucial for further processing, since sorting requires pre-computation of the entire result set
  • (though this is nitpicking:) the output is presentational HTML; while this is irrelevant from a processing point of view, I would prefer a semantically more ‘neutral’ format and defer presentational formatting to a later display phase

This post will address both objections and present alternatives. Additionally, ways for processing these KWIC results are discussed in the last section.

Read more of this post

Venturing into versions: strategies for querying a TEI apparatus with eXist

When encoding a critical edition in XML, one of the challenges facing the text encoder is finding a way to represent multiple versions of a work in a sensible way. As usual when it comes to the electronic representation of texts in the field of the humanities, such a sensible way is provided by the Text Encoding Initiative (TEI). Actually, three ways are offered, though this post will focus on the so-called parallel-segmentation method (for extensive reference, the reader is directed to chapter 12: Critical Apparatus of the TEI Guidelines). In short: this method allows an encoder to represent all text versions of a work within a single XML source, where places with variant text are encoded as an inline apparatus (<app>), in which the distinct variants are identified as readings (<rdg wit=”[sigil]”>), whose @wit attribute links them to (an) identified version(s) of the work. At this point, a lot more could be said about both edition and markup theoretic aspects, but this won’t be the focus of this post.

Instead, this post will focus on a topic I saw myself confronted with when developing an application (i.e. a web interface) for such an edition: how do you search within such ‘multiversion’ texts? Most probably, users of the edition would want to focus on one (or a selection of) text version(s). Of course, when version 1 contains the word ‘hope’, which in version 2 had been changed to ‘despair’, (only) the right readings should be retrieved for the respective text version.

Read more of this post

XQuery Unit testing in eXist-1.4

[UPDATE 2011-01-19]: As of revisions 13587 and 13589, the XQuery Unit Testing framework has been ported back from eXist-trunk to the eXist-1.4.x branch. While obsolescing the need for the XSLT stylesheet presented in this blog post, I’ll leave the latter here for the sake of documentation. eXist users who want to test XQueries in eXist-1.4 now are encouraged to use its built-in XQuery Unit Testing framework instead.

[UPDATE 2011-01-05]: The XSLT stylesheet has been extended with missing features:

  • [feature]: added @trace handling
  • [feature]: added <xpath> handling
  • [feature]: added <store-files> handling
  • [feature]: added context handling for util:eval()
  • [fix]: <![CDATA[ ]]> in output: spaces required…

[UPDATE 2010-12-09]: The XSLT stylesheet has been substantially reworked, to produce

  • more legible XQuery code
  • more reliable XQuery code, taking into account serialization options, and deriving the most sensible highlight-matches settings where necessary

Currently, I’m heavily porting old XQuery code to the latest version of the eXist XML database’s new Lucene FT index and search capabilities. In doing so, I’m hitting a couple of bugs in this area, that I’m trying to isolate, test and report as clearly as possible. This post discusses a means to use the same test files for both eXist-1.4 and eXist-trunk.

Read more of this post

As a matter of fac(e)t: (mimicking) faceted searching in eXist

In hindsight, since I set out developing search interfaces for XML text collections with the marvelous eXist XML database, I’ve been drawn to the concept of faceted search, even long before I knew it was called that way. The recent integration of Lucene indexing and searching capabilities into eXist (since version 1.4) holds promises for efficient facet-oriented search features such as integrating Lucene fields in search queries.

Read more of this post

Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2014-05-20]: The lucene2xml scripts have been modified:

  • [fix]: refined regex parsing
  • [feature]: added differentiation between ‘term’, ‘wildcard’, and ‘regex’ search terms, based on detection of metacharacters

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Read more of this post

I’m so glad…

…my previous post has finally grown out-of-date!

FOP-1.0 has been released, which fixes the nasty bug where footnotes inside lists and tables got swallowed. There’s still an issue with overlapping content for footnotes inside columns, but I can live with that…

Hence, I could as well delete my previous entry but will leave it here for documentation’s sake.

Rendering footnotes in tables and lists with FOP

[UPDATE: Meanwhile, FOP-1.0 has been released, which fixes the bug that informed this post. The workaround described below thus is only relevant for users of FOP versions 0.92 to 0.95. For the happiest FOPping experience, stop reading here and grab your copy of FOP-1.0!]

[…or skip the discussion and just download the files]

The reason

During the past couple of years, I’ve gathered some experience working with XML and related standards (XSLT, XSL-FO, XQuery). Part of our professional document production chain involves rendering PDF output from XML sources. I’ve grown into a big fan of Apache’s open source FOP processor since its now ancient version 0.20.5. Although the FOP code has been substantially revised and improved long since, the versions up to version 0.95 were haunted by one serious bug, which kept me from switching to an up-to-date version of FOP: footnotes inside lists or table cells got swallowed in PDF output.

On the other hand, FOP’s XSL-FO compliance rate has risen substantially in the recent versions, prompting me to find a way of dealing with this nasty show-stopper. Of course, I hope the FOP developers will be able to resolve this issue soon. In the mean time, I think I’ve found a way of circumventing (or at least alleviating) the problem (at stylesheet level; not at Java code level). Moreover, I think this approach might help other users as well, and other users might help improving this approach where it doesn’t.

Read more of this post