An XQuery script for listing the contents of collections in eXist-db

This ‘ll be a fairly straightforward post: an utility script for recursively listing all sub-collections and resources of collections in an eXist-db database (version 2.0). I often find myself looking for this, and stumbling my head against errors raised by insufficient resource permissions. Hence, this script has some checks built in that should avoid permission-related errors, and get following info for the resources available:

  • name
  • path inside the eXist db
  • (for collections) number of files or sub-collections
  • (for files) MIME type, file size
  • permission information: owner, group, permissions

Read more of this post


Internal URL Rewriting with eXist’s MVC Framework

Since version 1.4, the eXist native XML database has been equipped with a Model View Controller (MVC) framework designed to express the logic for request routing of eXist-based web applications in XQuery. In this post I’ll illuminate a (in my opinion) somewhat under-exposed feature of eXist’s MVC framework: internal URL rewriting. With this term, I mean the fact that a URL, say http://localhost:8080/exist/urltest/test.xql is resolved internally to another URL like http://localhost:8080/exist/urltest/xquery/test.xql. Internally, meaning that the original request is not redirected to another one, and the user still sees the original URL in the browser address bar. As section 1of this post will illustrate, this works like a charm for ‘simple’ rewrites, like the previous one, but requires some thought if you would like to ‘chain’ multiple internal rewrite rules. In this post, I’ll try to provide a flexible coding pattern to achieve such internal rewriting with eXist’s MVC framework.

Read more of this post

From KWIC display to KWIC(er) processing with eXist

The eXist XML database has a dedicated XQuery module for displaying search results in a fixed context window, a visualization that is commonly known as a KeyWord In Context view. Search results are presented with a preceding and following text context (called further in this text left and right text context):

    <span class="previous">... s effect, sir; after what flourish your </span>
    <span class="hi">nature</span>
    <span class="following"> will.</span>

This formatting of search results invites to exploit its particular features, such as sorting the search results according to their left or right contexts, or even according to the nth word preceding or following the search term. This is heavily facilitated by the XML representation of the KWIC search results, where all three parts are isolated in their own XML element. However, while eXist’s current KWIC display module (as it is consistently called) does its job in presenting a KWIC display, in my opinion it is too much display-oriented:

  • it lacks performance on large result sets, and / or wide context widths, which is crucial for further processing, since sorting requires pre-computation of the entire result set
  • (though this is nitpicking:) the output is presentational HTML; while this is irrelevant from a processing point of view, I would prefer a semantically more ‘neutral’ format and defer presentational formatting to a later display phase

This post will address both objections and present alternatives. Additionally, ways for processing these KWIC results are discussed in the last section.

Read more of this post

Venturing into versions: strategies for querying a TEI apparatus with eXist

When encoding a critical edition in XML, one of the challenges facing the text encoder is finding a way to represent multiple versions of a work in a sensible way. As usual when it comes to the electronic representation of texts in the field of the humanities, such a sensible way is provided by the Text Encoding Initiative (TEI). Actually, three ways are offered, though this post will focus on the so-called parallel-segmentation method (for extensive reference, the reader is directed to chapter 12: Critical Apparatus of the TEI Guidelines). In short: this method allows an encoder to represent all text versions of a work within a single XML source, where places with variant text are encoded as an inline apparatus (<app>), in which the distinct variants are identified as readings (<rdg wit=”[sigil]”>), whose @wit attribute links them to (an) identified version(s) of the work. At this point, a lot more could be said about both edition and markup theoretic aspects, but this won’t be the focus of this post.

Instead, this post will focus on a topic I saw myself confronted with when developing an application (i.e. a web interface) for such an edition: how do you search within such ‘multiversion’ texts? Most probably, users of the edition would want to focus on one (or a selection of) text version(s). Of course, when version 1 contains the word ‘hope’, which in version 2 had been changed to ‘despair’, (only) the right readings should be retrieved for the respective text version.

Read more of this post

As a matter of fac(e)t: (mimicking) faceted searching in eXist

In hindsight, since I set out developing search interfaces for XML text collections with the marvelous eXist XML database, I’ve been drawn to the concept of faceted search, even long before I knew it was called that way. The recent integration of Lucene indexing and searching capabilities into eXist (since version 1.4) holds promises for efficient facet-oriented search features such as integrating Lucene fields in search queries.

Read more of this post

Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2014-05-20]: The lucene2xml scripts have been modified:

  • [fix]: refined regex parsing
  • [feature]: added differentiation between ‘term’, ‘wildcard’, and ‘regex’ search terms, based on detection of metacharacters

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Read more of this post

%d bloggers like this: