Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2014-05-20]: The lucene2xml scripts have been modified:

  • [fix]: refined regex parsing
  • [feature]: added differentiation between ‘term’, ‘wildcard’, and ‘regex’ search terms, based on detection of metacharacters

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Read more of this post

Advertisements

I’m so glad…

…my previous post has finally grown out-of-date!

FOP-1.0 has been released, which fixes the nasty bug where footnotes inside lists and tables got swallowed. There’s still an issue with overlapping content for footnotes inside columns, but I can live with that…

Hence, I could as well delete my previous entry but will leave it here for documentation’s sake.

Rendering footnotes in tables and lists with FOP

[UPDATE: Meanwhile, FOP-1.0 has been released, which fixes the bug that informed this post. The workaround described below thus is only relevant for users of FOP versions 0.92 to 0.95. For the happiest FOPping experience, stop reading here and grab your copy of FOP-1.0!]

[…or skip the discussion and just download the files]

The reason

During the past couple of years, I’ve gathered some experience working with XML and related standards (XSLT, XSL-FO, XQuery). Part of our professional document production chain involves rendering PDF output from XML sources. I’ve grown into a big fan of Apache’s open source FOP processor since its now ancient version 0.20.5. Although the FOP code has been substantially revised and improved long since, the versions up to version 0.95 were haunted by one serious bug, which kept me from switching to an up-to-date version of FOP: footnotes inside lists or table cells got swallowed in PDF output.

On the other hand, FOP’s XSL-FO compliance rate has risen substantially in the recent versions, prompting me to find a way of dealing with this nasty show-stopper. Of course, I hope the FOP developers will be able to resolve this issue soon. In the mean time, I think I’ve found a way of circumventing (or at least alleviating) the problem (at stylesheet level; not at Java code level). Moreover, I think this approach might help other users as well, and other users might help improving this approach where it doesn’t.

Read more of this post

%d bloggers like this: