Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2014-05-20]: The lucene2xml scripts have been modified:

  • [fix]: refined regex parsing
  • [feature]: added differentiation between ‘term’, ‘wildcard’, and ‘regex’ search terms, based on detection of metacharacters

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Obviously, the first step of such a translation involves parsing of a query in Lucene syntax and transforming it to its XML syntax equivalent.

This seems feasible in XQuery using a couple of eXist-specific extension functions. I’ll briefly sketch the approach below:

  1. translate a Lucene search string to an intermediate string mimicking the XML syntax, with some additions for later parsing of boolean operators
  2. parse the intermediary XML search string as XML with util:parse()
  3. transform the intermediary structures in the search query to full-fledged boolean expressions

The first step, transforming a Lucene search string to an intermediate string mimicking the XML syntax, can be done with an XQuery function that merely performs string replacement. Let’s call it local:parse-lucene():

declare function local:parse-lucene($string) { if (matches($string, '[^\\](\|{2}|&amp;{2}|!) ')) then let $rep := replace(replace(replace($string, '&amp;{2} ', 'AND '), '\|{2} ', 'OR '), '! ', 'NOT ') return local:parse-lucene($rep) else if (matches($string, '[^<](AND|OR|NOT) ')) then let $rep := replace($string, '(AND|OR|NOT) ', '<$1/>') return local:parse-lucene($rep) else if (matches($string, '(^|[^\w"'])\+[\w"'(]')) then let $rep := replace($string, '(^|[^\w"'])\+([\w"'(])', '$1<AND type=_+_/>$2') return local:parse-lucene($rep) else if (matches($string, '(^|[^\w"'])-[\w"'(]')) then let $rep := replace($string, '(^|[^\w"'])-([\w"'(])', '$1<NOT type=_-_/>$2') return local:parse-lucene($rep) else if (matches($string, '(^|[\W-[\\]]|>)\(.*?[^\\]\)(\^(\d+))?(<|\W|$)')) then let $rep := if (matches($string, '(^|\W|>)\(.*?\)(\^(\d+))(<|\W|$)')) then replace($string, '(^|\W|>)\((.*?)\)(\^(\d+))(<|\W|$)', '$1<bool boost=_$4_>$2</bool>$5') else replace($string, '(^|\W|>)\((.*?)\)(<|\W|$)', '$1<bool>$2</bool>$3') return local:parse-lucene($rep) else if (matches($string, '(^|\W|>)("|').*?\2([~^]\d+)?(<|\W|$)')) then let $rep := if (matches($string, '(^|\W|>)("|').*?\2([\^]\d+)?(<|\W|$)')) then replace($string, '(^|\W|>)("|')(.*?)\2([~^](\d+))?(<|\W|$)', '$1<near boost=_$5_>$3</near>$6') else replace($string, '(^|\W|>)("|')(.*?)\2([~^](\d+))?(<|\W|$)', '$1<near slop=_$5_>$3</near>$6') return local:parse-lucene($rep) else if (matches($string, '[\w-[<>]]+?~[\d.]*')) then let $rep := replace($string, '([\w-[<>]]+?)~([\d.]*)', '<fuzzy min-similarity=_$2_>$1</fuzzy>') return local:parse-lucene($rep) else concat('<query>', replace(normalize-space($string), '_', '"'), '</query>') };

[NOTE: The single and double quotation marks in the match() and replace() functions should be escaped with their respective character entities. Unfortunately, posting them to a HTML blog destroys this escaping, and doubly escaping the ampersands of the character entities ended up… escaped themselves! Should you want to try this function, you’ll have to make sure to escape these characters.]

This results in a string representing an intermediary XML version of the query. For example, this function would translate the Lucene search string ‘(fillet OR "mal(ic)e done"~1) AND snake^4’ to:

<query> <bool>fillet <OR/><near slop="1">mal(ic)e done</near></bool> <AND/>snake^4 </query>

This must be processed further, in order to get all boolean operators (and the occurrence indicators of their members) right. As booleans involve pairing of terms and hence (re)grouping, this conversion optimally operates on the corresponding XML structure of the preprocessed string. Therefore, it should be parsed to XML with eXist’s util:parse() function.

Option 1: conversion to final XML with XSLT

Since I’m most proficient in XSLT, I’ve implemented a first proof of concept for this conversion in an XSLT stylesheet that takes care of the boolean operators. This stylesheet can be applied to the intermediary XML search structure with the transform:transform() function:

let $lucene2xml := <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="#all" version="2.0"> <xsl:output indent="yes"/> <xsl:template match="query"> <xsl:copy> <xsl:copy-of select="@*"/> <bool> <xsl:apply-templates select="node()"/> </bool> </xsl:copy> </xsl:template> <xsl:template match="*[(following-sibling::*[1]|preceding-sibling::*[1])[self::AND or self::OR or self::NOT]]"> <xsl:variable name="name"> <xsl:choose> <xsl:when test="(self::phrase|self::near)[not(@slop &gt; 0)]">phrase</xsl:when> <xsl:otherwise><xsl:value-of select="name()"/></xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:element name="{{$name}}"> <xsl:attribute name="occur"> <xsl:choose> <xsl:when test="preceding-sibling::*[1][self::AND]">must</xsl:when> <xsl:when test="preceding-sibling::*[1][self::NOT]">not</xsl:when> <xsl:otherwise> <xsl:choose> <xsl:when test="following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]">should</xsl:when> <xsl:otherwise>should</xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:attribute> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template> <xsl:template match="query/text()[normalize-space()]|bool/text()[normalize-space()]"> <xsl:variable name="current" select="."/> <xsl:for-each select="tokenize(., '\s+')[normalize-space()]"> <!-- here is the place for further differentiation between term / wildcard / regex elements --> <xsl:variable name="el-name" select=" if (matches($tok, '(^|[^\\])[$^|+\p{{P}}-[,]]')) then if (matches($tok, '(^|[^\\.])[?*+]|\[!')) then 'wildcard' else 'regex' else 'term' "/> <xsl:element name="{{$el-name}}"> <xsl:attribute name="occur"> <xsl:choose> <xsl:when test="position() = 1 and $current/preceding-sibling::*[1][self::AND]">must</xsl:when> <xsl:when test="position() = 1 and $current/preceding-sibling::*[1][self::NOT]">not</xsl:when> <xsl:otherwise> <xsl:choose> <xsl:when test="position() = 1 and $current/following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]">should</xsl:when> <xsl:otherwise>should</xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:attribute> <xsl:if test="matches(., '(.*?)(\^(\d+))(\W|$)')"> <xsl:attribute name="boost"> <xsl:value-of select="replace(., '(.*?)(\^(\d+))(\W|$)', '$3')"/> </xsl:attribute> </xsl:if> <xsl:value-of select="normalize-space(replace(., '(.*?)(\^(\d+))(\W|$)', '$1'))"/> </xsl:element> </xsl:for-each> </xsl:template> <xsl:template match="AND|OR|NOT" priority="1"/> <xsl:template match="near/bool"> <xsl:value-of select="concat('(', ., ')')"/> </xsl:template> <xsl:template match="*"> <xsl:variable name="name"> <xsl:choose> <xsl:when test="(self::phrase|self::near)[not(@slop &gt; 0)]">phrase</xsl:when> <xsl:otherwise><xsl:value-of select="name()"/></xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:element name="{{$name}}"> <xsl:apply-templates select="@*|node()"/> </xsl:element> </xsl:template> <xsl:template match="@*|node()" priority="-1"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>

declare function local:lucene2xml($node) { typeswitch ($node) case element(query) return element { node-name($node)} { element bool { $node/node()/local:lucene2xml(.) } } case element(AND) return () case element(OR) return () case element(NOT) return () case element() return let $name := if (($node/self::phrase|$node/self::near)[not(@slop > 0)]) then 'phrase' else node-name($node) return element { $name } { $node/@*, if (($node/following-sibling::*[1]|$node/preceding-sibling::*[1])[self::AND or self::OR or self::NOT]) then attribute occur { if ($node/preceding-sibling::*[1][self::AND]) then 'must' else if ($node/preceding-sibling::*[1][self::NOT]) then 'not' else if ($node/following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]) then 'should' (:'must':) else 'should' } else (), $node/node()/local:lucene2xml(.) } case text() return if ($node/parent::*[self::query or self::bool]) then for $tok at $p in tokenize($node, '\s+')[normalize-space()] (: here is the place for further differentiation between term / wildcard / regex elements :) let $el-name := if (matches($tok, '(^|[^\\])[$^|+\p{P}-[,]]')) then if (matches($tok, '(^|[^\\.])[?*+]|\[!')) then 'wildcard' else 'regex' else 'term' return element { $el-name } { attribute occur { if ($p = 1 and $node/preceding-sibling::*[1][self::AND]) then 'must' else if ($p = 1 and $node/preceding-sibling::*[1][self::NOT]) then 'not' else if ($p = 1 and $node/following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]) then 'should' (:'must':) else 'should' }, if (matches($tok, '(.*?)(\^(\d+))(\W|$)')) then attribute boost { replace($tok, '(.*?)(\^(\d+))(\W|$)', '$3') } else (), lower-case(normalize-space(replace($tok, '(.*?)(\^(\d+))(\W|$)', '$1'))) } else normalize-space($node) default return $node };

Suppose (for quick prototyping’s sake) this XSLT stylesheet is stored in a variable $lucene2xml. Now all we have to do in the main query is process the input $string with local:parse-lucene(), parse its output with util:parse(), and transform it with transform:transform():

let $string := '(fillet OR "mal(ic)e done"~1) AND snake^4' let $luceneParse := local:parse-lucene($string) let $luceneXML := util:parse($luceneParse) return transform:transform($luceneXML, $lucene2xml, ())


For the input string ‘(fillet OR "mal(ic)e done"~1) AND snake^4’, this produces following XML query:

<query> <bool> <bool occur="should"> <term occur="should">fillet</term> <near occur="should" slop="1">mal(ic)e done</near> </bool> <term occur="must" boost="4">snake</term> </bool> </query>

Option 2: conversion to final XML with XQuery function

Since the XSLT stylesheet described above is quite simple, it can easily be formulated as an XQuery function as well. Apart from possible performance issues (I can’t make any sensible claims about this), a full XQuery solution is probably the more elegant option. Therefore, I’ve prepared an alternative version, expressing the XSLT stylesheet as an XQuery function:

declare function local:lucene2xml($node) { typeswitch ($node) case element(query) return element { node-name($node)} { element bool { $node/node()/local:lucene2xml(.) } } case element(AND) return () case element(OR) return () case element(NOT) return () case element(bool) return if ($node/parent::near) then concat("(", $node, ")") else element {node-name($node)} { $node/@*, $node/node()/local:lucene2xml(.) } case element() return let $name := if (($node/self::phrase|$node/self::near)[not(@slop > 0)]) then 'phrase' else node-name($node) return element { $name } { $node/@*, if (($node/following-sibling::*[1]|$node/preceding-sibling::*[1])[self::AND or self::OR or self::NOT]) then attribute occur { if ($node/preceding-sibling::*[1][self::AND]) then 'must' else if ($node/preceding-sibling::*[1][self::NOT]) then 'not' else if ($node/following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]) then 'should' (:'must':) else 'should' } else (), $node/node()/local:lucene2xml(.) } case text() return if ($node/parent::*[self::query or self::bool]) then for $tok at $p in tokenize($node, '\s+')[normalize-space()] (: here is the place for further differentiation between term / wildcard / regex elements :) (: currently differentiating between term and regex, based on detection of metacharacters :) let $el-name := if (matches($tok, '(^|[^\\])[$^|+\p{P}-[,]]')) then 'regex' else 'term' return element { $el-name } { attribute occur { if ($p = 1 and $node/preceding-sibling::*[1][self::AND]) then 'must' else if ($p = 1 and $node/preceding-sibling::*[1][self::NOT]) then 'not' else if ($p = 1 and $node/following-sibling::*[1][self::AND or self::OR or self::NOT][not(@type)]) then 'should' (:'must':) else 'should' }, if (matches($tok, '(.*?)(\^(\d+))(\W|$)')) then attribute boost { replace($tok, '(.*?)(\^(\d+))(\W|$)', '$3') } else (), lower-case(normalize-space(replace($tok, '(.*?)(\^(\d+))(\W|$)', '$1'))) } else normalize-space($node) default return $node };


Instead of the XSLT transformation in the example above, this XQuery function can be called as the last step in the main query:

let $string := '(fillet OR "mal(ic)e done"~1) AND snake^4' let $luceneParse := local:parse-lucene($string) let $luceneXML := util:parse($luceneParse) return local:lucene2xml($luceneXML/node())

The output is (and should be) the same as above.

Conclusion

So far for this proof of concept. What’s missing still are the extra search parameters:

  • further refinements to the interpretation of search terms as ‘term’, ‘wildcard’, or ‘regex’
  • specify the ordered status of a proximity search in an @order attribute for <near>

In order to achieve this, those options could be passed as parameters to the XSLT stylesheet or the XQuery function. Both the XSLT stylesheet and XQuery function have comments at the place where the former option should be processed. In the current approach, the @occur attribute could already be added in the local:parse-lucene() function. Also, the best place for inserting the <fuzzy> tag is subject to reconsideration (maybe it’s more logical to treat it on a par with other ‘atomic’ search components <term>, <regex>, and <wildcard>).

However, eXist currently limits regex and wildcard searches to a boolean context (<bool>), where they can occur instead of <term>. Allowing them inside <near> (or <phrase>) will have to be implemented (and considered) first. I’ll wait for the outcome of this implementation before finishing the script, which then will either allow <regex> or <wildcard> anywhere, or have to restrict it to more limited contexts. To be continued…

In the mean time, a wrapped-up version of this proof of concept XQuery can be found:

  • here for the XSLT version
  • here for the XQuery version
About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: