Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Obviously, the first step of such a translation involves parsing of a query in Lucene syntax and transforming it to its XML syntax equivalent.

This seems feasible in XQuery using a couple of eXist-specific extension functions. I’ll briefly sketch the approach below:

  1. translate a Lucene search string to an intermediate string mimicking the XML syntax, with some additions for later parsing of boolean operators
  2. parse the intermediary XML search string as XML with util:parse()
  3. transform the intermediary structures in the search query to full-fledged boolean expressions

The first step, transforming a Lucene search string to an intermediate string mimicking the XML syntax, can be done with an XQuery function that merely performs string replacement. Let’s call it local:parse-lucene():

declare function local:parse-lucene($string) {
  (: replace all symbolic booleans with lexical counterparts :)
  if (matches($string, '[^\\](\|{2}|&amp;{2}|!) ')) then
    let $rep := replace(
                  replace(
                    replace($string, '&amp;{2} ', 'AND ')
                  , '\|{2} ', 'OR ')
                , '! ', 'NOT ')
    return local:parse-lucene($rep)                
  (: replace all booleans with '<AND/>|<OR/>|<NOT/>' :)
  else if (matches($string, '[^<](AND|OR|NOT) ')) then
    let $rep := replace($string,
                        '(AND|OR|NOT) ',
                        '<$1/>')
    return local:parse-lucene($rep)
  (: replace all '+' modifiers with '<AND/>' :)
  else if (matches($string, '(^|[^\w"'])\+[\w"'(]')) then
    let $rep := replace($string,
                        '(^|[^\w"'])\+([\w"'(])',
                        '$1<AND type=_+_/>$2')
    return local:parse-lucene($rep)
  (: replace all '-' modifiers with '<NOT/>' :)
  else if (matches($string, '(^|[^\w"'])-[\w"'(]')) then
    let $rep := replace($string,
                        '(^|[^\w"'])-([\w"'(])',
                        '$1<NOT type=_-_/>$2')
    return local:parse-lucene($rep)
  (: replace round brackets with '<bool></bool>' :)
  else if (matches($string, '(^|\W|>)\(.*?\)(\^(\d+))?(<|\W|$)')) then
    let $rep := 
      (: add @boost attribute when string ends in ^\d :)
      if (matches($string, '(^|\W|>)\(.*?\)(\^(\d+))(<|\W|$)')) then
        replace($string, 
                '(^|\W|>)\((.*?)\)(\^(\d+))(<|\W|$)', 
                '$1<bool boost=_$4_>$2</bool>$5')
      else 
        replace($string, 
                '(^|\W|>)\((.*?)\)(<|\W|$)', 
                '$1<bool>$2</bool>$3')
    return local:parse-lucene($rep)
  (: replace quoted phrases with '<near slop=""></bool>' :)
  else if (matches($string, '(^|\W|>)("|').*?\2([~^]\d+)?(<|\W|$)')) then
    let $rep := 
      (: add @boost attribute when phrase ends in ^\d :)
      if (matches($string, '(^|\W|>)("|').*?\2([\^]\d+)?(<|\W|$)')) then 
        replace($string, 
                '(^|\W|>)("|')(.*?)\2([~^](\d+))?(<|\W|$)', 
                '$1<near boost=_$5_>$3</near>$6')
      (: add @slop attribute in other cases :)
      else 
        replace($string, 
                '(^|\W|>)("|')(.*?)\2([~^](\d+))?(<|\W|$)', 
                '$1<near slop=_$5_>$3</near>$6')
    return local:parse-lucene($rep)
  (: wrap fuzzy search strings in '<fuzzy min-similarity=""></fuzzy>' :)
  else if (matches($string, '[\w-[<>]]+?~[\d.]*')) then
    let $rep := replace($string,
                        '([\w-[<>]]+?)~([\d.]*)',
                        '<fuzzy min-similarity=_$2_>$1</fuzzy>')
    return local:parse-lucene($rep)
  (: wrap resulting string in '<query></query>' :)
  else concat('<query>',
              replace(normalize-space($string), '_', '"'),
               '</query>')
};

[NOTE: The single and double quotation marks in the match() and replace() functions should be escaped with their respective character entities. Unfortunately, posting them to a HTML blog destroys this escaping, and doubly escaping the ampersands of the character entities ended up… escaped themselves! Should you want to try this function, you’ll have to make sure to escape these characters.]

This results in a string representing an intermediary XML version of the query. For example, this function would translate the Lucene search string ‘(fillet OR "mal(ic)e done"~1) AND snake^4’ to:

<query>
  <bool>fillet <OR/><near slop="1">mal(ic)e done</near></bool> 
  <AND/>snake^4
</query>

This must be processed further, in order to get all boolean operators (and the occurrence indicators of their members) right. As booleans involve pairing of terms and hence (re)grouping, this conversion optimally operates on the corresponding XML structure of the preprocessed string. Therefore, it should be parsed to XML with eXist’s util:parse() function.

Option 1: conversion to final XML with XSLT

Since I’m most proficient in XSLT, I’ve implemented a first proof of concept for this conversion in an XSLT stylesheet that takes care of the boolean operators. This stylesheet can be applied to the intermediary XML search structure with the transform:transform() function:

<xsl:stylesheet 
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:xs="http://www.w3.org/2001/XMLSchema" 
     exclude-result-prefixes="#all" version="2.0">
    
  <xsl:output indent="yes"/>
    
  <xsl:template match="query">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
        <bool>
          <xsl:apply-templates select="node()"/>
        </bool>
      </xsl:copy>
  </xsl:template>
    
  <xsl:template match="*[(following-sibling::*[1]|preceding-sibling::*[1])
       [self::AND or self::OR or self::NOT]]">
    <xsl:variable name="name">
      <xsl:choose>
        <xsl:when test="(self::phrase|self::near)[not(@slop &gt; 0)]">phrase</xsl:when>
        <xsl:otherwise><xsl:value-of select="name()"/></xsl:otherwise>
      </xsl:choose>
    </xsl:variable>
    <xsl:element name="{{$name}}">
      <xsl:attribute name="occur">
        <xsl:choose>
          <xsl:when test="preceding-sibling::*[1][self::AND]">must</xsl:when>
          <xsl:when test="preceding-sibling::*[1][self::NOT]">not</xsl:when>
          <xsl:otherwise>
            <xsl:choose>
              <xsl:when test="following-sibling::*[1]
                   [self::AND or self::OR or self::NOT][not(@type)]">should</xsl:when>
              <xsl:otherwise>should</xsl:otherwise>
            </xsl:choose>
          </xsl:otherwise>
        </xsl:choose>            
      </xsl:attribute>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>        
  </xsl:template>
    
  <xsl:template match="query/text()[normalize-space()]|bool/text()[normalize-space()]">
    <xsl:variable name="current" select="."/>
    <xsl:for-each select="tokenize(., '\s+')[normalize-space()]">
      <!-- here is the place for further differentiation between  
           term / wildcard / regex elements -->
      <term>
        <xsl:attribute name="occur">
          <xsl:choose>
            <xsl:when test="position() = 1 and $current/preceding-sibling::*[1]
                 [self::AND]">must</xsl:when>
            <xsl:when test="position() = 1 and $current/preceding-sibling::*[1]
                 [self::NOT]">not</xsl:when>
            <xsl:otherwise>
              <xsl:choose>
                <xsl:when test="position() = 1 and $current/following-sibling::*[1]
                     [self::AND or self::OR or self::NOT][not(@type)]">should</xsl:when>
                <xsl:otherwise>should</xsl:otherwise>
              </xsl:choose>
            </xsl:otherwise>
          </xsl:choose>            
        </xsl:attribute>
        <xsl:if test="matches(., '(.*?)(\^(\d+))(\W|$)')">
          <xsl:attribute name="boost">
            <xsl:value-of select="replace(., '(.*?)(\^(\d+))(\W|$)', '$3')"/>
          </xsl:attribute>
        </xsl:if>
        <xsl:value-of select="normalize-space(replace(., '(.*?)(\^(\d+))(\W|$)', '$1'))"/>
      </term>
    </xsl:for-each>
  </xsl:template>
        
  <xsl:template match="AND|OR|NOT" priority="1"/>
    
  <xsl:template match="near/bool">
    <xsl:value-of select="concat('(', ., ')')"/>
  </xsl:template>

  <xsl:template match="*">
    <xsl:variable name="name">
      <xsl:choose>
        <xsl:when test="(self::phrase|self::near)[not(@slop &gt; 0)]">phrase</xsl:when>
        <xsl:otherwise><xsl:value-of select="name()"/></xsl:otherwise>
      </xsl:choose>
    </xsl:variable>
    <xsl:element name="{{$name}}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*|node()" priority="-1">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Suppose (for quick prototyping’s sake) this XSLT stylesheet is stored in a variable $lucene2xml. Now all we have to do in the main query is process the input $string with local:parse-lucene(), parse its output with util:parse(), and transform it with transform:transform():

let $string := '(fillet OR "mal(ic)e done"~1) AND snake^4'
let $luceneParse := local:parse-lucene($string)
let $luceneXML := util:parse($luceneParse)
return transform:transform($luceneXML, $lucene2xml, ())

For the input string ‘(fillet OR "mal(ic)e done"~1) AND snake^4’, this produces following XML query:

<query>
  <bool>
    <bool occur="must">
      <term occur="must">fillet</term>
      <near occur="should" slop="1">mal(ic)e done</near>
    </bool>
    <term occur="must" boost="4">snake</term>
  </bool>
</query>

Option 2: conversion to final XML with XQuery function

Since the XSLT stylesheet described above is quite simple, it can easily be formulated as an XQuery function as well. Apart from possible performance issues (I can’t make any sensible claims about this), a full XQuery solution is probably the more elegant option. Therefore, I’ve prepared an alternative version, expressing the XSLT stylesheet as an XQuery function:

declare function local:lucene2xml($node) {
  typeswitch ($node)
    case element(query) return 
      element { node-name($node)} {
        element bool {
          $node/node()/local:lucene2xml(.)
        }
      }
    case element(AND) return ()
    case element(OR) return ()
    case element(NOT) return ()
    case element(bool) return
      if ($node/parent::near) then
        concat("(", $node, ")") 
      else element {node-name($node)} {
        $node/@*,
        $node/node()/local:lucene2xml(.)
      }
    case element() return
      let $name := 
        if (($node/self::phrase|$node/self::near)[not(@slop > 0)]) then 
          'phrase' 
        else node-name($node)
      return 
        element { $name } {
          $node/@*,
          if (($node/following-sibling::*[1]|
               $node/preceding-sibling::*[1])
              [self::AND or self::OR or self::NOT]) then
            attribute occur { 
              if ($node/preceding-sibling::*[1][self::AND]) then 'must'
              else if ($node/preceding-sibling::*[1][self::NOT]) then 'not'
              else if ($node/following-sibling::*[1]
                       [self::AND or self::OR or self::NOT][not(@type)]) then 'should' 
              else 'should'
            }
          else (),
          $node/node()/local:lucene2xml(.)
        }
    case text() return 
      if ($node/parent::*[self::query or self::bool]) then
        for $tok at $p in tokenize($node, '\s+')[normalize-space()]
        (: here is the place for further differentiation 
           between  term / wildcard / regex elements :)
        (: using regex-regex detection (?): 
           matches($string, '((^|[^\\])[.?*+()\[\]\\^]|\$$)') :)
        let $el-name := 'term'
        return element { $el-name } {
          attribute occur {
            if ($p = 1 and $node/preceding-sibling::*[1]
                [self::AND]) then 'must'
            else if ($p = 1 and $node/preceding-sibling::*[1]
                [self::NOT]) then 'not'
            else if ($p = 1 and $node/following-sibling::*[1]
                [self::AND or self::OR or self::NOT][not(@type)]) then 'should'
            else 'should'
          },
          if (matches($tok, '(.*?)(\^(\d+))(\W|$)')) then
            attribute boost {
              replace($tok, '(.*?)(\^(\d+))(\W|$)', '$3')
            }
            else (),
          normalize-space(replace($tok, '(.*?)(\^(\d+))(\W|$)', '$1'))
        }
      else 
        normalize-space($node)
  default return
    $node
};

Instead of the XSLT transformation in the example above, this XQuery function can be called as the last step in the main query:

let $string := '(fillet OR "mal(ic)e done"~1) AND snake^4'
let $luceneParse := local:parse-lucene($string)
let $luceneXML := util:parse($luceneParse)
return local:lucene2xml($luceneXML/node())

The output is (and should be) the same as above.

Conclusion

So far for this proof of concept. What’s missing still are the extra search parameters:

  • wrap regex searches in <regex> instead of <term>
  • specify the ordered status of a proximity search in an @order attribute for <near>

In order to achieve this, those options could be passed as parameters to the XSLT stylesheet or the XQuery function. Both the XSLT stylesheet and XQuery function have comments at the place where the former option should be processed. In the current approach, the @occur attribute could already be added in the local:parse-lucene() function. Also, the best place for inserting the <fuzzy> tag is subject to reconsideration (maybe it’s more logical to treat it on a par with other ‘atomic’ search components <term>, <regex>, and <wildcard>).

However, eXist currently limits regex and wildcard searches to a boolean context (<bool>), where they can occur instead of <term>. Allowing them inside <near> (or <phrase>) will have to be implemented (and considered) first. I’ll wait for the outcome of this implementation before finishing the script, which then will either allow <regex> or <wildcard> anywhere, or have to restrict it to more limited contexts. To be continued…

In the mean time, a wrapped-up version of this proof of concept XQuery can be found:

  • here for the XSLT version
  • here for the XQuery version
About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: