As a matter of fac(e)t: (mimicking) faceted searching in eXist

In hindsight, since I set out developing search interfaces for XML text collections with the marvelous eXist XML database, I’ve been drawn to the concept of faceted search, even long before I knew it was called that way. The recent integration of Lucene indexing and searching capabilities into eXist (since version 1.4) holds promises for efficient facet-oriented search features such as integrating Lucene fields in search queries.

Although these features are not yet implemented, exchanges on the eXist-open mailing list suggest:

(Currently) eXist lacks optimised features for performant faceted searching (like e.g. the Lucene-based solr search engine). Yet, some index-based features of eXist can be used to add facets to XQuery search scripts already. Key to this approach is the util:index-keys() function, which returns indexed terms with some distribution statistics. Note, that the discussion in this post doesn’t claim any originality: the basic idea is already implemented in the eXist bibliographic demo web application, to which this discussion adds a small mechanism for defining and looping over multiple facets in an XQuery script.

I’ll illustrate this approach with an example based on the Shakespeare example XML example files that are shipped with eXist. This assumes that a) eXist-1.4 is installed, and b) the eXist examples have been set up with the admin web application. First, let’s add some more index definitions to the index configuration file that is stored in the database at ‘/db/system/config/db/shakespeare/collection.xconf’. Add two more range index definitions for the elements named ‘SPEAKER’, and ‘TITLE’, so the index configuration looks as follows:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="no"/>
        <lucene>
            <text qname="SPEECH">
                <ignore qname="SPEAKER"/>
            </text>
            <text qname="TITLE"/>
        </lucene>
        <ngram qname="SPEAKER"/>
        <!-- range indexes -->
        <!-- note: although path-based range indexes are liable
             to get deprecated (in favour of qname based ones),
             following range indexes are path-based in order to
             avoid a bug with util:index-keys() in eXist trunk
        -->
        <create path="//SPEAKER" type="xs:string"/>
        <create path="//TITLE" type="xs:string"/>
    </index>
</collection>

In order to actually put these index definitions to use, the ‘/db/shakespeare’ collection needs to be reindexed with the Java admin client.

A simple search

First, let’s issue a simple search on the Shakespeare plays:

let $coll := collection('/db/shakespeare') 
let $hits := $coll//SPEECH[ft:query(., 'lord')] 
return $hits

This will perform a straightforward search on the <SPEECH> nodes that contain the string ‘lord’. The ft:query() function operates on the Lucene full text index, which in this case was defined on all <SPEECH> elements. As the content of the <SPEAKER> elements has been ignored from the full text index (see the index definition above), this search will return 272 <SPEECH> elements whose <LINE> children contain the string ‘lord’:

<SPEECH>
  <SPEAKER>HAMLET</SPEAKER>
  <LINE>Not so, my lord; I am too much i' the sun.</LINE>
</SPEECH>
<SPEECH>
  <SPEAKER>HORATIO</SPEAKER>
  <LINE>The same, my lord, and your poor servant ever.</LINE>
</SPEECH>
<SPEECH>
  <SPEAKER>MARCELLUS</SPEAKER>
  <LINE>My good lord--</LINE>
</SPEECH>
<!-- ... -->

A simple facet with standard XQuery functions

This simple search can return more than just the speeches holding the word ‘lord’. Suppose that -besides the actual speeches- we are interested in the different dramatic characters that utter these speeches. We could thus add ‘speakers’ as a facet to the search. Speakers are encoded in a <SPEAKER> element per <SPEECH>. In the context of the previous XQuery snippet, they are easily accessible as $hits//SPEAKER.

Before implementing this in XQuery code, let’s consider what we’re after at this moment:

  • all speeches containing the word ‘lord’
  • all different speakers uttering those speeches

This is easy enough with XQuery’s distinct-values() function:

let $coll := collection('/db/shakespeare') 
let $hits := $coll//SPEECH[ft:query(., 'lord')] 
let $speakers := 
  for $a in distinct-values($hits/SPEAKER)
  return <speaker>{$a}</speaker>
return <results>{ 
  <speakers>{$speakers}</speakers>, 
  <hits>{$hits}</hits> 
}</results>

These results could then be presented to the user in an intelligible way. Typically, the facets (‘speakers’, in this case) will be offered as a kind of search refinement alongside the actual search results. However, often such facets are accompanied by an indication of the number of search results that correspond to that additional search filter. This information can be computed with standard XQuery functions as well, simply by counting how many times each distinct value occurs for a facet within the search results:

let $coll := collection('/db/shakespeare')
let $hits := $coll//SPEECH[ft:query(., 'lord')] 
let $speakers := 
  for $a in distinct-values($hits/SPEAKER) 
  return <speaker freq="{count($hits/SPEAKER[. eq $a])}">{$a}</speaker> 
return <results>{ 
  <speakers>{$speakers}</speakers>, 
  <hits>{$hits}</hits> 
}</results>

As the example illustrates, the construction of the $speakers facet now adds a @freq attribute that counts how many times each unique speaker name occurs within the set of speakers that utter the word ‘lord’:

<speakers>
  <speaker freq="10">LAERTES</speaker>
  <speaker freq="30">LORD POLONIUS</speaker>
  <speaker freq="13">HAMLET</speaker>
  <speaker freq="48">HORATIO</speaker>
  <!-- ... -->
</speakers>

However, this approach soon hits its limits: the performance of such queries directly correlates to a) the size of the search space (collection size and search hits), and b) the number of facets. The standard XQuery functions (‘distinct-values()’) and operators (‘eq’) don’t operate on eXist indexes and hence quickly suffer from degrading performance when complexity increases. Therefore, in the next section, an eXist-specific alternative will be proposed.

A simple facet with eXist-specific functions

The eXist XML database tries to optimise query performance, both by giving standard XQuery functions and operators access to its efficient indexes, and defining eXist-specific functions for querying those indexes. One such function is util:index-keys($node, $start-value, $function-reference, $max-number-returned), which looks up the different indexed values for a given node set ($node) that start with a specific start value ($start-value), delegates the processing of this index information to a dedicated function ($function-reference), and returns a maximal number of index entries ($max-number-returned).

The information returned to the helper function consists of:

  • the index key
  • the overall frequency of that index key within the specified node set ($node)
  • the number of documents in the specified node set ($node) in which the key occurs
  • the cardinal number of the index key in the total number of keys returned

Let’s illustrate this by reformulating the previous example. Note how in the facet generating part of the XQuery script the standard distinct-values() function is replaced with the eXist-specific util:index-keys() function.

declare function local:term-callback($term as xs:string, $data as xs:int+) as element() {
  <term freq="{$data[1]}" docs="{$data[2]}" n="{$data[3]}">{$term}</term> 
}; 
let $callback := util:function(xs:QName("local:term-callback"), 2)
let $coll := collection('/db/shakespeare') 
let $hits := $coll//SPEECH[ft:query(., 'lord')] 
let $speakers := util:index-keys($hits, '', $callback, 10000) 
return <results>{ 
  <speakers>{$speakers}</speakers>, 
  <hits>{$hits}</hits> 
}</results>

This produces the same list of speakers (only in alphabetical order by default), but this time they come enriched with basic statistical index information, e.g.:

<speakers>
  <!-- ... -->
  <term freq="7" docs="1" n="18">LADY MACBETH</term>
  <term freq="10" docs="1" n="19">LAERTES</term>
  <term freq="5" docs="1" n="20">LENNOX</term>
  <term freq="30" docs="1" n="21">LORD POLONIUS</term> 
  <!-- ... --> 
</speakers>

This call to the util:index-keys() function works directly on eXist indexes, and avoids expensive comparison and counting of the speakers for all unique speaker names. Hence, (when operating on range indexes – cf. infra) the util:index-keys() function can function as a ‘power’ version of the standard distinct-values() function.

There’s more facets to a gem

Now, let’s illustrate how the example query can be complicated by adding more facets. Suppose that besides the speakers, we are interested in the sections in which the search hits occur. In the Shakespeare documents, the distinct text divisions each have their own title:

    play /PLAY/TITLE
    act /PLAY/ACT/TITLE
    scene /PLAY/ACT/SCENE/TITLE

Remember how earlier we defined a range index on the <TITLE> element? This can be put to use for all of these facets. In order to keep the code minimal and easily extensible to other facets, facets can be defined as a set of XML elements:

<facets>
  <facet label="scenes">$hits/ancestor::SCENE/TITLE</facet>
  <facet label="acts">$hits/ancestor::ACT/TITLE</facet>
  <facet label="plays">$hits/ancestor::PLAY/TITLE</facet>
  <facet label="speakers">$hits/SPEAKER</facet>
</facets>

Each facet definition then consists of two pieces of information:

  • facet path: an XPath expressions for the facet, relative to the search hits (the text content of each <facet> element)
  • facet label: a string label to be used for identifying the different facets in the search results (the value for the @label attribute for each <facet> element)

The facet generation then loops over the defined facets, evaluates their XPath expressions using the eXist-specific util:eval() function, looks up the corresponding facet label , and generates the list of index entries for the facet:

declare function local:term-callback($term as xs:string, $data as xs:int+) as element() {
  <term freq="{$data[1]}" docs="{$data[2]}" n="{$data[3]}">{$term}</term> 
};
let $callback := util:function(xs:QName("local:term-callback"), 2) 
let $coll := collection('/db/shakespeare') 
let $hits := $coll//SPEECH[ft:query(., 'lord')] 
(: declare facets as XPath expressions, relative to the search hits :) 
let $facets := 
  <facets>
    <facet label="scenes">$hits/ancestor::SCENE/TITLE</facet>
    <facet label="acts">$hits/ancestor::ACT/TITLE</facet> 
    <facet label="plays">$hits/ancestor::PLAY/TITLE</facet>
    <facet label="speakers">$hits/SPEAKER</facet> 
  </facets> 
return <results> 
  <facets>{ 
    (: loop over facet XPaths, and evaluate them :)
    for $facet in $facets//facet 
    let $vals := util:eval($facet) 
    return <facet>{
      $facet/@label,
      util:index-keys($vals, '', $callback, 5)
    }</facet>
  }</facets>
  <hits>{$hits}</hits> 
</results>

The result looks as follows:

<results>
  <facets>
    <facet label="scenes">
      <term freq="1" docs="1" n="1">SCENE I.  A cavern. In the middle, a boiling cauldron.</term>
      <term freq="1" docs="1" n="2">SCENE I.  A churchyard.</term>
      <term freq="1" docs="1" n="3">SCENE I.  A room in POLONIUS' house.</term>
      <term freq="2" docs="1" n="4">SCENE I.  A room in the castle.</term>
      <term freq="1" docs="1" n="5">SCENE I.  Dunsinane. Ante-room in the castle.</term>
    </facet>
    <facet label="acts">
      <term freq="3" docs="3" n="1">ACT I</term>
      <term freq="3" docs="3" n="2">ACT II</term>
      <term freq="3" docs="3" n="3">ACT III</term>
      <term freq="3" docs="3" n="4">ACT IV</term>
      <term freq="3" docs="3" n="5">ACT V</term>
    </facet>
    <facet label="plays">
      <term freq="1" docs="1" n="1">The Tragedy of Hamlet, Prince of Denmark</term>
      <term freq="1" docs="1" n="2">The Tragedy of Macbeth</term>
      <term freq="1" docs="1" n="3">The Tragedy of Romeo and Juliet</term>
    </facet>
    <facet label="speakers">
      <term freq="1" docs="1" n="1">ATTENDANT</term>
      <term freq="1" docs="1" n="2">BALTHASAR</term>
      <term freq="4" docs="1" n="3">BANQUO</term>
      <term freq="3" docs="1" n="4">BERNARDO</term>
      <term freq="2" docs="1" n="5">Both Murderers</term>
    </facet>
  </facets> 
  <hits>
    <SPEECH>
      <SPEAKER>LAERTES</SPEAKER>
      <LINE>My dread lord,</LINE>
      <LINE>Your leave and favour to return to France;</LINE>
      <LINE>From whence though willingly I came to Denmark,</LINE>
      <LINE>To show my duty in your coronation,</LINE>
      <LINE>Yet now, I must confess, that duty done,</LINE>
      <LINE>My thoughts and wishes bend again toward France</LINE>
      <LINE>And bow them to your gracious leave and pardon.</LINE>
    </SPEECH>
    <!-- ... -->
  </hits>
</results>

Full text facets

Besides the values of the headings of the distinct text divisions that contain the search hits, it might be interesting to add yet another facet, listing the separate words of the speeches containing the search results. This will require a slightly different approach than for the facets already present. One thing of notice in the previous search results, is the fact that the index entries returned consist of whole phrases: ‘SCENE I. A cavern. In the middle, a boiling cauldron.’, ‘ACT I’. This is due to the way the nodes in those facets have been indexed. So far, the facets are constructed on the range indexes that have been defined on the <TITLE> and <SPEAKER> elements. Suppose we defined a range index on the <LINE> elements as well and scanned this with the util:index-keys() function, this would retrieve the entire contents of those lines, instead of the individual words. Which is clearly not very helpful for refining the search in a search interface.

This is where the Lucene full text index comes into play, which tokenises the text of the nodes for which it is defined, and indexes those tokens. The collection.xconf file above defines a Lucene full text index on the <SPEECH> nodes of the Shakespare documents (while ignoring the <SPEAKER> elements). In practice, this means that the index is populated with all tokenised words of the <LINE> elements that make up each <SPEECH>. In order to scan this index with util:index-keys(), we’ll have to be more specific about the index to be used. That is possible in an optional 5th parameter for that function, naming the index. In the case of the <SPEECH> elements, the keys of their full text index can be retrieved as follows:

util:index-keys(//SPEECH, '', $callback, 5, 'lucene-index')

Combined with the callback function defined elsewhere in the XQuery, this function call would then produce a list of index entries similar to this fragment:

<term freq="2" docs="2" n="1">abate</term>
<term freq="1" docs="1" n="2">abatements</term>
<term freq="1" docs="1" n="3">abbey</term>
<term freq="3" docs="3" n="4">abhorred</term>
<term freq="1" docs="1" n="5">abhors</term>

Note that when no index is specified for the util:index-keys() function, it seems to default to the range index. Only when other indexes are scanned, those have to be named in the last argument to util:index-keys(). (On the other hand, there seems no way to explicitly state the range index in the 5th parameter; ‘range-index’ won’t be recognised.)

Wrapping up, this full text facet can be integrated in the sample XQuery, with the provision that different versions of util:index-keys() must be used depending on the index on which the facet operates. In the following example, this is catered for by adding a @type attribute with value “FT” to the full text search facets, and using this information to determine which index util:index-keys() has to be applied to:

declare function local:term-callback($term as xs:string, $data as xs:int+) as element() {
  <term freq="{$data[1]}" docs="{$data[2]}" n="{$data[3]}">{$term}</term>
}; 
let $callback := util:function(xs:QName("local:term-callback"), 2)
let $coll := collection('/db/shakespeare') 
let $hits := $coll//SPEECH[ft:query(., 'lord')]
(: declare facets as XPath expressions, relative to the search hits :) 
let $facets := 
  <facets>
    <facet label="scenes">$hits/ancestor::SCENE/TITLE</facet> 
    <facet label="acts">$hits/ancestor::ACT/TITLE</facet> 
    <facet label="plays">$hits/ancestor::PLAY/TITLE</facet> 
    <facet label="speakers">$hits/SPEAKER</facet> 
    <facet label="fulltext" type="FT">$hits</facet> 
  </facets> 
return <results> 
  <facets>{ 
    (: loop over facet XPaths, and evaluate them :) 
    for $facet in $facets//facet
    let $vals := util:eval($facet) 
    let $index := if ($facet[@type eq 'FT']) then 'lucene-index' else () 
    return <facet>{ 
      $facet/@label, 
      if ($index) then 
        util:index-keys($vals, '', $callback, 5, $index) 
      else 
        util:index-keys($vals, '', $callback, 5) 
    }</facet> 
  }</facets>
  <hits>{$hits}</hits> 
</results>

Although at this stage, this XQuery script only provides a basic proof of concept, it illustrates how facets for certain information categories could be generated alongside search results. Of course, in order to reduce processing overhead, the $max-number-returned parameter of util:index-keys() can be reduced (as in these examples), so that only a limited number of facet entries are returned in a first phase. An option could then be offered to see more (or all) entries for these individual facets in a second step.

Conclusion

Although future integration of efficient Lucene search capabilities into eXist probably holds more promises for efficient faceted search implementations with eXist, some eXist-specific features (different indexes, util:index-keys()) provide a way to add search facets to query results already. Although the performance of the approach described in this post will probably decrease with very large data collections, the use of the util:index-keys() function seems to offer a reasonably stable option (performance-wise) for constructing search facets alongside a search.

About these ads

4 Responses to As a matter of fac(e)t: (mimicking) faceted searching in eXist

  1. Anne Schuth says:

    Hello Ron,

    We’ve ‘talked’ on the eXist mailing list. Just now I was trying to reconstruct the examples above, to compare the result to my own implementation of faceted search in eXist.
    It turns out that if I follow your example literally, I get no results for the approach listed under “A simple facet with eXist-specific functions”. I do get hits, but no statistics about speakers. Is it me, or is something missing?

    Thanks for looking into it!

    My implementation will be shared soon!

    Best,
    Anne

  2. rvdb says:

    Hi Anne,

    Thanks for spotting this. Apparently, there’s a bug in the current eXist trunk code (up to revision 12988, at least), where util:index-keys() fails to find range index keys when they’re defined on qnames. The problem has been reported on the eXist-open ML:

    For the time being, you can work around the issue by a) either experimenting with eXist-1.4.x (which doesn’t have the problem), or b) redefining the range indexes in this example as path-based ones. I’ll change the example in this post (had opted for qname indexes as path-based ones are liable to get deprecated in the future).

    I’m really curious about your implementation (which will undoubtedly be superior), and wish you good luck with your progress!

  3. Hi Anne,

    I would also be interested in your implementation of faceted search.

    For now I’m using a simple faceted search built with standard XQuery functions found in this article. However it’s too slow to handle the thousands documents I’m dealing with. I will try to use eXist-specific functions to see how much speedup I can get. If you have any advice, tip or trick please let me know!

    Thank you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: