Venturing into versions: strategies for querying a TEI apparatus with eXist

When encoding a critical edition in XML, one of the challenges facing the text encoder is finding a way to represent multiple versions of a work in a sensible way. As usual when it comes to the electronic representation of texts in the field of the humanities, such a sensible way is provided by the Text Encoding Initiative (TEI). Actually, three ways are offered, though this post will focus on the so-called parallel-segmentation method (for extensive reference, the reader is directed to chapter 12: Critical Apparatus of the TEI Guidelines). In short: this method allows an encoder to represent all text versions of a work within a single XML source, where places with variant text are encoded as an inline apparatus (<app>), in which the distinct variants are identified as readings (<rdg wit=”[sigil]”>), whose @wit attribute links them to (an) identified version(s) of the work. At this point, a lot more could be said about both edition and markup theoretic aspects, but this won’t be the focus of this post.

Instead, this post will focus on a topic I saw myself confronted with when developing an application (i.e. a web interface) for such an edition: how do you search within such ‘multiversion’ texts? Most probably, users of the edition would want to focus on one (or a selection of) text version(s). Of course, when version 1 contains the word ‘hope’, which in version 2 had been changed to ‘despair’, (only) the right readings should be retrieved for the respective text version.

Enter XQuery and the eXist XML database. As with the specific background of scholarly editing, a decent familiarity with the basic concepts of indexing and searching documents with eXist will be assumed. The focus will lie on theoretical aspects of two key problems: 1) full-text searching and 2) index retrieval of terms in ‘multiversion’ texts. Two possible approaches are discussed and evaluated:

  1. Querying a single ‘multiversion’ source text
  2. Splitting up the text in separate source texts per text version and querying those

Note: though this is quite a technical discussion, you can download appSearch_db.zip, an eXist backup file containing all files discussed here. They can be installed in one click in an eXist database by choosing to restore this backup file.

1. Single source text

I’ve started exploring this approach from the desire to do something clever with a ‘multiversion’ text containing all variants occurring in all text versions included in the edition. It starts from a single indexed source text containing transcriptions for all distinct text versions. For example, let’s take a 2-paragraph text for which 4 text versions have been collated (which will be indexed as test.xml in a dedicated collection in the eXist database, namely /db/test):

<div xmlns="http://www.tei-c.org/ns/1.0">
  <listWit>
    <witness xml:id="w1"/>
    <witness xml:id="w2"/>
    <witness xml:id="w3"/>
    <witness xml:id="w4"/>
  </listWit>
  <p>This is a paragraph with common text.</p>
  <p>This paragraph has 
    <app>
      <rdg wit="#w1 #w3">variants</rdg>
      <rdg wit="#w2">variant text</rdg>
      <rdg wit="#w4">variant test</rdg>
    </app>.
  </p>
</div>

This ‘single source’ approach requires some consideration w.r.t. index configuration, and search and index lookup scripts. Assume the <p> elements will be the text unit we’re interested in for querying this document. Since the <rdg> elements containing text variants are contained by <p>, there’s no way to exclude the contents of irrelevant <rdg>s from the search space. Without precautions, hence, fulltext searches on paragraphs will always include text variants from all text versions enclosed in <app> elements. In order to be able to address the contents of their embedded <rdg> elements separately, both should be indexed separately. This can be specified in a dedicated configuration file at /db/system/conf/db/test/collection.xconf:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <validation mode="auto"/>
    <index xmlns:tei="http://www.tei-c.org/ns/1.0">
        <fulltext default="none" attributes="no"/>
        <lucene>
            <text qname="tei:p">
                <ignore qname="tei:rdg"/>
            </text>
            <text qname="tei:rdg"/>
        </lucene>
        <create qname="@wit" type="xs:string"/>
    </index>
</collection>

This index definition excludes the content of <rdg> elements from searches focusing on <p>, like //tei:p[ft:query(., ‘test’)], while still allowing to query them directly with queries like //tei:rdg[ft:query(., ‘test’)].

1.1 Search script

The main concern for the search script is to separate the search contexts with common text (shared among all text versions) from version-specific search contexts (represented as <rdg> elements within <app>). Following XQuery script illustrates a possible approach:

declare namespace tei="http://www.tei-c.org/ns/1.0";
let $docs := doc('/db/test/test.xml')
let $rdgs := 
  for $rdg in $docs//tei:listWit//tei:witness/@xml:id 
  return concat('#', $rdg)
let $pool := $docs//tei:p
let $query := 'text'
return $pool[ft:query((.|.//tei:rdg[tokenize(string(@wit), '\s+') = $rdgs]), $query)]

Note, how in this script the $rdgs variable is populated with the sigla of all text versions that have been defined in the @xml:id attributes of the different <witness> elements in test.xml. This might seem to undermine the exact point of wanting to address the text variants separately, but is used here to illustrate the maximal possible query scenario (searching in all text versions). Narrower selections are possible, of course, and can be mimicked by replacing it with e.g. $rdgs := (‘#1’, ‘#4’).

In this query, first the nodes to be queried are collected in a $pool variable (in this case, all <p> nodes inside test.xml). Note, how for real-life documents, this set of nodes could be expanded with all desired text elements one would want to query. The actual differentiation lies in the search expression, which performs a ft:query() full-text search on both this set of nodes and any embedded <rdg> elements whose @wit attributes contain references to (one of) the text version(s) selected for the search (by checking whether any of the tokens for the tokenized @wit attribute value equals any of the text versions defined in the $rdgs variable).

Now, above query probably won’t yield much useful results: while all matching paragraphs are retrieved, they are presented as a bag of undifferentiated nodes without any duplicates. In other words: from these results it’s impossible to tell what search result occurs in what text version. This can be improved in the next version:

declare namespace tei="http://www.tei-c.org/ns/1.0";
let $docs := doc('/db/test/test.xml')
let $rdgs := for $rdg in $docs//tei:listWit//tei:witness/@xml:id return concat('#', $rdg)
let $pool := $docs//tei:p
let $query := 'text'

for $rdg in $rdgs
let $results := $pool[ft:query((.|.//tei:rdg[tokenize(string(@wit), '\s+') = $rdg]), $query)]
for $result in $results 
return <result wit="{$rdg}">{$result}</result>

By looping over the selected text versions defined in $rdgs, this script will repeat the full-text search for each text version and present all results per version, while identifying the associated text version by its siglum in a @wit attribute on the <result> element.

1.2 Index lookup script

The same differentiation technique can be applied in a script for lookup of indexed terms in a (selection of) text version(s):

declare namespace tei="http://www.tei-c.org/ns/1.0";
declare function local:term-callback($term, $data) {
  <term freq="{$data[1]}" docs="{$data[2]}" n="{$data[3]}">{$term}</term>
};

let $callback := util:function(xs:QName('local:term-callback'), 2)
let $docs := doc('/db/test/test.xml')
let $rdgs := for $rdg in $docs//tei:listWit//tei:witness/@xml:id return concat('#', $rdg)
let $pool := $docs//tei:p
let $query := 'text'
let $nodes := 
  for $a in $pool|$pool//tei:rdg[tokenize(string(@wit), '\s+') = $rdgs]
  (: here nodes can be refined first by querying, if required :)
  return $a(:[ft:query(., $query)]:)
for $term in util:index-keys($nodes, '', $callback, 15000, 'lucene-index')
(:order by $term/@freq/number() descending:) 
return $term

This script first selects the paragraphs and (only) relevant <rdg> elements in the $nodes variable, using the technique sketched above. (Note how the comments are provided as hooks for further refinement of search contexts and ordering of the search results. ) Next, those selected nodes are passed to the eXist-specific util:index-keys() function that will collect their different index terms in the $terms variable. When inspecting the results of the script above, some things catch the eye:

<term freq="1" docs="1" n="1">common</term>
<term freq="1" docs="1" n="2">has</term>
<term freq="2" docs="1" n="3">paragraph</term>
<term freq="1" docs="1" n="4">test</term>
<term freq="2" docs="1" n="5">text</term>
<term freq="2" docs="1" n="6">variant</term>
<term freq="1" docs="1" n="7">variants</term>

While this script does a decent job w.r.t. separating the common from the version-specific search contexts (try with separating out some sigla and compare the results), there is a problem with the statistics:

  • term frequency: all index terms in common text (i.e. text shared between all text versions) are counted only once

  • document count: the number of documents in which the terms occur is always 1 (as only one indexed document is queried)

Instead, the frequencies and document count of terms in common text should be multiplied with the number of text versions being queried. For example, the word ‘common’ occurs once in all text versions and hence should amount to 4 occurrences in 4 documents. Yet, it’s less straightforward for index terms occurring both in common and version-specific text: the term ‘text’ occurs once within common text, and once within version ‘w2’, totaling 5 (4 + 1) occurrences in 4 documents. When statistics matter, this differentiation between the occurrences in the selected text versions should be accounted for in the index lookup script. This can be worked around (theoretically) by first generating separate index lists per version, and later adding all separate statistics:

declare namespace tei="http://www.tei-c.org/ns/1.0";
declare function local:term-callback($term, $data) {
  <term freq="{$data[1]}" docs="{$data[2]}">{$term}</term>
};

let $callback := util:function(xs:QName('local:term-callback'), 2)
let $docs := doc('/db/test/test.xml')
let $rdgs := for $rdg in $docs//tei:listWit//tei:witness/@xml:id return concat('#', $rdg)
let $pool := $docs//tei:p
let $query := 'text'

let $terms := 
  for $rdg in $rdgs
  let $nodes := 
    for $a in $pool|$pool//tei:rdg[tokenize(string(@wit), '\s+') = $rdg]
    (: here nodes can be refined first by querying, if required :)
    return $a(:[ft:query(., $query)]:)
  return
    for $term in util:index-keys($nodes, '', $callback, 15000, 'lucene-index')
    return <term wit="{$rdg}">{
      $term/(@*, node())
    }</term>

let $conflateTerms := 
  for $term in distinct-values($terms)
  let $groupTerms := $terms[. eq $term]
  return <term>
  {
    for $att in $groupTerms[1]/@*[name() != 'wit']
    return attribute {$att/name()} {sum($groupTerms/@*[name() eq $att/name()])}
  }
  {$term}
  </term>
  
for $a in $conflateTerms
(:order by $a/@freq/number() descending:)
return $a

In this script, the $terms variable first collects all index terms  per selected text version, by looping over the $rdgs variable and performing an index lookup on the nodes occurring in that text version. A next step conflates the distinct terms in the $conflateTerms variable, by grouping the unique terms in $terms and accumulating the stats for their respective occurrences. This produces the correct statistics:

<term freq="4" docs="4">common</term>
<term freq="4" docs="4">has</term>
<term freq="8" docs="4">paragraph</term>
<term freq="5" docs="4">text</term>
<term freq="2" docs="2">variants</term>
<term freq="2" docs="2">variant</term>
<term freq="1" docs="1">test</term>

Due to the lack of a dedicated grouping mechanism in XQuery 1.0 (which should be addressed with the ‘group by’ clause in the upcoming XQuery 3.0 specification), the distinct-values() route used in the $conflateTerms variable is the only way to achieve this grouping, without resorting to the undocumented ‘group by‘ extension in eXist). This doesn’t scale well on large documents with many versions.

One way of speeding this up a bit, is by delegating the conflation of the separate index terms to XSLT, by using eXist’s transform:transform() function. Since version 2.0, native grouping capabilities have been added to XSLT, which definitely outperform the distinct-values() approach with XQuery. Hence, the $conflateTerms variable could instead be computed via XSLT:

let $conflateXSLT := 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">
    <xsl:template match="terms">
        <xsl:for-each-group select="term" group-by=".">
            <term>
                <xsl:for-each select="@*[name() != 'wit']">
                    <xsl:sort select="."/>
                    <xsl:variable name="attName" select="name()"/>
                    <xsl:attribute name="{{$attName}}">
                        <xsl:value-of select="sum(current-group()//@*[name() = $attName])"/>
                    </xsl:attribute>
                </xsl:for-each>
                <xsl:value-of select="current-grouping-key()"/>
            </term>
        </xsl:for-each-group>
    </xsl:template>    
</xsl:stylesheet>

let $conflateTerms := transform:transform(<terms>{$terms}</terms>, $conflateXSLT, ()

Still, this approach remains very fragile when it comes to performance: the main bottleneck is the requirement that full index lookups must be repeated for all selected text versions. Next, those (possibly huge) term collections must be ordered and their frequencies added. Performance of this approach hence is entirely dependent on:

  • the number of text versions selected for the search
  • the size of the node set to be searched

Again, further optimisation is possible, by restricting the number of index lookups as much as possible. This can be achieved by splitting up the previous $terms variable into two variables, containing only the common index terms and only the version-specific index terms, respectively:

declare namespace tei="http://www.tei-c.org/ns/1.0";
declare function local:term-callback($term, $data) {
 <term freq="{$data[1]}" docs="{$data[2]}">{$term}</term>
};

let $callback := util:function(xs:QName('local:term-callback'), 2)
let $docs := doc('/db/test/test.xml')
let $rdgs := for $rdg in $docs//tei:listWit//tei:witness/@xml:id return concat('#', $rdg)
let $pool := $docs//tei:p
let $query := 'text'

let $commonTerms := 
  let $nodes := 
    for $a in $pool
    (: here nodes can be refined first by querying, if required :)
    return $a(:[ft:query(., $query)]:)
  return
    for $term in util:index-keys($nodes, '', $callback, 15000, 'lucene-index')
    return <term>{
      (
      for $att in $term/@*
      return attribute {$att/name()} {$att * count($rdgs)}
      ,
      $term/node()
      )
    }</term>

let $rdgTerms := 
  for $rdg in $rdgs
  let $nodes := 
    for $a in $pool//tei:rdg[tokenize(string(@wit), '\s+') = $rdg]
    (: here nodes can be refined first by querying, if required :)
    return $a(:[ft:query(., $query)]:)
  return
    for $term in util:index-keys($nodes, '', $callback, 15000, 'lucene-index')
    return <term wit="{$rdg}">{
      $term/(@*, node())
    }</term>

let $conflateXSLT := 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">    
    <xsl:template match="terms">
        <xsl:for-each-group select="term" group-by=".">
            <term>
                <xsl:for-each select="@*[name() != 'wit']">
                    <xsl:sort select="."/>
                    <xsl:variable name="attName" select="name()"/>
                    <xsl:attribute name="{{$attName}}">
                      <xsl:choose>
                        <xsl:when test="name() = 'docs'">
                          <xsl:value-of select="(current-group()[not(@wit)]//@*[name() = $attName],
                                                 sum(current-group()//@*[name() = $attName]))[1]"/>
                        </xsl:when>
                        <xsl:otherwise>
                          <xsl:value-of select="sum(current-group()//@*[name() = $attName])"/>
                        </xsl:otherwise>
                      </xsl:choose>
                    </xsl:attribute>
                </xsl:for-each>
                <xsl:value-of select="current-grouping-key()"/>
            </term>
        </xsl:for-each-group>
    </xsl:template>
</xsl:stylesheet>

let $conflateTerms := transform:transform(<terms>{$commonTerms, $rdgTerms}</terms>, $conflateXSLT, ())

for $a in $conflateTerms
(:order by $a/@freq/number() descending:)
return $a

The common terms (i.e. terms occurring outside of <rdg> elements, which can be assumed to be the majority of terms in a document) are now retrieved with a single util:index-keys() lookup within the $commonTerms variable. Note how the statistics are adjusted: since all of these terms occur in all selected text versions, their occurrences and document count numbers are multiplied with the number of text versions selected. This leaves the version-specific terms (i.e. those occurring within the relevant <rdg> elements), to be collected in the $rdgTerms variable. Again, the number of selected text versions determines the number of index lookups, but this time the node set on which the lookups are performed is substantially cut down to only the (relevant) <rdg> nodes. Consequently, the subsequent conflation of the index terms has to deal with much less nodes and gains in efficiency. Note, how the XSLT script had to be adapted for the computation of the total @docs metric: for terms occurring both within common and version-specific contexts, the total number of documents should be taken as cutoff point. This total is retrieved by selecting the @docs value of the <term> without a @wit attribute, when available. In human language: this prevents that such terms would be computed to occur in more documents than the number of text versions selected.

Still, in a maximal scenario where full index scans are performed for large texts with many versions, this strategy could be prohibitively expensive. Before evaluating it, let’s have a look at another alternative for querying ‘multiversion’ texts, in the next section.

2. Index all versions separately

This approach simplifies on-the-fly computation of the different text versions enclosed in a parallel-segmented TEI text, by separating out all distinct text versions into complete XML source texts in their own right, and indexing and querying those ‘single-version’ texts separately in eXist.

Splitting up a ‘multiversion’ text into distinct text versions can be done easily enough via a batch XSLT script prior to indexing those texts, but adds to the maintenance cost: if something changes to the ‘multiversion’ text, all derived versions should be updated as well, and indexed again. However, an appealing alternative may be found in eXist’s trigger facilities. Instead of prior batch processing, eXist can be made to automatically apply the XSLT transformation upon indexing of the ‘multiversion’ text.

2.1 Trigger setup

In order to configure triggers for the /db/test collection, a <trigger> section should be added to /db/system/config/db/test/collection.xconf:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <validation mode="auto"/>
    <index xmlns:tei="http://www.tei-c.org/ns/1.0">
        <fulltext default="none" attributes="no"/>
        <lucene>
            <text qname="tei:p">
                <ignore qname="tei:rdg"/>
            </text>
            <text qname="tei:rdg"/>
        </lucene>
    </index>

<!-- to be added to collection.xconf file of target collection -->
    <triggers>
        <trigger event="store,update" class="org.exist.collections.triggers.XQueryTrigger">
            <parameter name="url" value="xmldb:exist://localhost/db/test/triggers/splitRdgs.xql"/>
        </trigger>
    </triggers>
</collection>

This tells eXist to run the /db/test/triggers/splitRdgs.xql XQuery script for all documents that are added to or updated in the collection /db/test. Since this includes the subcollection /db/test/triggers that will hold the scripts needed for executing this trigger, it’s safest to prevent them from activating this trigger when they are stored or updated themselves. This can be done by adding an empty collection.xconf file in the subcollection /db/system/conf/db/test/triggers:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0"/>
</collection>

Now, let’s have a look at the XQuery script at /db/test/triggers/splitRdgs.xql (note: syntax is based on the trigger implementation in eXist-1.4.x, and subject to change after the rework of this implementation in the current development version):

declare namespace  xmldb="http://exist-db.org/xquery/xmldb";
declare namespace tei="http://www.tei-c.org/ns/1.0";

declare variable $local:triggerEvent external;
declare variable $local:eventType external;
declare variable $local:collectionName external;
declare variable $local:documentName external;
declare variable $local:document external;
declare variable $local:triggersLogFile := "triggersLog.xml";

(: create the log file if it does not exist :)
let $logfile := 
  if(not(doc-available($local:triggersLogFile)))then
	xmldb:store("/db", $local:triggersLogFile, <triggers/>)
  else()
let $doc := doc($local:documentName)
return
if ($local:eventType eq 'finish' and $doc//tei:rdg) then 
  let $xsl := doc(concat($local:collectionName, '/triggers/splitRdgs.xsl'))
  let $rdgs := $doc//tei:witness/@xml:id  
  for $rdg in $rdgs
  let $params :=
    <parameters>
      <param name="t" value="#{$rdg}"/>
    </parameters>
  let $transformDoc := transform:transform($doc, $xsl, $params)
  let $docName := replace(tokenize($local:documentName, '/')[last()], 
                          '(.+)(\.[^.]+)$', concat('$1', '_', $rdg, '$2'))
  let $storeDoc :=  xmldb:store($local:collectionName, $docName, $transformDoc) 
  return update 
    insert 
      <trigger event="{$local:triggerEvent}" eventType="{$local:eventType}"
               collectionName="{$local:collectionName}" documentName="{$local:documentName}" 
               timestamp="{current-dateTime()}">{$rdg}</trigger>  
    into doc("/db/triggersLog.xml")/triggers
else ()

Basically, what this script does, is detect if the document being stored or updated contains any <rdg> elements. If so, it applies a splitting XSLT stylesheet per identified text version, and stores the results of that transformation as documents whose name is based on the ‘multiversion’ filename, suffixed with an underscore and the sigil of that version. The XSLT stylesheet referred to points towards /db/test/triggers/splitRdgs.xsl and could look as follows:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    exclude-result-prefixes="#all"
    version="2.0">
    
    <xsl:param name="t"/>
    
    <xsl:template match="tei:app">
        <xsl:apply-templates select="*[tokenize(@wit, '\s+') = $t]/node()" />
    </xsl:template>                
    
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
        
</xsl:stylesheet>

This effectively replaces all <app> elements with the content of the relevant text version, and copies all other content literally.

To summarise, this trigger setup will store the original document, as well as a separate text per text version, with following characteristics:

  • document name: original document name suffixed with ‘_’ + sigil

  • all <app> elements are removed; only content of <rdg> elements relevant to that version is preserved

2.2 Search script

The separation of all text versions in separate texts greatly alleviates the complexity of the search script. The only place where versions should be taken into account, is the determination of the file names to be included in the search:

declare namespace tei="http://www.tei-c.org/ns/1.0";
let $rdgs := for $a in doc('/db/test/test.xml')//tei:listWit//tei:witness/@xml:id return concat('#', $a)
let $docs := for $rdg in $rdgs return doc(concat('/db/test/test', replace($rdg, '#', '_'), '.xml'))
let $pool := $docs//tei:p
let $query := 'text'
for $hit in $pool[ft:query(., $query)]
return <result wit="{$hit/replace(substring-after(util:document-name(.), '_'), '\.[^.]+$', '')}">{
  $hit
}</result>

Note how the version sigla are used to determine the document name of the documents to be included in the search. Like its ‘multiversion’ counterpart discussed in section 1.2 above, this will return all search hits while identifying the text version in which they occur in a @wit attribute on the <result> element.

2.3 Index lookup script

Likewise, the index lookup script is much simplified, in that only a single call to util:index-keys() needs to be made, irrespective of the number of text versions selected:

declare namespace tei="http://www.tei-c.org/ns/1.0";
declare function local:term-callback($term, $data) {
  <term freq="{$data[1]}" docs="{$data[2]}" n="{$data[3]}">{$term}</term>
};

let $callback := util:function(xs:QName('local:term-callback'), 2)
let $rdgs := for $a in doc('/db/test/test.xml')//tei:listWit//tei:witness/@xml:id return concat('#', $a)
let $docs := for $rdg in $rdgs return doc(concat('/db/test/test', replace($rdg, '#', '_'), '.xml'))
let $pool := $docs//tei:p
let $query := 'text'
let $nodes := 
  for $a in $pool
  (: here nodes can be refined first by querying, if required :)
  return $a(:[ft:query(., $query)]:)
for $term in util:index-keys($nodes, '', $callback, 15000, 'lucene-index')
(:order by $term/@freq/number() descending:) 
return $term

Note how no mention needs to be made to any <rdg> elements, since those have been filtered out at the indexing stage.

3. Evaluation

This exercise started from the theoretical desire to be able to search ‘multiversion’ TEI source texts out of the box. eXist’s indexing implementation allows for index definitions flexible enough to construct XQuery scripts that neatly cut their way through the different versions encoded within the single source text. While searching is quite performant, there’s an important performance bottleneck when it comes to index lookup, whose performance is heavily dependent on the number of text versions and the size of their (selected) node sets to be searched. Although there is room for optimisation, this approach clearly has its limits and could hardly be defendable in a maximal scenario, where a complete index scan is requested for all versions of a real-life text (say, to produce a frequency list). Yet, it could be considered an option for more limited scenarios:

  • when the scope of the index lookup is restricted by requiring one or more start letters, this greatly improves performance
  • when statistics are not important (in –say- an autocomplete scenario), the initial index lookup script performs quite well

On the other hand, splitting up a ‘multiversion’ text into its constituent text versions at first sight looked unwieldy from a maintenance perspective. Yet, eXist’s trigger implementation relieves this burden. It is beyond doubt that this approach greatly alleviates the complexity of search and index lookup scripts, and has superior performance. After all, the util:index-keys() function is there precisely for efficiently collecting index statistics; generating partial lists of index terms and conflating them via XQuery, while feasible, has its price performance-wise.

Download: appSearch_db.zip, an eXist backup file containing all files discussed here. Just download and restore this backup file in your eXist database.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: