From KWIC display to KWIC(er) processing with eXist

The eXist XML database has a dedicated XQuery module for displaying search results in a fixed context window, a visualization that is commonly known as a KeyWord In Context view. Search results are presented with a preceding and following text context (called further in this text left and right text context):

<p>
    <span class="previous">... s effect, sir; after what flourish your </span>
    <span class="hi">nature</span>
    <span class="following"> will.</span>
</p>

This formatting of search results invites to exploit its particular features, such as sorting the search results according to their left or right contexts, or even according to the nth word preceding or following the search term. This is heavily facilitated by the XML representation of the KWIC search results, where all three parts are isolated in their own XML element. However, while eXist’s current KWIC display module (as it is consistently called) does its job in presenting a KWIC display, in my opinion it is too much display-oriented:

  • it lacks performance on large result sets, and / or wide context widths, which is crucial for further processing, since sorting requires pre-computation of the entire result set
  • (though this is nitpicking:) the output is presentational HTML; while this is irrelevant from a processing point of view, I would prefer a semantically more ‘neutral’ format and defer presentational formatting to a later display phase

This post will address both objections and present alternatives. Additionally, ways for processing these KWIC results are discussed in the last section.

Disclaimer: this will be the most technical post so far, an non-programmer’s attempt to explain some algorithms. The last parts, however, return to practical grounds. Furthermore, a basic knowledge of the KWIC display function is assumed; see the eXist documentation and function reference for full documentation.

1. Strategies for improving the KWIC display module

In order to anchor this discussion of the KWIC display module, I will refer to the last changed revision (9892) in the SVN trunk of the eXist project, rather than duplicating it here. It currently has 2 main entry points:

  • kwic:summarize($node, $config, $callback?): a simplified function, allowing to pass a node containing search hits, a configuration element specifying display options like the context width or display format, and a callback function (optional) for specific processing of the left and right text contexts
  • kwic:get-summary($root, $node, $config, $callback?): a more complex function, adding one more parameter: a root node that serves as a cutoff point for the left or right text contexts

Actually, the kwic:get-summary() function is the basic function, since all the kwic:summarize() function does is determine a $root element for the text contexts and push further processing to kwic:get-summary().

The meat of the kwic:get-summary() function is a routine that determines the right amount of left and right text context to be included in the KWIC display. This is done by 2 helper functions, respectively kwic:truncate-previous() and kwic:truncate-following(). Both accept following arguments:

  • $root: the root node that serves as a cutoff point for the text contexts
  • $node: the current search term
  • $truncated: the text context text constructed so far
  • $max: the length of the text context constructed so far
  • $chars: the remaining available length of the text context
  • $callback: a reference to a specific function for further processing of the left or right text contexts

Starting from the current search term, these functions will test if there is a preceding or following text node. If so, they will test whether this text node and the currently constructed text context exceed the maximum context length. If so, the text context is expanded with a substring of this text node, and this result is returned. If not, the truncation function calls itself recursively, to test for the text node preceding or following the currently selected text node, until the maximum context length is reached.

Since these recursive functions are called for each search hit, this means that their efficiency is cumulative, and bottlenecks add up. Actually, there is some room for improvement, by reducing the choices to be made and functions to be called. Let’s examine a tuned version of kwic:truncate-following():

1 declare function kwic:truncate-following($root as node()?, $node as node()?, 
    $truncated as item()*, $width as xs:int, $callback as function?) 
2  {
3    let $nextProbe := $node/following::text()[1]
4    let $next := 
5      if ($root[not(. intersect $nextProbe/ancestor::*)]) then () 
6      else $nextProbe  
7    let $probe := 
8      if (exists($callback)) then 
9        concat($truncated, for $a in $next return kwic:callback($callback, $nextProbe, "after"))
10      else concat($truncated, ' ', $nextProbe)
11   return
12     if (string-length($probe) gt $width) then
13       let $norm := concat(' ', normalize-space($probe))
14       return 
15         if (string-length($norm) le $width and $next) then
16           kwic:truncate-following($root, $next, $norm, $width, $callback)
17         else if ($next) then
18           concat(substring($norm, 1, $width), '...')
19         else 
20           $norm
21     else if ($next) then 
22       kwic:truncate-following($root, $next, $probe, $width, $callback)
23     else for $str in normalize-space($probe)[.] return concat(' ', $str)
24  };

This version has one argument less: the $chars argument is abandoned, and the $max argument renamed to $width (line 1). Like the original version, this function starts by determining a next candidate text node ($nextProbe), by selecting the first following text() node. Next (line 4), the $next variable will check whether a $root argument was supplied, and whether it contains the candidate text node. If so, or if no $root argument was supplied, the $nextProbe node is copied; else it is emptied. Finally (line 7), the currently truncated text is updated by concatenating the previously truncated context ($truncated) with the current candidate node, in the $probe variable. When a $callback argument was supplied, the candidate node is first processed by this function.

Next, the string length of this concatenated string is tested (line 12). If it is smaller than the maximum context width ($width), while a following text() node exists (line 21), the kwic:truncate-following() function is called anew with the updated truncated context. If there’s no following text() node, the $probe string is returned (line 23). If, on the other hand, the $probe string length exceeds the maximum context width, a substring is returned that is exactly as long as the context width specified in $width (line 20); possibly followed with ‘…’ to indicate that there’s more following text that has been truncated (line 18). In order to make sure that whitespace (which is mostly irrelevant in XML) doesn’t eat all the context width, a further test compares the normalized value of this string to $width. If this normalization ends up smaller than $width, a next iteration of kwic:truncate-following() is called (line 16).

The kwic:truncate-previous() function is analogous, but obviously constructs the $truncated string in the opposite direction. Note how the correct substring length can be determined by subtracting the $width from the candidate truncated context ($probe):

concat('...', substring($norm, string-length($norm) - $width + 1))

This design of the truncation functions tries to minimize the processing, by:

  • minimizing the amount of tests and reducing redundancy (only 2 branching levels instead of 3 in the original KWIC module)
  • isolating the functions to be performed on the truncated text in the relevant branches of the decision tree

One costly operation is the determination whether the candidate text node still belongs to the $root cutoff node (line 5). This involves scanning all ancestors of the text node and checking whether the $root node is among them:

if ($root[not(. intersect $nextProbe/ancestor::*)]) then () 

(Note how the intersect operator is used, instead of the lazy evaluation $root//$hits in the original version of this function, which can be considered a bug). This has to be performed for each new text() node the truncation functions iterate over, which quickly amounts when the number of search hits is and/or the context $width are large. This is another area for improvement of the original KWIC functions, that expect a mandatory root argument. This means that even for simple kwic:summarize() calls, a $root node is determined, namely the current node containing the search hits. Actually, this is redundant: when those nodes are expanded with util:expand(), they are returned as root nodes anyway (without the wider document context), containing a copy of their internal structure, injected with <exist:match> elements around the matching text fragments. Having a $root argument in the truncation functions is more expensive than not having it (since its presence triggers the ancestor lookup for the current text node), so it is unwise, performance-wise, to pass it by default (when the default use case for a KWIC display probably won’t need it anyway).

That’s why the truncation functions can be tuned further by making the $root argument optional (see the “$root as node()?” definition in line 1 above). In order to propagate this optional $root parameter to the higher-level functions, those have to be adapted as well. In the kwic:get-summary() function, this simply requires declaring the $root argument as “node()?” (the question mark indicating it can be empty). For the kwic:summarize() function it then suffices to just pass an empty node as the $root argument in its call to the kwic:get-summary() function. Following overview highlights the changes from the original functions in yellow:

declare function kwic:get-summary($root as node()?, $node as element(exist:match), 
  $config as element(config)) as element() 
{
  kwic:get-summary($root, $node, $config, ())
};

declare function kwic:get-summary($root as node()?, $node as element(exist:match), 
  $config as element(config), $callback as function?) as element() 
{
  let $width := xs:int($config/@width)
  let $format := $config/@format
  let $ps := $config/@preserve-space = ('yes', 'true')
  
  let $prevTrunc := if ($ps) then kwic:truncate-previous-ps($root, $node, (), $width, $callback)
    else kwic:truncate-previous($root, $node, (), $width, $callback)
  let $followingTrunc := if ($ps) then kwic:truncate-following-ps($root, $node, (), $width, $callback)
    else kwic:truncate-following($root, $node, (), $width, $callback)
  return
    if ($format eq 'p') then
      <p>
        <span class="previous">{$prevTrunc}</span>
        {
          if ($config/@link) then
            <a class="hi" href="{$config/@link}">{ $node/text() }</a>
          else
            <span class="hi">{ $node/text() }</span>
        }
        <span class="following">{$followingTrunc}</span>
      </p>
    else if ($format eq 'table') then
      <tr>
        <td class="previous">{$prevTrunc}</td>
        <td class="hi">
        {
          if ($config/@link) then
            <a href="{$config/@link}">{$node/text()}</a>
          else
            $node/text()
        }
        </td>
        <td class="following">{$followingTrunc}</td>
      </tr>
    else
      <KWIC xmlns="http://exist-db.org/xquery/kwic">
        <prev>{$prevTrunc}</prev>
        <hit>{$node/text()}</hit>
        <next>{$followingTrunc}</next>
      </KWIC>
};

declare function kwic:summarize($hit as element(), 
  $config as element(config), $callback as function?) as element()* 
{
  let $expanded := util:expand($hit, "expand-xincludes=no")
  for $match in $expanded//exist:match
  return
    kwic:get-summary((), $match, $config, $callback)
};

2. An improved KWIC display module

The strategies discussed above have been implemented in an updated KWIC module that you can download and import in any XQuery. This updated KWIC display module provides exactly the same functionality as the original KWIC module, with improvements on following points:

  • performance: for large node sets, this version of the function performs 30 to 35% faster (tested for very broad searches on 2 different test collections (non-public), for a number of search scenarios: varying context widths, varying $root nodes)
  • accuracy: this version of the function normalizes the whitespace in the left and right text context
    Of course, the truncation functions could be simplified even further by avoiding whitespace normalization (which reduces both the number of operations and decisions when truncating the text contexts). While this definitely can gain a couple of seconds on large node sets with large context widths, I feel strongly for the whitespace normalization, since it does a slightly more accurate job. But then, there’s no reason why the simplified truncation functions couldn’t be included in the KWIC module as well, and have the kwic:get-summary() function decide what version to call based on an extra configuration parameter. In the updated KWIC module, the left and right text contexts are normalized by default. This can be overridden by passing the value ‘yes’ (or ‘true’) to an extra attribute @preserve-space on the <config> element:

kwic:summarize(., <config width=”40” preserve-space=”yes”/>)

kwic:get-summary((), ., <config width=”40” preserve-space=”true”/>, ())

Another (cosmetic) change concerns the output formatting. The original KWIC module directly outputs HTML fragments, either as paragraph or table. Probably due to my biased text encoding background, this seems too display-oriented to my taste. In the updated KWIC module, a third (default) output display is provided, structured as follows:

<KWIC xmlns="http://exist-db.org/xquery/kwic">
  <prev>{$prevTrunc}</prev>
  <hit>{$node/text()}</hit>
  <next>{$followingTrunc}</next>
</KWIC>

This format is taken as the default one, while both other formats can be output by providing appropriate values for the @format attribute on <config>: ‘p’ for paragraphs, ‘table’ for tables.

To wrap up, here is an overview of the configuration options in the updated KWIC module:

  • width: a number indicating the context width
  • link: an URL to which the hit is linked
  • format: output format
    • p: an HTML paragraph, containing <span class=”previous”>, <span class=”hi”>, and <span class=”following”> 
    • table: HTML table rows, containing <td class=”previous>, <td class=”hi”>, and <td class=”following”>
    • KWIC (default): a <KWIC> element, containing <prev>, <hit>, and <next> elements
  • preserve-space: whitespace normalization
    • yes|true: preserve original whitespace inside left and right text contexts 
    • no (default): normalize whitespace inside left and right text contexts

These improvements open the way for more powerful exploitation of the KWIC results, as will be illustrated in the next section.

3. Processing KWIC results

3.1 Contextual Sorting

A more performant KWIC module eases further processing of the KWIC search results, beyond mere display functionality. Traditionally, applications offering KWIC display provide the option to sort the search results along the left and right text context. In order to do so in XQuery, the entire search result set must be collected and formatted as KWIC results first. Next, the left and right text contexts can be used to add sort keys. Let’s take following example, based on the ‘Keyword In Context with Callback’ example in the eXist Sandbox application:

import module namespace kwic="http://exist-db.org/xquery/kwic" 
  at "xmldb:exist:///db/modules/kwic.xql";

declare function local:filter($node as node(), $mode as xs:string) as xs:string? {
  if ($node/parent::SPEAKER or $node/parent::STAGEDIR) then 
      ()
  else if ($mode eq 'before') then 
      concat($node, ' ')
  else 
      concat(' ', $node)
};

let $config := <config width="80" />
for $hit in doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "nature king sir")]
order by ft:score($hit) descending
return
  kwic:summarize($hit, $config,
            util:function(xs:QName("local:filter"), 2))

For all <SPEECH> elements in the Shakespeare plays that contain either “nature”, “king”, or “sir”, this query returns all of these search hits, together with their left and right text context. Since no additional configuration options were passed besides @width, the whitespace in these results is normalized, and they are presented as <kwic:KWIC> chunks:

<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>The </prev>
    <hit>king</hit>
    <next> , sir ,--</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>The king , </prev>
    <hit>sir</hit>
    <next> ,--</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>To this effect, sir ; after what flourish your </prev>
    <hit>nature</hit>
    <next> will.</next>
</KWIC>
<!-- ... -->

Note how the “order by” expression specifies that these results are ordered per <SPEECH> element, in decreasing relevance order (determined by ft:score()).

Let’s refactor this query, so it will sort the results along the keyword:

import module namespace kwic="http://exist-db.org/xquery/kwic" 
  at "xmldb:exist:///db/modules/kwic.xql";

declare function local:filter($node as node(), $mode as xs:string) as xs:string? {
  if ($node/parent::SPEAKER or $node/parent::STAGEDIR) then 
      ()
  else if ($mode eq 'before') then 
      concat($node, ' ')
  else 
      concat(' ', $node)
};

let $config := <config width="80" />
let $hits := doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "nature king sir")]
for $KWIC in $hits/kwic:summarize(., $config,
            util:function(xs:QName("local:filter"), 2))
order by $KWIC/kwic:hit ascending
return $KWIC

This time, the KWIC results are collected in their own variable $KWIC, and sorted on their <kwic:hit> elements. This will put the hits for ‘king’ first, followed by the hits for ‘nature’, and ‘sir’.

Suppose we’re interested in the words following those hits. It’s just a matter of adding a second sort key to regroup them. We just need to replace the “order by” line with the following one:

order by lower-case($KWIC/kwic:hit) ascending, 
         lower-case(replace($KWIC/kwic:next, '\W', ''))

Since sorting is case sensitive, the order by expression above transforms all sort keys to lower case, and ignores all non-word characters in the right text context.

<!-- ... -->
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>A man may fish with the worm that hath eat of a </prev>
    <hit>king</hit>
    <next> , and cat of the fish that hath fed of that worm.</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>...ood liege, I hold my duty, as I hold my soul, Both to my God and to my gracious </prev>
    <hit>king</hit>
    <next> : And I do think, or else this brain of mine Hunts not the trail of policy so s...</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>A bloody deed! almost as bad, good mother, As kill a </prev>
    <hit>king</hit>
    <next> , and marry with his brother.</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>The </prev>
    <hit>king</hit>
    <next> and queen and all are coming down.</next>
</KWIC>
<!-- ... -->

For sorting on the left text context, some additional processing is required: we’re interested in the words immediately preceding the search term; not in the first word of the left text context. Therefore, the left text context has to be reversed before sorting. This can be achieved by first tokenizing the string, reversing this sequence with the XQuery reverse() function, and reassembling this sequence to a string with the XQuery string-join() function:


order by lower-case($KWIC/kwic:hit) ascending, 
         lower-case(string-join(reverse(tokenize($KWIC/kwic:prev, '\W+')), ' ')) 

Likewise, instead of the entire left or right text, individual words at a certain position can be used for sorting. For example, suppose we want to sort the results on the third word preceding the search term (in descending order), and then on the second one following it (ascending). This can be achieved by the XQuery tokenize() function:


order by lower-case($KWIC/kwic:hit) ascending,
         lower-case(reverse(tokenize($KWIC/kwic:prev, '\W+')[.])[3]) descending,
         lower-case(tokenize($KWIC/kwic:next, '\W+')[.][2])
<!-- ... -->
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>He that plays the </prev>
    <hit>king</hit>
    <next> shall be welcome; his majesty shall have tribute of me; the adventurous knight ...</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>Of all the days i' the year, I came to't that day that our last </prev>
    <hit>king</hit>
    <next> Hamlet overcame Fortinbras.</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>...es, to be demanded of a sponge! what replication should be made by the son of a </prev>
    <hit>king</hit>
    <next> ?</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>That can I; At least, the whisper goes so. Our last </prev>
    <hit>king</hit>
    <next> , Whose image even but now appear'd to us, Was, as you know, by Fortinbras of N...</next>
</KWIC>
<!-- ... -->

Of course, while these sorting examples are still fairly basic, they could be useful already for linguistic exploration of documents indexed with eXist. A full linguistic tool kit would require more advanced features for deriving collocation information (what words occur with other words?), and statistics. That’s probably future music, but it sets one dreaming…

3.2 An (Experimental) Collocation Table

As an experiment, let’s see how far we get with a collocation table (caution: highly experimental!). Starting from KWIC search results, it is possible to compose a limited window of n words preceding and following the search term. When for each of these context slots all distinct words occurring at that slot are collected, a table can be composed listing all of these words-per-position. Let’s examine the steps in constructing such a collocation table, starting from KWIC search results (the full XQuery code can be downloaded here) :

import module namespace kwic="http://exist-db.org/xquery/kwic" at "xmldb:exist:///db/modules/kwic.xql";

declare function local:filter($node as node(), $mode as xs:string) as xs:string? {
  if ($node/parent::SPEAKER or $node/parent::STAGEDIR) then
      ()
  else if ($mode eq 'before') then
      concat($node, ' ')
  else
      concat(' ', $node)
};

(: context scope: number of preceding / following words :)
let $scope := 5
(: determine context width for KWIC results: 10 characters per context word, minimally 40 :)
let $cutoff := max(($scope * 10, xs:int(40)))
let $config := <config width="{$cutoff}" />
let $hits := doc("/db/shakespeare/plays/hamlet.xml")//SPEECH[ft:query(., "nature king sir")]
let $KWIC := $hits/kwic:summarize(., $config,
            util:function(xs:QName("local:filter"), 2))

This will produce a $KWIC variable with the search results formatted in a KWIC display (note: unordered, this time, since the results will be ordered later on):

<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>Long live the </prev>
    <hit>king</hit>
    <next> !</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>In the same figure, like the </prev>
    <hit>king</hit>
    <next> that's dead.</next>
</KWIC>
<KWIC xmlns="http://exist-db.org/xquery/kwic">
    <prev>Looks it not like the </prev>
    <hit>king</hit>
    <next> ? mark it, Horatio.</next>
</KWIC>
<!-- ... -->

Further processing is split per distinct search term, in order to keep their collocations nicely separated. In a next step, both left and right contexts are prepared for further processing:

(: split up collocations per search term :)
for $term in distinct-values($KWIC//kwic:hit/lower-case(normalize-space(.)))
let $KWIChits := $KWIC[kwic:hit/lower-case(normalize-space(.)) eq $term]
order by $term
return
  (: prepare entire left / right contexts for tokenization: 
        -lower case
        -normalize whitespace
        -reverse $prev
  :)
  let $prev := 
    for $a in $KWIChits/kwic:prev/lower-case(.) 
    for $tok at $pos in reverse(tokenize($a, '\W+')[matches(., '[a-zA-Z]')])[position() < $scope] 
    return 
  let $next := 
    for $a in $KWIChits/kwic:next/lower-case(.) 
    for $tok at $pos in tokenize($a, '\W+')[matches(., '[a-zA-Z]')][position() < $scope] 
    return         

Both contexts are tokenized up to the number of context words as defined in the $scope variable, and stored in $prev and $next variables. Both contain a ‘bag’ of all tokenized words occurring either before or after the search term, labeled with a <w> tag, containing its position to the context word in a @p attribute, and the lower-cased word itself in a @w attribute. For example, at this stage, the $prev variable will look as follows:

<w p="3" w="long"/>
<w p="2" w="live"/>
<w p="1" w="the"/>
<w p="5" w="the"/>
<w p="4" w="same"/>
<w p="3" w="figure"/>
<w p="2" w="like"/>
<w p="1" w="the"/>
<w p="5" w="looks"/>
<w p="4" w="it"/>
<w p="3" w="not"/>
<w p="2" w="like"/>
<w p="1" w="the"/>
<!-- ... -->

In the next stage, these ‘bags’ of single words occurring at n positions before or after the search term are regrouped per context position (re-using the @p value of the distinct <w> tags). Updated versions of the $prev and $next variables are created, containing a sorted list of all unique word forms per context position:

(: per context position, retrieve all distinct words :)
  let $prev := 
    for $context in reverse(1 to $scope)
    let $words := 
      let $tok := $prev[@p = $context]/@w
      for $b in distinct-values($tok) order by $b return {$b}
    return {$words}
  let $next := 
    for $context in (1 to $scope)
    let $words := 
      let $tok := $next[@p = $context]/@w
      for $b in distinct-values($tok) order by $b return {$b}
    return {$words}

For example, the left context is broken into the 5 distinct words that occur before the search term, starting from the 5th up to the last word before the search term. This will produce an updated $prev variable that looks as follows:

<context pos="-5">
    <w>an</w>
    <w>and</w>
    <w>battlements</w>
    <!-- ... -->
</context>
<context pos="-4">
    <w>and</w>
    <w>aside</w>
    <w>body</w>
    <!-- ... -->
</context>
<context pos="-3">
    <w>alone</w>
    <w>as</w>
    <w>be</w>
    <!-- ... -->
</context>
<context pos="-2">
    <w>a</w>
    <w>and</w>
    <w>before</w>
    <!-- ... -->
</context>
<context pos="-1">
    <w>a</w>
    <w>be</w>
    <w>bloat</w>
    <!-- ... -->
</context>

Finally, after these $prev and $next results have been collected, they are presented in an HTML table, where each position in the left and right contexts is represented in a column. All words occurring in that position are then presented in an own row.

  let $max := max(($next|$prev)/count(w))

  (: spread out words-per-context over table rows :)
  return 
    <table border="1">{
      <tr>
        <th/>
        {
          for $a in $prev
          return <th>{$a/@pos/string()}</th>,
          <th>term</th>,
          for $a in $next
          return <th>{$a/@pos/string()}</th>
        }
      </tr>,
      for $i in (1 to $max)
      return 
        <tr>{
          <td>{$i}</td>,
          for $a in $prev
          return
            <td>{$a/w[$i]/text()}</td>,
          <th>{$term}</th>,
          for $a in $next
          return
            <td>{$a/w[$i]/text()}</td>
        }</tr>
    }</table>

This produces following collocation tables (split per search term) for our query (note: this is only a summary; a full version can be found here):

-5 -4 -3 -2 -1 term 1 2 3 4 5
1 an and alone a a king and a a a and
2 and aside as and be king as be again all applaud
3 battlements body be before bloat king best but and and are
4 but can business body danish king but cat be beggar bed
5 by clouds conjuration but fat king caps denmark can but begin
6
-5 -4 -3 -2 -1 term 1 2 3 4 5
1 action after a capital baser nature and and and absurd alone
2 audience and am days for nature are any as compell and
3 can are and fault hast nature as between bear devil awake
4 crimeful as canker flourish in nature cannot burnt ever evil away
5 done change commendable fools of nature come by exception grow by
6
-5 -4 -3 -2 -1 term 1 2 3 4 5
1 a a a a ay sir after a are an a
2 against away approve all but sir an aery as and ambassador
3 dies fell are am cannot sir and all did any and
4 great foolish be and carriages sir are answer diligence as can
5 head forget but ay come sir but away done but clay
6

Note how the value of these collocation tables is fairly limited, though. Since the context words have been ordered alphabetically, all these tables can provide is an alphabetical list of words occurring per position before or after the search term (when read per column). It would of course be more interesting to have access to statistical information that can indicate the significance of the individual words per position. That would allow to order them by statistical saliency, so the tables could provide an overview of the most frequent words at their respective positions. Unfortunately, as such data is not (yet?) available in eXist, the meaningfulness of the collocation data presented above is quite questionable, whence the experimental character of this illustration. Anyway, if you’d like to experiment with it, feel free to download the updated KWIC module (kwic.xql), the full XQuery script (collocation_table.xq) and  sample output (collocationSample.htm).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: