Reverse Proxying Tomcat Web Applications Behind Apache

As far as sysadmin feelings go, I’ve been living happily with my Tomcat web application server, caged behind a hidden port and fronted with an Apache web server via mod_jk. While preventing direct web access to the Tomcat server (which is generally considered unsecure), it still makes Tomcat web applications publicly available via requests to the Apache web server. This setup served well for deploying our (mostly eXist-driven) web applications. Yet, this all-embracing symbiotic happiness started to dwindle somehow when I discovered that the latest generation code of eXist (2.x branch) seemed to have an issue with mod_jk. Apparently, when piped through mod_jk, eXist-2.x doesn’t seem to be able to set or get session attributes (whereas all works fine when those same requests are directed straight to the Tomcat application server). After my efforts to switch application code from Cocoon-based sitemaps to eXist’s own MVC controller framework and catch up with the latest eXist versions, this observation felt like a cold shower. Especially since session attributes have proven a great means to add state to my webapps and hence feature prominently in my webapp code logic.

Lacking any Java skills myself, my hopes for seeing this issue fixed were low, as it clearly falls in between both technologies (eXist and mod_jk). Moreover, eXist developers tend to prefer the mod_proxy Apache module over mod_jk for the communication between Apache and Tomcat, which reduces the chances that this mod_jk-related issue will have much priority. Fortunately, there is some basic eXist documentation on how to configure Apache to act as a reverse proxy for eXist webapps, which provided a good basis for investigating how I could switch my current mod_jk configuration to a reverse proxy setup with mod_proxy.

In this post, I’ll try to explain the different cliffs I came across for my scenario. I’ll set out with explaining some specifics of our configuration, and work my way from functional to optional proxy configuration. Before I start, I want to point out two disclaimers: first, I’m only an accidental sysadmin who started this ‘investigation’ without much prior knowledge about (reverse) proxying. Yet, I’ll try my best to explain things both as understandably and accurately as possible. Second, although the issues I’ll describe apply to any setup where Tomcat apps are fronted with Apache via reverse proxy, I’ll illustrate them with some aspects of the the eXist webapps I’m familiar with. Yet, the scope is definitely broader than eXist.

1. Some Assumptions

Before starting this discussion, I will briefly explain my setup:

  • an Apache-2.2 web server running on port 80, configured for the domain http://mydomain/, with the modules mod_proxy and mod_proxy_http enabled
  • a Tomcat-7.0.27 web application server running on (hidden) port 8082 that is accessible on http://localhost:8082/ on the remote server
  • a Tomcat web application ‘myExistApp’, located at ${catalina.base}\webapps\myExistApp, and accessible via http://localhost:8082/myExistApp/ on the remote server

In this discussion I’ll try to stick to the right or unambiguous terminology. Therefore, I’ll name the Apache web server as the proxy frontend, that will pass on requests to the Tomcat web application server which will act as the proxy backend.

2. Reverse Proxying Specific Webapps

It only takes a few tweaks to Apache’s httpd.conf file to get the example setup from the eXist proxy documentation working for above configuration:

<VirtualHost *:80>
  ServerName mydomain 
  ServerAlias *.mydomain 
  ProxyRequests off 
  ProxyPass / http://localhost:8082/ 
  ProxyPassReverse / http://localhost:8082/ 
</VirtualHost>

This would cause all requests for http://mydomain/ to be forwarded to the Tomcat web application server running at http://localhost:8082/ on the remote server. The ‘myExistApp’ web application could then be accessed with http://mydomain/myExistApp/.

Voila, sounds good, no? There is a gotcha, however: the root path (‘/’) specified as the first argument of the ProxyPass directive will cause the entire URI space of mydomain to be proxied: even http://mydomain/static.htm. Of course, those paths could be excluded explicitly, by using the ‘!’ directive in another ProxyPass directive:

ProxyRequests off 
ProxyPass /static.htm !
ProxyPass / http://localhost:8082/ 
ProxyPassReverse / http://localhost:8082/ 

Yet, this won’t work out very well unless there are some specific characteristics to paths that need proxying on the one hand, and paths that don’t on the other. Before investigating this a bit closer, there’s a second issue that needs to be tackled. In some cases, web applications can generate so-called “self-referential URLs’”: absolute URLs that refer to the same server that generated them. For example, the eXist Java webstart client, which allows administrative access to a remote database, needs access to the remote eXist instance. It will do so by generating absolute links to the eXist instance it lives in, by using the location headers it has access to. Yet, with these proxy settings, a request for http://mydomain/myExistApp/webstart/exist.jnlp (which should start the webstart client), will fail. Instead of the proxy frontend host name http://mydomain/, the webstart client will use the proxy backend host name and try to locate the eXist source code at http://localhost:8082/myExistApp/webstart/. The same goes for all links containing self-referential URLs that the ‘myExistApp’ web application will generate: those will all point to the proxy backend at http://localhost:8082/myExistApp/, instead of the frontend domain. Of course, a web browser encountering such links won’t be able to do resolve them in any useful way (unless the client computer happens to run a web application of that name on that port): there’s no way to connect to the remote web application with such links.

Fortunately, the mod_proxy module offers a directive allowing to pass the original frontend host name to the proxy backend, instead of replacing it with the rewritten backend host name: ProxyPreserveHost. The original proxy settings can be improved by switching this setting ‘on’:

ProxyRequests     off 
ProxyPreserveHost on
ProxyPass /static.htm !
ProxyPass / http://localhost:8082/ 
ProxyPassReverse / http://localhost:8082/ 

This will make the name of the proxy frontend host http://mydomain/ available to the ‘myExistApp’ application, so that it can properly construct self-referential URLs to http://mydomain/myExistApp/webstart/.

In order to further improve these proxy settings, I’d like to do away with the global proxying rule, and rather limit proxying to the URI paths for the Tomcat web applications. This can easily done by changing the proxy configuration to:

ProxyRequests     off
ProxyPreserveHost on
ProxyPass /myExistApp/ http://localhost:8082/myExistApp/ 
ProxyPassReverse /myExistApp/ http://localhost:8082/myExistApp/ 

These settings will leave requests for http://mydomain/static.htm alone, and only proxy those requests whose path starts with ‘/myExistApp/’. That’s fine, but what if I want to add a second Tomcat web application? The answer is quite straightforward: add a second rule for –say– ‘myExistApp2’:

ProxyRequests     off
ProxyPreserveHost on

ProxyPass /myExistApp/ http://localhost:8082/myExistApp/ 
ProxyPassReverse /myExistApp/ http://localhost:8082/myExistApp/ 

ProxyPass /myExistApp2/ http://localhost:8082/myExistApp2/ 
ProxyPassReverse /myExistApp2/ http://localhost:8082/myExistApp2/ 

The drawback is that each additional Tomcat web application will need a new proxy rule in Apache’s httpd.conf file, and hence require the Apache server to be restarted. Instead, I’d rather minimize configuration efforts and reserve a specific prefix in my URL space for reverse proxying. For example, I would like to introduce the path prefix ‘/apps/’ as a ‘flag’ for Apache so it knows it should reverse proxy those (and only those) requests whose path starts with ‘/apps/’. I’ll discuss this in the next section.

3. Introduce a ‘proxy path prefix’ in the URI space

At first glance, this is an easy one: add an ‘/app/’ path prefix to be proxied, and direct it to the root of the proxy backend:

ProxyRequests     off
ProxyPreserveHost on

ProxyPass /apps/ http://localhost:8082/
ProxyPassReverse /apps/ http://localhost:8082/

These settings will make the ‘myExistApp’ web application publicly available at http://mydomain/apps/myExistApp/index.xml (mind the ‘/app/’ path prefix in the URI). Yet, something goes wrong when the eXist web application issues a redirection. For example, a request for http://mydomain/apps/myExistApp/, which issues a redirect to the ‘index.xml’ page, will end up at http://mydomain/myExistApp/index.xml, without the ‘/app/’ prefix. Since Apache has no proxy rules for this path, this redirection will fail.

Apparently, something is wrong with the ProxyPassReverse directive, which should normally take care that location headers are rewritten appropriately for redirections at the proxy backend. It took me some hair pulling and a couple of posts to the Apache-users mailing list to get a solution for this problem. The kind folks over there pointed me to the fact that, due to ProxyPreserveHost being switched on, the reponse headers returned by the proxy backend will contain http://mydomain/; therefore the second argument of the ProxyPassReverse directive should match for this URL. Only then, the rule is effective and will –in this case– add the ‘/apps/’ prefix to the response headers. These settings fixed the issue with internal redirections, issued by the proxied web application:

ProxyRequests     off
ProxyPreserveHost on

ProxyPass /apps/ http://localhost:8082/
ProxyPassReverse /apps/ http://mydomain/

Yet, while this ProxyPassReverse directive made those ‘internal’ redirections work, there remained an issue with self-referential URLs. Apparently, what the proxied web application sees from the original request, is this: http://mydomain/myExistApp/, without the ‘/app/’ path prefix. To this moment, I haven’t found any fix for this issue. If I understand the mod_proxy documentation for the ProxyPassReverse directive correctly, there just isn’t any straightforward way to pass this prefix to the backend web applications:

Only the HTTP response headers specifically mentioned above [i.e. Location, Content-Location and URI headers on HTTP redirect responses] will be rewritten. Apache will not rewrite other response headers, nor will it rewrite URL references inside HTML pages. This means that if the proxied content contains absolute URL references, they will by-pass the proxy. A third-party module that will look inside the HTML and rewrite URL references is Nick Kew’s mod_proxy_html.

Apparently, it’s impossible to introduce a path prefix flagging a ‘proxy zone’ in the URI space, which does not occur in the URI space of the web application on the proxy backend. Therefore, an obvious approach to this problem could be to introduce the same path prefix for the Tomcat web applications. Apparently, Tomcat (since version 6) provides a quite easy solution for adding URI prefixes to webapps. This feature makes use of a specific naming convention for the folders or WAR files containing Tomcat web applications: by prefixing the webapp name with the desired path prefix, separating prefix and webapp name with a hash (‘#’), Tomcat will interpret the parts separated by hashes as path components. For example, renaming the ‘myExistApp’ folder as follows:

${catalina.base}/webapps/apps#myExistApp

…would make the web application accessible at http://localhost:8082/apps/myExistApp/, without any further Tomcat configuration. This feature is explained somewhat cryptically in the Tomcat Context Container reference, but this kind message on the Tomcat-users mailing list helped me a lot.

Yet, there appeared to be a catch: apparently, Cocoon-2.1 chokes on webapps whose folder or file names contain a hash (see https://issues.apache.org/jira/browse/COCOON-2270). Still, there is a workaround for prefixing Cocoon-based web applications as well, as shown in following configuration steps:

  1. move the folder or WAR file containing the webapp outside of the host’s appBase path, e.g.: F:\cocoonApps\myExistApp2
  2. add a file ${catalina.base}\conf\Catalina[host name][prefix]#[app name].xml, e.g.: ${catalina.base}\conf\Catalina\localhost\apps#myExistApp2.xml, with following content:

    <Context docBase="F:/cocoonApps/myExistApp2"/>

Using this workaround, even Cocoon webapps are happy when accessed at e.g. http://localhost:8082/apps/myExistApp2/. Of course, this will require that all Cocoon-based web applications be physically moved outside the normal Tomcat appBase location, and a <Context> be explicitly declared in a separate configuration file at the prescribed location. Still, this could allow for relatively flexible management of Tomcat web applications:

  • non-Cocoon-based webapps: just add them in the Host’s appBase, prefixing the name of the folder or WAR file with the desired prefix(es), separated with a hash (#). Adding new non-Cocoon webapps requires no further steps than storing them with the desired URI prefix.
  • Cocoon-based webapps: store them outside of the Host’s appBase, with just the unprefixed webapp name. Additionally, add a context file for each Cocoon-based webapp, specifying the of the webapp as explained above. This additional step then is only needed for Cocoon-based webapps.

With these steps in place, both Tomcat applications are accessible on the remote machine at http://localhost:8082/apps/myExistApp/, and http://localhost:8082/apps/myExistApp2/, respectively. In order to make them accessible via Apache, this single proxy rule for paths starting with the ‘/apps/’ prefix suffices:

ProxyRequests     off
ProxyPreserveHost on

ProxyPass /apps/ http://localhost:8082/apps/ 
ProxyPassReverse /apps/ http://localhost:8082/apps/ 

These Apache proxy settings make it possible to flexibly add Tomcat apps and have them reverse proxied behind Apache using the ‘/apps/’ URI prefix.

Still, this adds some overhead on the Tomcat side for maintaining the web applications: the naming conventions must be adhered to, and a difference is introduced for webapps using Cocoon, and others. Therefore I decided to look for another approach, which I’ll discuss in the next section.

4. Making life easier:replacing ‘proxy path prefixes’ with domain prefixes

In order to maximize flexibility while reducing the configuration burden, I decided to ‘merge’ both approaches discussed so far, in a solution that:

  • adopts a ‘proxy prefix’ in the URI space
  • still passes on the correct path to the proxy backend

This can be done by replacing the ‘proxy prefix’ from the URI path (which seems to cause fundamental problems) with a domain prefix. Instead of http://mydomain/apps/myExistApp/, that would become http://apps.mydomain/myExistApp/. In order to do so, I configured my DNS settings so the domain name http://apps.mydomain/ resolves to the same IP address as http://mydomain/.

Then it’s only a matter of adding a second <VirtualHost> section to the Apache httpd.conf file. This virtual host definition then can just proxy its root path to the Tomcat server:

NameVirtualHost *:80

<VirtualHost *:80>
  ServerName apps.mydomain
  ServerAlias *.apps.mydomain 
  ProxyRequests     off
  ProxyPreserveHost on
  ProxyPass / http://localhost:8082/
  ProxyPassReverse / http://localhost:8082/
</VirtualHost>

<VirtualHost *:80>
  ServerName mydomain 
  ServerAlias *.mydomain 
</VirtualHost>

Should you, for one reason or other, not be able to add another virtual host, this can be worked around. For example, our webserver is controlled by Plesk, which only allows 1 <VirtualHost> per domain, and lets users introduce Apache configurations for those domains in a separate vhost.conf file. Still, within a single <VirtualHost> setting, the variant with the proxy prefix can be specified as an alias using the ServerAlias directive. In order to reserve proxying to this domain alias, the mod_rewrite Apache module comes in handy. By expressing following RewriteRule directive, with a specific condition in RewriteCond, proxying can be limited to requests starting with http://apps.mydomain/:

<VirtualHost *:80>
  ServerName apps.mydomain
  ServerAlias *.apps.mydomain 
  RewriteEngine     on
  ProxyRequests     off
  ProxyPreserveHost on

  RewriteCond %{HTTP_HOST} ^apps.mydomain(:80)?$
  RewriteRule /(.*) http://localhost:8082/$1 [P]
</VirtualHost>

What these rewrite rules do, is first testing if the host part of the request matches ‘apps.mydomain’. For those (and only those) requests the host part is replaced with that of the proxy backend. By using the [P] flag, the result of the rewrite rule is passed on internally to mod_proxy. Of course, the use of rewrite rules implies that the module mod_rewrite is loaded in the Apache web server.

If needed, rewrite rules can be introduced as well that rewrite requests with a proxy path prefix to a proxy domain prefix:

RewriteCond %{HTTP_HOST} ^(mydomain(:80)?)$
RewriteRule ^/(apps)/(.*)    http://$1.%1/$2 [R]

5. Summary

In hindsight, my existing mod_jk configuration could quite straightforwardly be replaced with with mod_proxy. Actually, mod_proxy requires even less configuration than mod_jk: all configuration work can happen inside Apache’s httpd.conf file, without additional configuration in workers.properties and uriworkermap.properties files. Still, separate mod_proxy rules for each Tomcat web application can be avoided by introducing a ‘proxy prefix’ in the URI space:

  • a proxy path prefix: due to apparent limitations in passing the full path of the original request (including the path prefix) to the proxy backend, this will require the introduction of this path prefix in the URI space of the Tomcat applications themselves
  • a proxy domain prefix: if you’re able to configure a new domain that points to your web server, this allows for the most flexible proxy configuration: only requests for this specific prefixed domain can then be proxied, without the need for either introducing a path prefix in the URI space of the proxy frontend, nor the proxy backend

To round the circle, I was pleased to discover that the initial problem which triggered my mod_proxy quest is limited to mod_jk. If the submodule mod_proxy_ajp is enabled in Apache, connections to Tomcat via AJP can just as easily be set up (provided the Tomcat server provides an AJP connector at port 8009):

NameVirtualHost *:80

<VirtualHost *:80>
  ServerName apps.mydomain
  ServerAlias *.apps.mydomain 
  ProxyRequests     off
  ProxyPreserveHost on
  ProxyPass / ajp://localhost:8009/
  ProxyPassReverse / ajp://localhost:8009/
</VirtualHost>

<VirtualHost *:80>
  ServerName mydomain 
  ServerAlias *.mydomain 
</VirtualHost>

…or, within the same virtual host:

<VirtualHost *:80>
  ServerName apps.mydomain
  ServerAlias *.apps.mydomain 
  RewriteEngine     on
  ProxyRequests     off
  ProxyPreserveHost on

  RewriteCond %{HTTP_HOST} ^apps.mydomain(:80)?$
  RewriteRule /(.*) ajp://localhost:8009/$1 [P]
</VirtualHost>

Via a mod_proxy_ajp connection, eXist-2.x applications do have access to the session attributes. This way, the flexible configurability of mod_proxy can be coupled with the allegedly more performant AJP connection between Apache and Tomcat.

Internal URL Rewriting with eXist’s MVC Framework

Since version 1.4, the eXist native XML database has been equipped with a Model View Controller (MVC) framework designed to express the logic for request routing of eXist-based web applications in XQuery. In this post I’ll illuminate a (in my opinion) somewhat under-exposed feature of eXist’s MVC framework: internal URL rewriting. With this term, I mean the fact that a URL, say http://localhost:8080/exist/urltest/test.xql is resolved internally to another URL like http://localhost:8080/exist/urltest/xquery/test.xql. Internally, meaning that the original request is not redirected to another one, and the user still sees the original URL in the browser address bar. As section 1of this post will illustrate, this works like a charm for ‘simple’ rewrites, like the previous one, but requires some thought if you would like to ‘chain’ multiple internal rewrite rules. In this post, I’ll try to provide a flexible coding pattern to achieve such internal rewriting with eXist’s MVC framework.

Read more of this post

From KWIC display to KWIC(er) processing with eXist

The eXist XML database has a dedicated XQuery module for displaying search results in a fixed context window, a visualization that is commonly known as a KeyWord In Context view. Search results are presented with a preceding and following text context (called further in this text left and right text context):

<p>
    <span class="previous">... s effect, sir; after what flourish your </span>
    <span class="hi">nature</span>
    <span class="following"> will.</span>
</p>

This formatting of search results invites to exploit its particular features, such as sorting the search results according to their left or right contexts, or even according to the nth word preceding or following the search term. This is heavily facilitated by the XML representation of the KWIC search results, where all three parts are isolated in their own XML element. However, while eXist’s current KWIC display module (as it is consistently called) does its job in presenting a KWIC display, in my opinion it is too much display-oriented:

  • it lacks performance on large result sets, and / or wide context widths, which is crucial for further processing, since sorting requires pre-computation of the entire result set
  • (though this is nitpicking:) the output is presentational HTML; while this is irrelevant from a processing point of view, I would prefer a semantically more ‘neutral’ format and defer presentational formatting to a later display phase

This post will address both objections and present alternatives. Additionally, ways for processing these KWIC results are discussed in the last section.

Read more of this post

Venturing into versions: strategies for querying a TEI apparatus with eXist

When encoding a critical edition in XML, one of the challenges facing the text encoder is finding a way to represent multiple versions of a work in a sensible way. As usual when it comes to the electronic representation of texts in the field of the humanities, such a sensible way is provided by the Text Encoding Initiative (TEI). Actually, three ways are offered, though this post will focus on the so-called parallel-segmentation method (for extensive reference, the reader is directed to chapter 12: Critical Apparatus of the TEI Guidelines). In short: this method allows an encoder to represent all text versions of a work within a single XML source, where places with variant text are encoded as an inline apparatus (<app>), in which the distinct variants are identified as readings (<rdg wit=”[sigil]”>), whose @wit attribute links them to (an) identified version(s) of the work. At this point, a lot more could be said about both edition and markup theoretic aspects, but this won’t be the focus of this post.

Instead, this post will focus on a topic I saw myself confronted with when developing an application (i.e. a web interface) for such an edition: how do you search within such ‘multiversion’ texts? Most probably, users of the edition would want to focus on one (or a selection of) text version(s). Of course, when version 1 contains the word ‘hope’, which in version 2 had been changed to ‘despair’, (only) the right readings should be retrieved for the respective text version.

Read more of this post

XQuery Unit testing in eXist-1.4

[UPDATE 2011-01-19]: As of revisions 13587 and 13589, the XQuery Unit Testing framework has been ported back from eXist-trunk to the eXist-1.4.x branch. While obsolescing the need for the XSLT stylesheet presented in this blog post, I’ll leave the latter here for the sake of documentation. eXist users who want to test XQueries in eXist-1.4 now are encouraged to use its built-in XQuery Unit Testing framework instead.

[UPDATE 2011-01-05]: The XSLT stylesheet has been extended with missing features:

  • [feature]: added @trace handling
  • [feature]: added <xpath> handling
  • [feature]: added <store-files> handling
  • [feature]: added context handling for util:eval()
  • [fix]: <![CDATA[ ]]> in output: spaces required…

[UPDATE 2010-12-09]: The XSLT stylesheet has been substantially reworked, to produce

  • more legible XQuery code
  • more reliable XQuery code, taking into account serialization options, and deriving the most sensible highlight-matches settings where necessary

Currently, I’m heavily porting old XQuery code to the latest version of the eXist XML database’s new Lucene FT index and search capabilities. In doing so, I’m hitting a couple of bugs in this area, that I’m trying to isolate, test and report as clearly as possible. This post discusses a means to use the same test files for both eXist-1.4 and eXist-trunk.

Read more of this post

As a matter of fac(e)t: (mimicking) faceted searching in eXist

In hindsight, since I set out developing search interfaces for XML text collections with the marvelous eXist XML database, I’ve been drawn to the concept of faceted search, even long before I knew it was called that way. The recent integration of Lucene indexing and searching capabilities into eXist (since version 1.4) holds promises for efficient facet-oriented search features such as integrating Lucene fields in search queries.

Read more of this post

Full text queries in eXist: from Lucene to XML syntax

[UPDATE 2011-08-09]: The lucene2xml scripts have been modified:

  • [feature]: added a couple of further conditions in $lucene2xml, in order to benefit from unified <exist:match> markers for adjacent phrase terms: differentiate between
    • phrase search: rewrite <near slop="<1"> to <phrase>
    • proximity search: copy <near slop=">=1">
  • [fix]: improved treatment of escaped parentheses inside proximity search expressions

Since version 1.4, the eXist native XML database implements a Lucene-based full text index. The main Lucene-aware search function, ft:query() accepts queries expressed in two flavours:

The XML query syntax was explicitly designed to allow for more expressive queries than is possible with the Lucene syntax. Most notably, eXist has extensions for:

  • fine-grained proximity searches with the <near> element (a.o. the possibility to specify that search terms can occur unordered)
  • regular expression searches with the <regex> element

This makes the XML syntax the more interesting option for developing a user search interface. A search interface could then allow users to input search queries in the (quite intuitive) Lucene fashion, while providing additional options for specifying extra search features (‘(un)ordered proximity search’, ‘regular expression search’). Behind the scenes, both pieces of user input (search query + additional parameters) can be translated to an XML expression of the search query.

Read more of this post

Follow

Get every new post delivered to your Inbox.