XSLT Batch Processing

Overview

Every XSLT stylesheet you have written so far takes a single XML file as its input. Batch processing means applying the same transformation to an entire directory of XML files at once. This page covers the pure XSLT approach: declaring the input corpus from within the stylesheet itself using the collection() function, with no external configuration required.

There are two output shapes to know:

Many-to-one: many XML input files produce a single HTML output. Useful for aggregate analysis — statistics, indexes, summaries across a whole corpus.
Many-to-many: many XML input files each produce their own HTML output file. Useful for generating a set of individual pages — one per document — from a corpus.

Both share the same three-part infrastructure: a corpus variable using collection(), an xsl:initial-template entry point, and xsl:result-document to write output to named files. The only structural difference between the two shapes is whether xsl:result-document fires once or inside a loop.

Before running any batch stylesheet in oXygen, set the XML input dropdown to (None). If a document is selected there, it overrides the collection() variable declared inside the stylesheet. This is the most common source of errors in batch processing.

Variables and `collection()`

xsl:variable

A variable stores a value for reuse elsewhere in the stylesheet. XSLT variables are immutable — unlike Python variables, once declared the value cannot be changed. The basic syntax is:

<xsl:variable name="my-variable" as="xs:integer" select="42"/>

The name= attribute gives the variable its identifier; as= declares its type; select= provides its value. You reference a variable elsewhere in the stylesheet with a $ prefix: $my-variable.

For a corpus of XML documents the type is document-node()+:

document-node() — the root node of a parsed XML document. This is the node above the root element — the same thing / refers to in XPath.
+ — one or more (the same quantifier as in RelaxNG schemas).

The collection() function

The collection() function takes a path to a directory and returns all matching XML documents in it as a sequence of document nodes.

Flat directory (no subdirectories)

<xsl:variable name="my-corpus" as="document-node()+"
    select="collection('./my-collection?select=*.xml')"/>

The ?select=*.xml filter is important. Without it, any non-XML file in the directory — a .DS_Store, a README.md — produces a cryptic error. The ? signals that options follow; *.xml matches any filename ending in .xml.

Recursive directory (includes nested subdirectories)

<xsl:variable name="my-corpus" as="document-node()+"
    select="collection('./my-collection?recurse=yes;select=*.xml')"/>

recurse=yes tells collection() to descend into subdirectories. Multiple options are separated by semicolons. This is useful when a corpus is organized into subfolders by author, date, genre, or any other grouping.

Querying across the corpus

Once the corpus is in a variable, XPath expressions can run across all documents at once. Reference the variable with a $ prefix:

<xsl:value-of select="count($my-corpus//item)"/>

The //item steps down through all document nodes in the sequence simultaneously — no loop needed for a simple aggregate count. For output that must be computed or written per document, use xsl:for-each to iterate over the corpus variable.

`xsl:initial-template`

Every stylesheet you have written so far begins by matching the root node of a single input document:

<xsl:template match="/">
    ...
</xsl:template>

That works because oXygen supplies a default input document via the XML dropdown. With collection(), there is no default input document — the corpus is declared inside the stylesheet, and the XML dropdown is set to (None). The root-match template therefore never fires.

The solution is xsl:initial-template: a named template the processor runs first, before any document matching:

<xsl:template name="xsl:initial-template">
    <!-- entry point: query the corpus variable and write output here -->
</xsl:template>

Note that xsl:initial-template is both the attribute name and the value — this is intentional XSLT 3.0 syntax, not a typo. Named templates are not triggered by document processing; they run when explicitly called. In this case the processor calls it automatically at startup because of the special reserved name.

`xsl:result-document`

Normally a stylesheet writes all output to the primary output tree — the single file configured in the oXygen output dropdown. With batch processing you often want to control exactly where output goes. xsl:result-document lets you write to any named file:

<xsl:result-document href="output/myfile.html" method="html">
    <!-- content to write to this file -->
</xsl:result-document>

The href is an AVT — the curly-brace attribute value template syntax you already know — so the filename can be computed dynamically. Saxon will create intermediate directories (such as output/) if they do not already exist.

In a many-to-one stylesheet, xsl:result-document appears once with a hard-coded filename. In a many-to-many stylesheet, it appears inside an xsl:for-each loop so it fires once per input document.

Many-to-one output

The many-to-one pattern collects data from all documents in the corpus and writes a single HTML output — a summary page, a statistics table, an index. The xsl:result-document fires once, outside any loop, with a hard-coded filename.

Skeleton

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="html" indent="yes"/>

    <!-- 1. Declare the corpus variable -->
    <xsl:variable name="my-corpus" as="document-node()+"
        select="collection('./my-collection?select=*.xml')"/>

    <!-- 2. Named entry point -->
    <xsl:template name="xsl:initial-template">

        <!-- 3. Write to a single named file -->
        <xsl:result-document href="output/summary.html" method="html">
            <html>
                <head><title>Corpus Summary</title></head>
                <body>
                    <h1>Corpus Summary</h1>

                    <!-- Corpus-level aggregate: no loop needed -->
                    <p>Total documents: <xsl:value-of
                        select="count($my-corpus)"/></p>
                    <p>Total items: <xsl:value-of
                        select="count($my-corpus//item)"/></p>

                    <!-- Per-document breakdown: iterate with for-each -->
                    <table>
                        <tr><th>Document</th><th>Item count</th></tr>
                        <xsl:for-each select="$my-corpus">
                            <xsl:sort select=".//title"/>
                            <tr>
                                <td><xsl:value-of select=".//title"/></td>
                                <td><xsl:value-of select="count(.//item)"/></td>
                            </tr>
                        </xsl:for-each>
                    </table>
                </body>
            </html>
        </xsl:result-document>
    </xsl:template>

</xsl:stylesheet>

Key points

Corpus-level counts query $my-corpus directly — //item steps across all document nodes in the sequence simultaneously. No loop needed.
Inside xsl:for-each, . is the current document node. Use .//item (not //item) to query within that document only. Without the leading ., XPath searches the entire corpus.
xsl:sort takes select= (an expression), not match= (a pattern).

Many-to-many output

The many-to-many pattern produces one output file per input document. xsl:result-document moves inside an xsl:for-each loop so it fires once per iteration. The corpus variable and xsl:initial-template are identical to the many-to-one pattern.

Skeleton

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="html" indent="yes"/>

    <!-- 1. Declare the corpus variable -->
    <xsl:variable name="my-corpus" as="document-node()+"
        select="collection('./my-collection?select=*.xml')"/>

    <!-- 2. Named entry point -->
    <xsl:template name="xsl:initial-template">

        <!-- 3. Loop over the corpus -->
        <xsl:for-each select="$my-corpus">

            <!-- 4. Write one file per document -->
            <xsl:result-document href="output/doc_{position()}.html" method="html">
                <html>
                    <head>
                        <title><xsl:value-of select=".//title"/></title>
                    </head>
                    <body>
                        <h1><xsl:value-of select=".//title"/></h1>
                        <xsl:apply-templates select=".//item"/>
                    </body>
                </html>
            </xsl:result-document>
        </xsl:for-each>
    </xsl:template>

    <!-- 5. Match templates work exactly as in a single-document stylesheet -->
    <xsl:template match="item">
        <p><xsl:apply-templates/></p>
    </xsl:template>

</xsl:stylesheet>

Key points

xsl:result-document is inside the loop: one call per iteration, one output file per document.
All XPath expressions inside the loop should be scoped with a leading . to query only the current document, not the whole corpus.
Match templates work exactly as they do in a single-document stylesheet. The push model is unchanged.
Nothing is written to the primary output tree when all content goes through xsl:result-document. oXygen may warn about an empty result — this is not an error.
If a document in the corpus does not contain the element you are querying, its output file is still produced but the relevant portion of the body will be empty. Batch processing does not guarantee every document contains what you are looking for.

Dynamic filenames

Using position() — simple and transparent

position() returns the position of the current node in the sequence being iterated. Inside an xsl:for-each over the corpus, it gives each document a unique number. Use it in an AVT inside the href:

<xsl:result-document href="output/doc_{position()}.html" method="html">

This produces doc_1.html, doc_2.html, etc. The filenames are not descriptive of their content, but the mechanism is completely transparent and easy to reason about.

Derived from the input filename — descriptive

To name each output file after its input file — so that myfile.xml produces myfile.html — declare a variable inside the loop that computes the filename from the input document's URI:

<xsl:variable name="filename"
    select="substring-before(
        tokenize(base-uri(), '/')[last()],
        '.xml') || '.html'"/>

Then reference it in the href with an AVT:

<xsl:result-document href="output/{$filename}" method="html">

Breaking down the expression:

base-uri() — returns the full file path of the current document as a string
tokenize(..., '/')[last()] — splits the path on / and selects the last segment, giving the bare filename
substring-before(..., '.xml') — returns everything before .xml, stripping the extension without a regular expression
|| '.html' — concatenates the new extension (|| is the XPath string concatenation operator)

Because this variable is declared inside the xsl:for-each loop, it is re-evaluated for each document in turn — each iteration gets its own value of $filename.

Quick reference

The two output patterns compared

	Many-to-one	Many-to-many
Entry point	`xsl:initial-template`	`xsl:initial-template`
Corpus variable	`collection()`	`collection()`
`xsl:result-document`	once, hard-coded filename	inside loop, dynamic filename
Primary output tree	used by result-document	empty (warning, not error)
`xsl:for-each` purpose	one row or section per document	one output file per document

Common errors

Symptom	Cause	Fix
Cryptic `collection()` error	Non-XML file in the directory	Add `?select=*.xml` to the path
Wrong output / stylesheet ignores corpus	XML input not set to (None) in oXygen	Set XML dropdown to (None)
XPath inside loop queries whole corpus	Missing leading `.`	Use `.//element` not `//element`
Empty result warning in oXygen	Nothing written to primary output tree	Not an error — expected in many-to-many
Output file produced but body is empty	Document doesn't contain the queried element	Expected batch behavior; add `xsl:if` guard if needed

`collection()` syntax

Use case	Syntax
Flat directory, XML files only	`collection('./my-dir?select=*.xml')`
Recursive (nested subdirectories), XML only	`collection('./my-dir?recurse=yes;select=*.xml')`

Overview

Variables and collection()

xsl:variable

The collection() function

Flat directory (no subdirectories)

Recursive directory (includes nested subdirectories)

Querying across the corpus

xsl:initial-template

xsl:result-document

Many-to-one output

Skeleton

Key points

Many-to-many output

Skeleton

Key points

Dynamic filenames

Using position() — simple and transparent

Derived from the input filename — descriptive

Quick reference

The two output patterns compared

Common errors

collection() syntax

Variables and `collection()`

`xsl:initial-template`

`xsl:result-document`

`collection()` syntax