XSLT Batch Processing

Overview

Every XSLT stylesheet you have written so far takes a single XML file as its input. Batch processing means applying the same transformation to an entire directory of XML files at once. This page covers the pure XSLT approach: declaring the input corpus from within the stylesheet itself using the collection() function, with no external configuration required.

There are two output shapes to know:

Both share the same three-part infrastructure: a corpus variable using collection(), an xsl:initial-template entry point, and xsl:result-document to write output to named files. The only structural difference between the two shapes is whether xsl:result-document fires once or inside a loop.

Before running any batch stylesheet in oXygen, set the XML input dropdown to (None). If a document is selected there, it overrides the collection() variable declared inside the stylesheet. This is the most common source of errors in batch processing.

Variables and collection()

xsl:variable

A variable stores a value for reuse elsewhere in the stylesheet. XSLT variables are immutable — unlike Python variables, once declared the value cannot be changed. The basic syntax is:

<xsl:variable name="my-variable" as="xs:integer" select="42"/>

The name= attribute gives the variable its identifier; as= declares its type; select= provides its value. You reference a variable elsewhere in the stylesheet with a $ prefix: $my-variable.

For a corpus of XML documents the type is document-node()+:

The collection() function

The collection() function takes a path to a directory and returns all matching XML documents in it as a sequence of document nodes.

Flat directory (no subdirectories)

<xsl:variable name="my-corpus" as="document-node()+"
    select="collection('./my-collection?select=*.xml')"/>

The ?select=*.xml filter is important. Without it, any non-XML file in the directory — a .DS_Store, a README.md — produces a cryptic error. The ? signals that options follow; *.xml matches any filename ending in .xml.

Recursive directory (includes nested subdirectories)

<xsl:variable name="my-corpus" as="document-node()+"
    select="collection('./my-collection?recurse=yes;select=*.xml')"/>

recurse=yes tells collection() to descend into subdirectories. Multiple options are separated by semicolons. This is useful when a corpus is organized into subfolders by author, date, genre, or any other grouping.

Querying across the corpus

Once the corpus is in a variable, XPath expressions can run across all documents at once. Reference the variable with a $ prefix:

<xsl:value-of select="count($my-corpus//item)"/>

The //item steps down through all document nodes in the sequence simultaneously — no loop needed for a simple aggregate count. For output that must be computed or written per document, use xsl:for-each to iterate over the corpus variable.

xsl:initial-template

Every stylesheet you have written so far begins by matching the root node of a single input document:

<xsl:template match="/">
    ...
</xsl:template>

That works because oXygen supplies a default input document via the XML dropdown. With collection(), there is no default input document — the corpus is declared inside the stylesheet, and the XML dropdown is set to (None). The root-match template therefore never fires.

The solution is xsl:initial-template: a named template the processor runs first, before any document matching:

<xsl:template name="xsl:initial-template">
    <!-- entry point: query the corpus variable and write output here -->
</xsl:template>

Note that xsl:initial-template is both the attribute name and the value — this is intentional XSLT 3.0 syntax, not a typo. Named templates are not triggered by document processing; they run when explicitly called. In this case the processor calls it automatically at startup because of the special reserved name.

xsl:result-document

Normally a stylesheet writes all output to the primary output tree — the single file configured in the oXygen output dropdown. With batch processing you often want to control exactly where output goes. xsl:result-document lets you write to any named file:

<xsl:result-document href="output/myfile.html" method="html">
    <!-- content to write to this file -->
</xsl:result-document>

The href is an AVT — the curly-brace attribute value template syntax you already know — so the filename can be computed dynamically. Saxon will create intermediate directories (such as output/) if they do not already exist.

In a many-to-one stylesheet, xsl:result-document appears once with a hard-coded filename. In a many-to-many stylesheet, it appears inside an xsl:for-each loop so it fires once per input document.

Many-to-one output

The many-to-one pattern collects data from all documents in the corpus and writes a single HTML output — a summary page, a statistics table, an index. The xsl:result-document fires once, outside any loop, with a hard-coded filename.

Skeleton

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="html" indent="yes"/>

    <!-- 1. Declare the corpus variable -->
    <xsl:variable name="my-corpus" as="document-node()+"
        select="collection('./my-collection?select=*.xml')"/>

    <!-- 2. Named entry point -->
    <xsl:template name="xsl:initial-template">

        <!-- 3. Write to a single named file -->
        <xsl:result-document href="output/summary.html" method="html">
            <html>
                <head><title>Corpus Summary</title></head>
                <body>
                    <h1>Corpus Summary</h1>

                    <!-- Corpus-level aggregate: no loop needed -->
                    <p>Total documents: <xsl:value-of
                        select="count($my-corpus)"/></p>
                    <p>Total items: <xsl:value-of
                        select="count($my-corpus//item)"/></p>

                    <!-- Per-document breakdown: iterate with for-each -->
                    <table>
                        <tr><th>Document</th><th>Item count</th></tr>
                        <xsl:for-each select="$my-corpus">
                            <xsl:sort select=".//title"/>
                            <tr>
                                <td><xsl:value-of select=".//title"/></td>
                                <td><xsl:value-of select="count(.//item)"/></td>
                            </tr>
                        </xsl:for-each>
                    </table>
                </body>
            </html>
        </xsl:result-document>
    </xsl:template>

</xsl:stylesheet>

Key points

Many-to-many output

The many-to-many pattern produces one output file per input document. xsl:result-document moves inside an xsl:for-each loop so it fires once per iteration. The corpus variable and xsl:initial-template are identical to the many-to-one pattern.

Skeleton

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">
    <xsl:output method="html" indent="yes"/>

    <!-- 1. Declare the corpus variable -->
    <xsl:variable name="my-corpus" as="document-node()+"
        select="collection('./my-collection?select=*.xml')"/>

    <!-- 2. Named entry point -->
    <xsl:template name="xsl:initial-template">

        <!-- 3. Loop over the corpus -->
        <xsl:for-each select="$my-corpus">

            <!-- 4. Write one file per document -->
            <xsl:result-document href="output/doc_{position()}.html" method="html">
                <html>
                    <head>
                        <title><xsl:value-of select=".//title"/></title>
                    </head>
                    <body>
                        <h1><xsl:value-of select=".//title"/></h1>
                        <xsl:apply-templates select=".//item"/>
                    </body>
                </html>
            </xsl:result-document>
        </xsl:for-each>
    </xsl:template>

    <!-- 5. Match templates work exactly as in a single-document stylesheet -->
    <xsl:template match="item">
        <p><xsl:apply-templates/></p>
    </xsl:template>

</xsl:stylesheet>

Key points

Dynamic filenames

Using position() — simple and transparent

position() returns the position of the current node in the sequence being iterated. Inside an xsl:for-each over the corpus, it gives each document a unique number. Use it in an AVT inside the href:

<xsl:result-document href="output/doc_{position()}.html" method="html">

This produces doc_1.html, doc_2.html, etc. The filenames are not descriptive of their content, but the mechanism is completely transparent and easy to reason about.

Derived from the input filename — descriptive

To name each output file after its input file — so that myfile.xml produces myfile.html — declare a variable inside the loop that computes the filename from the input document's URI:

<xsl:variable name="filename"
    select="substring-before(
        tokenize(base-uri(), '/')[last()],
        '.xml') || '.html'"/>

Then reference it in the href with an AVT:

<xsl:result-document href="output/{$filename}" method="html">

Breaking down the expression:

Because this variable is declared inside the xsl:for-each loop, it is re-evaluated for each document in turn — each iteration gets its own value of $filename.

Quick reference

The two output patterns compared

Many-to-one Many-to-many
Entry point xsl:initial-template xsl:initial-template
Corpus variable collection() collection()
xsl:result-document once, hard-coded filename inside loop, dynamic filename
Primary output tree used by result-document empty (warning, not error)
xsl:for-each purpose one row or section per document one output file per document

Common errors

Symptom Cause Fix
Cryptic collection() error Non-XML file in the directory Add ?select=*.xml to the path
Wrong output / stylesheet ignores corpus XML input not set to (None) in oXygen Set XML dropdown to (None)
XPath inside loop queries whole corpus Missing leading . Use .//element not //element
Empty result warning in oXygen Nothing written to primary output tree Not an error — expected in many-to-many
Output file produced but body is empty Document doesn't contain the queried element Expected batch behavior; add xsl:if guard if needed

collection() syntax

Use case Syntax
Flat directory, XML files only collection('./my-dir?select=*.xml')
Recursive (nested subdirectories), XML only collection('./my-dir?recurse=yes;select=*.xml')