Rewrite HTML attributes after parsing

To rewrite anchors, inject attributes, normalize URLs, or strip sentinels in already-rendered HTML, implement IHtmlResponseRewriter. Every rewriter shares one AngleSharp parse against the same IDocument. For non-HTML response types (JSON, plain text) or work that needs the final byte stream, use Transform the response body on every page instead.

The recipe references examples/ExtensibilityLabExample/AnchorLowercaseRewriter.cs, which exercises both phases of the contract against a bare AddPennington host.

Before you begin

An existing Pennington site rendering HTML pages (see Create your first Pennington site if not).
A clear sense of which phase fits the edit: a non-HTML token (something not valid HTML structure, like <xref:uid> or a sentinel comment) belongs in PreParseAsync; anything queryable by selectors belongs in ApplyAsync.

Write the rewriter

Implement Pennington.Infrastructure.IHtmlResponseRewriter as a sealed class. Three rules carry the page:

ShouldApply runs per-response; return false to skip both phases when the content-type, path, or headers mean there is nothing to do. The example narrows to text/html responses so non-HTML endpoints (search index JSON, llms.txt) bypass the rewriter entirely.
PreParseAsync receives the raw HTML string and returns the string to parse. Use it only when the target construct is not valid HTML structure — raw <xref:uid> tags are the canonical shipped example. Return the input unchanged when there is nothing to do.
ApplyAsync receives the already-parsed IDocument shared by every rewriter — query with QuerySelectorAll, mutate attributes and text, and return. Do not re-serialize or reparse.

csharp

namespace ExtensibilityLabExample;
  
using AngleSharp.Dom;
using AngleSharp.Html.Dom;
using Microsoft.AspNetCore.Http;
using Pennington.Infrastructure;
  
/// <summary>
/// Implements <see cref="IHtmlResponseRewriter"/> and demonstrates both
/// halves of the contract:
/// <list type="bullet">
/// <item><description><see cref="PreParseAsync"/> runs a cheap string
///   replace over the raw HTML before AngleSharp parses it. We use it to
///   strip the <c>&lt;!--LOWERCASE-SENTINEL--&gt;</c> comment — the kind
///   of pre-parse cleanup a real rewriter does for non-HTML tokens like
///   <c>&lt;xref:uid&gt;</c>.</description></item>
/// <item><description><see cref="ApplyAsync"/> walks the parsed document
///   and lowercases the text content of every <c>&lt;a&gt;</c> tag
///   marked <c>data-lowercase</c>.</description></item>
/// </list>
/// <para>
/// <see cref="Order"/> is 500 — after the shipped xref (10), locale (20),
/// and base-URL (30) rewriters so our pass sees already-resolved hrefs.
/// </para>
/// <para>
/// Backs how-to 2.3.50 <c>/how-to/extensibility/html-rewriter</c>.
/// </para>
/// </summary>
public sealed class AnchorLowercaseRewriter : IHtmlResponseRewriter
{
    public int Order => 500;
  
    public bool ShouldApply(HttpContext context)
    {
        var contentType = context.Response.ContentType;
        return contentType is not null
               && contentType.StartsWith("text/html", StringComparison.OrdinalIgnoreCase);
    }
  
    /// <summary>
    /// Pre-parse pass. Strip the sentinel comment so it is gone before
    /// AngleSharp runs. A string replace is the right tool when the
    /// target construct is not valid HTML structure (raw <c>&lt;xref&gt;</c>
    /// tags are the canonical example shipped with Pennington).
    /// </summary>
    public Task<string> PreParseAsync(string html, HttpContext context)
    {
        if (!html.Contains("<!--LOWERCASE-SENTINEL-->", StringComparison.Ordinal))
        {
            return Task.FromResult(html);
        }
  
        return Task.FromResult(html.Replace("<!--LOWERCASE-SENTINEL-->", string.Empty, StringComparison.Ordinal));
    }
  
    /// <summary>
    /// DOM pass. Walk the parsed document, find every <c>&lt;a&gt;</c>
    /// with <c>data-lowercase</c>, lowercase its text content.
    /// </summary>
    public Task ApplyAsync(IDocument document, HttpContext context)
    {
        foreach (var element in document.QuerySelectorAll("a[data-lowercase]"))
        {
            if (element is not IHtmlAnchorElement anchor)
            {
                continue;
            }
  
            if (string.IsNullOrEmpty(anchor.TextContent))
            {
                continue;
            }
  
            anchor.TextContent = anchor.TextContent.ToLowerInvariant();
        }
  
        return Task.CompletedTask;
    }
}

Pick an Order value

The shipped rewriters occupy Order values from 10 (xref resolution) through 60 (the last built-in transform); xref resolution, locale prefixing, and base-URL prefixing run in that relative order because each produces the link form the next one consumes. Pick above 60 to run after every shipped transform, below 10 to run before xref resolution, or between the built-ins only when that placement is deliberate. For the exact Order of each shipped rewriter, see Pennington.Infrastructure.IHtmlResponseRewriter. The example uses 500 so anchors are lowercased after every shipped transform has run.

Register the rewriter

Every registered IHtmlResponseRewriter is picked up and ordered by its Order value, so a single registration next to the host wiring is sufficient. Use the lifetime that matches your dependencies — AddSingleton for stateless rewriters, AddTransient (or AddFileWatched) when the rewriter captures file-watched state.

csharp

builder.Services.AddSingleton<IHtmlResponseRewriter, AnchorLowercaseRewriter>();

Configure the shipped word-break rewriter

One shipped rewriter you configure rather than implement is the word-break rewriter. AddWordBreak turns it on; it inserts <wbr> break opportunities into long identifiers so dotted namespaces and PascalCase names wrap inside narrow columns instead of overflowing.

csharp

builder.Services.AddWordBreak(options =>
{
    options.CssSelector = "h1, h2, h3, h4, h5, h6, span, .text-break";
    options.MinimumCharacters = 20;
});

A heading like Pennington.Infrastructure.WordBreakOptions then renders with breaks after each dot and before each interior case boundary:

Before:

html

<h3>Pennington.Infrastructure.WordBreakOptions</h3>

After:

html

<h3>Pennington.<wbr>Infrastructure.<wbr>WordBreakOptions</h3>

For every option and its default, see Pennington.Infrastructure.WordBreakOptions.

Result

Anchors marked data-lowercase have their text content lowercased, and the sentinel comment is gone from view-source.