Skip to content

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441

Open
johnml1135 wants to merge 51 commits into
masterfrom
fst-advisor
Open

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441
johnml1135 wants to merge 51 commits into
masterfrom
fst-advisor

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Accelerates HermitCrab morphological analysis with a precompiled FST, behind a caching front end
that keeps the engine as the source of truth. No second morphology engine, no reimplemented constraints.

Entry point — CachingMorphologicalAnalyzer (fast + slow + cache):

  • Default AnalyzeWord = guaranteed complete (backwards-compatible). On a certified grammar
    (FST-closed per the census and FST==engine set-parity over a corpus) the FST is proven complete,
    so this runs FST-only with no full search. Otherwise it returns the cached engine result, or runs
    the engine on a miss and caches it. Either way: complete.
  • AnalyzeWordFast = opt-in immediate. Cached-complete if warm (or if the grammar is certified),
    else a sound but possibly under-generating verified-FST result, flagged IsComplete=false. Never
    runs the engine.
  • Warm(corpus) fills the cache in parallel; AnalysisCacheSerializer persists it across
    sessions (fixed corpora), keyed by MorphemeRegistry and guarded by a grammar-version string
    (stale cache rejected → re-warm). Confirmed non-words are cached too.

The FST pipeline behind the fast path: FstTemplateAnalyzer (proposer; immutable, shared;
derivation depth tunable) → VerifiedFstAnalyzer (confirms each candidate by restricted re-analysis,
FstReplay, against HC's own engine from a MorpherPool; emits the genuine HC analysis) →
CompleteHybridMorpher (certified→FST / else→engine, with per-word AnalyzeWord(word, useFst)).
GrammarFstAdvisor + GrammarFstClosure are the grammar census/linter (this PR's original core).

Guarantees

  • Correctness equals the engine — the cache never invents or hides an analysis; the default path is
    always complete. The fast path is sound (0 false positives on 50 generated non-words), a yes-only
    detector for "is this a word" (can under-generate, even to zero on un-built constructs), which the
    cache/certification corrects.
  • Proven-complete → no full search. A certified grammar never runs the engine; on the 60-word Sena
    corpus that certifies, the default complete path runs at ~18 ms/word (~11×).
  • Fast — ~13× on Sena's FST path; 98% of words covered by the fast path (the rest resolve via
    engine/cache, no silent miss).
  • Thread-safe — shared proposer + pooled engine + concurrent cache; 0 parallel-vs-sequential
    mismatches on 200 words
    (and a CI test).

Tests

CI unit tests on the in-repo toy grammar cover the proposer, the verify chain, the caching/persistence
layer (default==engine, provisional→complete after warm, certified-skip, round-trip + version guard),
soundness/negatives, the category fix, per-word opt-out, and thread-safety; plus advisor/closure. An
[Explicit] benchmark measures speed/parity/soundness/concurrency/certification on an external grammar.
Full unit suite green (93).

Honest limitations / out of scope

  • ~2% of Sena words use constructs the FST proposer doesn't build yet — compounding (the main one),
    depth-3 derivation, one suffix-order case. They resolve via the engine/cache (no silent miss) and keep
    the full corpus from certifying. Compounding is the highest-value next coverage build (an additive,
    shared-root-chain design) and is scoped as a focused follow-on.
  • The per-stem completeness proof (proving the fast path complete without the engine) was explored
    and abandoned; completeness is delivered by certification + cache + engine.
  • Deferred: the generator (reverse/synthesis direction); compounding; a 2-way-FST treatment of the
    residual.

Design + research record: docs/HERMITCRAB_FST_PLAN.md (§13 = caching front end); advisor:
docs/HERMITCRAB_FST_ADVISOR.md.

🤖 Generated with Claude Code


This change is Reviewable

johnml1135 and others added 6 commits June 25, 2026 20:13
Tech stack: build on SIL.Machine's own Fst (already has Compose/Determinize/Minimize/
Intersect + unification arcs; RootAllomorphTrie precedent) rather than external OpenFst/Foma
(interop + no native feature-structure support). Graceful degradation via census-chosen
tiers: fully-FS grammars -> transducer-only; partial -> FST + per-word search fallback at
non-FS escapes; pervasively-non-FS -> existing search (no regression). Soundness contract +
verification mode. Phased plan gated on a Sena compile-and-verify spike.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te FST

A grammar evolves; one new rule can quietly push it from the fast finite-state
path into the slow combinatorial search. GrammarFstAdvisor.Analyze(Language)
walks every rule and emits per-rule advisories with severity (Escape = breaks
FST, Cost = inflates search, Info), a one-line issue, and an actionable
write-up (how to constrain it / what to try instead), plus an overall tier
verdict. This is the "one new rule blew up the grammar" guard: a new Escape
that flips the tier names the offending rule and explains the fix.

Classifier: reduplication (a part copied >=2x via CopyFromInput) = Escape;
stem-split/infixation (>=2 copies of different parts) = Escape; unbounded
rewrite environment (Quantifier MaxOccur == Infinite) = Escape; deletion
(LHS longer than RHS) = Cost; many allomorphs = Cost; ModifyFromInput,
bounded rewrite rule, metathesis, compounding = Info. Report also reports how
many affix/phonological/compounding rules were examined (clean ones produce no
advisory) so "fully FST-able" is backed by inspection counts.

Validated on real Sena grammar: examined 19 affix + 8 compounding, 0
phonological -> Tier 1, 0 escapes (matches the grammar census; no false
positives). Tests: concatenative grammar -> Tier 1; add a reduplication rule
-> flagged Escape with write-up + tier downgrade to Tier 2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infixation check flagged any allomorph with >=2 CopyFromInput of different
parts as Escape, but a plain suffix/circumfix over a split stem (copy "1",
copy "2", insert) has contiguous copies and is fully FST-able. True infixation
is signalled by inserted material BETWEEN two copies (copy...insert...copy);
HasInfixedCopy now detects exactly that. Added tests: a contiguous split-stem
suffix stays Tier 1 (no false escape) and a real copy-insert-copy infix is
flagged Escape.

Also label each advisory with its stratum (rules can appear in more than one),
which clarifies the Sena report: its 8 compounding rules (mrule1-8, 4 names
reused in pairs) all live in the 'Morphology' stratum -- genuine distinct
rules, not a re-walk. Sena verdict unchanged: Tier 1, 0 escapes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…opaque

An infix/reduplication escape can be un-applied per word by a cheap
strip-and-reparse probe (remove the candidate affix, re-parse the residue with
the FST) ONLY if nothing downstream rewrites the affixed span. Add the static
soundness test: an escape in stratum i is "probe-able" iff no phonological rule
runs at stratum i or later (surface-invariant); otherwise "opaque" and the
search backstop is required. Sound-conservative: presence of any later
phonological rule => opaque.

GrammarAdvisory.Probeable (bool?) records it; the report counts
ProbeableEscapeCount / OpaqueEscapeCount and, when every escape is probe-able,
reports a "Tier 2+" verdict (a per-word probe recovers the fast path,
effectively Tier 1 with no search backstop). Escape advice now spells out the
probe and why it is or isn't sound.

Tests: reduplication with no later phonology => probe-able (Tier 2+); the same
rule with a later-stratum rewrite rule => opaque (plain Tier 2 hybrid). Sena
unchanged (Tier 1, 0 escapes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arning

Add GrammarAdvisory.Regular (bool?): does an FST exist for this construct in
principle? By Kaplan & Kay (1994) a directional context-sensitive rewrite rule
is a regular relation however long its environment, so harmony/spreading and
bounded reduplication and infixation are regular and FST-reclaimable; only
whole-stem (unbounded) copy is genuinely non-regular.

Crucially this is kept ORTHOGONAL to severity: the FST compiler that turns
"regular" into "fast" is not built yet, so severity still means "slow in
today's engine" and is UNCHANGED -- every current escape stays an escape. A
harmony rule still warns (escape present, not Tier 1); Regular only adds a
separate reclaim-path note ("FST-reclaimable once the compiler exists; slow
today"). The report prints RegularEscapeCount / NonRegularEscapeCount and a
reclaim-path line; the tier verdict is NOT upgraded by regularity.

Detection: reduplication regularity from the copied part's Lhs pattern
boundedness (unbounded/unresolved -> non-regular, conservative); infix regular
(pattern-defined slot); unbounded-environment rewrite regular iff its own
Lhs/Rhs are bounded. Also fixed a latent tier bug (Probeable==null phonological
escapes were counted as "all probe-able") and removed the present-tense
"effectively Tier 1" claim from the Tier 2+ string -- the probe runtime is also
unbuilt, so both reclaim axes now read "would recover ... once it exists; slow
today".

Tests: harmony rewrite stays Escape + Regular (headline still warns);
unbounded-copy redup => non-regular; bounded reduplicant + infix => regular.
Sena unchanged (Tier 1). 6 advisor + 69 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.

Changes:

  • Introduces GrammarFstAdvisor, GrammarFstReport, and GrammarAdvisory to classify expensive/non-FST-able constructs across morphological and phonological rules.
  • Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
  • Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity).
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar.
src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs Implements the advisor, report model, and the core static analyses for affix and phonological rules.
HERMITCRAB_FST_PLAN.md Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate.
fst.md Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +111 to +129
Advisories = advisories;
AffixRulesExamined = affixRulesExamined;
PhonologicalRulesExamined = phonologicalRulesExamined;
CompoundingRulesExamined = compoundingRulesExamined;
EscapeCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Escape);
CostCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Cost);
InfoCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Info);
ProbeableEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == true
);
OpaqueEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == false
);
RegularEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Regular == true
);
NonRegularEscapeCount = advisories.Count(a =>
a.Severity == GrammarAdvisorySeverity.Escape && a.Regular != true
);
Comment on lines +270 to +277
foreach (IMorphologicalRule mrule in stratum.MorphologicalRules)
{
switch (mrule)
{
case AffixProcessRule affix:
affixExamined++;
AnalyzeAffix(affix, stratum.Name, surfaceInvariant, advisories, manyAllomorphsThreshold);
break;
johnml1135 and others added 20 commits June 25, 2026 20:51
The analyzer transducer must emit the structured derivation (ordered morphemes
+ root), not just accept/reject, or it is a recognizer not an analyzer. Define
the compact output token: high 8 bits = MorphOp (role/operation: Root, Prefix,
Suffix, Infix, Reduplication, Circumfix*, Compound, Clitic, Process, Null), low
24 bits = morpheme index into the grammar's morpheme table. An accepting path's
output is the uint[] of these tokens, which IS the analysis and is
self-describing: Morphemes = indices in array order; RootMorphemeIndex = the
Root token's position (no separate field).

Verdict on the proposed 8+24 packing: sound and the right compactness choice
(4 bytes/morph, hashable, columnar). 24-bit ceiling = 16,777,215 morphemes
(ample; compiler asserts). Refinement baked into the schema: keep the 32-bit
word as the pure (op, morpheme) derivation and DON'T overload it with surface
segmentation or allomorph identity -- those are optional parallel channels.

MorphToken codec (Encode/GetOp/GetMorphemeId/RootIndex) + bounds check, plus
HERMITCRAB_FST_PLAN.md section 8 documenting the schema. 5 tests (round-trip,
out-of-range throw, distinctness, self-describing derivation array, root
recovery).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nalyses

The packed-token tests alone proved bit-packing, not schema fidelity. Add the
reference encoder MorphTokenCodec (Word -> uint[]), which mirrors
Morpher.CreateWordAnalysis (same AllomorphsInMorphOrder iteration + RootAllomorph
check) and populates the op channel from the actual rule: head root -> Root,
other stems -> Compound, affixes classified from their output actions
(reduplication / infix / prefix / suffix / process).

Round-trip tests on real parsed words now MEASURE soundness rather than assert
it:
- suffix word: decoded morphemes reproduce WordAnalysis.Morphemes in order, and
  RootIndex (recovered purely from the Root op code) == WordAnalysis.RootMorphemeIndex;
- compound (two stems): the flat array keeps both morphemes with exactly one
  Root + one Compound, matched to WordAnalysis by morpheme sequence with root
  index at parity -- confirming the flat array is at parity with WordAnalysis's
  own compound flattening (not lossy);
- ClassifyOp populates reduplication/infix/prefix/suffix from real output actions.

Resolves the two open risks on the schema: the op channel is now populated from
a real Word (not asserted), and multi-root/compound handling is verified. 75 HC
tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstMorpher hand-builds one acceptor (mirroring RootAllomorphTrie -- no
Compose/Determinize/Minimize, which are unsafe for HC's underspecified-feature
arcs): root segment chains from the start state, with each fixed-segment suffix
appended after every root-accepting state; accepting states map to packed
MorphToken arrays. Analysis is a single nondeterministic Transduce walk of the
surface word -- no Word clones, no generate-and-test.

Verified against Morpher.AnalyzeWord on the concatenative fragment:
- bare root and root+suffix ("sags") round-trip to the same morphemes + root;
- COMPLETENESS as analysis-set equality, not "found one": homographs (dat ->
  entries 8 and 9 both found), and the negative case (no path -> both empty)
  agree with the search engine;
- an [Explicit] allocation comparison (FST walk vs search engine).

Caveat documented in FstMorpher: arcs match segments only, not the affix's
syntactic/MPR/stratum constraints, so on grammars where letters match but
constraints exclude an analysis it would over-generate -- closed by
feature-unification arcs (HERMITCRAB_FST_PLAN.md section 8). 78 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e contract)

FstMorpher now implements IMorphologicalAnalyzer.AnalyzeWord -> IEnumerable<WordAnalysis>,
the same interface Morpher implements and that consumers (FieldWorks et al.)
depend on. Morphemes and root index come from the token walk; Category is null
because this slice does not yet track syntactic features (arrives with the
unification arcs). A test drives both engines through the IMorphologicalAnalyzer
contract and asserts the WordAnalysis sets match (homographs included) on the
concatenative fragment, so the FST analyzer is a drop-in for the search engine
at the interface level.

Still scoped to the clean concatenative fragment built from explicit root/suffix
lists; full-grammar Compose compilation, phonology/allomorphy, the
feeding-closure completeness certificate, and the FieldWorks adapter remain. 79
HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…grammar

FstMorpher.FromLanguage(Language) builds the analyzer by introspecting the
grammar: every root allomorph plus every single-allomorph suffix rule
(detected via MorphTokenCodec.ClassifyOp == Suffix). It THROWS
NotSupportedException on any construct outside the concatenative root+suffix
fragment (prefixes, infixes, reduplication, compounding, templates,
multi-allomorph affixes) so it never silently under-generates — a caller learns
exactly what this slice cannot cover.

This closes the "not driven from a compiled Language" gap: the analyzer now
consumes a real Language object, not explicit root/suffix lists. Verified at
parity with Morpher (through the IMorphologicalAnalyzer contract) on the
concatenative fragment, plus a guard test that a prefix rule makes FromLanguage
refuse rather than produce a quietly incomplete analyzer. 81 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffix?)

The acceptor now prepends a prefix segment chain before the root (mirroring the
appended suffix chains), so it covers an optional fixed-segment prefix + root +
optional fixed-segment suffix. FromLanguage auto-detects prefix rules
(ClassifyOp == Prefix) alongside suffixes and still throws on anything outside
the concatenative fragment (reduplication, infix, compounding, templates,
multi-allomorph affixes).

Verified at parity with Morpher via the IMorphologicalAnalyzer contract:
"disag" = di-(PST) + sag, and the bare root through the no-prefix branch. The
throw-guard test now uses reduplication (a genuinely non-concatenative rule).
82 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tificate)

Document the completeness/closure analysis: an FST analyzer is trustworthy only
if its silence is a proof. Completeness has two parts — "no escape applies here"
(easy) and "no normal FST step reachable from the input can FEED an escape" (the
kicker, Kiparsky feeding). The universal question is undecidable, but per grammar
it is usually decidable:

- decidable feeding-closure: for each FST-able rule F and escape E, test
  range(F) ∩ trigger(E) = ∅ via Fst.Intersect; all empty ⇒ the fragment is
  closed ⇒ "no path" is a complete certificate; non-empty + regular ⇒ fold in;
  non-empty + non-regular ⇒ those words fall to the search backstop;
- stratal containment as the practical guarantee (escapes innermost, not fed by
  the FST fragment);
- homograph completeness = all accepting paths returned, contingent on closure +
  never unsafely determinizing/minimizing unification arcs;
- the search backstop's "done" rests on a true derivation-depth bound (finite iff
  no unbounded self-feeding cycle);
- the work: a static feeding-closure pass extending GrammarFstAdvisor + corpus
  closure verification (set parity) as the gate before the FST may replace the
  search engine.

Wired into the phased plan (Phase 3 gated on §9), the risks table, and the
decision flow. Until closure is confirmed for a grammar, the FST runs in
shadow/verification mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, invariant completeness)

Design in from the start a tunable partition that bounds the compiled automaton
without sacrificing completeness. Three buckets: A precompiled (eager, fast
walk, costs states), B on-the-fly (lazy on-demand composition, bounded memory),
C search/probe fallback (non-FS escapes, set by section 9 closure not the knob).
The A<->B boundary is the knob, with a safe floor (everything lazy = bounded +
complete).

Why completeness is INVARIANT under the knob: composition is associative, so
precompiling A.B vs applying B lazily after A denotes the same relation — the
split changes when work happens, never which analyses exist; the walk enumerates
all paths in either bucket; and closure (section 9) is computed on the full
A.B relation. So the knob is a pure space/time dial; the analysis set does not
move.

The policy is per-language (yes, it differs): rank layers by state-multiplier x
corpus hotness, precompile cheap-and-hot, keep expensive-and-cold lazy, auto-
demote A->B under a state/memory budget. Same construct can be eager in one
project and lazy in another -> pluggable policy + optional auto-tuner.

Designed-in requirements: compiler is a pipeline of self-contained composable
layers (each with state-multiplier/hotness/closure metadata) behind one
eager-or-lazy interface; analyzer walks the eager core and lazily expands B
layers, emitting the same MorphToken outputs; state budget is a first-class
compile input (auto-demotion logged, never silent truncation); the corpus
set-parity gate runs against the chosen partition. Wired into the risks table
(state-blowup) and the phased plan (Phase 1-2 architecture).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ty gate

FstVerification.Compare runs a candidate analyzer (FstMorpher) beside the
sound+complete reference (Morpher) over a corpus and reports, per word, where
their analysis SETS differ: MissingFromCandidate (completeness failures) and
ExtraInCandidate (soundness/over-generation failures). AnalysisComparison.IsComplete
is the gate (HERMITCRAB_FST_PLAN.md §9.5/§10.4) that must pass before the FST may
replace the search engine — until then the FST runs in shadow mode.

This operationalizes the completeness question: it measures both "did we find
them all" and "did we invent any" at once, against the proven engine. Tests:
FstMorpher.FromLanguage vs Morpher is IsComplete over a concatenative corpus
(inflected, bare root, homograph, non-word), and the harness flags a
deliberately-empty candidate as incomplete (proving it is not vacuous). 84 HC
tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the acceptor with compound paths: for each compounding subrule, build
root×root chains (head segs + non-head segs), tagging the head Root and the
non-head Compound per the subrule's surface head position (head-first or
head-last detected from the first CopyFromInput in the Rhs). FromLanguage now
collects CompoundingRules alongside prefixes/suffixes; single head + single
non-head subrules only (throws otherwise).

This is root×root in state count — exactly the layer §10 flags as a lazy-bucket
candidate at lexicon scale — built eagerly here for the parity check. Verified
against Morpher via the IMorphologicalAnalyzer contract: "pʰutdat" = pʰut(5) +
dat, returning both homographic non-heads (5+8, 5+9) exactly as the search
engine, and the bare root through the non-compound path. 85 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…morph

BuildAffixChains now iterates ALL allomorphs of an affix rule, building a
segment chain for each, all sharing the rule's morpheme token. Environment-
conditioned allomorphy is handled by the surface: only the allomorph whose
segments match the input accepts. FromLanguage no longer restricts affixes to a
single allomorph (throws only if an allomorph lacks a fixed-segment
InsertSegments).

This rounds out the concatenative Tier-1 fragment — roots, prefixes, suffixes,
bounded compounding, and multi-allomorph affixes. Verified at parity with
Morpher via the IMorphologicalAnalyzer contract: a plural with -s/-t allomorphs
analyzes "sags" and "sagt" exactly as the search engine, the surface selecting
the right allomorph. 86 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GrammarFstClosure.Analyze decides, per non-regular escape (reduplication /
infixation), whether any FST-able rule could apply before it and FEED it
(Kiparsky feeding). Sound stratal pre-filter: an escape is CLOSED only if no
FST-able rule (concatenative affix, compounding, or any phonological rule)
applies at or before its stratum — same-stratum rules count too, since unordered
application could place them first. Never falsely reports closed.

ClosureReport.FstClosed is true iff every escape is closed (vacuously, none):
then the FST built over the FST-able fragment is closed and its "no path" is a
proof for words showing no escape signature — subject to the per-word surface
check and the corpus parity gate. This is the static half of "confirming FST
closure"; the empirical half is FstVerification. The precise refinement that
reclaims over-flagged cases is range(F) ∩ trigger(E) = ∅ via Fst.Intersect.

Tests: no escapes -> closed (vacuous); innermost reduplication with nothing
before it -> closed; a suffix in the same unordered stratum -> potentially fed.
89 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e 3)

HybridMorpher (IMorphologicalAnalyzer) wires the three pieces together: the
precompiled FST handles the FST-able fragment as a fast, allocation-light walk,
and words that could involve a non-regular escape fall back to the
sound+complete search engine. FstMorpher.FromLanguage gains ignoreEscapes to
build the FST over just the FST-able fragment.

Routing is sound by construction (only ever sends MORE words to search): the
fast path is taken iff the grammar has no escapes, or every escape is CLOSED
(GrammarFstClosure) AND is total reduplication (the surface signature this
router detects, XX) AND the word shows no such signature. Otherwise the search
runs.

Verified: with a closed total-reduplication escape, "sag" takes the FST fast
path and "sagsag" falls back to search, and the combined analysis set is
verified COMPLETE against the pure search engine via FstVerification. This is
the Phase-3 Tier-2 hybrid: the closure pass decides safety, the verification
gate proves parity, the runtime routes per word. 90 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstGenerator implements IMorphologicalGenerator for the concatenative fragment:
generation is the inverse of the analyzer's walk — ordered concatenation of each
morpheme's surface representation (prefix before root, suffix after, compound
stems concatenated), Cartesian over allomorph choices. Mirrors Morpher's
morpheme inventory so it is a drop-in IMorphologicalGenerator.

Verified against Morpher.GenerateWords on the concatenative fragment, and the
analyze→generate round-trip recovers the input word ("sag", "sags"). Scope
matches FstMorpher (roots + fixed-segment affixes); phonology/reduplication
defer to the search generator. 92 HC tests green.

This completes the in-repo Phase-4 surface: both directions (FstMorpher analyzer
+ FstGenerator generator), the Tier-2 hybrid, closure confirmation, verification
gate, and the grammar census all exist and are verified. The remaining Phase-4
item — the FieldWorks adapter — lives in a separate repository.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iming + parity

FstSenaBenchmark loads a real grammar (HC_GRAMMAR/HC_WORDS), runs the census and
closure pass, attempts to build FstMorpher/HybridMorpher (reporting the concrete
WALL if a construct is out of fragment), and times search vs FST vs hybrid with
FstVerification parity.

Run on the real Sena grammar this surfaces the concrete remaining wall:
- census: Tier 1, fully FST-able, 0 escapes, FST-CLOSED;
- but FstMorpher.FromLanguage hits "stratum 'Morphology' has affix templates" —
  templates (position classes) are FST-able in principle but not yet built by
  FstMorpher, so the FST/hybrid cannot yet be constructed from Sena;
- search baseline (unlimited unapplications): ~206 ms/word, true parses.

So affix-template support is the next build to get the FST over the wall on a
real FLEx grammar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…umulating walk

FstTemplateAnalyzer handles affix templates (position classes) — the real-grammar
case. Two design points from the advisor review: (1) build-time CATEGORY GATING —
a template attaches only to roots whose category unifies with its
RequiredSyntacticFeatureStruct, which kills over-generation AND lets same-category
roots share one copy of the template's slot-automaton (states = roots +
Σ template automata, not roots × combos); (2) TOKEN ACCUMULATION along the path
(a state carries the morpheme token emitted on entry) via a custom DFS walk,
since the shared automaton is reached by many roots so an accept-id map won't do.
A maxStates budget (§10 knob) aborts before a blowup.

New additive class (the 92 existing tests + the accept-id acceptor are untouched).
Verified on a toy: a V-only suffix template with two optional slots reproduces
the search engine's analyses for sag/sagd/sagdv, AND the category gate correctly
blocks the verb template on an A-category root (gab/gabd) — FstVerification parity.

Scope this slice: suffixing templates + category gating; prefix-slot templates,
cross-stratum gating, and phonology are next. Wired into FstSenaBenchmark.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k — Sena PARSES

FstTemplateAnalyzer now handles prefix AND suffix template slots (prefixes
surface in reverse template order), gated by BOTH category (root features unify
with the template's RequiredSyntacticFeatureStruct) and stratum (root at the
template's stratum or inner — the 'datd' lesson). The walk is a proper NFA
simulation (active config-set per segment, deduped by (state, tokens)) instead
of the exponential recursive DFS, and guards InvalidShapeException (out-of-table
phonemes) like the search engine.

Result on the real Sena grammar (sena-hc.xml, 24 templates):
- FstTemplateAnalyzer BUILDS in ~0.5 s (gating shares automata → no state blowup);
- parses at ~6.4 ms/word vs the search engine's ~178 ms/word — about 28x faster;
- 14 of 16 analyses match the search engine across the sample, with one MISSING
  analysis on 'mafuta' (a two-prefix form) — a coverage gap, not over-generation.

Toy tests (parity-verified): suffix template + category gate; prefix+suffix
template (bare / prefixed / suffixed / both) + gate. 94 HC tests green.

Sena now parses through the FST; closing the last divergence to full
FstVerification parity is the remaining refinement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Sena ~100x

Two coverage fixes for real Sena: (1) slot rules of type
RealizationalAffixProcessRule (a sibling of AffixProcessRule, same
IList<AffixProcessAllomorph>) are now included, not skipped; (2) every affix is
entered through a token-bearing state, so a zero/empty-segment morpheme still
emits its token (previously the token was placed on the first segment state and
a zero morph emitted nothing -> a missing analysis).

Result on real Sena: FstTemplateAnalyzer parses at ~3 ms/word vs the search
engine's ~337 ms/word (~100x) and matches the search engine's analysis set on
the 8-word sample exactly; on 30 words 26 match, with residual divergences in
both directions (one under-generation; over-generation where a constraint the
FST does not yet enforce -- obligatory affixation / MPR / co-occurrence -- would
exclude an analysis). Sena PARSES through the FST; full FstVerification parity at
scale is the remaining constraint-enforcement refinement. 94 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VerifiedFstAnalyzer wraps the FST proposer and verifies each candidate by
re-synthesis (keep it only if generating from it reproduces the surface). In
principle this makes the FST sound — "matched a state" becomes "valid
derivation" (completeness condition 3) enforced outside the automaton — and
eliminates over-generation without encoding constraints in the FST.

Empirical finding on real Sena: Morpher.GenerateWords is NOT a clean validity
oracle here — it over-rejects (regenerates only ~1/4 of valid analyses, dropping
30-word parity from 4 to 11 divergences), because realizational affixes need
their realizational feature structure to synthesize and a bare morpheme-list
WordAnalysis does not carry it. NFD-normalizing the comparison did not help, so
it is not a normalization artifact.

Conclusion (documented in the class): propose-and-verify needs either a richer
candidate (carrying the realizational FS) or a different soundness mechanism
(constraints-on-arcs, or the stratal left/right handoff). The sound system today
remains the hybrid (FST proposes, search backstop). Mechanism + finding kept;
94 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nd+complete

SoundHybridMorpher is sound+complete BY CONSTRUCTION: the FST proposes, each
candidate is verified by REPLAYING its own derivation (synthesize from the
candidate's root + affixes with the realizational FS its realizational affixes
imply), and any candidate that can't be confirmed routes the whole word to the
search engine. Verified on real Sena: "sound vs search: IDENTICAL".

But the measured fallback is high (~87% on 30 words), so the speedup is lost:
re-synthesis is not yet a clean validity oracle. Supplying the realizational
feature-structure (vs the empty one Morpher.GenerateWords(WordAnalysis) uses)
recovered some words (100%->87% fallback), confirming the cause is missing
derivational features — but other feature determinants (syntactic agreement,
MPR, allomorph conditioning) the (op, morpheme) token list does NOT carry are
still absent, so most valid candidates fail to regenerate and fall back.

Finding: propose-and-verify-by-resynthesis is sound+complete but cannot reach
the <5% fallback / >=10x target while the candidate under-determines the
derivation. The clean-oracle path is a richer candidate (carry the feature/
allomorph state) or feature-bearing FST arcs (way A). 94 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
johnml1135 and others added 23 commits June 26, 2026 07:35
…g table

The data structure for a feature-aware, O(1) FST walk: ids in bits, objects in
tables.

- Interner<T>: maps values to dense small ids and back in O(1), so a feature
  structure / MPR set is referenced by a few bits instead of stored per state.
- MorphStateLayout: a tunable bit-packed state over an array of uint. Every
  field is one shift+mask, never straddling a 32-bit word, greedily packed in
  declaration order; WordCount falls out of the chosen widths. Field widths are
  the tuning knob (constructor args).

Default layout (3 words): Op8|Morpheme24 (the analysis token) ; Category12|
Realizational12|Slot8 ; Mpr32 (bitmask). A validity transition is shift/mask +
an O(1) interned-table / transition-table lookup — no runtime unification — so
an accepting state means a valid derivation (completeness condition 3, in O(1)).

FST_STATE_PACKING.md documents each field (stored vs looked-up), the tradeoffs
(bits-headroom vs density vs words-to-hash; baked-transition vs runtime-unify),
and the bit tuning (Morpheme 24->22 frees 2 bits / ~4.19M morphemes; collapsing
to 2 words; per-field ceilings via MaxValue). Tests: pack/unpack round-trip +
field independence + overflow + re-tuned width; interner id-sharing/density.
101 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sue)

Wired the first morphosyntactic constraint into FstTemplateAnalyzer: a slot rule
whose RequiredSyntacticFeatureStruct cannot unify with the template's category is
omitted at build time — HC's Required.Unify(stem) gate, hoisted to compile time.
This avoids the surface-vs-derivation-order obstacle that blocks threading the
category through the surface-order walk (prefixes are encountered before the root
whose category they need): for inflectional templates the category is ~constant,
so the build-time gate is faithful and needs no per-path state. Test:
N-requiring suffix in a V-template is pruned, so "sagz" is not over-generated and
the FST stays IDENTICAL to search.

Empirical finding: this did NOT change Sena's divergences — so Sena's
over-generation is NOT category-based. The actual causes, by word:
- mbale [:0]      : bare-root acceptance (Bantu obligatory inflection),
- kulemba [+++..] : MPR / morpheme-co-occurrence (a forbidden affix combination),
- aikhane/angwera : coverage gaps (deeper morpheme stacks not yet built).
Category is one transition; MPR/obligatoriness/co-occurrence are the next ones.
102 HC tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion

A root is marked bare-accepting only if synthesizing it with no affixes yields
its own surface (HC's finality/validity check). In an always-inflected grammar
(Bantu) a root that must take a class/agreement affix returns nothing from
bare synthesis, so the spurious bare reading is suppressed. Default constructor
keeps "every root may stand bare" (toy grammars); the new (Language, Morpher)
constructor enables the check, wired into SoundHybridMorpher and the Sena
benchmark.

Result on real Sena: the raw FST-vs-search divergences drop 4 -> 3 — the
"mbale [:0]" bare-root over-generation is gone. Remaining: kulemba (MPR /
co-occurrence over-gen — the next in-FST guard) and aikhane/angwera (coverage
under-gen — deeper morpheme stacks). 102 HC tests green; build ~0.6s.

Note: the SoundHybridMorpher verify-fallback rate is still high because the
re-synthesis verify over-rejects (under-determined candidate); the in-FST guards
(category, obligatoriness, next MPR) are the path that makes the FST sound
directly so the verify becomes unnecessary — tracked by the raw FST-vs-search
divergence count, now 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…11 findings

The benchmark forced MaxUnapplications=3 on the reference Morpher, but =0 means
UNLIMITED — so the ground truth was crippled and every prior "divergence" was an
artifact. With the corrected unlimited oracle the FST template analyzer is
IDENTICAL to search on the curated set at ~74x; on the broader set the only real
residuals are over-generation (cleanly removed by verify-discard) and
under-generation that is entirely unbuilt-B derivational affixes (REC/APPLIC/REV/
NZR/NEU/PAS) — not bucket C.

- FstReplay: shared re-synthesis verifier (supplies the candidate's realizational FS).
- VerifiedFstAnalyzer: discard mode — installs all over-gen gates via HC synthesis.
- SoundHybridMorpher: delegates replay to FstReplay.
- Benchmark: gloss diagnostic + obligatoriness-gate re-validation (gate still
  load-bearing under the correct oracle: 5->4 divergences).
- HERMITCRAB_FST_PLAN §11: the corrected picture, the sharpened bucket-C framing
  (B∘C∘B thin local core; Sena has 0 reduplication rules / FST-CLOSED), the three
  C release valves, and the path to a full solution.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Shared, bounded (depth 2) derivational-suffix layer between root and the
inflectional suffix slots, shared across a template's roots (tokens accumulate on
the walk path, so no roots×derivations blowup). This CLOSES the under-generation:
the raw FST now proposes the correct REC/APPLIC/REV/NEU/PAS analyses
(aikhane/angwera/paoneke/ikoyiwe go from missing to proposed).

FstReplay now tries BOTH re-synthesis strategies (native WordAnalysis replay with
permutation; explicit realizational-FS form) and accepts if either reproduces.

Key finding (benchmark round-trip self-test): HC's analysis->synthesis is LOSSY for
derivational verb forms — search's OWN analyses for aikhane/angwera/kunduli/ikoyiwe
do NOT round-trip through GenerateWords (they return NO), while noun/simple forms do
(kulemba/mbalira OK). The rich analysis Word carries rule-chain features the lean
WordAnalysis (op+morpheme tokens) loses. So re-synthesis verification cannot be the
soundness oracle for this grammar's realizational morphology — it rejects valid
analyses. Soundness needs a different mechanism (richer tokens / build-time gates).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deep probe + grammar inspection confirm WHY re-synthesis verification failed: it is
NOT fundamental loss. Morpher has two synthesis doors — internal Synthesize (rich
analysis Word: template/slot structure + features → faithful) and public
GenerateWords (flat morpheme bag applied as FREE rules → lossy). The inflectional
affixes (3P+2, SBJV) are template-slot rules (mrule26+ inside <Slot>), only
compounding/derivation are free (mrule1-25); GenerateWords applies slot rules as
free rules so feature-dependent verb combos never synthesize — while simple nouns do
(why nouns round-trip, verbs don't).

The under-determination is self-inflicted: the FST walk knows the template/slot path
and discarded it in the lean (op,morpheme) token. Fix = preserve that path and verify
through HC's faithful door (template-aware directed synthesis), making verify sound
AND lossless and collapsing the 90% false-rejection fallback.

Corpus (200 Sena words, unlimited oracle): raw FST 49/200 diverge (~19% over-gen,
~7% under-gen) at ~64x; documented in plan §11.5/§11.6.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d, 99% complete)

Status check: the SoundHybridMorpher path has ZERO over-generation across 200 words
(both residual divergences miwiri/mitemo are pure under-gen). So the system is
already SOUND and ~99% COMPLETE — correctness is effectively done. The one open axis
is SPEED: the 90% fallback is the lossy GenerateWords verify false-rejecting valid
words, not genuine errors.

Pinpointed why GenerateWords fails (probe of aikhane): stem shape = citation shape =
ikh (not a stem-shape issue); its rules a-5/-e/-an mix template-slot inflection with a
free derivational rule across DIFFERENT STRATA, and GenerateWords' flat permuted rule
pool cannot reconstruct the cross-stratum order that HC's internal Synthesize reads
off the rich analysis Word. The FST walk knows that structure.

Plan §11.7/§11.8: the one remaining build is a faithful (template-aware,
stratum-ordered) directed-synthesis verify, which makes VerifiedFstAnalyzer sound AND
lossless with no fallback at full FST speed. §11.4 path updated with what's done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…wo routes

GenerateWords(WordAnalysis) permutes rule order (tries the correct order) and still
fails — so the missing ingredient is analysis-derived Word state, not rule ordering.
A cheap faithful verify is therefore harder than "right order". §11.8 now records two
viable routes to the speed unlock: (A) faithful reconstruction verify driving HC's
internal Synthesize, or (B) build-time constraint gates (subcategorization etc.) that
make the FST faithful with no per-word verify. Route B is likely the surer win for
this grammar.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aints)

Per design review: Route B duplicates HC's constraint logic as parallel FST arc-gates
(a second engine to keep aligned) — the anti-pattern. Route A reuses HC. Sharpened
Route A = "directed un-application, then Synthesize": the FST replaces only HC's slow
backward SEARCH (it knows the exact path), then HC's real forward Synthesize confirms.
Faithful by construction, cost ~(rules in path) not full fan-out → the >=10x.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ureStruct)

AnalysisAffixTemplateRule unifies the template's RequiredSyntacticFeatureStruct and
writes it onto the word (SyntacticFeatureStruct.Add(fs)). That populated feature
structure is the synthesis precondition the inflectional rules check — exactly what
from-citation GenerateWords never establishes (why correct rule order still fails
there). Directed un-application calls the same CompileAnalysisRule objects along the
FST path, reconstructing that state for free, then Synthesize succeeds. Confirms
Route A is reuse, not reimplementation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the lossy GenerateWords re-synthesis verify with restricted re-analysis: run
HC's own AnalyzeWord with LexEntrySelector/RuleSelector pinned to the candidate's root
and rules. That prunes HC's combinatorial fan-out to the single path the FST found
(a few ms, not the full search) while reusing HC's real analysis+synthesis validation
end to end — no reimplemented constraints.

This is the thin wrapper the design wanted: faithful AND lossless, where GenerateWords
was neither (it re-synthesized from a flat permuted rule list off the citation form and
never re-established the SyntacticFeatureStruct that AnalysisAffixTemplateRule sets
during un-application, so it false-rejected valid verb forms).

Measured (200 Sena words, unlimited oracle):
- verify-discard: 48 -> 14 divergences (186/200 match); ALL over-generation removed,
  remaining 14 are pure under-gen (category-changing NZR derivation + miwiri/mitemo).
- verify speed 14.4 ms/word vs 263 ms/word oracle (~18x), no fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ound-trip probe

The harness re-ran the slow search oracle ~5x (once per parity comparison). Wrap it
in a CachingAnalyzer so the oracle runs one pass and the comparisons reuse it; wall
time drops from ~6min to ~2.4min (remainder = one 47s oracle pass + the sound variant's
86%-fallback search). Removed the now-obsolete GenerateWords round-trip probe (its
investigation is concluded; the deep probe stays gated on HC_PROBE_WORD).

Corpus result with the Route A verify (200 words): verify(discard) 14 divergences
(186/200 match) at 15.6 ms/word (~15x), sound, lossless, no fallback. The 14 are pure
under-gen (category-changing NZR derivation + miwiri/mitemo) to be closed in the proposer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…art-2 coverage

§11.8 marks the restricted-re-analysis verify implemented and measured (48->14
divergences, 186/200 set-match, ~15x, no fallback, lossless). §11.4 updates the path:
the only remaining work is closing the 14 under-gen in the PROPOSER — category-changing
derivation (verb->noun NZR feeding a noun template) + a few prefixal/deeper cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…stems

Relax template attachment: a template attaches to a root if the root's category
matches OR can be DERIVED to the template's category via <=DerivDepth derivational
suffixes (DerivableToCategory, using each rule's OutSyntacticFeatureStruct). This lets
a noun-class template sit over a nominalized verb stem (vencer[verb]+NZR -> noun +
class prefix), closing the category-changing under-gen (kunduli/cidzo/khalani). The
category-changing suffix rides the existing shared derivation layer; verify-discard
removes any resulting over-gen.

Measured (200 Sena words): verify-discard 14 -> 6 divergences (194/200 set-match) at
17.2 ms/word (~13x), sound + lossless, no fallback. Remaining 6 are diverse proposer
coverage gaps: prefixal derivation (nyari/cawo), depth-3 derivation (miwiri), and
copula/TAM constructions (ndico/ndimwe/kuumadi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Depth 3 gains 1 word (miwiri) but ~2x verify cost, so keep DerivDepth=2: 194/200
set-match at 17.2 ms/word (~13x), sound + lossless, no fallback. Plan §11.4 records the
achieved target and characterizes the last 6 (diverse proposer coverage gaps: prefixal
derivation, depth-3, copula/TAM) — not a verify or soundness issue.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t is real

The parity signature was join(Morpheme.Id)+":"+rootIndex, but affix Id is empty here,
so it encoded only (morpheme count, root position) — collapsing same-shape affix
variants (3P+2 / 3S+1 / 6) and hiding same-shape under-generation. Replace with
per-morpheme object identity (both analyzers reference the same Morpheme instances, so
identity is a faithful shared key). FstReplay's candidate-match signature sharpened too.

Re-measured (200 words, strict signature): raw FST 44 -> 90 divergences (shape-parity
HAD hidden raw over-gen), but verify-discard stayed at 6 (194/200), ALL pure under-gen.
So the soundness/lossless/~13x result is real, not a shape artifact.

Plan §11.9 records the metric fix and two productionization caveats: (1) verify mutates
shared Morpher selectors -> needs per-thread morpher/pool for parallel verify; (2) the
~13x is vs the unlimited-unapp oracle (the correct sound+complete baseline).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… derivation

Records and implements the §12 completeness certificate: completeness is a property of
the grammar's rule structure, certified once, not a per-word check. Two halves —
closure (regular B-side / valid cut, via GrammarFstClosure) and coverage (the FST emits
every B-side construct). Certified ⇒ FST-only is provably complete; else the engine
backstop guarantees it and the certificate names the gaps.

- FstCompletenessCertificate / FstCompletenessReport: closure + affix coverage (from
  the codec's covered-morpheme set) + compounding count + verdict + uncovered list.
- FstTemplateAnalyzer: prefix-derivation layer (mirror of suffix; closed 4 of 7 prefix
  gaps, coverage 125->129/132), category reachability over prefix+suffix derivation,
  CoversAnalysis (sound structural predicate: single root, covered, depth<=2, canonical
  morph order), StateCount.
- MorphTokenCodec: CoveredMorphemes / Covers accessors.
- CompleteHybridMorpher: provably-complete analyzer (certified->FST, else->engine).

PROOF (Prove_CertificateCompleteness, 200 hard Sena words, unlimited oracle):
complete-system misses = 0 — returns every true analysis. Sena reports NOT certified
(129/132 affix, 8 compounding), so 4 out-of-class words route to the engine; the FST's
certified in-class set reaches 468 analyses.

Key finding (the stress test working as intended): a predictive per-analysis predicate
is whack-a-mole (broke on depth, then order, then a template-less prefix cawo). So
soundness rests on the grammar-level certificate + engine backstop, not a predicate —
100% complete today, faster as coverage closes, never a silent miss. FST size: 50,673
states from 1,463 roots (additive, not multiplicative — §12.5).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y is the gate

Advisor review caught two real defects:
1. The "proof" was VACUOUS: CompleteHybridMorpher for uncertified Sena == search, so
   complete==search was search==search; the certified FST branch was never exercised.
2. The static certificate was UNSOUND: cawo (coisa+d'eles) has every morpheme covered
   yet the FST can't build it (prefix on a template-less pronoun root) — so filling the
   other gaps would flip IsCertified=true while still silently dropping cawo. Coverage
   is necessary, not sufficient; completeness is about paths (attachments), not symbols.

Fix: certification is the EMPIRICAL set-parity gate (CertifyEmpirically: FST==search at
morpheme-identity over a corpus, §9.5), which is path-level and catches cawo. The static
coverage check is demoted to a fast pre-filter/gap-namer (PreFilterPasses, no longer a
gate). CompleteHybridMorpher now takes the empirical-certified bool.

Non-vacuous proof (200 hard Sena words): FST path tested DIRECTLY produces 467/480
search analyses (13 → engine); the static check is shown UNSOUND on cawo (in-class yet
FST-missed); empirical gate refuses to certify Sena (5 divergent words); complete-system
misses = 0. System is 100% complete today via the engine; path to FST-only is driving
set-parity divergences to 0. Plan §12.6 corrected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Soundness_NegativeExamples: generate 50 plausible non-words by perturbing real words
(over-prefix, over-suffix, prefix-swap, fake partial reduplication, fake compound),
keep only TRUE negatives (search oracle = ∅), and require the verified FST to also
return ∅. Result on Sena: 50 true negatives, 11 exercise the verify (raw FST
over-proposes), 0 false positives — the verify (restricted re-analysis) rejects every
non-word it is offered. Writes negatives to HC_NEG_OUT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sound on corpus

Add a template-less stem path (optional derivational prefixes + root + optional
derivational suffixes, no inflectional template) for roots that derive/associate
without inflecting — e.g. a pronoun taking an associative prefix (cawo = coisa+d'eles).
Shared prefix/suffix derivation layers keep it additive (50,673 -> 59,022 states).

Result (200 words): divergent words 5 -> 4 (cawo closed); FST-direct 467 -> 468/480;
and static-unsound cases 1 -> 0 — the cawo analysis that broke CoversAnalysis is now
produced by the FST, so the structural predicate is sound on this corpus. Negatives
still 0 false positives (verify filters the added over-gen); verify 19.8 ms/word (~12x).

Remaining 4 divergent (engine-handled, all under-gen): miwiri (depth-3 derivation),
ndimwe/ndico (copula e+pronoun), kuumadi (derivational suffix outside inflection).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ead code

Acting on the cross-review landing plan + three decisions (keep advisor; multithread;
per-word FST opt-out).

Multithreading (fixes the Critical thread-safety bug):
- MorpherPool: thread-safe pool of Morphers. The verify pins a Morpher's selectors per
  candidate (mutable instance state), so a shared Morpher can't be used across threads.
  Each parse now rents/returns its own; Morphers are built once and reused.
- FstTemplateAnalyzer (proposer) is immutable post-build → shared safely.
- VerifiedFstAnalyzer + CompleteHybridMorpher now use the pool → corpus can be parsed in
  parallel. New Concurrent_MatchesSequential benchmark test: 0 mismatches.

M2 fix: FstReplay.Confirm returns the MATCHED HC WordAnalysis (real category, genuine
engine object) instead of bool; VerifiedFstAnalyzer yields it. Parity now sees category.

Per-word opt-out: CompleteHybridMorpher.AnalyzeWord(word, bool useFst) forces FST on/off
for one word; UseFstFor policy hook; plain AnalyzeWord uses the certified default.

Cuts (dead/superseded/abandoned, ~62% reduction): FstMorpher, FstGenerator (defer to a
generation PR), HybridMorpher, SoundHybridMorpher, MorphStateLayout, Interner,
FstCompletenessCertificate, FstLatticeAnalyzer + their 5 tests. CoversAnalysis (per-stem
proof, abandoned) removed from FstTemplateAnalyzer; codec Covers/CoveredMorphemes removed;
CertifyEmpirically inlined into CompleteHybridMorpher. KEPT the advisor (GrammarFstAdvisor
+ GrammarFstClosure) per decision 1. Retargeted FstVerificationTests off FstMorpher;
rewrote FstSenaBenchmark spine-only.

Validated (Sena, 60 words): ~11.5x (220->19 ms/word), verified==search IDENTICAL
(certified), 0 parallel mismatches, 0 false positives on 50 negatives. Unit suite 83 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VerifiedFstAnalyzerTests on the in-repo toy grammar (no external data) — closes the gap
where the propose-and-verify spine was exercised only by the [Explicit] benchmark:
- Verified_MatchesSearch_OnConcatenativeCorpus: set parity vs the engine.
- Verified_RejectsNonWord_NoFalsePositive: soundness (no false positive).
- Verified_YieldsGenuineEngineAnalyses_WithCategory: guards the M2 fix (yields HC
  analyses with category, not category-less FST candidates).
- CompleteHybrid_PerWordOptOut_EngineMatchesSearch: per-word useFst on/off both correct.
- Verified_ParallelMatchesSequential: thread-safety (pooled Morpher) in CI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…VP header

- Delete FST_PR_PLAN.md (scaffolding) and FST_STATE_PACKING.md (documented the cut
  MorphStateLayout).
- Move HERMITCRAB_FST_PLAN.md and the advisor plan (fst.md → HERMITCRAB_FST_ADVISOR.md)
  under docs/.
- Add a "Shipped MVP" header to the plan: the verify + engine-backstop + per-word
  opt-out design that landed, and an explicit note that the per-stem completeness proof
  (§11.5+/§12.3+) was explored then abandoned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@johnml1135 johnml1135 changed the title HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork) HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor) Jun 26, 2026
johnml1135 and others added 2 commits June 26, 2026 14:51
… cache)

The FST fast path is sound but not guaranteed complete, so it only makes sense paired
with the proven engine behind a cache. Adds the shipped front end:

- AnalysisCache: thread-safe store of complete (engine) analyses per word; FastAnalysisResult
  carries IsComplete (cached-complete vs provisional-FST).
- CachingMorphologicalAnalyzer: default AnalyzeWord = guaranteed complete (cached, or engine
  on miss then cached) — backwards-compatible; AnalyzeWordFast = cached-complete if warm else
  provisional FST (never blocks); Warm(corpus) fills the cache in parallel.
- MorphemeRegistry + AnalysisCacheSerializer: persist the cache across sessions (fixed corpora)
  with a grammar-version guard that rejects a stale cache; confirmed non-words cached too.

Correctness equals the engine (cache never invents/hides analyses); the FST removes cold-start
latency; a warmed corpus resolves fast AND complete. Per-word fast/slow choice preserved.

CI tests (toy grammar): default==engine + caches; fast provisional→complete after warm; warm
fills corpus; persistence round-trips and version-guard rejects stale. Full unit suite 92 green.
Plan §13 documents the design.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…epth

Answers "say a word is fully analyzed without full search": a grammar that is FST-closed
(census) AND passes set-parity vs the engine is CERTIFIED — the FST is then proven
complete for every word, so the default guaranteed-complete path runs FST-only with no
search and AnalyzeWordFast reports IsComplete=true without warming.

- CachingMorphologicalAnalyzer: grammarCertified flag; FromLanguage(corpus) computes it
  (closed && set-parity). Certified → FST-only (no engine, no cache); else engine/cache
  backstop. New GrammarCertified property.
- FstTemplateAnalyzer: derivation depth is now a tunable ctor param (default 2) instead
  of a const — battle-ready knob to trade fast-path completeness vs verify cost per grammar.
- CI test: certified grammar never runs the engine and its fast result is proven complete.
- Benchmark: reports certified status; on the 60-word Sena corpus (certifies) the default
  complete path is FST-only at ~18 ms/word (~11x), no full search.

Thread-safety reconfirmed: 0 parallel-vs-sequential mismatches on 200 words. Unit suite 93 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants