HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441
Open
johnml1135 wants to merge 51 commits into
Open
HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441johnml1135 wants to merge 51 commits into
johnml1135 wants to merge 51 commits into
Conversation
Tech stack: build on SIL.Machine's own Fst (already has Compose/Determinize/Minimize/ Intersect + unification arcs; RootAllomorphTrie precedent) rather than external OpenFst/Foma (interop + no native feature-structure support). Graceful degradation via census-chosen tiers: fully-FS grammars -> transducer-only; partial -> FST + per-word search fallback at non-FS escapes; pervasively-non-FS -> existing search (no regression). Soundness contract + verification mode. Phased plan gated on a Sena compile-and-verify spike. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te FST A grammar evolves; one new rule can quietly push it from the fast finite-state path into the slow combinatorial search. GrammarFstAdvisor.Analyze(Language) walks every rule and emits per-rule advisories with severity (Escape = breaks FST, Cost = inflates search, Info), a one-line issue, and an actionable write-up (how to constrain it / what to try instead), plus an overall tier verdict. This is the "one new rule blew up the grammar" guard: a new Escape that flips the tier names the offending rule and explains the fix. Classifier: reduplication (a part copied >=2x via CopyFromInput) = Escape; stem-split/infixation (>=2 copies of different parts) = Escape; unbounded rewrite environment (Quantifier MaxOccur == Infinite) = Escape; deletion (LHS longer than RHS) = Cost; many allomorphs = Cost; ModifyFromInput, bounded rewrite rule, metathesis, compounding = Info. Report also reports how many affix/phonological/compounding rules were examined (clean ones produce no advisory) so "fully FST-able" is backed by inspection counts. Validated on real Sena grammar: examined 19 affix + 8 compounding, 0 phonological -> Tier 1, 0 escapes (matches the grammar census; no false positives). Tests: concatenative grammar -> Tier 1; add a reduplication rule -> flagged Escape with write-up + tier downgrade to Tier 2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infixation check flagged any allomorph with >=2 CopyFromInput of different parts as Escape, but a plain suffix/circumfix over a split stem (copy "1", copy "2", insert) has contiguous copies and is fully FST-able. True infixation is signalled by inserted material BETWEEN two copies (copy...insert...copy); HasInfixedCopy now detects exactly that. Added tests: a contiguous split-stem suffix stays Tier 1 (no false escape) and a real copy-insert-copy infix is flagged Escape. Also label each advisory with its stratum (rules can appear in more than one), which clarifies the Sena report: its 8 compounding rules (mrule1-8, 4 names reused in pairs) all live in the 'Morphology' stratum -- genuine distinct rules, not a re-walk. Sena verdict unchanged: Tier 1, 0 escapes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…opaque An infix/reduplication escape can be un-applied per word by a cheap strip-and-reparse probe (remove the candidate affix, re-parse the residue with the FST) ONLY if nothing downstream rewrites the affixed span. Add the static soundness test: an escape in stratum i is "probe-able" iff no phonological rule runs at stratum i or later (surface-invariant); otherwise "opaque" and the search backstop is required. Sound-conservative: presence of any later phonological rule => opaque. GrammarAdvisory.Probeable (bool?) records it; the report counts ProbeableEscapeCount / OpaqueEscapeCount and, when every escape is probe-able, reports a "Tier 2+" verdict (a per-word probe recovers the fast path, effectively Tier 1 with no search backstop). Escape advice now spells out the probe and why it is or isn't sound. Tests: reduplication with no later phonology => probe-able (Tier 2+); the same rule with a later-stratum rewrite rule => opaque (plain Tier 2 hybrid). Sena unchanged (Tier 1, 0 escapes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…arning
Add GrammarAdvisory.Regular (bool?): does an FST exist for this construct in
principle? By Kaplan & Kay (1994) a directional context-sensitive rewrite rule
is a regular relation however long its environment, so harmony/spreading and
bounded reduplication and infixation are regular and FST-reclaimable; only
whole-stem (unbounded) copy is genuinely non-regular.
Crucially this is kept ORTHOGONAL to severity: the FST compiler that turns
"regular" into "fast" is not built yet, so severity still means "slow in
today's engine" and is UNCHANGED -- every current escape stays an escape. A
harmony rule still warns (escape present, not Tier 1); Regular only adds a
separate reclaim-path note ("FST-reclaimable once the compiler exists; slow
today"). The report prints RegularEscapeCount / NonRegularEscapeCount and a
reclaim-path line; the tier verdict is NOT upgraded by regularity.
Detection: reduplication regularity from the copied part's Lhs pattern
boundedness (unbounded/unresolved -> non-regular, conservative); infix regular
(pattern-defined slot); unbounded-environment rewrite regular iff its own
Lhs/Rhs are bounded. Also fixed a latent tier bug (Probeable==null phonological
escapes were counted as "all probe-able") and removed the present-tense
"effectively Tier 1" claim from the Tier 2+ string -- the probe runtime is also
unbuilt, so both reclaim axes now read "would recover ... once it exists; slow
today".
Tests: harmony rewrite stays Escape + Regular (headline still warns);
unbounded-copy redup => non-regular; bounded reduplicant + infix => regular.
Sena unchanged (Tier 1). 6 advisor + 69 HC tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.
Changes:
- Introduces
GrammarFstAdvisor,GrammarFstReport, andGrammarAdvisoryto classify expensive/non-FST-able constructs across morphological and phonological rules. - Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
- Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs | Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity). |
| tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs | Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar. |
| src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs | Implements the advisor, report model, and the core static analyses for affix and phonological rules. |
| HERMITCRAB_FST_PLAN.md | Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate. |
| fst.md | Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+111
to
+129
| Advisories = advisories; | ||
| AffixRulesExamined = affixRulesExamined; | ||
| PhonologicalRulesExamined = phonologicalRulesExamined; | ||
| CompoundingRulesExamined = compoundingRulesExamined; | ||
| EscapeCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Escape); | ||
| CostCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Cost); | ||
| InfoCount = advisories.Count(a => a.Severity == GrammarAdvisorySeverity.Info); | ||
| ProbeableEscapeCount = advisories.Count(a => | ||
| a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == true | ||
| ); | ||
| OpaqueEscapeCount = advisories.Count(a => | ||
| a.Severity == GrammarAdvisorySeverity.Escape && a.Probeable == false | ||
| ); | ||
| RegularEscapeCount = advisories.Count(a => | ||
| a.Severity == GrammarAdvisorySeverity.Escape && a.Regular == true | ||
| ); | ||
| NonRegularEscapeCount = advisories.Count(a => | ||
| a.Severity == GrammarAdvisorySeverity.Escape && a.Regular != true | ||
| ); |
Comment on lines
+270
to
+277
| foreach (IMorphologicalRule mrule in stratum.MorphologicalRules) | ||
| { | ||
| switch (mrule) | ||
| { | ||
| case AffixProcessRule affix: | ||
| affixExamined++; | ||
| AnalyzeAffix(affix, stratum.Name, surfaceInvariant, advisories, manyAllomorphsThreshold); | ||
| break; |
The analyzer transducer must emit the structured derivation (ordered morphemes + root), not just accept/reject, or it is a recognizer not an analyzer. Define the compact output token: high 8 bits = MorphOp (role/operation: Root, Prefix, Suffix, Infix, Reduplication, Circumfix*, Compound, Clitic, Process, Null), low 24 bits = morpheme index into the grammar's morpheme table. An accepting path's output is the uint[] of these tokens, which IS the analysis and is self-describing: Morphemes = indices in array order; RootMorphemeIndex = the Root token's position (no separate field). Verdict on the proposed 8+24 packing: sound and the right compactness choice (4 bytes/morph, hashable, columnar). 24-bit ceiling = 16,777,215 morphemes (ample; compiler asserts). Refinement baked into the schema: keep the 32-bit word as the pure (op, morpheme) derivation and DON'T overload it with surface segmentation or allomorph identity -- those are optional parallel channels. MorphToken codec (Encode/GetOp/GetMorphemeId/RootIndex) + bounds check, plus HERMITCRAB_FST_PLAN.md section 8 documenting the schema. 5 tests (round-trip, out-of-range throw, distinctness, self-describing derivation array, root recovery). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nalyses The packed-token tests alone proved bit-packing, not schema fidelity. Add the reference encoder MorphTokenCodec (Word -> uint[]), which mirrors Morpher.CreateWordAnalysis (same AllomorphsInMorphOrder iteration + RootAllomorph check) and populates the op channel from the actual rule: head root -> Root, other stems -> Compound, affixes classified from their output actions (reduplication / infix / prefix / suffix / process). Round-trip tests on real parsed words now MEASURE soundness rather than assert it: - suffix word: decoded morphemes reproduce WordAnalysis.Morphemes in order, and RootIndex (recovered purely from the Root op code) == WordAnalysis.RootMorphemeIndex; - compound (two stems): the flat array keeps both morphemes with exactly one Root + one Compound, matched to WordAnalysis by morpheme sequence with root index at parity -- confirming the flat array is at parity with WordAnalysis's own compound flattening (not lossy); - ClassifyOp populates reduplication/infix/prefix/suffix from real output actions. Resolves the two open risks on the schema: the op channel is now populated from a real Word (not asserted), and multi-root/compound handling is verified. 75 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstMorpher hand-builds one acceptor (mirroring RootAllomorphTrie -- no
Compose/Determinize/Minimize, which are unsafe for HC's underspecified-feature
arcs): root segment chains from the start state, with each fixed-segment suffix
appended after every root-accepting state; accepting states map to packed
MorphToken arrays. Analysis is a single nondeterministic Transduce walk of the
surface word -- no Word clones, no generate-and-test.
Verified against Morpher.AnalyzeWord on the concatenative fragment:
- bare root and root+suffix ("sags") round-trip to the same morphemes + root;
- COMPLETENESS as analysis-set equality, not "found one": homographs (dat ->
entries 8 and 9 both found), and the negative case (no path -> both empty)
agree with the search engine;
- an [Explicit] allocation comparison (FST walk vs search engine).
Caveat documented in FstMorpher: arcs match segments only, not the affix's
syntactic/MPR/stratum constraints, so on grammars where letters match but
constraints exclude an analysis it would over-generate -- closed by
feature-unification arcs (HERMITCRAB_FST_PLAN.md section 8). 78 HC tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e contract) FstMorpher now implements IMorphologicalAnalyzer.AnalyzeWord -> IEnumerable<WordAnalysis>, the same interface Morpher implements and that consumers (FieldWorks et al.) depend on. Morphemes and root index come from the token walk; Category is null because this slice does not yet track syntactic features (arrives with the unification arcs). A test drives both engines through the IMorphologicalAnalyzer contract and asserts the WordAnalysis sets match (homographs included) on the concatenative fragment, so the FST analyzer is a drop-in for the search engine at the interface level. Still scoped to the clean concatenative fragment built from explicit root/suffix lists; full-grammar Compose compilation, phonology/allomorphy, the feeding-closure completeness certificate, and the FieldWorks adapter remain. 79 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…grammar FstMorpher.FromLanguage(Language) builds the analyzer by introspecting the grammar: every root allomorph plus every single-allomorph suffix rule (detected via MorphTokenCodec.ClassifyOp == Suffix). It THROWS NotSupportedException on any construct outside the concatenative root+suffix fragment (prefixes, infixes, reduplication, compounding, templates, multi-allomorph affixes) so it never silently under-generates — a caller learns exactly what this slice cannot cover. This closes the "not driven from a compiled Language" gap: the analyzer now consumes a real Language object, not explicit root/suffix lists. Verified at parity with Morpher (through the IMorphologicalAnalyzer contract) on the concatenative fragment, plus a guard test that a prefix rule makes FromLanguage refuse rather than produce a quietly incomplete analyzer. 81 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ffix?) The acceptor now prepends a prefix segment chain before the root (mirroring the appended suffix chains), so it covers an optional fixed-segment prefix + root + optional fixed-segment suffix. FromLanguage auto-detects prefix rules (ClassifyOp == Prefix) alongside suffixes and still throws on anything outside the concatenative fragment (reduplication, infix, compounding, templates, multi-allomorph affixes). Verified at parity with Morpher via the IMorphologicalAnalyzer contract: "disag" = di-(PST) + sag, and the bare root through the no-prefix branch. The throw-guard test now uses reduplication (a genuinely non-concatenative rule). 82 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tificate) Document the completeness/closure analysis: an FST analyzer is trustworthy only if its silence is a proof. Completeness has two parts — "no escape applies here" (easy) and "no normal FST step reachable from the input can FEED an escape" (the kicker, Kiparsky feeding). The universal question is undecidable, but per grammar it is usually decidable: - decidable feeding-closure: for each FST-able rule F and escape E, test range(F) ∩ trigger(E) = ∅ via Fst.Intersect; all empty ⇒ the fragment is closed ⇒ "no path" is a complete certificate; non-empty + regular ⇒ fold in; non-empty + non-regular ⇒ those words fall to the search backstop; - stratal containment as the practical guarantee (escapes innermost, not fed by the FST fragment); - homograph completeness = all accepting paths returned, contingent on closure + never unsafely determinizing/minimizing unification arcs; - the search backstop's "done" rests on a true derivation-depth bound (finite iff no unbounded self-feeding cycle); - the work: a static feeding-closure pass extending GrammarFstAdvisor + corpus closure verification (set parity) as the gate before the FST may replace the search engine. Wired into the phased plan (Phase 3 gated on §9), the risks table, and the decision flow. Until closure is confirmed for a grammar, the FST runs in shadow/verification mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, invariant completeness) Design in from the start a tunable partition that bounds the compiled automaton without sacrificing completeness. Three buckets: A precompiled (eager, fast walk, costs states), B on-the-fly (lazy on-demand composition, bounded memory), C search/probe fallback (non-FS escapes, set by section 9 closure not the knob). The A<->B boundary is the knob, with a safe floor (everything lazy = bounded + complete). Why completeness is INVARIANT under the knob: composition is associative, so precompiling A.B vs applying B lazily after A denotes the same relation — the split changes when work happens, never which analyses exist; the walk enumerates all paths in either bucket; and closure (section 9) is computed on the full A.B relation. So the knob is a pure space/time dial; the analysis set does not move. The policy is per-language (yes, it differs): rank layers by state-multiplier x corpus hotness, precompile cheap-and-hot, keep expensive-and-cold lazy, auto- demote A->B under a state/memory budget. Same construct can be eager in one project and lazy in another -> pluggable policy + optional auto-tuner. Designed-in requirements: compiler is a pipeline of self-contained composable layers (each with state-multiplier/hotness/closure metadata) behind one eager-or-lazy interface; analyzer walks the eager core and lazily expands B layers, emitting the same MorphToken outputs; state budget is a first-class compile input (auto-demotion logged, never silent truncation); the corpus set-parity gate runs against the chosen partition. Wired into the risks table (state-blowup) and the phased plan (Phase 1-2 architecture). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ty gate FstVerification.Compare runs a candidate analyzer (FstMorpher) beside the sound+complete reference (Morpher) over a corpus and reports, per word, where their analysis SETS differ: MissingFromCandidate (completeness failures) and ExtraInCandidate (soundness/over-generation failures). AnalysisComparison.IsComplete is the gate (HERMITCRAB_FST_PLAN.md §9.5/§10.4) that must pass before the FST may replace the search engine — until then the FST runs in shadow mode. This operationalizes the completeness question: it measures both "did we find them all" and "did we invent any" at once, against the proven engine. Tests: FstMorpher.FromLanguage vs Morpher is IsComplete over a concatenative corpus (inflected, bare root, homograph, non-word), and the harness flags a deliberately-empty candidate as incomplete (proving it is not vacuous). 84 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the acceptor with compound paths: for each compounding subrule, build root×root chains (head segs + non-head segs), tagging the head Root and the non-head Compound per the subrule's surface head position (head-first or head-last detected from the first CopyFromInput in the Rhs). FromLanguage now collects CompoundingRules alongside prefixes/suffixes; single head + single non-head subrules only (throws otherwise). This is root×root in state count — exactly the layer §10 flags as a lazy-bucket candidate at lexicon scale — built eagerly here for the parity check. Verified against Morpher via the IMorphologicalAnalyzer contract: "pʰutdat" = pʰut(5) + dat, returning both homographic non-heads (5+8, 5+9) exactly as the search engine, and the bare root through the non-compound path. 85 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…morph BuildAffixChains now iterates ALL allomorphs of an affix rule, building a segment chain for each, all sharing the rule's morpheme token. Environment- conditioned allomorphy is handled by the surface: only the allomorph whose segments match the input accepts. FromLanguage no longer restricts affixes to a single allomorph (throws only if an allomorph lacks a fixed-segment InsertSegments). This rounds out the concatenative Tier-1 fragment — roots, prefixes, suffixes, bounded compounding, and multi-allomorph affixes. Verified at parity with Morpher via the IMorphologicalAnalyzer contract: a plural with -s/-t allomorphs analyzes "sags" and "sagt" exactly as the search engine, the surface selecting the right allomorph. 86 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GrammarFstClosure.Analyze decides, per non-regular escape (reduplication / infixation), whether any FST-able rule could apply before it and FEED it (Kiparsky feeding). Sound stratal pre-filter: an escape is CLOSED only if no FST-able rule (concatenative affix, compounding, or any phonological rule) applies at or before its stratum — same-stratum rules count too, since unordered application could place them first. Never falsely reports closed. ClosureReport.FstClosed is true iff every escape is closed (vacuously, none): then the FST built over the FST-able fragment is closed and its "no path" is a proof for words showing no escape signature — subject to the per-word surface check and the corpus parity gate. This is the static half of "confirming FST closure"; the empirical half is FstVerification. The precise refinement that reclaims over-flagged cases is range(F) ∩ trigger(E) = ∅ via Fst.Intersect. Tests: no escapes -> closed (vacuous); innermost reduplication with nothing before it -> closed; a suffix in the same unordered stratum -> potentially fed. 89 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e 3) HybridMorpher (IMorphologicalAnalyzer) wires the three pieces together: the precompiled FST handles the FST-able fragment as a fast, allocation-light walk, and words that could involve a non-regular escape fall back to the sound+complete search engine. FstMorpher.FromLanguage gains ignoreEscapes to build the FST over just the FST-able fragment. Routing is sound by construction (only ever sends MORE words to search): the fast path is taken iff the grammar has no escapes, or every escape is CLOSED (GrammarFstClosure) AND is total reduplication (the surface signature this router detects, XX) AND the word shows no such signature. Otherwise the search runs. Verified: with a closed total-reduplication escape, "sag" takes the FST fast path and "sagsag" falls back to search, and the combined analysis set is verified COMPLETE against the pure search engine via FstVerification. This is the Phase-3 Tier-2 hybrid: the closure pass decides safety, the verification gate proves parity, the runtime routes per word. 90 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FstGenerator implements IMorphologicalGenerator for the concatenative fragment:
generation is the inverse of the analyzer's walk — ordered concatenation of each
morpheme's surface representation (prefix before root, suffix after, compound
stems concatenated), Cartesian over allomorph choices. Mirrors Morpher's
morpheme inventory so it is a drop-in IMorphologicalGenerator.
Verified against Morpher.GenerateWords on the concatenative fragment, and the
analyze→generate round-trip recovers the input word ("sag", "sags"). Scope
matches FstMorpher (roots + fixed-segment affixes); phonology/reduplication
defer to the search generator. 92 HC tests green.
This completes the in-repo Phase-4 surface: both directions (FstMorpher analyzer
+ FstGenerator generator), the Tier-2 hybrid, closure confirmation, verification
gate, and the grammar census all exist and are verified. The remaining Phase-4
item — the FieldWorks adapter — lives in a separate repository.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iming + parity FstSenaBenchmark loads a real grammar (HC_GRAMMAR/HC_WORDS), runs the census and closure pass, attempts to build FstMorpher/HybridMorpher (reporting the concrete WALL if a construct is out of fragment), and times search vs FST vs hybrid with FstVerification parity. Run on the real Sena grammar this surfaces the concrete remaining wall: - census: Tier 1, fully FST-able, 0 escapes, FST-CLOSED; - but FstMorpher.FromLanguage hits "stratum 'Morphology' has affix templates" — templates (position classes) are FST-able in principle but not yet built by FstMorpher, so the FST/hybrid cannot yet be constructed from Sena; - search baseline (unlimited unapplications): ~206 ms/word, true parses. So affix-template support is the next build to get the FST over the wall on a real FLEx grammar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…umulating walk FstTemplateAnalyzer handles affix templates (position classes) — the real-grammar case. Two design points from the advisor review: (1) build-time CATEGORY GATING — a template attaches only to roots whose category unifies with its RequiredSyntacticFeatureStruct, which kills over-generation AND lets same-category roots share one copy of the template's slot-automaton (states = roots + Σ template automata, not roots × combos); (2) TOKEN ACCUMULATION along the path (a state carries the morpheme token emitted on entry) via a custom DFS walk, since the shared automaton is reached by many roots so an accept-id map won't do. A maxStates budget (§10 knob) aborts before a blowup. New additive class (the 92 existing tests + the accept-id acceptor are untouched). Verified on a toy: a V-only suffix template with two optional slots reproduces the search engine's analyses for sag/sagd/sagdv, AND the category gate correctly blocks the verb template on an A-category root (gab/gabd) — FstVerification parity. Scope this slice: suffixing templates + category gating; prefix-slot templates, cross-stratum gating, and phonology are next. Wired into FstSenaBenchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k — Sena PARSES FstTemplateAnalyzer now handles prefix AND suffix template slots (prefixes surface in reverse template order), gated by BOTH category (root features unify with the template's RequiredSyntacticFeatureStruct) and stratum (root at the template's stratum or inner — the 'datd' lesson). The walk is a proper NFA simulation (active config-set per segment, deduped by (state, tokens)) instead of the exponential recursive DFS, and guards InvalidShapeException (out-of-table phonemes) like the search engine. Result on the real Sena grammar (sena-hc.xml, 24 templates): - FstTemplateAnalyzer BUILDS in ~0.5 s (gating shares automata → no state blowup); - parses at ~6.4 ms/word vs the search engine's ~178 ms/word — about 28x faster; - 14 of 16 analyses match the search engine across the sample, with one MISSING analysis on 'mafuta' (a two-prefix form) — a coverage gap, not over-generation. Toy tests (parity-verified): suffix template + category gate; prefix+suffix template (bare / prefixed / suffixed / both) + gate. 94 HC tests green. Sena now parses through the FST; closing the last divergence to full FstVerification parity is the remaining refinement. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…Sena ~100x Two coverage fixes for real Sena: (1) slot rules of type RealizationalAffixProcessRule (a sibling of AffixProcessRule, same IList<AffixProcessAllomorph>) are now included, not skipped; (2) every affix is entered through a token-bearing state, so a zero/empty-segment morpheme still emits its token (previously the token was placed on the first segment state and a zero morph emitted nothing -> a missing analysis). Result on real Sena: FstTemplateAnalyzer parses at ~3 ms/word vs the search engine's ~337 ms/word (~100x) and matches the search engine's analysis set on the 8-word sample exactly; on 30 words 26 match, with residual divergences in both directions (one under-generation; over-generation where a constraint the FST does not yet enforce -- obligatory affixation / MPR / co-occurrence -- would exclude an analysis). Sena PARSES through the FST; full FstVerification parity at scale is the remaining constraint-enforcement refinement. 94 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VerifiedFstAnalyzer wraps the FST proposer and verifies each candidate by re-synthesis (keep it only if generating from it reproduces the surface). In principle this makes the FST sound — "matched a state" becomes "valid derivation" (completeness condition 3) enforced outside the automaton — and eliminates over-generation without encoding constraints in the FST. Empirical finding on real Sena: Morpher.GenerateWords is NOT a clean validity oracle here — it over-rejects (regenerates only ~1/4 of valid analyses, dropping 30-word parity from 4 to 11 divergences), because realizational affixes need their realizational feature structure to synthesize and a bare morpheme-list WordAnalysis does not carry it. NFD-normalizing the comparison did not help, so it is not a normalization artifact. Conclusion (documented in the class): propose-and-verify needs either a richer candidate (carrying the realizational FS) or a different soundness mechanism (constraints-on-arcs, or the stratal left/right handoff). The sound system today remains the hybrid (FST proposes, search backstop). Mechanism + finding kept; 94 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nd+complete SoundHybridMorpher is sound+complete BY CONSTRUCTION: the FST proposes, each candidate is verified by REPLAYING its own derivation (synthesize from the candidate's root + affixes with the realizational FS its realizational affixes imply), and any candidate that can't be confirmed routes the whole word to the search engine. Verified on real Sena: "sound vs search: IDENTICAL". But the measured fallback is high (~87% on 30 words), so the speedup is lost: re-synthesis is not yet a clean validity oracle. Supplying the realizational feature-structure (vs the empty one Morpher.GenerateWords(WordAnalysis) uses) recovered some words (100%->87% fallback), confirming the cause is missing derivational features — but other feature determinants (syntactic agreement, MPR, allomorph conditioning) the (op, morpheme) token list does NOT carry are still absent, so most valid candidates fail to regenerate and fall back. Finding: propose-and-verify-by-resynthesis is sound+complete but cannot reach the <5% fallback / >=10x target while the candidate under-determines the derivation. The clean-oracle path is a richer candidate (carry the feature/ allomorph state) or feature-bearing FST arcs (way A). 94 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g table The data structure for a feature-aware, O(1) FST walk: ids in bits, objects in tables. - Interner<T>: maps values to dense small ids and back in O(1), so a feature structure / MPR set is referenced by a few bits instead of stored per state. - MorphStateLayout: a tunable bit-packed state over an array of uint. Every field is one shift+mask, never straddling a 32-bit word, greedily packed in declaration order; WordCount falls out of the chosen widths. Field widths are the tuning knob (constructor args). Default layout (3 words): Op8|Morpheme24 (the analysis token) ; Category12| Realizational12|Slot8 ; Mpr32 (bitmask). A validity transition is shift/mask + an O(1) interned-table / transition-table lookup — no runtime unification — so an accepting state means a valid derivation (completeness condition 3, in O(1)). FST_STATE_PACKING.md documents each field (stored vs looked-up), the tradeoffs (bits-headroom vs density vs words-to-hash; baked-transition vs runtime-unify), and the bit tuning (Morpheme 24->22 frees 2 bits / ~4.19M morphemes; collapsing to 2 words; per-field ceilings via MaxValue). Tests: pack/unpack round-trip + field independence + overflow + re-tuned width; interner id-sharing/density. 101 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sue) Wired the first morphosyntactic constraint into FstTemplateAnalyzer: a slot rule whose RequiredSyntacticFeatureStruct cannot unify with the template's category is omitted at build time — HC's Required.Unify(stem) gate, hoisted to compile time. This avoids the surface-vs-derivation-order obstacle that blocks threading the category through the surface-order walk (prefixes are encountered before the root whose category they need): for inflectional templates the category is ~constant, so the build-time gate is faithful and needs no per-path state. Test: N-requiring suffix in a V-template is pruned, so "sagz" is not over-generated and the FST stays IDENTICAL to search. Empirical finding: this did NOT change Sena's divergences — so Sena's over-generation is NOT category-based. The actual causes, by word: - mbale [:0] : bare-root acceptance (Bantu obligatory inflection), - kulemba [+++..] : MPR / morpheme-co-occurrence (a forbidden affix combination), - aikhane/angwera : coverage gaps (deeper morpheme stacks not yet built). Category is one transition; MPR/obligatoriness/co-occurrence are the next ones. 102 HC tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion A root is marked bare-accepting only if synthesizing it with no affixes yields its own surface (HC's finality/validity check). In an always-inflected grammar (Bantu) a root that must take a class/agreement affix returns nothing from bare synthesis, so the spurious bare reading is suppressed. Default constructor keeps "every root may stand bare" (toy grammars); the new (Language, Morpher) constructor enables the check, wired into SoundHybridMorpher and the Sena benchmark. Result on real Sena: the raw FST-vs-search divergences drop 4 -> 3 — the "mbale [:0]" bare-root over-generation is gone. Remaining: kulemba (MPR / co-occurrence over-gen — the next in-FST guard) and aikhane/angwera (coverage under-gen — deeper morpheme stacks). 102 HC tests green; build ~0.6s. Note: the SoundHybridMorpher verify-fallback rate is still high because the re-synthesis verify over-rejects (under-determined candidate); the in-FST guards (category, obligatoriness, next MPR) are the path that makes the FST sound directly so the verify becomes unnecessary — tracked by the raw FST-vs-search divergence count, now 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…11 findings The benchmark forced MaxUnapplications=3 on the reference Morpher, but =0 means UNLIMITED — so the ground truth was crippled and every prior "divergence" was an artifact. With the corrected unlimited oracle the FST template analyzer is IDENTICAL to search on the curated set at ~74x; on the broader set the only real residuals are over-generation (cleanly removed by verify-discard) and under-generation that is entirely unbuilt-B derivational affixes (REC/APPLIC/REV/ NZR/NEU/PAS) — not bucket C. - FstReplay: shared re-synthesis verifier (supplies the candidate's realizational FS). - VerifiedFstAnalyzer: discard mode — installs all over-gen gates via HC synthesis. - SoundHybridMorpher: delegates replay to FstReplay. - Benchmark: gloss diagnostic + obligatoriness-gate re-validation (gate still load-bearing under the correct oracle: 5->4 divergences). - HERMITCRAB_FST_PLAN §11: the corrected picture, the sharpened bucket-C framing (B∘C∘B thin local core; Sena has 0 reduplication rules / FST-CLOSED), the three C release valves, and the path to a full solution. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Shared, bounded (depth 2) derivational-suffix layer between root and the inflectional suffix slots, shared across a template's roots (tokens accumulate on the walk path, so no roots×derivations blowup). This CLOSES the under-generation: the raw FST now proposes the correct REC/APPLIC/REV/NEU/PAS analyses (aikhane/angwera/paoneke/ikoyiwe go from missing to proposed). FstReplay now tries BOTH re-synthesis strategies (native WordAnalysis replay with permutation; explicit realizational-FS form) and accepts if either reproduces. Key finding (benchmark round-trip self-test): HC's analysis->synthesis is LOSSY for derivational verb forms — search's OWN analyses for aikhane/angwera/kunduli/ikoyiwe do NOT round-trip through GenerateWords (they return NO), while noun/simple forms do (kulemba/mbalira OK). The rich analysis Word carries rule-chain features the lean WordAnalysis (op+morpheme tokens) loses. So re-synthesis verification cannot be the soundness oracle for this grammar's realizational morphology — it rejects valid analyses. Soundness needs a different mechanism (richer tokens / build-time gates). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deep probe + grammar inspection confirm WHY re-synthesis verification failed: it is NOT fundamental loss. Morpher has two synthesis doors — internal Synthesize (rich analysis Word: template/slot structure + features → faithful) and public GenerateWords (flat morpheme bag applied as FREE rules → lossy). The inflectional affixes (3P+2, SBJV) are template-slot rules (mrule26+ inside <Slot>), only compounding/derivation are free (mrule1-25); GenerateWords applies slot rules as free rules so feature-dependent verb combos never synthesize — while simple nouns do (why nouns round-trip, verbs don't). The under-determination is self-inflicted: the FST walk knows the template/slot path and discarded it in the lean (op,morpheme) token. Fix = preserve that path and verify through HC's faithful door (template-aware directed synthesis), making verify sound AND lossless and collapsing the 90% false-rejection fallback. Corpus (200 Sena words, unlimited oracle): raw FST 49/200 diverge (~19% over-gen, ~7% under-gen) at ~64x; documented in plan §11.5/§11.6. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d, 99% complete) Status check: the SoundHybridMorpher path has ZERO over-generation across 200 words (both residual divergences miwiri/mitemo are pure under-gen). So the system is already SOUND and ~99% COMPLETE — correctness is effectively done. The one open axis is SPEED: the 90% fallback is the lossy GenerateWords verify false-rejecting valid words, not genuine errors. Pinpointed why GenerateWords fails (probe of aikhane): stem shape = citation shape = ikh (not a stem-shape issue); its rules a-5/-e/-an mix template-slot inflection with a free derivational rule across DIFFERENT STRATA, and GenerateWords' flat permuted rule pool cannot reconstruct the cross-stratum order that HC's internal Synthesize reads off the rich analysis Word. The FST walk knows that structure. Plan §11.7/§11.8: the one remaining build is a faithful (template-aware, stratum-ordered) directed-synthesis verify, which makes VerifiedFstAnalyzer sound AND lossless with no fallback at full FST speed. §11.4 path updated with what's done. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…wo routes GenerateWords(WordAnalysis) permutes rule order (tries the correct order) and still fails — so the missing ingredient is analysis-derived Word state, not rule ordering. A cheap faithful verify is therefore harder than "right order". §11.8 now records two viable routes to the speed unlock: (A) faithful reconstruction verify driving HC's internal Synthesize, or (B) build-time constraint gates (subcategorization etc.) that make the FST faithful with no per-word verify. Route B is likely the surer win for this grammar. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aints) Per design review: Route B duplicates HC's constraint logic as parallel FST arc-gates (a second engine to keep aligned) — the anti-pattern. Route A reuses HC. Sharpened Route A = "directed un-application, then Synthesize": the FST replaces only HC's slow backward SEARCH (it knows the exact path), then HC's real forward Synthesize confirms. Faithful by construction, cost ~(rules in path) not full fan-out → the >=10x. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ureStruct) AnalysisAffixTemplateRule unifies the template's RequiredSyntacticFeatureStruct and writes it onto the word (SyntacticFeatureStruct.Add(fs)). That populated feature structure is the synthesis precondition the inflectional rules check — exactly what from-citation GenerateWords never establishes (why correct rule order still fails there). Directed un-application calls the same CompileAnalysisRule objects along the FST path, reconstructing that state for free, then Synthesize succeeds. Confirms Route A is reuse, not reimplementation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the lossy GenerateWords re-synthesis verify with restricted re-analysis: run HC's own AnalyzeWord with LexEntrySelector/RuleSelector pinned to the candidate's root and rules. That prunes HC's combinatorial fan-out to the single path the FST found (a few ms, not the full search) while reusing HC's real analysis+synthesis validation end to end — no reimplemented constraints. This is the thin wrapper the design wanted: faithful AND lossless, where GenerateWords was neither (it re-synthesized from a flat permuted rule list off the citation form and never re-established the SyntacticFeatureStruct that AnalysisAffixTemplateRule sets during un-application, so it false-rejected valid verb forms). Measured (200 Sena words, unlimited oracle): - verify-discard: 48 -> 14 divergences (186/200 match); ALL over-generation removed, remaining 14 are pure under-gen (category-changing NZR derivation + miwiri/mitemo). - verify speed 14.4 ms/word vs 263 ms/word oracle (~18x), no fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ound-trip probe The harness re-ran the slow search oracle ~5x (once per parity comparison). Wrap it in a CachingAnalyzer so the oracle runs one pass and the comparisons reuse it; wall time drops from ~6min to ~2.4min (remainder = one 47s oracle pass + the sound variant's 86%-fallback search). Removed the now-obsolete GenerateWords round-trip probe (its investigation is concluded; the deep probe stays gated on HC_PROBE_WORD). Corpus result with the Route A verify (200 words): verify(discard) 14 divergences (186/200 match) at 15.6 ms/word (~15x), sound, lossless, no fallback. The 14 are pure under-gen (category-changing NZR derivation + miwiri/mitemo) to be closed in the proposer. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…art-2 coverage §11.8 marks the restricted-re-analysis verify implemented and measured (48->14 divergences, 186/200 set-match, ~15x, no fallback, lossless). §11.4 updates the path: the only remaining work is closing the 14 under-gen in the PROPOSER — category-changing derivation (verb->noun NZR feeding a noun template) + a few prefixal/deeper cases. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…stems Relax template attachment: a template attaches to a root if the root's category matches OR can be DERIVED to the template's category via <=DerivDepth derivational suffixes (DerivableToCategory, using each rule's OutSyntacticFeatureStruct). This lets a noun-class template sit over a nominalized verb stem (vencer[verb]+NZR -> noun + class prefix), closing the category-changing under-gen (kunduli/cidzo/khalani). The category-changing suffix rides the existing shared derivation layer; verify-discard removes any resulting over-gen. Measured (200 Sena words): verify-discard 14 -> 6 divergences (194/200 set-match) at 17.2 ms/word (~13x), sound + lossless, no fallback. Remaining 6 are diverse proposer coverage gaps: prefixal derivation (nyari/cawo), depth-3 derivation (miwiri), and copula/TAM constructions (ndico/ndimwe/kuumadi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Depth 3 gains 1 word (miwiri) but ~2x verify cost, so keep DerivDepth=2: 194/200 set-match at 17.2 ms/word (~13x), sound + lossless, no fallback. Plan §11.4 records the achieved target and characterizes the last 6 (diverse proposer coverage gaps: prefixal derivation, depth-3, copula/TAM) — not a verify or soundness issue. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t is real The parity signature was join(Morpheme.Id)+":"+rootIndex, but affix Id is empty here, so it encoded only (morpheme count, root position) — collapsing same-shape affix variants (3P+2 / 3S+1 / 6) and hiding same-shape under-generation. Replace with per-morpheme object identity (both analyzers reference the same Morpheme instances, so identity is a faithful shared key). FstReplay's candidate-match signature sharpened too. Re-measured (200 words, strict signature): raw FST 44 -> 90 divergences (shape-parity HAD hidden raw over-gen), but verify-discard stayed at 6 (194/200), ALL pure under-gen. So the soundness/lossless/~13x result is real, not a shape artifact. Plan §11.9 records the metric fix and two productionization caveats: (1) verify mutates shared Morpher selectors -> needs per-thread morpher/pool for parallel verify; (2) the ~13x is vs the unlimited-unapp oracle (the correct sound+complete baseline). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… derivation Records and implements the §12 completeness certificate: completeness is a property of the grammar's rule structure, certified once, not a per-word check. Two halves — closure (regular B-side / valid cut, via GrammarFstClosure) and coverage (the FST emits every B-side construct). Certified ⇒ FST-only is provably complete; else the engine backstop guarantees it and the certificate names the gaps. - FstCompletenessCertificate / FstCompletenessReport: closure + affix coverage (from the codec's covered-morpheme set) + compounding count + verdict + uncovered list. - FstTemplateAnalyzer: prefix-derivation layer (mirror of suffix; closed 4 of 7 prefix gaps, coverage 125->129/132), category reachability over prefix+suffix derivation, CoversAnalysis (sound structural predicate: single root, covered, depth<=2, canonical morph order), StateCount. - MorphTokenCodec: CoveredMorphemes / Covers accessors. - CompleteHybridMorpher: provably-complete analyzer (certified->FST, else->engine). PROOF (Prove_CertificateCompleteness, 200 hard Sena words, unlimited oracle): complete-system misses = 0 — returns every true analysis. Sena reports NOT certified (129/132 affix, 8 compounding), so 4 out-of-class words route to the engine; the FST's certified in-class set reaches 468 analyses. Key finding (the stress test working as intended): a predictive per-analysis predicate is whack-a-mole (broke on depth, then order, then a template-less prefix cawo). So soundness rests on the grammar-level certificate + engine backstop, not a predicate — 100% complete today, faster as coverage closes, never a silent miss. FST size: 50,673 states from 1,463 roots (additive, not multiplicative — §12.5). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y is the gate Advisor review caught two real defects: 1. The "proof" was VACUOUS: CompleteHybridMorpher for uncertified Sena == search, so complete==search was search==search; the certified FST branch was never exercised. 2. The static certificate was UNSOUND: cawo (coisa+d'eles) has every morpheme covered yet the FST can't build it (prefix on a template-less pronoun root) — so filling the other gaps would flip IsCertified=true while still silently dropping cawo. Coverage is necessary, not sufficient; completeness is about paths (attachments), not symbols. Fix: certification is the EMPIRICAL set-parity gate (CertifyEmpirically: FST==search at morpheme-identity over a corpus, §9.5), which is path-level and catches cawo. The static coverage check is demoted to a fast pre-filter/gap-namer (PreFilterPasses, no longer a gate). CompleteHybridMorpher now takes the empirical-certified bool. Non-vacuous proof (200 hard Sena words): FST path tested DIRECTLY produces 467/480 search analyses (13 → engine); the static check is shown UNSOUND on cawo (in-class yet FST-missed); empirical gate refuses to certify Sena (5 divergent words); complete-system misses = 0. System is 100% complete today via the engine; path to FST-only is driving set-parity divergences to 0. Plan §12.6 corrected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Soundness_NegativeExamples: generate 50 plausible non-words by perturbing real words (over-prefix, over-suffix, prefix-swap, fake partial reduplication, fake compound), keep only TRUE negatives (search oracle = ∅), and require the verified FST to also return ∅. Result on Sena: 50 true negatives, 11 exercise the verify (raw FST over-proposes), 0 false positives — the verify (restricted re-analysis) rejects every non-word it is offered. Writes negatives to HC_NEG_OUT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sound on corpus Add a template-less stem path (optional derivational prefixes + root + optional derivational suffixes, no inflectional template) for roots that derive/associate without inflecting — e.g. a pronoun taking an associative prefix (cawo = coisa+d'eles). Shared prefix/suffix derivation layers keep it additive (50,673 -> 59,022 states). Result (200 words): divergent words 5 -> 4 (cawo closed); FST-direct 467 -> 468/480; and static-unsound cases 1 -> 0 — the cawo analysis that broke CoversAnalysis is now produced by the FST, so the structural predicate is sound on this corpus. Negatives still 0 false positives (verify filters the added over-gen); verify 19.8 ms/word (~12x). Remaining 4 divergent (engine-handled, all under-gen): miwiri (depth-3 derivation), ndimwe/ndico (copula e+pronoun), kuumadi (derivational suffix outside inflection). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ead code Acting on the cross-review landing plan + three decisions (keep advisor; multithread; per-word FST opt-out). Multithreading (fixes the Critical thread-safety bug): - MorpherPool: thread-safe pool of Morphers. The verify pins a Morpher's selectors per candidate (mutable instance state), so a shared Morpher can't be used across threads. Each parse now rents/returns its own; Morphers are built once and reused. - FstTemplateAnalyzer (proposer) is immutable post-build → shared safely. - VerifiedFstAnalyzer + CompleteHybridMorpher now use the pool → corpus can be parsed in parallel. New Concurrent_MatchesSequential benchmark test: 0 mismatches. M2 fix: FstReplay.Confirm returns the MATCHED HC WordAnalysis (real category, genuine engine object) instead of bool; VerifiedFstAnalyzer yields it. Parity now sees category. Per-word opt-out: CompleteHybridMorpher.AnalyzeWord(word, bool useFst) forces FST on/off for one word; UseFstFor policy hook; plain AnalyzeWord uses the certified default. Cuts (dead/superseded/abandoned, ~62% reduction): FstMorpher, FstGenerator (defer to a generation PR), HybridMorpher, SoundHybridMorpher, MorphStateLayout, Interner, FstCompletenessCertificate, FstLatticeAnalyzer + their 5 tests. CoversAnalysis (per-stem proof, abandoned) removed from FstTemplateAnalyzer; codec Covers/CoveredMorphemes removed; CertifyEmpirically inlined into CompleteHybridMorpher. KEPT the advisor (GrammarFstAdvisor + GrammarFstClosure) per decision 1. Retargeted FstVerificationTests off FstMorpher; rewrote FstSenaBenchmark spine-only. Validated (Sena, 60 words): ~11.5x (220->19 ms/word), verified==search IDENTICAL (certified), 0 parallel mismatches, 0 false positives on 50 negatives. Unit suite 83 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VerifiedFstAnalyzerTests on the in-repo toy grammar (no external data) — closes the gap where the propose-and-verify spine was exercised only by the [Explicit] benchmark: - Verified_MatchesSearch_OnConcatenativeCorpus: set parity vs the engine. - Verified_RejectsNonWord_NoFalsePositive: soundness (no false positive). - Verified_YieldsGenuineEngineAnalyses_WithCategory: guards the M2 fix (yields HC analyses with category, not category-less FST candidates). - CompleteHybrid_PerWordOptOut_EngineMatchesSearch: per-word useFst on/off both correct. - Verified_ParallelMatchesSequential: thread-safety (pooled Morpher) in CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…VP header - Delete FST_PR_PLAN.md (scaffolding) and FST_STATE_PACKING.md (documented the cut MorphStateLayout). - Move HERMITCRAB_FST_PLAN.md and the advisor plan (fst.md → HERMITCRAB_FST_ADVISOR.md) under docs/. - Add a "Shipped MVP" header to the plan: the verify + engine-backstop + per-word opt-out design that landed, and an explicit note that the per-stem completeness proof (§11.5+/§12.3+) was explored then abandoned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… cache) The FST fast path is sound but not guaranteed complete, so it only makes sense paired with the proven engine behind a cache. Adds the shipped front end: - AnalysisCache: thread-safe store of complete (engine) analyses per word; FastAnalysisResult carries IsComplete (cached-complete vs provisional-FST). - CachingMorphologicalAnalyzer: default AnalyzeWord = guaranteed complete (cached, or engine on miss then cached) — backwards-compatible; AnalyzeWordFast = cached-complete if warm else provisional FST (never blocks); Warm(corpus) fills the cache in parallel. - MorphemeRegistry + AnalysisCacheSerializer: persist the cache across sessions (fixed corpora) with a grammar-version guard that rejects a stale cache; confirmed non-words cached too. Correctness equals the engine (cache never invents/hides analyses); the FST removes cold-start latency; a warmed corpus resolves fast AND complete. Per-word fast/slow choice preserved. CI tests (toy grammar): default==engine + caches; fast provisional→complete after warm; warm fills corpus; persistence round-trips and version-guard rejects stale. Full unit suite 92 green. Plan §13 documents the design. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…epth Answers "say a word is fully analyzed without full search": a grammar that is FST-closed (census) AND passes set-parity vs the engine is CERTIFIED — the FST is then proven complete for every word, so the default guaranteed-complete path runs FST-only with no search and AnalyzeWordFast reports IsComplete=true without warming. - CachingMorphologicalAnalyzer: grammarCertified flag; FromLanguage(corpus) computes it (closed && set-parity). Certified → FST-only (no engine, no cache); else engine/cache backstop. New GrammarCertified property. - FstTemplateAnalyzer: derivation depth is now a tunable ctor param (default 2) instead of a const — battle-ready knob to trade fast-path completeness vs verify cost per grammar. - CI test: certified grammar never runs the engine and its fast result is proven complete. - Benchmark: reports certified status; on the 60-word Sena corpus (certifies) the default complete path is FST-only at ~18 ms/word (~11x), no full search. Thread-safety reconfirmed: 0 parallel-vs-sequential mismatches on 200 words. Unit suite 93 green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Accelerates HermitCrab morphological analysis with a precompiled FST, behind a caching front end
that keeps the engine as the source of truth. No second morphology engine, no reimplemented constraints.
Entry point —
CachingMorphologicalAnalyzer(fast + slow + cache):AnalyzeWord= guaranteed complete (backwards-compatible). On a certified grammar(FST-closed per the census and FST==engine set-parity over a corpus) the FST is proven complete,
so this runs FST-only with no full search. Otherwise it returns the cached engine result, or runs
the engine on a miss and caches it. Either way: complete.
AnalyzeWordFast= opt-in immediate. Cached-complete if warm (or if the grammar is certified),else a sound but possibly under-generating verified-FST result, flagged
IsComplete=false. Neverruns the engine.
Warm(corpus)fills the cache in parallel;AnalysisCacheSerializerpersists it acrosssessions (fixed corpora), keyed by
MorphemeRegistryand guarded by a grammar-version string(stale cache rejected → re-warm). Confirmed non-words are cached too.
The FST pipeline behind the fast path:
FstTemplateAnalyzer(proposer; immutable, shared;derivation depth tunable) →
VerifiedFstAnalyzer(confirms each candidate by restricted re-analysis,FstReplay, against HC's own engine from aMorpherPool; emits the genuine HC analysis) →CompleteHybridMorpher(certified→FST / else→engine, with per-wordAnalyzeWord(word, useFst)).GrammarFstAdvisor+GrammarFstClosureare the grammar census/linter (this PR's original core).Guarantees
always complete. The fast path is sound (0 false positives on 50 generated non-words), a yes-only
detector for "is this a word" (can under-generate, even to zero on un-built constructs), which the
cache/certification corrects.
corpus that certifies, the default complete path runs at ~18 ms/word (~11×).
engine/cache, no silent miss).
mismatches on 200 words (and a CI test).
Tests
CI unit tests on the in-repo toy grammar cover the proposer, the verify chain, the caching/persistence
layer (default==engine, provisional→complete after warm, certified-skip, round-trip + version guard),
soundness/negatives, the category fix, per-word opt-out, and thread-safety; plus advisor/closure. An
[Explicit]benchmark measures speed/parity/soundness/concurrency/certification on an external grammar.Full unit suite green (93).
Honest limitations / out of scope
depth-3 derivation, one suffix-order case. They resolve via the engine/cache (no silent miss) and keep
the full corpus from certifying. Compounding is the highest-value next coverage build (an additive,
shared-root-chain design) and is scoped as a focused follow-on.
and abandoned; completeness is delivered by certification + cache + engine.
residual.
Design + research record:
docs/HERMITCRAB_FST_PLAN.md(§13 = caching front end); advisor:docs/HERMITCRAB_FST_ADVISOR.md.🤖 Generated with Claude Code
This change is