Skip to content

Narrow regex digit sequences to numeric-string instead of decimal-int-string when a leading zero is possible#5823

Closed
phpstan-bot wants to merge 2 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-qi1ainu
Closed

Narrow regex digit sequences to numeric-string instead of decimal-int-string when a leading zero is possible#5823
phpstan-bot wants to merge 2 commits into
phpstan:2.2.xfrom
phpstan-bot:create-pull-request/patch-qi1ainu

Conversation

@phpstan-bot
Copy link
Copy Markdown
Collaborator

Summary

PHPStan narrowed the type of a series of digits matched by a regex (e.g. preg_match('/^\d+$/', $s, $m)) to decimal-int-string. That is incorrect: a digit sequence such as \d+ / [0-9]+ / \d{2} also matches strings with a leading zero like "007", and -?\d+ matches "-0". Neither "007" nor "-0" is a canonical decimal integer string (both stay strings when used as an array key), so they are merely numeric-string.

This change keeps decimal-int-string only when the match is guaranteed to be canonical and falls back to numeric-string otherwise.

Changes

  • src/Type/Regex/RegexGroupWalkResult.php
    • Added leading-zero tracking fields (decimalLeadingResolved, decimalSeenDigit, decimalLeadCanBeZero, decimalBad) plus transient quantifier flags (decimalAtomRepeats, decimalAtomOptional).
    • Added isDecimalIntegerLeadingZeroSafe(), which is false when a leading-zero-and-more (like "00") or a negative zero ("-0") is possible.
  • src/Type/Regex/RegexGroupParser.php
    • The quantification walk now records whether the quantified atom may repeat (max >= 2 / unbounded) and whether it is optional.
    • Digit positions are tracked through the concatenation/quantification/class/alternation walk via applyDecimalDigitPosition(). Character classes are analysed as a single position via getClassDecimalInfo() / getClassBoundChar() (so [0-9] is one zero-able position and [1-9] is one non-zero position), instead of relying on the individual range-bound tokens.
    • Alternation merges the leading-zero state across branches (a branch that is internally non-canonical, or a zero-able lead followed by more digits, makes the whole result non-canonical).
    • When isDecimalInteger() holds, the result is now decimal-int-string only if isDecimalIntegerLeadingZeroSafe(); otherwise it is numeric-string. This is applied both to the subject base type and to capturing-group types (createGroupType).
  • Updated existing regex-shape expectations in tests/PHPStan/Analyser/nsrt/preg_match_shapes.php, preg_match_all_shapes.php, preg_match_shapes_php80.php, preg_match_shapes_php82.php, bug-11293.php, bug-11311.php, bug-14177.php, bug-14750.php, bug-14784.php where \d+-style patterns now correctly infer numeric-string.

Root cause

A pure digit sequence is a canonical decimal integer only if it cannot have a leading zero in a multi-digit number, and is not a negative zero. The old code treated every all-digit (optionally minus-prefixed) match as decimal-int-string, ignoring repetition and multi-digit length. The pattern is centralized in RegexGroupParser::walkGroupAst(), so the same flaw affected both the subject narrowing reported in the issue and every capturing group ((\d+), \d{1,6}, etc.). The fix tracks, per character position, whether the leading digit can be 0, whether more digits can follow, and whether a sign is present, and only keeps decimal-int-string when none of those make the value non-canonical.

Patterns that stay decimal-int-string (verified): single digit \d / [0-9], literal 0, leading non-zero class [1-9][0-9]* / [1-9]+, -?[1-9]\d*, [3-9]+, and decimal alternations like ^(?:0|-?[1-9][0-9]*)$.
Patterns now correctly numeric-string: \d+, [0-9]+, \d{2}, \d{1,6}, \d\d, 0\d+, -?\d+, -?\d.

Probed analogous case

ctype_digit() has a theoretically identical issue (ctype_digit("007") is true), but its narrowing to decimal-int-string is intentionally relied upon by tests/PHPStan/Rules/Arrays/data/bug-14758.php (offset checks on string-keyed arrays), so it was left unchanged to keep this PR focused on the reported regex behavior.

Test

  • Added tests/PHPStan/Analyser/nsrt/bug-14791.php asserting that subject and capturing-group narrowing yields numeric-string for leading-zero-possible patterns (\d+, [0-9]+, \d{2}, \d{1,6}, \d\d, 0\d+, -?\d+, -?\d) and keeps decimal-int-string for canonical patterns (\d, [0-9], 0, [1-9][0-9]*, [1-9]+[0-9]*, -?[1-9]\d*, [1-5]).
  • Confirmed the new test fails before the fix (it inferred decimal-int-string) and passes after.
  • Full test suite (make tests) and PHPStan self-analysis (make phpstan) are green.

Fixes phpstan/phpstan#14791

…int-string` when a leading zero is possible

- `RegexGroupParser` previously narrowed any pure digit sequence (e.g. `\d+`, `[0-9]+`, `\d{2}`) to `decimal-int-string`, but those match leading-zero strings like "007" (and `-?\d+` matches "-0"), which stay strings as array keys and are therefore not canonical decimal integers.
- Track, while walking the regex group AST, whether a leading zero can be followed by more digits or whether a negative zero is possible (`RegexGroupWalkResult::isDecimalIntegerLeadingZeroSafe()`). A digit position is analysed for whether it can be "0", whether it is mandatory, and whether it may repeat; classes are analysed as a single character position via `getClassDecimalInfo()`.
- Only emit `AccessoryDecimalIntegerStringType` when the match is guaranteed canonical (single digit `\d`, leading non-zero digit class like `[1-9]\d*`, literal "0", decimal alternations like `(?:0|-?[1-9][0-9]*)`); otherwise emit `AccessoryNumericStringType`. This applies to both the subject base type and capturing-group types.
- Updated the existing regex shape tests whose `\d+`/`\d{n}`/`-?[0-9]+` patterns now correctly infer `numeric-string`.
Copy link
Copy Markdown
Contributor

@staabm staabm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach seems very complicated.

could it be simplified if we track whether within the current group we already saw a 0 (or a pattern which allows zero). as soon as we see another pattern which matches on 0 we should forget about decimal-int-string.

…ject

The leading-zero detection was spread across six extra fields on
RegexGroupWalkResult plus transient quantifier flags and two helper
methods, which made the walk hard to follow.

Move the whole "is this still a canonical decimal integer" automaton into
RegexDecimalIntegerState: it tracks whether the leading digit can be zero
and, as soon as another digit follows such a zero-able lead, drops the
decimal-int-string narrowing in favour of numeric-string. The alternation
merge becomes a single method on the state, and RegexGroupWalkResult holds
one field instead of six.

Dropping the redundant "seen digit" bookkeeping (a zero-able lead already
implies a digit was seen) and folding the alternation case into the same
rule shrinks the implementation by ~270 lines with identical behaviour.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@phpstan-bot
Copy link
Copy Markdown
Collaborator Author

Pushed. Here's a summary of how I addressed the review.

Response to @staabm

You were right that the approach was too complicated. I've restructured it around exactly the model you described — track whether the leading digit can be zero, and as soon as another digit follows that zero-able lead, drop decimal-int-string.

What changed

  • New RegexDecimalIntegerState value object now owns the whole "is this still a canonical decimal integer" automaton. The core rule is one method (withDigit): once a zero-able lead exists, any further digit makes the match non-canonical → numeric-string.
  • RegexGroupWalkResult went from 12 constructor params back to 6 — it holds a single decimalState field instead of six scattered booleans (decimalLeadingResolved, decimalSeenDigit, decimalLeadCanBeZero, decimalBad, decimalAtomRepeats, decimalAtomOptional) each with its own wither.
  • RegexGroupParser lost the two helper methods (consumeDecimalAtomQuantification, applyDecimalDigitPosition); the walk now just calls named transitions on the state.
  • Removed the redundant seenDigit bookkeeping: a zero-able lead already implies a digit was seen, so the first-digit guard collapses to leadCanBeZero. The alternation case folds into the same single rule (a zero-able lead before an alternation that contributes digits is non-canonical; if the branches contain non-digits the match isn't a decimal integer anyway, so it stays safe).

Net: +31 / −297 lines across the two source files (plus the focused new class).

Correctness preserved

  • Full NodeScopeResolverTest (1639 assertions), regex unit + rule tests, and make phpstan are all green — no test-expectation changes were needed, so behaviour is identical.
  • Added two cases to bug-14791.php that specifically pin the simplified paths: 0(?:1|2) (zero-able lead + alternation → numeric-string) and (?:\d)+ (quantified group forwarding → numeric-string). Canonical patterns like (?:0|[1-9][0-9]*) and -?[1-9]\d* still narrow to decimal-int-string.

The safe-fallback invariant holds throughout: the simplification can only ever make a pattern more conservative (numeric-string), never wrongly emit decimal-int-string.

@staabm staabm closed this Jun 7, 2026
@staabm staabm deleted the create-pull-request/patch-qi1ainu branch June 7, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regexp: incorrect narrowing to decimal-int-string for series of digits beginning with 0

2 participants