Narrow regex digit sequences to numeric-string instead of decimal-int-string when a leading zero is possible#5823
Conversation
…int-string` when a leading zero is possible
- `RegexGroupParser` previously narrowed any pure digit sequence (e.g. `\d+`, `[0-9]+`, `\d{2}`) to `decimal-int-string`, but those match leading-zero strings like "007" (and `-?\d+` matches "-0"), which stay strings as array keys and are therefore not canonical decimal integers.
- Track, while walking the regex group AST, whether a leading zero can be followed by more digits or whether a negative zero is possible (`RegexGroupWalkResult::isDecimalIntegerLeadingZeroSafe()`). A digit position is analysed for whether it can be "0", whether it is mandatory, and whether it may repeat; classes are analysed as a single character position via `getClassDecimalInfo()`.
- Only emit `AccessoryDecimalIntegerStringType` when the match is guaranteed canonical (single digit `\d`, leading non-zero digit class like `[1-9]\d*`, literal "0", decimal alternations like `(?:0|-?[1-9][0-9]*)`); otherwise emit `AccessoryNumericStringType`. This applies to both the subject base type and capturing-group types.
- Updated the existing regex shape tests whose `\d+`/`\d{n}`/`-?[0-9]+` patterns now correctly infer `numeric-string`.
staabm
left a comment
There was a problem hiding this comment.
the approach seems very complicated.
could it be simplified if we track whether within the current group we already saw a 0 (or a pattern which allows zero). as soon as we see another pattern which matches on 0 we should forget about decimal-int-string.
…ject The leading-zero detection was spread across six extra fields on RegexGroupWalkResult plus transient quantifier flags and two helper methods, which made the walk hard to follow. Move the whole "is this still a canonical decimal integer" automaton into RegexDecimalIntegerState: it tracks whether the leading digit can be zero and, as soon as another digit follows such a zero-able lead, drops the decimal-int-string narrowing in favour of numeric-string. The alternation merge becomes a single method on the state, and RegexGroupWalkResult holds one field instead of six. Dropping the redundant "seen digit" bookkeeping (a zero-able lead already implies a digit was seen) and folding the alternation case into the same rule shrinks the implementation by ~270 lines with identical behaviour. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Pushed. Here's a summary of how I addressed the review. Response to @staabmYou were right that the approach was too complicated. I've restructured it around exactly the model you described — track whether the leading digit can be zero, and as soon as another digit follows that zero-able lead, drop What changed
Net: +31 / −297 lines across the two source files (plus the focused new class). Correctness preserved
The safe-fallback invariant holds throughout: the simplification can only ever make a pattern more conservative ( |
Summary
PHPStan narrowed the type of a series of digits matched by a regex (e.g.
preg_match('/^\d+$/', $s, $m)) todecimal-int-string. That is incorrect: a digit sequence such as\d+/[0-9]+/\d{2}also matches strings with a leading zero like"007", and-?\d+matches"-0". Neither"007"nor"-0"is a canonical decimal integer string (both stay strings when used as an array key), so they are merelynumeric-string.This change keeps
decimal-int-stringonly when the match is guaranteed to be canonical and falls back tonumeric-stringotherwise.Changes
src/Type/Regex/RegexGroupWalkResult.phpdecimalLeadingResolved,decimalSeenDigit,decimalLeadCanBeZero,decimalBad) plus transient quantifier flags (decimalAtomRepeats,decimalAtomOptional).isDecimalIntegerLeadingZeroSafe(), which isfalsewhen a leading-zero-and-more (like"00") or a negative zero ("-0") is possible.src/Type/Regex/RegexGroupParser.phpmax >= 2/ unbounded) and whether it is optional.applyDecimalDigitPosition(). Character classes are analysed as a single position viagetClassDecimalInfo()/getClassBoundChar()(so[0-9]is one zero-able position and[1-9]is one non-zero position), instead of relying on the individual range-bound tokens.isDecimalInteger()holds, the result is nowdecimal-int-stringonly ifisDecimalIntegerLeadingZeroSafe(); otherwise it isnumeric-string. This is applied both to the subject base type and to capturing-group types (createGroupType).tests/PHPStan/Analyser/nsrt/preg_match_shapes.php,preg_match_all_shapes.php,preg_match_shapes_php80.php,preg_match_shapes_php82.php,bug-11293.php,bug-11311.php,bug-14177.php,bug-14750.php,bug-14784.phpwhere\d+-style patterns now correctly infernumeric-string.Root cause
A pure digit sequence is a canonical decimal integer only if it cannot have a leading zero in a multi-digit number, and is not a negative zero. The old code treated every all-digit (optionally minus-prefixed) match as
decimal-int-string, ignoring repetition and multi-digit length. The pattern is centralized inRegexGroupParser::walkGroupAst(), so the same flaw affected both the subject narrowing reported in the issue and every capturing group ((\d+),\d{1,6}, etc.). The fix tracks, per character position, whether the leading digit can be0, whether more digits can follow, and whether a sign is present, and only keepsdecimal-int-stringwhen none of those make the value non-canonical.Patterns that stay
decimal-int-string(verified): single digit\d/[0-9], literal0, leading non-zero class[1-9][0-9]*/[1-9]+,-?[1-9]\d*,[3-9]+, and decimal alternations like^(?:0|-?[1-9][0-9]*)$.Patterns now correctly
numeric-string:\d+,[0-9]+,\d{2},\d{1,6},\d\d,0\d+,-?\d+,-?\d.Probed analogous case
ctype_digit()has a theoretically identical issue (ctype_digit("007")istrue), but its narrowing todecimal-int-stringis intentionally relied upon bytests/PHPStan/Rules/Arrays/data/bug-14758.php(offset checks on string-keyed arrays), so it was left unchanged to keep this PR focused on the reported regex behavior.Test
tests/PHPStan/Analyser/nsrt/bug-14791.phpasserting that subject and capturing-group narrowing yieldsnumeric-stringfor leading-zero-possible patterns (\d+,[0-9]+,\d{2},\d{1,6},\d\d,0\d+,-?\d+,-?\d) and keepsdecimal-int-stringfor canonical patterns (\d,[0-9],0,[1-9][0-9]*,[1-9]+[0-9]*,-?[1-9]\d*,[1-5]).decimal-int-string) and passes after.make tests) and PHPStan self-analysis (make phpstan) are green.Fixes phpstan/phpstan#14791