NotificationsYou must be signed in to change notification settings
Fork380
Star4.6k

Matcher: pass _regionStart to RE2, pos 0 special#4199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

kitbellew wants to merge3 commits intoscala-native:main

base:main

Choose a base branch

fromkitbellew:4198

Open

Matcher: pass _regionStart to RE2, pos 0 special#4199

kitbellew wants to merge3 commits intoscala-native:mainfromkitbellew:4198

Conversation

Copy link

Contributor

kitbellew commentedFeb 1, 2025

Instead of telling RE2 "machine" we start at pos > 0, simply modify the input slice and use position relative to region start.Fixes#4198.

kitbellew added2 commits

January 31, 2025 18:59

MatcherTest: test invalid handling of regions

4be70bf

Matcher: use _lastMatchXxx/_regionEnd, not _groups

cfeddd2

- instead of _hasMatch, use _lastMatchXXX as sufficient- in find(), also use _lastMatchEnd instead of _groups- instead of _inputLength, use _regionEnd for matching

kitbellew force-pushed the4198 branch frome0a9ec3 to3a0464dCompare

February 1, 2025 05:02

Matcher: pass _regionStart to RE2, pos 0 special

93bdbd2

Instead of telling RE2 "machine" we start at pos > 0, simply modify theinput slice and use position relative to region start.

kitbellew force-pushed the4198 branch from3a0464d to93bdbd2Compare

February 1, 2025 05:12

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

@WojciechMazur @LeeTibbert sorry for multiple attempts to run checks, couldn't figure out how to test locally. helps improve#2324.

WojciechMazur reviewed

Feb 1, 2025

View reviewed changes

Copy link

Contributor

WojciechMazur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM,@LeeTibbert, you're the most experience with RE2, what do you think about these changes?

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025

kitbellew,

Thank you for the Issue and, especially, for this PR. It is encouraging to meet a first time
contributor to the Scala Native repository.

Please do not be discouraged or put off by the review process. I find it generally
results in a better product, where "better" means both correct and maintainable
over the long haul.

TheMatcherTest.scala has helped me understand the Issue and looks useful/essential
to an eventual fix.

Please be understanding of my limitations. I speak a number of variants of regex but
none of them are my native language.

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025•
edited
Loading

I'm 6+ hours into looking at the Issue and this PR and have run out of time for today. Let
me report what I have found so far in my archeological studies:

"region" seems to be a javalib concept, not an RE2J or RE2 (C++) concept.
In a hasty search of both upstream repositories I found no Release Notes
or Issues reporting "region" Issues. I traced RE2j Matcher.java back
to the RE2j version, 1.3, which was used as the basis of the Scala Native port.
No code relating to "region" popped out at me.
"region" support appears to have been introduced in SN PRFix #2608, #1543: Correct/complete regex Matcher region support #2619, from June 2022.
That PR has my fingerprints all over it. Oops!
There is a comment in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619Matcher.scala which provides context. I was reluctant to changeRE2.scala` then. I need to check that
reluctance against the needs of the presenting issue.

  // Design Note:                                                                 //   Implement region support in a way that it can not break                    //   prior art. Ugly and paranoid, but contains introduced defects.             //                                                                              //   To aid understanding of the description of "entire input"                  //   in matches(), lookingAt(), and find() below, mentally substitute           //   "entire input region".

My current thinking about my next steps:
- RunMatcherTest.scala of this PR against current GitHub tip and
  document the SN failure results in the Issue.
- Trace the Git tip Matcher.scala "entire input region" concept mentioned above.
- Compare & contrast to the changes toMatcher.scala of this PR.
- Looks like I'm not going to be hanging out in the pub anytime soon.

More as it develops (days).

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

I'm 6+ hours into looking at the Issue and this PR and have run out of time for today. Let me report what I have found so far in my archeological studies:
"region" seems to be a javalib concept, not an RE2J or RE2 (C++) concept.
In a hasty search of both upstream repositories I found no Release Notes
or Issues reporting "region" Issues. I traced RE2j Matcher.java back
to the RE2j version, 1.3, which was used as the basis of the Scala Native port.
No code relating to "region" popped out at me.
"region" support appears to have been introduced in SN PRFix #2608, #1543: Correct/complete regex Matcher region support #2619, from June 2022.
That PR has my fingerprints all over it. Oops!
There is a comment in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619Matcher.scala which provides context. I was reluctant to changeRE2.scala` then. I need to check that
reluctance against the needs of the presenting issue.
  // Design Note:                                                                 //   Implement region support in a way that it can not break                    //   prior art. Ugly and paranoid, but contains introduced defects.             //                                                                              //   To aid understanding of the description of "entire input"                  //   in matches(), lookingAt(), and find() below, mentally substitute           //   "entire input region".
My current thinking about my next steps:
RunMatcherTest.scala of this PR against current GitHub tip and
document the SN failure results in the Issue.
Trace the Git tip Matcher.scala "entire input region" concept mentioned above.
Compare & contrast to the changes toMatcher.scala of this PR.
Looks like I'm not going to be hanging out in the pub anytime soon.
More as it develops (days).

Thank you for detailed, careful consideration.

Separately, I was thinking that we could find a suite of unit tests used for the original JVM version of matcher and see what this implementation (not just region) fails on.

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

I have another commit I prepared as an option, I like it a little better but didn't test it as much:

adb9a81

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025

kitbellew

helps improve#2324.
Are you working on scalafmt? Is that where this particular regex popped up?
scalafmt was wicked fast when I had it running privately with Scala Native a few years ago.
It would be wonderful get scalafmt working on Scala Native again. Issue#2324 went
quiet and fell off the bottom of the priority list.

sorry for multiple attempts to run checks, couldn't figure out how to test locally

I believe that running locally means cloning the Scala Native repository, making the
changes, running sbt (I usually use '$ sbt "++3.6.3; shell"', then "sbt> tests3/testOnly *.FooTest"
or to run using JVM "sbt> testJVM3/testOnly *.FooTest"

I know of no easy way to restart Continuous Integration, but then again, that is not local.

Call out if I you can not get what you want/need in SN build done.

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

kitbellew
helps improve#2324.
Are you working on scalafmt? Is that where this particular regex popped up?
scalafmt was wicked fast when I had it running privately with Scala Native a few years ago.
It would be wonderful get scalafmt working on Scala Native again. Issue#2324 went
quiet and fell off the bottom of the priority list.

Yes, scala-native is close to being ready but I am struggling with speed. See comments inthis attempt to run scala-native on massive number of files. It's barely faster than JVM, if we exclude the compilation time.

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025

Good idea but may be yak shaving. Depends on your time & interests.

Separately, I was thinking that we could find a suite of unit tests used for the original JVM version of matcher and see what this implementation (not just region) fails on.

As a Scala Native contributor, one must beexquisitely careful with license conditions and what you
look at, even on the Web. I not a license lawyer, here is formal documentation in the SN project docs which describe
what is usable. I do not remember where but could cite chapter & verse if that would give better guidelines.

Unfortunately, the SNContributor's Guide is not-up-to-date overall. There may be a license section there.

Incomplete, short story:

Apache code and those, such as RE2, mentioned in the SN project LICENSE.md (sp) are all OK. Other
liberally licensed Open Source software can be used proper attribution, usually a reference at the top of the
file to a full license text in LICENSE.md.
GPL & Oracle non-public domain code can not be read or included. I believe that the OpenJDK code
is carries a GPLmumble license.

A whack of RE2mumble tests where ported when RE2 Scala Native support what added. Those engine
tests were meant to be manual 'advanced developer only' tests, not CI. I suspect that through a series
of changes, they were swept up into CI and are run currently.

At the same time, I believe that some SN javalib tests were created. The 2022 PR I mentioned above
may/should have added some 'region' tests. Obviously they did not test the case of the current Issue.

Those SN javalib tests may or may not have Scala.js tests as ancestors. These days I usually
add the Scala.js commit ID and date at the top of files I~~filch~~ port, but the current
javalib Tests may have come from less experienced days.

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025

I have another commit I prepared as an option, I like it a little better but didn't test it as much:adb9a81

A friend of mine who lives in New Hampshire (USA) often uses the local expression to express
the reason for (awkward) silence: Not saying, not knowing.

It takes time but I'd like to see (and capture) both the current SN failure and understand
both RE2 and javalibMatcher.scala possible, feasible, & infeasible issues.

I realize that you are trying to fix this real time and delay must be frustrating.

As part of my archeology, I noticed that RE2j is now at its version 1.8, of approximately
January 2025. It looks like there are some changes there (UTF-8 binary bytes?) since
the version 1.3 (approx) used by Scala Native. To pick up these changes would
mean re-porting or, probably extensive, edits to the files under discussion.

A fix to SN javalib code: Matcher.scala, etc would probably be a good outcome, because
it is isolated to SN. Such a fix may not be feasible.
A fix to RE2 Matcher.scala would probably be a less good outcome, but would also be
easier to carry forward if other parts of RE2 were re-ported.

Copy link

Contributor

LeeTibbert commentedFeb 1, 2025•
edited
Loading

Yes, scala-native is close to being ready

Both the current work and the closeness are good news.

but I am struggling with speed.

Ah the culprits caughtin flagrante delicito Good data, good catch.

I took a quick look at the scalameta url you provided. I'll certainly take two minutes wall-clock time off a
CI run.

I think that earlier reports from scalafmt about speed/correctness issues withWalkFileTree (?)
and large number of files have lead directly to improvements in Scala Native.
IIRC, a lot of that WFT code was written for confidence correctness and not benchmarked
or profiled for speed.

I suggest that you create a new SN Issue with that URL and we use this thread to fix
the Issue of this PR.

Looks like there, viewed from SN, are two separate concerns in that new Issue.

Build speed. I did not look at compilation flags, modes (Release-fast, etc).
Test execution speed. A focused effort to identify time consuming spots
might be useful to both projects. The first place that I would probably
look at is the regex expressions used. Once measured, it may be possible
to both use regexen which are correct but less costly and to make the
chosen regexen more efficient.
Unfortunately, these are both distractions from the current defect. Understood
and appreciated that hope of improvement there might provide better motivation
for a good fix to the current issue.

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

It takes time but I'd like to see (and capture) both the current SN failure and understand both RE2 and javalibMatcher.scala possible, feasible, & infeasible issues.

Regarding the "see and capture", I just want to make sure you are aware: my commits should be reviewed one at a time; there are three, and each does something separate, including the first one whichshows the failure and the last one which shows how it is fixed. If you ask me to fix something, I will not just add another commit on top; I will, if necessary, amend the existing ones.

Copy link

ContributorAuthor

kitbellew commentedFeb 1, 2025

I took a quick look at the scalameta url you provided. I'll certainly take two minutes wall-clock time off a CI run.

For this case, native is slower overall ; build time is 8 minutes and run time is another 8 (16 altogether); whereas for JVM build time is 1.5 minutes and run time about 10 (so 11.5 minutes together).

If you mean for people using a pre-built binary, then native might get a 20% speed up; it's good but I wouldn't call it blazing fast, especially since the formatting step for most builds I have seen is a separate task and it usually finishes in a fraction of the time anything else does.

This was referencedFeb 2, 2025

regex Matcher: using region might fail to match at start of said region#4198

Open

Matcher.region() should clear lastmatch#4207

Open

Copy link

Contributor

LeeTibbert commentedFeb 5, 2025•
edited
Loading

kitbellew@WojciechMazur

I'm 4+ days into studying this PR, an alternate commit in this PR, the existing SN javalib and RE2 code,
various release levels of the upstream RE2j code and the upstream RE2 C code.

Let me float an approach for your consideration.

Phase 1

Create a draft PR which applies the "region offset" bookkeeping approach of this PR
to javalib Matcher.scala.
- The javalib region(start, end) and two reset() methods keep track of the region
  and a "currentlyActiveInput" which is, if no region, the original input. When
  a region is established either the original input is restored or else a sub-string is taken¹.
  This bookeeping might be encapsulated into a private class.
  find(), lookingAt, and matches() pass the curentlyActiveInput. I believe that
  find(index) resets() the region to [0, originalEnd).
  Working all this out & exercising this is the reason for a draft PR.
- The current RE2 region code introduced in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619 would never get
  called by javalib Matcher and would become dead code.
- The MatcherTest code of this PR would get moved to
  unit-tests/native/src/test/scala/scala/scalanative/regex/MatcherTest.scala
  This would allow testing on both SN & JVM, making it easier to find
  and document the perennial question: What does JVM do?

Phase 2

If Phase 2 above proves out, then in a later submission, PR#2619 could be
manually reverted. I say manually because I think there is one small
but useful/necessaryj.l.StringBuilder commit which should be preserved.

Discussion

My main concerns are one of grand tactics, not of point edits. They are
of Software Engineering, and especially of medium term (decade) long
maintainability.

From what I have been able to tell this time around, RE2j has no concept of
"region", I am not sure it tries to be a "drop in" replacement for JDK. The
proposal above implements the Java concept of "region" in the Java domain
(javalib code) and tests it there; easy SN/JDK A/B comparisons.
Better separation of concerns.
The current SN RE2 code was last aligned to RE2j version 1.3. There may have
been a RE2j commit after that rolled into SN. RE2j is now at v1.8. Most of the
commits are not relevant to SN, but a few look like they would either repair
defects in the code upon which SN RE2 is based or make that code more efficient.
I believe there are some character handling changes which look intriguing.
Keeping the underlying SN RE2 as close to feasible to the RE2J 1.3 base would
make porting RE2J 1.8 changes economically_possible and/or easier.
Yes, the consideration is benefit of fixing a defect at hand balanced against the
possibility of a future port which may never happen. Roll the dice.
Keeping the two code bases reasonably aligned will also make it easier for
future maintainers to trace the code and rule out where a defect might exist.
Instead of "What would JVM do?" one can ask "What does RE2j do?".
Of course, my present concern with "minimal changes" goes against my
approach in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619.

Footnotes

Yes, creating sub-strings has a runtime cost but that cost is
only for regexen which either use new regions heavily or which
use large regions.
I considered but rejected as too complex, a hybrid approach, where
the existing RE2 code is used for regex patterns which do not either
start with '^' or end with "$" and create a sub-string for only for
patterns which use those anchors.
The underlying problem with SN regex support across the board
is that we have no good way to profile and/or benchmark it.
I suspect that the Character lookup handling code isway more
expensive that the allocation & copy cost considered in the base
proposal. Then again, I suspected that I was going to win the'
last large lottery; did not happen.↩

Copy link

Contributor

LeeTibbert commentedFeb 5, 2025

@kitbellew

If you are open to the approach I described and to creating a separate draft PR based on it, let me know
if I can help.

If you do not like the approach or have run out of time, I can do the work described. In my
earliest engineering days I was taught early on: Those who propose, dispose. This keep
some some balance in the design & implementation processes.

My intent is to do collaborative development & iterative improvement of Scala Native. I hope
my proposal does not put you off or discourage you. Projects, especially Open Source, live or die by their
ability to attract & retain talented developers.

Copy link

ContributorAuthor

kitbellew commentedJun 9, 2025

Forgot about this, my apologies,@LeeTibbert .

kitbellew@WojciechMazur
I'm 4+ days into studying this PR, an alternate commit in this PR, the existing SN javalib and RE2 code, various release levels of the upstream RE2j code and the upstream RE2 C code.
...
Footnotes
Yes, creating sub-strings has a runtime cost but that cost is
only for regexen which either use new regions heavily or which
use large regions.

I would call it a deal-breaker. regex libraries are not expected to duplicate any input, which is why there are so many methods returning indices to input allowing the user to extract matches without copying. If you really wanted to go this route, you can wrap the existing input (CharSequence) in a java.nio.CharBuffer (also a CharSequence).

However, it would appear and in order to support transparent or anchoring bounds, the underlying (RE2) implementation would have to know something these positions, so pretending region covers the entire original input would be misleading.

I understand your comment about not deviating from RE2 too much, but I would say the proposed change is minor. As I mentioned earlier, I would consider choosing between this implementation and the one withone extra commit.

Copy link

Contributor

LeeTibbert commentedJun 10, 2025

Kitbellew

Thank you for reviving this discussion. I have been looking at this on my work list and it has been falling
further & further back on my list.

I'll have to study your observation "regex libraries are not expected to duplicate any input," and
the " java.nio.CharBuffer" (lesser) alternative.

I am pretty certain I will not be able to do that before the snow flies. I do not want to stand in the
way of progress.

Is this holding up any of yourscalafmt or other work?

I do not mean to malign the current RE2 implementation. It has served well for a number of years after
being forced fit into Scala Native.

For awhile now I have been getting the feeling that it might be worth spending at least a few hours of looking
around the Web to see if anything more recent has emerged. Not recency for recency's sake alone, but
a better match. We/SN seems to be living without the Java backtracing features.

Fear and horror are the wrong words but the right direction. I live in dread of the day when I or someone
else takes a close look at the SN RE2 implementation in terms of memory used (and orphened) and runtime cost.

The wonder is not that the dog sings well, it is that the dog sings at all, and for so long.

Labels

None yet

Movatterモバイル変換

Matcher: pass _regionStart to RE2, pos 0 special#4199

Are you sure you want to change the base?

Matcher: pass _regionStart to RE2, pos 0 special#4199

Uh oh!

Conversation

kitbellew commentedFeb 1, 2025

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

WojciechMazur left a comment

Choose a reason for hiding this comment

Uh oh!

LeeTibbert commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 1, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 1, 2025

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 1, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

kitbellew commentedFeb 1, 2025

Uh oh!

LeeTibbert commentedFeb 5, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Phase 1

Phase 2

Discussion

Footnotes

Uh oh!

LeeTibbert commentedFeb 5, 2025

Uh oh!

kitbellew commentedJun 9, 2025

Footnotes

Uh oh!

LeeTibbert commentedJun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LeeTibbert commentedFeb 1, 2025•
edited
Loading

LeeTibbert commentedFeb 1, 2025•
edited
Loading

LeeTibbert commentedFeb 5, 2025•
edited
Loading