Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Matcher: pass _regionStart to RE2, pos 0 special#4199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
kitbellew wants to merge3 commits intoscala-native:main
base:main
Choose a base branch
Loading
fromkitbellew:4198

Conversation

@kitbellew
Copy link
Contributor

Instead of telling RE2 "machine" we start at pos > 0, simply modify the input slice and use position relative to region start.Fixes#4198.

He-Pin reacted with thumbs up emoji
- instead of _hasMatch, use _lastMatchXXX as sufficient- in find(), also use _lastMatchEnd instead of _groups- instead of _inputLength, use _regionEnd for matching
Instead of telling RE2 "machine" we start at pos > 0, simply modify theinput slice and use position relative to region start.
@kitbellew
Copy link
ContributorAuthor

@WojciechMazur@LeeTibbert sorry for multiple attempts to run checks, couldn't figure out how to test locally. helps improve#2324.

Copy link
Contributor

@WojciechMazurWojciechMazur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM,@LeeTibbert, you're the most experience with RE2, what do you think about these changes?

@LeeTibbert
Copy link
Contributor

kitbellew,

Thank you for the Issue and, especially, for this PR. It is encouraging to meet a first time
contributor to the Scala Native repository.

Please do not be discouraged or put off by the review process. I find it generally
results in a better product, where "better" means both correct and maintainable
over the long haul.

TheMatcherTest.scala has helped me understand the Issue and looks useful/essential
to an eventual fix.

Please be understanding of my limitations. I speak a number of variants of regex but
none of them are my native language.

@LeeTibbert
Copy link
Contributor

LeeTibbert commentedFeb 1, 2025
edited
Loading

I'm 6+ hours into looking at the Issue and this PR and have run out of time for today. Let
me report what I have found so far in my archeological studies:

  • "region" seems to be a javalib concept, not an RE2J or RE2 (C++) concept.
    In a hasty search of both upstream repositories I found no Release Notes
    or Issues reporting "region" Issues. I traced RE2j Matcher.java back
    to the RE2j version, 1.3, which was used as the basis of the Scala Native port.
    No code relating to "region" popped out at me.

  • "region" support appears to have been introduced in SN PRFix #2608, #1543: Correct/complete regex Matcher region support #2619, from June 2022.
    That PR has my fingerprints all over it. Oops!

  • There is a comment in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619Matcher.scala which provides context. I was reluctant to changeRE2.scala` then. I need to check that
    reluctance against the needs of the presenting issue.

  // Design Note:                                                                 //   Implement region support in a way that it can not break                    //   prior art. Ugly and paranoid, but contains introduced defects.             //                                                                              //   To aid understanding of the description of "entire input"                  //   in matches(), lookingAt(), and find() below, mentally substitute           //   "entire input region".
  • My current thinking about my next steps:

    • RunMatcherTest.scala of this PR against current GitHub tip and
      document the SN failure results in the Issue.

    • Trace the Git tip Matcher.scala "entire input region" concept mentioned above.

    • Compare & contrast to the changes toMatcher.scala of this PR.

    • Looks like I'm not going to be hanging out in the pub anytime soon.

More as it develops (days).

@kitbellew
Copy link
ContributorAuthor

I'm 6+ hours into looking at the Issue and this PR and have run out of time for today. Let me report what I have found so far in my archeological studies:

  • "region" seems to be a javalib concept, not an RE2J or RE2 (C++) concept.
    In a hasty search of both upstream repositories I found no Release Notes
    or Issues reporting "region" Issues. I traced RE2j Matcher.java back
    to the RE2j version, 1.3, which was used as the basis of the Scala Native port.
    No code relating to "region" popped out at me.
  • "region" support appears to have been introduced in SN PRFix #2608, #1543: Correct/complete regex Matcher region support #2619, from June 2022.
    That PR has my fingerprints all over it. Oops!
  • There is a comment in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619Matcher.scala which provides context. I was reluctant to changeRE2.scala` then. I need to check that
    reluctance against the needs of the presenting issue.
  // Design Note:                                                                 //   Implement region support in a way that it can not break                    //   prior art. Ugly and paranoid, but contains introduced defects.             //                                                                              //   To aid understanding of the description of "entire input"                  //   in matches(), lookingAt(), and find() below, mentally substitute           //   "entire input region".
  • My current thinking about my next steps:

    • RunMatcherTest.scala of this PR against current GitHub tip and
      document the SN failure results in the Issue.
    • Trace the Git tip Matcher.scala "entire input region" concept mentioned above.
    • Compare & contrast to the changes toMatcher.scala of this PR.
    • Looks like I'm not going to be hanging out in the pub anytime soon.

More as it develops (days).

Thank you for detailed, careful consideration.

Separately, I was thinking that we could find a suite of unit tests used for the original JVM version of matcher and see what this implementation (not just region) fails on.

@kitbellew
Copy link
ContributorAuthor

I have another commit I prepared as an option, I like it a little better but didn't test it as much:

adb9a81

@LeeTibbert
Copy link
Contributor

kitbellew

helps improve#2324.
Are you working on scalafmt? Is that where this particular regex popped up?
scalafmt was wicked fast when I had it running privately with Scala Native a few years ago.
It would be wonderful get scalafmt working on Scala Native again. Issue#2324 went
quiet and fell off the bottom of the priority list.

sorry for multiple attempts to run checks, couldn't figure out how to test locally

I believe that running locally means cloning the Scala Native repository, making the
changes, running sbt (I usually use '$ sbt "++3.6.3; shell"', then "sbt> tests3/testOnly *.FooTest"
or to run using JVM "sbt> testJVM3/testOnly *.FooTest"

I know of no easy way to restart Continuous Integration, but then again, that is not local.

Call out if I you can not get what you want/need in SN build done.

@kitbellew
Copy link
ContributorAuthor

kitbellew

helps improve#2324.
Are you working on scalafmt? Is that where this particular regex popped up?
scalafmt was wicked fast when I had it running privately with Scala Native a few years ago.
It would be wonderful get scalafmt working on Scala Native again. Issue#2324 went
quiet and fell off the bottom of the priority list.

Yes, scala-native is close to being ready but I am struggling with speed. See comments inthis attempt to run scala-native on massive number of files. It's barely faster than JVM, if we exclude the compilation time.

@LeeTibbert
Copy link
Contributor

Good idea but may be yak shaving. Depends on your time & interests.

Separately, I was thinking that we could find a suite of unit tests used for the original JVM version of matcher and see what this implementation (not just region) fails on.

As a Scala Native contributor, one must beexquisitely careful with license conditions and what you
look at, even on the Web. I not a license lawyer, here is formal documentation in the SN project docs which describe
what is usable. I do not remember where but could cite chapter & verse if that would give better guidelines.

Unfortunately, the SNContributor's Guide is not-up-to-date overall. There may be a license section there.

Incomplete, short story:

  • Apache code and those, such as RE2, mentioned in the SN project LICENSE.md (sp) are all OK. Other
    liberally licensed Open Source software can be used proper attribution, usually a reference at the top of the
    file to a full license text in LICENSE.md.

  • GPL & Oracle non-public domain code can not be read or included. I believe that the OpenJDK code
    is carries a GPLmumble license.

A whack of RE2mumble tests where ported when RE2 Scala Native support what added. Those engine
tests were meant to be manual 'advanced developer only' tests, not CI. I suspect that through a series
of changes, they were swept up into CI and are run currently.

At the same time, I believe that some SN javalib tests were created. The 2022 PR I mentioned above
may/should have added some 'region' tests. Obviously they did not test the case of the current Issue.

Those SN javalib tests may or may not have Scala.js tests as ancestors. These days I usually
add the Scala.js commit ID and date at the top of files Ifilch port, but the current
javalib Tests may have come from less experienced days.

@LeeTibbert
Copy link
Contributor

I have another commit I prepared as an option, I like it a little better but didn't test it as much:adb9a81

A friend of mine who lives in New Hampshire (USA) often uses the local expression to express
the reason for (awkward) silence: Not saying, not knowing.

It takes time but I'd like to see (and capture) both the current SN failure and understand
both RE2 and javalibMatcher.scala possible, feasible, & infeasible issues.

I realize that you are trying to fix this real time and delay must be frustrating.

As part of my archeology, I noticed that RE2j is now at its version 1.8, of approximately
January 2025. It looks like there are some changes there (UTF-8 binary bytes?) since
the version 1.3 (approx) used by Scala Native. To pick up these changes would
mean re-porting or, probably extensive, edits to the files under discussion.

  • A fix to SN javalib code: Matcher.scala, etc would probably be a good outcome, because
    it is isolated to SN. Such a fix may not be feasible.

  • A fix to RE2 Matcher.scala would probably be a less good outcome, but would also be
    easier to carry forward if other parts of RE2 were re-ported.

@LeeTibbert
Copy link
Contributor

LeeTibbert commentedFeb 1, 2025
edited
Loading

Yes, scala-native is close to being ready

Both the current work and the closeness are good news.

but I am struggling with speed.

Ah the culprits caughtin flagrante delicito Good data, good catch.

I took a quick look at the scalameta url you provided. I'll certainly take two minutes wall-clock time off a
CI run.

I think that earlier reports from scalafmt about speed/correctness issues withWalkFileTree (?)
and large number of files have lead directly to improvements in Scala Native.
IIRC, a lot of that WFT code was written for confidence correctness and not benchmarked
or profiled for speed.

I suggest that you create a new SN Issue with that URL and we use this thread to fix
the Issue of this PR.

Looks like there, viewed from SN, are two separate concerns in that new Issue.

  • Build speed. I did not look at compilation flags, modes (Release-fast, etc).

  • Test execution speed. A focused effort to identify time consuming spots
    might be useful to both projects. The first place that I would probably
    look at is the regex expressions used. Once measured, it may be possible
    to both use regexen which are correct but less costly and to make the
    chosen regexen more efficient.

  • Unfortunately, these are both distractions from the current defect. Understood
    and appreciated that hope of improvement there might provide better motivation
    for a good fix to the current issue.

@kitbellew
Copy link
ContributorAuthor

It takes time but I'd like to see (and capture) both the current SN failure and understand both RE2 and javalibMatcher.scala possible, feasible, & infeasible issues.

Regarding the "see and capture", I just want to make sure you are aware: my commits should be reviewed one at a time; there are three, and each does something separate, including the first one whichshows the failure and the last one which shows how it is fixed. If you ask me to fix something, I will not just add another commit on top; I will, if necessary, amend the existing ones.

@kitbellew
Copy link
ContributorAuthor

I took a quick look at the scalameta url you provided. I'll certainly take two minutes wall-clock time off a CI run.

For this case, native is slower overall ; build time is 8 minutes and run time is another 8 (16 altogether); whereas for JVM build time is 1.5 minutes and run time about 10 (so 11.5 minutes together).

If you mean for people using a pre-built binary, then native might get a 20% speed up; it's good but I wouldn't call it blazing fast, especially since the formatting step for most builds I have seen is a separate task and it usually finishes in a fraction of the time anything else does.

@LeeTibbert
Copy link
Contributor

LeeTibbert commentedFeb 5, 2025
edited
Loading

kitbellew@WojciechMazur

I'm 4+ days into studying this PR, an alternate commit in this PR, the existing SN javalib and RE2 code,
various release levels of the upstream RE2j code and the upstream RE2 C code.

Let me float an approach for your consideration.

Phase 1
  1. Create a draft PR which applies the "region offset" bookkeeping approach of this PR
    to javalib Matcher.scala.

    • The javalib region(start, end) and two reset() methods keep track of the region
      and a "currentlyActiveInput" which is, if no region, the original input. When
      a region is established either the original input is restored or else a sub-string is taken1.

      This bookeeping might be encapsulated into a private class.

      find(), lookingAt, and matches() pass the curentlyActiveInput. I believe that
      find(index) resets() the region to [0, originalEnd).

      Working all this out & exercising this is the reason for a draft PR.

    • The current RE2 region code introduced in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619 would never get
      called by javalib Matcher and would become dead code.

    • The MatcherTest code of this PR would get moved to
      unit-tests/native/src/test/scala/scala/scalanative/regex/MatcherTest.scala
      This would allow testing on both SN & JVM, making it easier to find
      and document the perennial question: What does JVM do?

Phase 2

If Phase 2 above proves out, then in a later submission, PR#2619 could be
manually reverted. I say manually because I think there is one small
but useful/necessaryj.l.StringBuilder commit which should be preserved.

Discussion

My main concerns are one of grand tactics, not of point edits. They are
of Software Engineering, and especially of medium term (decade) long
maintainability.

  • From what I have been able to tell this time around, RE2j has no concept of
    "region", I am not sure it tries to be a "drop in" replacement for JDK. The
    proposal above implements the Java concept of "region" in the Java domain
    (javalib code) and tests it there; easy SN/JDK A/B comparisons.
    Better separation of concerns.

  • The current SN RE2 code was last aligned to RE2j version 1.3. There may have
    been a RE2j commit after that rolled into SN. RE2j is now at v1.8. Most of the
    commits are not relevant to SN, but a few look like they would either repair
    defects in the code upon which SN RE2 is based or make that code more efficient.
    I believe there are some character handling changes which look intriguing.

    Keeping the underlying SN RE2 as close to feasible to the RE2J 1.3 base would
    make porting RE2J 1.8 changes economically_possible and/or easier.

    Yes, the consideration is benefit of fixing a defect at hand balanced against the
    possibility of a future port which may never happen. Roll the dice.

    Keeping the two code bases reasonably aligned will also make it easier for
    future maintainers to trace the code and rule out where a defect might exist.
    Instead of "What would JVM do?" one can ask "What does RE2j do?".

    Of course, my present concern with "minimal changes" goes against my
    approach in PRFix #2608, #1543: Correct/complete regex Matcher region support #2619.

Footnotes

  1. Yes, creating sub-strings has a runtime cost but that cost is
    only for regexen which either use new regions heavily or which
    use large regions.

    I considered but rejected as too complex, a hybrid approach, where
    the existing RE2 code is used for regex patterns which do not either
    start with '^' or end with "$" and create a sub-string for only for
    patterns which use those anchors.

    The underlying problem with SN regex support across the board
    is that we have no good way to profile and/or benchmark it.

    I suspect that the Character lookup handling code isway more
    expensive that the allocation & copy cost considered in the base
    proposal. Then again, I suspected that I was going to win the'
    last large lottery; did not happen.

@LeeTibbert
Copy link
Contributor

@kitbellew

If you are open to the approach I described and to creating a separate draft PR based on it, let me know
if I can help.

If you do not like the approach or have run out of time, I can do the work described. In my
earliest engineering days I was taught early on: Those who propose, dispose. This keep
some some balance in the design & implementation processes.

My intent is to do collaborative development & iterative improvement of Scala Native. I hope
my proposal does not put you off or discourage you. Projects, especially Open Source, live or die by their
ability to attract & retain talented developers.

@kitbellew
Copy link
ContributorAuthor

Forgot about this, my apologies,@LeeTibbert .

kitbellew@WojciechMazur

I'm 4+ days into studying this PR, an alternate commit in this PR, the existing SN javalib and RE2 code, various release levels of the upstream RE2j code and the upstream RE2 C code.
...

Footnotes

  1. Yes, creating sub-strings has a runtime cost but that cost is
    only for regexen which either use new regions heavily or which
    use large regions.

I would call it a deal-breaker. regex libraries are not expected to duplicate any input, which is why there are so many methods returning indices to input allowing the user to extract matches without copying. If you really wanted to go this route, you can wrap the existing input (CharSequence) in a java.nio.CharBuffer (also a CharSequence).

However, it would appear and in order to support transparent or anchoring bounds, the underlying (RE2) implementation would have to know something these positions, so pretending region covers the entire original input would be misleading.

I understand your comment about not deviating from RE2 too much, but I would say the proposed change is minor. As I mentioned earlier, I would consider choosing between this implementation and the one withone extra commit.

@LeeTibbert
Copy link
Contributor

Kitbellew

Thank you for reviving this discussion. I have been looking at this on my work list and it has been falling
further & further back on my list.

I'll have to study your observation "regex libraries are not expected to duplicate any input," and
the " java.nio.CharBuffer" (lesser) alternative.

I am pretty certain I will not be able to do that before the snow flies. I do not want to stand in the
way of progress.

Is this holding up any of yourscalafmt or other work?


I do not mean to malign the current RE2 implementation. It has served well for a number of years after
being forced fit into Scala Native.

For awhile now I have been getting the feeling that it might be worth spending at least a few hours of looking
around the Web to see if anything more recent has emerged. Not recency for recency's sake alone, but
a better match. We/SN seems to be living without the Java backtracing features.

Fear and horror are the wrong words but the right direction. I live in dread of the day when I or someone
else takes a close look at the SN RE2 implementation in terms of memory used (and orphened) and runtime cost.

The wonder is not that the dog sings well, it is that the dog sings at all, and for so long.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@WojciechMazurWojciechMazurWojciechMazur left review comments

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

regex Matcher: using region might fail to match at start of said region

3 participants

@kitbellew@LeeTibbert@WojciechMazur

[8]ページ先頭

©2009-2025 Movatter.jp