Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
/jqPublic
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Revamp sub/3 to resolve most issues with gsub (and sub with "g")#2641

Merged
itchyny merged 6 commits intojqlang:masterfrompkoppstein:gsub
Jul 3, 2023

Conversation

pkoppstein
Copy link
Contributor

@pkoppsteinpkoppstein commentedJun 29, 2023
edited by itchyny
Loading

The primary purpose of this commit (which supercedes PR#2624) is to rectify most problems withgsub (and alsosub with the "g" option), in particularfix#1425 ('\b'),fix#2354 (lookahead), andfix#2532 (regex == "^(?!cd ).*$|^cd ";"")).

This commit also partlyresolves#2148 and alsoresolves#1206 in thatgsub no longer loops infinitely; however, because the newgsub depends critically on match(_;"g"), the behavior when regex == "" is sometimes non-standard. [*1]

Since the new sub/3 relies on uniq/1, that has been added as well [*2].

The documentation has been updated to reflect the fact thatsub andgsub are intended to be regular in the second argument. [*3]

Also, _nwise/1 has been tweaked to take advantage of TCO.

Footnotes:

[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than "aaa" as would be standard. This is nevertheless better than the infinite loop behavior of jq 1.6 in this case.

With one exception (as explained in [*2]), the new gsub is implemented as though match/2 behavior is correct. That is, bugs ingsub behavior will most likely have their origin inmatch/2.

[*2]uniq/1 adopts the Unix/Linux name and semantics; it is needed for the following test case:

gsub("(?=u)"; "u")
"qux"
"quux"

Without this functionality:

Test#23: 'gsub("(?=u)"; "u")' at line number 100
*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")

The root of the problem here ismatch: ifmatch is fixed, then gsub would not needuntie.

The addition ofuniq as a top-level function should be a non-issue relative to general concern about builtins.jq bloat: the line count of the new builtin.jq is significantly reduced overall, and the number of defs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).

[*3] See e.g.#513 (comment)

… uniq(stream)The primary purpose of this commit (which supercedes PRjqlang#2624) is to rectify most problemswith `gsub` (and also `sub` with the "g" option), in particularjqlang#1425('\b'),jqlang#2354 (lookahead), andjqlang#2532 (regex == "^(?!cd ).*$|^cd ";"")).This commit also partlyresolvesjqlang#2148 andjqlang#1206 in that `gsub` nolonger loops infinitely; however, because the new `gsub` dependscritically on match(_;"g"), the behavior when regex == "" is sometimesnon-standard. [*1]Since the new sub/3 relies on uniq/1, that has been added as well [*2].The documentation has been updated to reflect the fact that `sub` and`gsub` are intended to be regular in the second argument. [*3]Also, _nwise/1 has been tweaked to take advantage of TCO.Footnotes:[*1] Using the new gsub, '"a" | gsub( ""; "a")' emits "aa" rather than"aaa" as would be standard.  This is nevertheless better than theinfinite loop behavior of jq 1.6 in this case.With one exception (as explained in [*2]), the new gsub is implementedas though match/2 behavior is correct.  That is, bugs in `gsub`behavior will most likely have their origin in `match/2`.[*2] `uniq/1` adopts the Unix/Linux name and semantics; it is needed for the following test case:gsub("(?=u)"; "u")"qux""quux"Without this functionality:Testjqlang#23: 'gsub("(?=u)"; "u")' at line number 100*** Expected "quux", but got "quuux" for test at line number 102: gsub("(?=u)"; "u")The root of the problem here is `match`: if `match` is fixed, then gsub would not need `untie`.The addition of `uniq` as a top-level function should be a non-issuerelative to general concern about builtins.jq bloat: the line count ofthe new builtin.jq is significantly reduced overall, and the number ofdefs is actually reduced by 1 (from 111 (ignoring a redundant def) to 110).[*3] See e.g.jqlang#513 (comment)
@pkoppstein
Copy link
ContributorAuthor

@itchyny - Please note that the failing checks almost certainly have nothing to do with the changes in this PR.

Please also note that in the case of "no match" examples such as"p" | gsub("q"; "a","b"), this PR currently follows gojq, but that it would only require a small change to one line to produce what I would argue is actually the "correct" behavior (in the sense of regularity in the second position).

@pkoppstein
Copy link
ContributorAuthor

@itchyny - as expected, the failing tests had nothing to do with this PR. Since “all checks have passed”, it would be a good time to pull, before merging becomes very complex, wouldn’t it?

named captures. The named captures are, in effect, presented
as a JSON object (as constructed by `capture`) to
`tostring`, so a reference to a captured variable named "x"
would take the form: `"\(.x)"`.

example:
- program: 'sub("^[^a-z]*(?<x>[a-z]*).*")'
input: '"123abc456"'
output: '"ZabcZabc"'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This example is invalid and somehow not included in man.test. There is nosub/1 andoutput should always be an array. Since you added examples, would you rebuild man.test to check if the CI passes?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

As per notes elsewhere, this example is very old, and was always wrong. It has been replaced.

@@ -1428,6 +1428,30 @@ sections:
input: '[{"foo":1, "bar":14}, {"foo":2, "bar":3}]'
output: ['{"foo":2, "bar":3}']

- title: "`uniq(stream)`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm still disagree with this naming, I'm very worried about this confuses the users. Consistency within the builtin functions is much important than external command name conventions. We need to remember thatuniq is a stream oriented andunique is an array oriented. There're already filters having both array and stream versions likefirst so that's good.

body: |

`gsub` is like `sub` but all the non-overlapping occurrences of the regex are
replaced by the string, after interpolation.
replaced by `tostring`, after interpolation. If the second argument is a stream
of jq strings, then `gsub` will produce a corresponding stream of JSON strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Adding an example will help users understand this sentence.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Example added in5373191

@@ -75,6 +75,45 @@ gsub( "(.*)"; ""; "x")
""
""

gsub( ""; "a"; "g")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nitpick but inconsistent spacing.

# in mind that s could be a stream
def sub($re; s; $flags):
. as $in
| (reduce uniq(match($re; $flags)) as $edit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Following patch isn't enough to fix the duplicate bug?

diff --git a/src/builtin.c b/src/builtin.cindex 9b2d9a2..03ab4d5 100644--- a/src/builtin.c+++ b/src/builtin.c@@ -930,7 +930,7 @@ static jv f_match(jq_state *jq, jv input, jv regex, jv modifiers, jv testmode) {         match = jv_object_set(match, jv_string("string"), jv_string(""));         match = jv_object_set(match, jv_string("captures"), jv_array());         result = jv_array_append(result, match);-        start += 1;+        start = (const UChar*)(input_string+region->end[0]+1);         continue;       }
 $ ./jq -nc'"qqqqqqux" | match("(?=u)"; "g")'{"offset":6,"length":0,"string":"","captures":[]}

def _nwise(a; $n): if a|length <= $n then a else a[0:$n] , _nwise(a[$n:]; $n) end;
def _nwise($n): _nwise(.; $n);
def _nwise($n):
def n: if length <= $n then . else .[0:$n] , (.[$n:] | n) end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We can use omittingelse . by flipping the condition.

def_nwise($n):def_nwise:iflength>$nthen.[:$n], (.[$n:]|_nwise)end;_nwise;

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Being able to omitelse is not in jq 1.6, so my reason for sticking with the 1.6 syntax is that it should be easy for someone familiar with 1.6 to understand code that might well be "copied and pasted" into a jq program, or indeed a jaq program.

| [reduce ( $edit | .captures | .[] | select(.name != null) | { (.name) : .string } ) as $pair
({}; . + $pair) | s ] as $inserts
| reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix])
| .previous = ($edit | .offset + .length ) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Inconsistent tab indentation.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmmm. This seems to me to be a case of apples and oranges. The firstreduce occurs in the form[reduce ... ] as $_. Anyway, in my view, it's fine to put areduce (or anif..then..else..end) on one line if its fits. But feel free to tweak :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I just noticed you're mixing space and tab indentations...

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry about that. Two tabs in builtin.jq removed inpkoppstein@3015ec8

| reduce range(0; $inserts|length) as $ix (.; .result[$ix] += $gap + $inserts[$ix])
| .previous = ($edit | .offset + .length ) )
| .result[] + $in[.previous:] )
// $in;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Removing// $in passes the tests, but this is necessary. Would you add a test that doesn't match?

Copy link
Contributor

@itchynyitchynyJul 2, 2023
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

↑ I thought this but this is actually unnecessary!

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please see detailed notes elsewhere regarding the three variants of this line. Assuming you want gsub fixed in the next release, I'd suggest merging and then tweaking, otherwise this "branch" will fall too far behind and be quite difficult to manage later as there are so many files involved, two or three of which are the subject of many commits.

@itchynyitchyny added this to the1.7 release milestoneJul 2, 2023
builtin.c: f_match: Zero-width match : ensure '"qux" | match("(?=u)"; "g")' matches just oncerm uniq/1 as it is no longer neededIn manual.yml, replace nonsensical sub/1 example
Correct one gsub example, and add another
Remove two tab characters
@itchynyitchyny changed the titlerevamp sub/3 to resolve most issues with gsub (and sub with "g"); add…Revamp sub/3 to resolve most issues with gsub (and sub with "g")Jul 2, 2023
src/builtin.c Outdated
@@ -930,7 +930,8 @@ static jv f_match(jq_state *jq, jv input, jv regex, jv modifiers, jv testmode) {
match = jv_object_set(match, jv_string("string"), jv_string(""));
match = jv_object_set(match, jv_string("captures"), jv_array());
result = jv_array_append(result, match);
start += 1;
// ensure '"qux" | match("(?=u)"; "g")' matches just once
start = (const UChar*)(input_string+region->end[0]+1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is tab indentation here but we prefer spaces.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Fixed (in832734b); also fixed the editor setting.

itchyny reacted with thumbs up emoji
Copy link
Contributor

@itchynyitchyny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM!

@pkoppstein
Copy link
ContributorAuthor

pkoppstein commentedJul 3, 2023
edited
Loading

@itchyny wrote:

LGTM!

Thanks. I manually forced the restarting of the checking process, and now there's a sign "All checks have passed" again. But github also tells me:

You’re notauthorized to merge this pull request

@itchyny
Copy link
Contributor

@pkoppstein Can you kick the ci again (by rebasing again?) to make sure you can merge?

@pkoppstein
Copy link
ContributorAuthor

pkoppstein commentedJul 3, 2023
edited
Loading

@itchyny - As expected, my git/github skills are not up to the task.

In fact, I may have made things worse. Using GitHub Desktop, I clicked on "Update from upstream/master"
which resulted in two new commits that I didn't intend.

However, GitHub tells me now that "This branch [gsub] is not behind the upstream jqlang:master".
That looks promising, doesn't it?

@itchyny
Copy link
Contributor

@pkoppstein That's good. However , if you want to drop the merge commit, you can usegit reset --hard HEAD^ andgit rebase upstream/master.

@pkoppstein
Copy link
ContributorAuthor

pkoppstein commentedJul 3, 2023
edited
Loading

@itchyny wrote:

That's good.

Apparently, good enough for you to merge into jqlang:

image

I'll try to avoid messing things up in the meantime.

If something needs adjusting, please feel free to make changes: "Maintainers are allowed to edit this pull request."

@itchyny
Copy link
Contributor

If you don't mind, I'll merge this PR. Would you rephrase the PR description (which contains the outdated descriptions about uniq)? Note thatfix #XXX links the issue toe be closed by this PR so the that's important. Thanks.

@pkoppstein
Copy link
ContributorAuthor

@itchyny wrote:

If you don't mind, I'll merge this PR

Since I don’t have the permissions to do so, that would be great. I think it would be best if you made any adjustments when doing so. I will undoubtedly just make a mess of things. Tx.

itchyny reacted with thumbs up emoji

@emanuele6
Copy link
Member

emanuele6 commentedJul 18, 2023
edited
Loading

I think there is a bug in this implementation: If you use the"g" flag, and the pattern matches up to the end of the string, it will replace both that match and the empty string at the end of the string if the pattern is able to match the empty string:

"1"|gsub("1*";"2")# "22"

EDIT: actually, that is a bug with_match_impl/3

nicowilliams reacted with thumbs up emoji

@pkoppstein
Copy link
ContributorAuthor

I think there is a bug in this implementation ... "1" | gsub("1*"; "2")

Interestingly, 7 of the 8 regex engines at regex101.com indicate that "1" ~ /1*/ twice, evidently because
/1*/ eats the 1, and then matches the EOS as well. I wonder why it's Rust that's the odd one out here.

Perhaps the the moral of the story is that users should avoid using "g" in cases like this.

pkoppstein added a commit to pkoppstein/jq that referenced this pull requestJul 29, 2023
@pkoppsteinpkoppstein mentioned this pull requestJul 31, 2023
@itchynyitchyny linked an issueJul 31, 2023 that may beclosed by this pull request
pkoppstein added a commit to pkoppstein/jq that referenced this pull requestAug 1, 2023
@pkoppsteinpkoppstein mentioned this pull requestAug 1, 2023
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@itchynyitchynyitchyny approved these changes

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
1.7 release
3 participants
@pkoppstein@itchyny@emanuele6

[8]ページ先頭

©2009-2025 Movatter.jp