Movatterモバイル変換


[0]ホーム

URL:


RFC 9485I-RegexpOctober 2023
Bormann & BrayStandards Track[Page]
Stream:
Internet Engineering Task Force (IETF)
RFC:
9485
Category:
Standards Track
Published:
ISSN:
2070-1721
Authors:
C. Bormann
Universität Bremen TZI
T. Bray
Textuality

RFC 9485

I-Regexp: An Interoperable Regular Expression Format

Abstract

This document specifies I-Regexp, a flavor of regular expression that islimited in scope with the goal of interoperation across many differentregular expression libraries.

Status of This Memo

This is an Internet Standards Track document.

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained athttps://www.rfc-editor.org/info/rfc9485.

Copyright Notice

Copyright (c) 2023 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.

Table of Contents

1.Introduction

This specification describes an interoperable regular expression (abbreviated as "regexp") flavor, I-Regexp.

I-Regexp does not provide advanced regular expression features such as capture groups, lookahead, or backreferences.It supports only a Boolean matching capability, i.e., testing whether a given regular expression matches a given piece of text.

I-Regexp supports the entire repertoire of Unicode characters (Unicodescalar values); both the I-Regexp strings themselves and the stringsthey are matched against are sequences of Unicode scalar values (oftenrepresented in UTF-8 encoding form[STD63] for interchange).

I-Regexp is a subset of XML Schema Definition (XSD) regular expressions[XSD-2].

This document includes guidance for converting I-Regexps for use with several well-known regular expression idioms.

The development of I-Regexp was motivated by the work of the JSONPath Working Group (WG). The WG wanted to include support for the use of regular expressions in JSONPath filters in its specification[JSONPATH-BASE], but was unable to find a usefulspecification for regular expressions that would be interoperable across the popular libraries.

1.1.Terminology

This document uses the abbreviation "regexp" for what is usuallycalled a "regular expression" in programming.The term "I-Regexp" is used as a noun meaning a character string (sequence ofUnicode scalar values) that conforms to the requirementsin this specification; the plural is "I-Regexps".

This specification uses Unicode terminology; a good entry point is provided by[UNICODE-GLOSSARY].

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14[RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

The grammatical rules in this document are to be interpreted as ABNF,as described in[RFC5234] and[RFC7405], where the "characters" ofSection 2.3 of [RFC5234] are Unicode scalar values.

2.Objectives

I-Regexps should handle the vast majority of practical cases where amatching regexp is needed in a data-model specification or a query-language expression.

At the time of writing, an editor of this document conducted a survey of the regexp syntaxused in recently published RFCs. All examples found there should be covered by I-Regexps,both syntactically and with their intended semantics.The exception is the use of multi-character escapes, for whichworkaround guidance is provided inSection 5.

3.I-Regexp Syntax

An I-RegexpMUST conform to the ABNF specification inFigure 1.

i-regexp = branch *( "|" branch )branch = *piecepiece = atom [ quantifier ]quantifier = ( "*" / "+" / "?" ) / range-quantifierrange-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"QuantExact = 1*%x30-39 ; '0'-'9'atom = NormalChar / charClass / ( "(" i-regexp ")" )NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>' / %x40-5A ; '@'-'Z' / %x5E-7A ; '^'-'z' / %x7E-D7FF ; skip surrogate code points / %xE000-10FFFF )charClass = "." / SingleCharEsc / charClassEsc / charClassExprSingleCharEsc = "\" ( %x28-2B ; '('-'+' / "-" / "." / "?" / %x5B-5E ; '['-'^' / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}' )charClassEsc = catEsc / complEsccharClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEscCCchar = ( %x00-2C / %x2E-5A ; '.'-'Z' / %x5E-D7FF ; skip surrogate code points / %xE000-10FFFF ) / SingleCharEsccatEsc = %s"\p{" charProp "}"complEsc = %s"\P{" charProp "}"charProp = IsCategoryIsCategory = Letters / Marks / Numbers / Punctuation / Separators /    Symbols / OthersLetters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f' / %s"i" / %s"o" / %s"s" ) ]Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]
Figure 1:I-Regexp Syntax in ABNF

As an additional restriction,charClassExpr is not allowed tomatch[^], which, according to this grammar, would parse as apositive character class containing the single character^.

This is essentially an XSD regexp without:

An I-Regexp implementationMUST be a complete implementation of thislimited subset.In particular, full support for the Unicode functionality defined in this specification isREQUIRED. The implementation:

3.1.Checking Implementations

Achecking I-Regexp implementation is one that checks a suppliedregexp for compliance with this specification and reports any problems.Checking implementations give their users confidence that they didn'taccidentally insert syntax that is not interoperable, so checking isRECOMMENDED.Exceptions to this rule may be made for low-effort implementationsthat map I-Regexp to another regexp library by simple steps such asperforming the mapping operations discussed inSection 5. Here, theeffort needed to do full checking might dwarf the rest of theimplementation effort.ImplementationsSHOULD document whether or not they are checking.

Specifications that employ I-Regexp may want to define in whichcases their implementations can work with a non-checking I-Regexpimplementation and when full checking is needed, possibly in theprocess of defining their own implementation classes.

4.I-Regexp Semantics

This syntax is a subset of that of[XSD-2].Implementations that interpret I-RegexpsMUSTyield Boolean results as specified in[XSD-2].(See alsoSection 5.2.)

5.Mapping I-Regexp to Regexp Dialects

The material in this section is not normative; it is provided as guidanceto developers who want to use I-Regexps in the context of otherregular expression dialects.

5.1.Multi-Character Escapes

I-Regexp does not support common multi-character escapes (MCEs) and character classes built around them. These can usuallybe replaced as shown by the examples inTable 1.

Table 1:Example Substitutes for Multi-Character Escapes
MCE/class:Replace with:
\S[^ \t\n\r]
[\S ][^\t\n\r]
\d[0-9]

Note that the semantics of\d in XSD regular expressions is that of\p{Nd}; however, this would include all Unicode characters that are digits in various writing systems, which is almost certainly not what is required in IETF publications.

The construct\p{IsBasicLatin} is essentially a reference to legacyASCII; it can be replaced by the character class[\u0000-\u007f].

5.2.XSD Regexps

Any I-Regexp is also an XSD regexp[XSD-2], so the mapping is an identity function.

Note that a few errata for[XSD-2] have been fixed in[XSD-1.1-2]; therefore, itis also included in theNormative References (Section 9.1).XSD 1.1 is less widely implemented than XSD 1.0, and implementationsof XSD 1.0 are likely to include these bugfixes; for the intentsand purposes of this specification, an implementation of XSD 1.0regexps is equivalent to an implementation of XSD 1.1 regexps.

5.3.ECMAScript Regexps

Perform the following steps on an I-Regexp to obtain an ECMAScriptregexp[ECMA-262]:

  • For any unescaped dots (.) outside character classes (first alternative ofcharClass production), replace the dot with[^\n\r].
  • Envelope the result in^(?: and)$.

The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"flag; see Section 21.2.2 "Pattern Semantics" of[ECMA-262]).

Note that where a regexp literal is required,the actual regexp needs to be enclosed in/.

5.4.PCRE, RE2, and Ruby Regexps

To obtain a valid regexp in Perl Compatible Regular Expressions (PCRE)[PCRE2], the Go programming language's RE2 regexp library[RE2], and the Rubyprogramming language, perform the same steps as inSection 5.3, except that the last step is:

  • Enclose the regexp in\A(?: and)\z.

6.Motivation and Background

While regular expressions originally were intended to describe aformal language to support a Boolean matching function, theyhave been enhanced with parsing functions that support the extractionand replacement of arbitrary portions of the matched text. With thisaccretion of features, parsing-regexp libraries have becomemore susceptible to bugs and surprising performance degradations thatcan be exploited in denial-of-service attacks byan attacker who controls the regexp submitted forprocessing. I-Regexp is designed to offer interoperability and to beless vulnerable to such attacks, with the trade-off that its onlyfunction is to offer a Boolean response as to whether a charactersequence is matched by a regexp.

6.1.Implementing I-Regexp

XSD regexps are relatively easy to implement or map to widelyimplemented parsing-regexp dialects, with these notableexceptions:

  • Character class subtraction. This is a very useful feature in many specifications, but it is unfortunately mostly absent from parsing-regexp dialects. Thus, it is omitted from I-Regexp.
  • Multi-character escapes.\d,\w,\s and their uppercase complement classes exhibit a large amount of variation between regexp flavors. Thus, they are omitted from I-Regexp.
  • Not all regexp implementations support access to Unicode tables that enable executing constructs such as\p{Nd}, although the\p/\P feature in general is now quite widely available. While, in principle, it is possible to translate these into character-class matches, this also requires access to those tables. Thus, regexp libraries in severely constrained environments may not be able to support I-Regexp conformance.

7.IANA Considerations

This document has no IANA actions.

8.Security Considerations

While technically out of the scope of this specification, Section10 ("Security Considerations") of RFC 3629[STD63] applies to implementations.Particular note needs to be taken of the last paragraph of Section3 ("UTF-8 definition") of RFC 3629[STD63]; an I-Regexp implementation may need tomitigate limitations of the platform implementation in this regard.

As discussed inSection 6, more complex regexp libraries maycontain exploitable bugs, which can lead to crashes and remote codeexecution. There is also the problem that such libraries often haveperformance characteristics that are hard to predict, leading to attacksthat overload an implementation by matching against an expensiveattacker-controlled regexp.

I-Regexps have been designed to allow implementation in a way that isresilient to both threats; this objective needs to be addressedthroughout the implementation effort. Non-checking implementations (seeSection 3.1) are likely to exposesecurity limitations of any regexp engine they use, which may be lessproblematic if that engine has been built with security considerationsin mind (e.g.,[RE2]). In any case, a checking implementation is stillRECOMMENDED.

Implementations that specifically implement the I-Regexp subset can,with care, be designed to generally run in linear time and space inthe input and to detect when that would not be the case (see below).

Existing regexp engines should be able to easily handle most I-Regexps(after the adjustments discussed inSection 5), but may consumeexcessive resources for some types of I-Regexps or outright rejectthem because they cannot guarantee efficient execution.(Note that different versions of the same regexp library may be more or less vulnerable to excessive resource consumption for these cases.)

Specifically, range quantifiers (as ina{2,4}) provide particularchallenges for both existing and I-Regexp focused implementations.Implementations may therefore limit range quantifiers in composability(disallowing nested range quantifiers such as(a{2,4}){2,4}) orrange (disallowing very large ranges such asa{20,200000}), or detectand reject any excessive resource consumption caused by range quantifiers.

I-Regexp implementations that are used to evaluate regexps fromuntrusted sources need to be robust in these cases.Implementers using existing regexp libraries are encouraged:

9.References

9.1.Normative References

[RFC2119]
Bradner, S.,"Key words for use in RFCs to Indicate Requirement Levels",BCP 14,RFC 2119,DOI 10.17487/RFC2119,,<https://www.rfc-editor.org/info/rfc2119>.
[RFC5234]
Crocker, D., Ed. andP. Overell,"Augmented BNF for Syntax Specifications: ABNF",STD 68,RFC 5234,DOI 10.17487/RFC5234,,<https://www.rfc-editor.org/info/rfc5234>.
[RFC7405]
Kyzivat, P.,"Case-Sensitive String Support in ABNF",RFC 7405,DOI 10.17487/RFC7405,,<https://www.rfc-editor.org/info/rfc7405>.
[RFC8174]
Leiba, B.,"Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words",BCP 14,RFC 8174,DOI 10.17487/RFC8174,,<https://www.rfc-editor.org/info/rfc8174>.
[XSD-1.1-2]
Peterson, D., Ed.,Gao, S., Ed.,Malhotra, A., Ed.,Sperberg-McQueen, C. M., Ed.,Thompson, H., Ed., andP. Biron, Ed.,"W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes",W3C REC REC-xmlschema11-2-20120405,W3C REC-xmlschema11-2-20120405,,<https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.
[XSD-2]
Biron, P., Ed. andA. Malhotra, Ed.,"XML Schema Part 2: Datatypes Second Edition",W3C REC REC-xmlschema-2-20041028,W3C REC-xmlschema-2-20041028,,<https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.

9.2.Informative References

[ECMA-262]
Ecma International,"ECMAScript 2020 Language Specification",Standard ECMA-262, 11th Edition,,<https://www.ecma-international.org/wp-content/uploads/ECMA-262.pdf>.
[JSONPATH-BASE]
Gössner, S., Ed.,Normington, G., Ed., andC. Bormann, Ed.,"JSONPath: Query expressions for JSON",Work in Progress,Internet-Draft, draft-ietf-jsonpath-base-20,,<https://datatracker.ietf.org/doc/html/draft-ietf-jsonpath-base-20>.
[PCRE2]
"Perl-compatible Regular Expressions (revised API: PCRE2)",<http://pcre.org/current/doc/html/>.
[RE2]
"RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.",commit 73031bb,<https://github.com/google/re2>.
[RFC7493]
Bray, T., Ed.,"The I-JSON Message Format",RFC 7493,DOI 10.17487/RFC7493,,<https://www.rfc-editor.org/info/rfc7493>.
[STD63]
Yergeau, F.,"UTF-8, a transformation format of ISO 10646",STD 63,RFC 3629,.
<https://www.rfc-editor.org/info/std63>
[UNICODE-GLOSSARY]
Unicode, Inc.,"Glossary of Unicode Terms",<https://unicode.org/glossary/>.

Acknowledgements

Discussion in the IETF JSONPATH WG about whether to include a regexp mechanism into the JSONPath query expression specification and previous discussions about the YANGpattern and Concise Data Definition Language (CDDL).regexp features motivated this specification.

The basic approach for this specification was inspired by "The I-JSON Message Format"[RFC7493].

Authors' Addresses

Carsten Bormann
Universität Bremen TZI
Postfach 330440
D-28359Bremen
Germany
Phone:+49-421-218-63921
Email:cabo@tzi.org
Tim Bray
Textuality
Canada
Email:tbray@textuality.com

[8]ページ先頭

©2009-2025 Movatter.jp