Strings are finite sequences of characters. Of course, the real trouble comes when one asks what a character is. The characters that English speakers are familiar with are the lettersA
,B
,C
, etc., together with numerals and common punctuation symbols. These characters are standardized together with a mapping to integer values between 0 and 127 by theASCII standard. There are, of course, many other characters used in non-English languages, including variants of the ASCII characters with accents and other modifications, related scripts such as Cyrillic and Greek, and scripts completely unrelated to ASCII and English, including Arabic, Chinese, Hebrew, Hindi, Japanese, and Korean. TheUnicode standard tackles the complexities of what exactly a character is, and is generally accepted as the definitive standard addressing this problem. Depending on your needs, you can either ignore these complexities entirely and just pretend that only ASCII characters exist, or you can write code that can handle any of the characters or encodings that one may encounter when handling non-ASCII text. Julia makes dealing with plain ASCII text simple and efficient, and handling Unicode is as simple and efficient as possible. In particular, you can write C-style string code to process ASCII strings, and they will work as expected, both in terms of performance and semantics. If such code encounters non-ASCII text, it will gracefully fail with a clear error message, rather than silently introducing corrupt results. When this happens, modifying the code to handle non-ASCII data is straightforward.
There are a few noteworthy high-level features about Julia's strings:
String
. This supports the full range ofUnicode characters via theUTF-8 encoding. (Atranscode
function is provided to convert to/from other Unicode encodings.)AbstractString
, and external packages define additionalAbstractString
subtypes (e.g. for other encodings). If you define a function expecting a string argument, you should declare the type asAbstractString
in order to accept any string type.AbstractChar
. The built-inChar
subtype ofAbstractChar
is a 32-bit primitive type that can represent any Unicode character (and which is based on the UTF-8 encoding).AbstractString
object cannot be changed. To construct a different string value, you construct a new string from parts of other strings.AChar
value represents a single character: it is just a 32-bit primitive type with a special literal representation and appropriate arithmetic behaviors, and which can be converted to a numeric value representing aUnicode code point. (Julia packages may define other subtypes ofAbstractChar
, e.g. to optimize operations for othertext encodings.) Here is howChar
values are input and shown (note that character literals are delimited with single quotes, not double quotes):
julia> c = 'x''x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)julia> typeof(c)Char
You can easily convert aChar
to its integer value, i.e. code point:
julia> c = Int('x')120julia> typeof(c)Int64
On 32-bit architectures,typeof(c)
will beInt32
. You can convert an integer value back to aChar
just as easily:
julia> Char(120)'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
Not all integer values are valid Unicode code points, but for performance, theChar
conversion does not check that every character value is valid. If you want to check that each converted value is a valid code point, use theisvalid
function:
julia> Char(0x110000)'\U110000': Unicode U+110000 (category In: Invalid, too high)julia> isvalid(Char, 0x110000)false
As of this writing, the valid Unicode code points areU+0000
throughU+D7FF
andU+E000
throughU+10FFFF
. These have not all been assigned intelligible meanings yet, nor are they necessarily interpretable by applications, but all of these values are considered to be valid Unicode characters.
You can input any Unicode character in single quotes using\u
followed by up to four hexadecimal digits or\U
followed by up to eight hexadecimal digits (the longest valid value only requires six):
julia> '\u0''\0': ASCII/Unicode U+0000 (category Cc: Other, control)julia> '\u78''x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)julia> '\u2200''∀': Unicode U+2200 (category Sm: Symbol, math)julia> '\U10ffff''\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)
Julia uses your system's locale and language settings to determine which characters can be printed as-is and which must be output using the generic, escaped\u
or\U
input forms. In addition to these Unicode escape forms, all ofC's traditional escaped input forms can also be used:
julia> Int('\0')0julia> Int('\t')9julia> Int('\n')10julia> Int('\e')27julia> Int('\x7f')127julia> Int('\177')127
You can do comparisons and a limited amount of arithmetic withChar
values:
julia> 'A' < 'a'truejulia> 'A' <= 'a' <= 'Z'falsejulia> 'A' <= 'X' <= 'Z'truejulia> 'x' - 'a'23julia> 'A' + 1'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
String literals are delimited by double quotes or triple double quotes (not single quotes):
julia> str = "Hello, world.\n""Hello, world.\n"julia> """Contains "quote" characters""""Contains \"quote\" characters"
Long lines in strings can be broken up by preceding the newline with a backslash (\
):
julia> "This is a long \ line""This is a long line"
If you want to extract a character from a string, you index into it:
julia> str[begin]'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)julia> str[1]'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)julia> str[6]',': ASCII/Unicode U+002C (category Po: Punctuation, other)julia> str[end]'\n': ASCII/Unicode U+000A (category Cc: Other, control)
Many Julia objects, including strings, can be indexed with integers. The index of the first element (the first character of a string) is returned byfirstindex(str)
, and the index of the last element (character) withlastindex(str)
. The keywordsbegin
andend
can be used inside an indexing operation as shorthand for the first and last indices, respectively, along the given dimension. String indexing, like most indexing in Julia, is 1-based:firstindex
always returns1
for anyAbstractString
. As we will see below, however,lastindex(str)
isnot in general the same aslength(str)
for a string, because some Unicode characters can occupy multiple "code units".
You can perform arithmetic and other operations withend
, just like a normal value:
julia> str[end-1]'.': ASCII/Unicode U+002E (category Po: Punctuation, other)julia> str[end÷2]' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
Using an index less thanbegin
(1
) or greater thanend
raises an error:
julia> str[begin-1]ERROR: BoundsError: attempt to access 14-codeunit String at index [0][...]julia> str[end+1]ERROR: BoundsError: attempt to access 14-codeunit String at index [15][...]
You can also extract a substring using range indexing:
julia> str[4:9]"lo, wo"
Notice that the expressionsstr[k]
andstr[k:k]
do not give the same result:
julia> str[6]',': ASCII/Unicode U+002C (category Po: Punctuation, other)julia> str[6:6]","
The former is a single character value of typeChar
, while the latter is a string value that happens to contain only a single character. In Julia these are very different things.
Range indexing makes a copy of the selected part of the original string. Alternatively, it is possible to create a view into a string using the typeSubString
. More simply, using the@views
macro on a block of code converts all string slices into substrings. For example:
julia> str = "long string""long string"julia> substr = SubString(str, 1, 4)"long"julia> typeof(substr)SubString{String}julia> @views typeof(str[1:4]) # @views converts slices to SubStringsSubString{String}
Several standard functions likechop
,chomp
orstrip
return aSubString
.
Julia fully supports Unicode characters and strings. Asdiscussed above, in character literals, Unicode code points can be represented using Unicode\u
and\U
escape sequences, as well as all the standard C escape sequences. These can likewise be used to write string literals:
julia> s = "\u2200 x \u2203 y""∀ x ∃ y"
Whether these Unicode characters are displayed as escapes or shown as special characters depends on your terminal's locale settings and its support for Unicode. String literals are encoded using the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded in the same number of bytes ("code units"). In UTF-8, ASCII characters — i.e. those with code points less than 0x80 (128) – are encoded as they are in ASCII, using a single byte, while code points 0x80 and above are encoded using multiple bytes — up to four per character.
String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that are used to encode arbitrary characters (code points). This means that not every index into aString
is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:
julia> s[1]'∀': Unicode U+2200 (category Sm: Symbol, math)julia> s[2]ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' 'Stacktrace:[...]julia> s[3]ERROR: StringIndexError: invalid index [3], valid nearby indices [1]=>'∀', [4]=>' 'Stacktrace:[...]julia> s[4]' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
In this case, the character∀
is a three-byte character, so the indices 2 and 3 are invalid and the next character's index is 4; this next valid index can be computed bynextind(s,1)
, and the next index after that bynextind(s,4)
and so on.
Sinceend
is always the last valid index into a collection,end-1
references an invalid byte index if the second-to-last character is multibyte.
julia> s[end-1]' ': ASCII/Unicode U+0020 (category Zs: Separator, space)julia> s[end-2]ERROR: StringIndexError: invalid index [9], valid nearby indices [7]=>'∃', [10]=>' 'Stacktrace:[...]julia> s[prevind(s, end, 2)]'∃': Unicode U+2203 (category Sm: Symbol, math)
The first case works, because the last charactery
and the space are one-byte characters, whereasend-2
indexes into the middle of the∃
multibyte representation. The correct way for this case is usingprevind(s, lastindex(s), 2)
or, if you're using that value to index intos
you can writes[prevind(s, end, 2)]
andend
expands tolastindex(s)
.
Extraction of a substring using range indexing also expects valid byte indices or an error is thrown:
julia> s[1:1]"∀"julia> s[1:2]ERROR: StringIndexError: invalid index [2], valid nearby indices [1]=>'∀', [4]=>' 'Stacktrace:[...]julia> s[1:4]"∀ "
Because of variable-length encodings, the number of characters in a string (given bylength(s)
) is not always the same as the last index. If you iterate through the indices 1 throughlastindex(s)
and index intos
, the sequence of characters returned when errors aren't thrown is the sequence of characters comprising the strings
. Thuslength(s) <= lastindex(s)
, since each character in a string must have its own index. The following is an inefficient and verbose way to iterate through the characters ofs
:
julia> for i = firstindex(s):lastindex(s) try println(s[i]) catch # ignore the index error end end∀x∃y
The blank lines actually have spaces on them. Fortunately, the above awkward idiom is unnecessary for iterating through the characters in a string, since you can just use the string as an iterable object, no exception handling required:
julia> for c in s println(c) end∀x∃y
If you need to obtain valid indices for a string, you can use thenextind
andprevind
functions to increment/decrement to the next/previous valid index, as mentioned above. You can also use theeachindex
function to iterate over the valid character indices:
julia> collect(eachindex(s))7-element Vector{Int64}: 1 4 5 6 7 10 11
To access the raw code units (bytes for UTF-8) of the encoding, you can use thecodeunit(s,i)
function, where the indexi
runs consecutively from1
toncodeunits(s)
. Thecodeunits(s)
function returns anAbstractVector{UInt8}
wrapper that lets you access these raw codeunits (bytes) as an array.
Strings in Julia can contain invalid UTF-8 code unit sequences. This convention allows to treat any byte sequence as aString
. In such situations a rule is that when parsing a sequence of code units from left to right characters are formed by the longest sequence of 8-bit code units that matches the start of one of the following bit patterns (eachx
can be0
or1
):
0xxxxxxx
;110xxxxx
10xxxxxx
;1110xxxx
10xxxxxx
10xxxxxx
;11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
;10xxxxxx
;11111xxx
.In particular this means that overlong and too-high code unit sequences and prefixes thereof are treated as a single invalid character rather than multiple invalid characters. This rule may be best explained with an example:
julia> s = "\xc0\xa0\xe2\x88\xe2|""\xc0\xa0\xe2\x88\xe2|"julia> foreach(display, s)'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)'\xe2\x88': Malformed UTF-8 (category Ma: Malformed, bad data)'\xe2': Malformed UTF-8 (category Ma: Malformed, bad data)'|': ASCII/Unicode U+007C (category Sm: Symbol, math)julia> isvalid.(collect(s))4-element BitArray{1}: 0 0 0 1julia> s2 = "\xf7\xbf\xbf\xbf""\U1fffff"julia> foreach(display, s2)'\U1fffff': Unicode U+1FFFFF (category In: Invalid, too high)
We can see that the first two code units in the strings
form an overlong encoding of space character. It is invalid, but is accepted in a string as a single character. The next two code units form a valid start of a three-byte UTF-8 sequence. However, the fifth code unit\xe2
is not its valid continuation. Therefore code units 3 and 4 are also interpreted as malformed characters in this string. Similarly code unit 5 forms a malformed character because|
is not a valid continuation to it. Finally the strings2
contains one too high code point.
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages. For example, theLegacyStrings.jl package implementsUTF16String
andUTF32String
types. Additional discussion of other encodings and how to implement support for them is beyond the scope of this document for the time being. For further discussion of UTF-8 encoding issues, see the section below onbyte array literals. Thetranscode
function is provided to convert data between the various UTF-xx encodings, primarily for working with external data and libraries.
One of the most common and useful string operations is concatenation:
julia> greet = "Hello""Hello"julia> whom = "world""world"julia> string(greet, ", ", whom, ".\n")"Hello, world.\n"
It's important to be aware of potentially dangerous situations such as concatenation of invalid UTF-8 strings. The resulting string may contain different characters than the input strings, and its number of characters may be lower than sum of numbers of characters of the concatenated strings, e.g.:
julia> a, b = "\xe2\x88", "\x80"("\xe2\x88", "\x80")julia> c = string(a, b)"∀"julia> collect.([a, b, c])3-element Vector{Vector{Char}}: ['\xe2\x88'] ['\x80'] ['∀']julia> length.([a, b, c])3-element Vector{Int64}: 1 1 1
This situation can happen only for invalid UTF-8 strings. For valid UTF-8 strings concatenation preserves all characters in strings and additivity of string lengths.
Julia also provides*
for string concatenation:
julia> greet * ", " * whom * ".\n""Hello, world.\n"
While*
may seem like a surprising choice to users of languages that provide+
for string concatenation, this use of*
has precedent in mathematics, particularly in abstract algebra.
In mathematics,+
usually denotes acommutative operation, where the order of the operands does not matter. An example of this is matrix addition, whereA + B == B + A
for any matricesA
andB
that have the same shape. In contrast,*
typically denotes anoncommutative operation, where the order of the operandsdoes matter. An example of this is matrix multiplication, where in generalA * B != B * A
. As with matrix multiplication, string concatenation is noncommutative:greet * whom != whom * greet
. As such,*
is a more natural choice for an infix string concatenation operator, consistent with common mathematical use.
More precisely, the set of all finite-length stringsS together with the string concatenation operator*
forms afree monoid (S,*
). The identity element of this set is the empty string,""
. Whenever a free monoid is not commutative, the operation is typically represented as\cdot
,*
, or a similar symbol, rather than+
, which as stated usually implies commutativity.
Constructing strings using concatenation can become a bit cumbersome, however. To reduce the need for these verbose calls tostring
or repeated multiplications, Julia allows interpolation into string literals using$
, as in Perl:
julia> greet = "Hello"; whom = "world";julia> "$greet, $whom.\n""Hello, world.\n"
This is more readable and convenient and equivalent to the above string concatenation – the system rewrites this apparent single string literal into the callstring(greet, ", ", whom, ".\n")
.
The shortest complete expression after the$
is taken as the expression whose value is to be interpolated into the string. Thus, you can interpolate any expression into a string using parentheses:
julia> "1 + 2 = $(1 + 2)""1 + 2 = 3"
Both concatenation and string interpolation callstring
to convert objects into string form. However,string
actually just returns the output ofprint
, so new types should add methods toprint
orshow
instead ofstring
.
Most non-AbstractString
objects are converted to strings closely corresponding to how they are entered as literal expressions:
julia> v = [1,2,3]3-element Vector{Int64}: 1 2 3julia> "v: $v""v: [1, 2, 3]"
string
is the identity forAbstractString
andAbstractChar
values, so these are interpolated into strings as themselves, unquoted and unescaped:
julia> c = 'x''x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)julia> "hi, $c""hi, x"
To include a literal$
in a string literal, escape it with a backslash:
julia> print("I have \$100 in my account.\n")I have $100 in my account.
When strings are created using triple-quotes ("""..."""
) they have some special behavior that can be useful for creating longer blocks of text.
First, triple-quoted strings are also dedented to the level of the least-indented line. This is useful for defining strings within code that is indented. For example:
julia> str = """ Hello, world. """" Hello,\n world.\n"
In this case the final (empty) line before the closing"""
sets the indentation level.
The dedentation level is determined as the longest common starting sequence of spaces or tabs in all lines, excluding the line following the opening"""
and lines containing only spaces or tabs (the line containing the closing"""
is always included). Then for all lines, excluding the text following the opening"""
, the common starting sequence is removed (including lines containing only spaces and tabs if they start with this sequence), e.g.:
julia> """ This is a test"""" This\nis\n a test"
Next, if the opening"""
is followed by a newline, the newline is stripped from the resulting string.
"""hello"""
is equivalent to
"""hello"""
but
"""hello"""
will contain a literal newline at the beginning.
Stripping of the newline is performed after the dedentation. For example:
julia> """ Hello, world.""""Hello,\nworld."
If the newline is removed using a backslash, dedentation will be respected as well:
julia> """ Averylong\ word""""Averylongword"
Trailing whitespace is left unaltered.
Triple-quoted string literals can contain"
characters without escaping.
Note that line breaks in literal strings, whether single- or triple-quoted, result in a newline (LF) character\n
in the string, even if your editor uses a carriage return\r
(CR) or CRLF combination to end lines. To include a CR in a string, use an explicit escape\r
; for example, you can enter the literal string"a CRLF line ending\r\n"
.
You can lexicographically compare strings using the standard comparison operators:
julia> "abracadabra" < "xylophone"truejulia> "abracadabra" == "xylophone"falsejulia> "Hello, world." != "Goodbye, world."truejulia> "1 + 2 = 3" == "1 + 2 = $(1 + 2)"true
You can search for the index of a particular character using thefindfirst
andfindlast
functions:
julia> findfirst('o', "xylophone")4julia> findlast('o', "xylophone")7julia> findfirst('z', "xylophone")
You can start the search for a character at a given offset by using the functionsfindnext
andfindprev
:
julia> findnext('o', "xylophone", 1)4julia> findnext('o', "xylophone", 5)7julia> findprev('o', "xylophone", 5)4julia> findnext('o', "xylophone", 8)
You can use theoccursin
function to check if a substring is found within a string:
julia> occursin("world", "Hello, world.")truejulia> occursin("o", "Xylophon")truejulia> occursin("a", "Xylophon")falsejulia> occursin('o', "Xylophon")true
The last example shows thatoccursin
can also look for a character literal.
Two other handy string functions arerepeat
andjoin
:
julia> repeat(".:Z:.", 10)".:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:..:Z:."julia> join(["apples", "bananas", "pineapples"], ", ", " and ")"apples, bananas and pineapples"
Some other useful functions include:
firstindex(str)
gives the minimal (byte) index that can be used to index intostr
(always 1 for strings, not necessarily true for other containers).lastindex(str)
gives the maximal (byte) index that can be used to index intostr
.length(str)
the number of characters instr
.length(str, i, j)
the number of valid character indices instr
fromi
toj
.ncodeunits(str)
number ofcode units in a string.codeunit(str, i)
gives the code unit value in the stringstr
at indexi
.thisind(str, i)
given an arbitrary index into a string find the first index of the character into which the index points.nextind(str, i, n=1)
find the start of then
th character starting after indexi
.prevind(str, i, n=1)
find the start of then
th character starting before indexi
.There are situations when you want to construct a string or use string semantics, but the behavior of the standard string construct is not quite what is needed. For these kinds of situations, Julia provides non-standard string literals. A non-standard string literal looks like a regular double-quoted string literal, but is immediately prefixed by an identifier, and may behave differently from a normal string literal.
Regular expressions,byte array literals, andversion number literals, as described below, are some examples of non-standard string literals. Users and packages may also define new non-standard string literals. Further documentation is given in theMetaprogramming section.
Sometimes you are not looking for an exact string, but a particularpattern. For example, suppose you are trying to extract a single date from a large text file. You don’t know what that date is (that’s why you are searching for it), but you do know it will look something likeYYYY-MM-DD
. Regular expressions allow you to specify these patterns and search for them.
Julia uses version 2 of Perl-compatible regular expressions (regexes), as provided by thePCRE library (see thePCRE2 syntax description for more details). Regular expressions are related to strings in two ways: the obvious connection is that regular expressions are used to find regular patterns in strings; the other connection is that regular expressions are themselves input as strings, which are parsed into a state machine that can be used to efficiently search for patterns in strings. In Julia, regular expressions are input using non-standard string literals prefixed with various identifiers beginning withr
. The most basic regular expression literal without any options turned on just usesr"..."
:
julia> re = r"^\s*(?:#|$)"r"^\s*(?:#|$)"julia> typeof(re)Regex
To check if a regex matches a string, useoccursin
:
julia> occursin(r"^\s*(?:#|$)", "not a comment")falsejulia> occursin(r"^\s*(?:#|$)", "# a comment")true
As one can see here,occursin
simply returns true or false, indicating whether a match for the given regex occurs in the string. Commonly, however, one wants to know not just whether a string matched, but alsohow it matched. To capture this information about a match, use thematch
function instead:
julia> match(r"^\s*(?:#|$)", "not a comment")julia> match(r"^\s*(?:#|$)", "# a comment")RegexMatch("#")
If the regular expression does not match the given string,match
returnsnothing
– a special value that does not print anything at the interactive prompt. Other than not printing, it is a completely normal value and you can test for it programmatically:
m = match(r"^\s*(?:#|$)", line)if m === nothing println("not a comment")else println("blank or comment")end
If a regular expression does match, the value returned bymatch
is aRegexMatch
object. These objects record how the expression matches, including the substring that the pattern matches and any captured substrings, if there are any. This example only captures the portion of the substring that matches, but perhaps we want to capture any non-blank text after the comment character. We could do the following:
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$)", "# a comment ")RegexMatch("# a comment ", 1="a comment")
When callingmatch
, you have the option to specify an index at which to start the search. For example:
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)RegexMatch("1")julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)RegexMatch("2")julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)RegexMatch("3")
You can extract the following info from aRegexMatch
object:
m.match
m.captures
m.offset
m.offsets
For when a capture doesn't match, instead of a substring,m.captures
containsnothing
in that position, andm.offsets
has a zero offset (recall that indices in Julia are 1-based, so a zero offset into a string is invalid). Here is a pair of somewhat contrived examples:
julia> m = match(r"(a|b)(c)?(d)", "acd")RegexMatch("acd", 1="a", 2="c", 3="d")julia> m.match"acd"julia> m.captures3-element Vector{Union{Nothing, SubString{String}}}: "a" "c" "d"julia> m.offset1julia> m.offsets3-element Vector{Int64}: 1 2 3julia> m = match(r"(a|b)(c)?(d)", "ad")RegexMatch("ad", 1="a", 2=nothing, 3="d")julia> m.match"ad"julia> m.captures3-element Vector{Union{Nothing, SubString{String}}}: "a" nothing "d"julia> m.offset1julia> m.offsets3-element Vector{Int64}: 1 0 2
It is convenient to have captures returned as an array so that one can use destructuring syntax to bind them to local variables. As a convenience, theRegexMatch
object implements iterator methods that pass through to thecaptures
field, so you can destructure the match object directly:
julia> first, second, third = m; first"a"
Captures can also be accessed by indexing theRegexMatch
object with the number or name of the capture group:
julia> m=match(r"(?<hour>\d+):(?<minute>\d+)","12:45")RegexMatch("12:45", hour="12", minute="45")julia> m[:minute]"45"julia> m[2]"45"
Captures can be referenced in a substitution string when usingreplace
by using\n
to refer to the nth capture group and prefixing the substitution string withs
. Capture group 0 refers to the entire match object. Named capture groups can be referenced in the substitution with\g<groupname>
. For example:
julia> replace("first second", r"(\w+) (?<agroup>\w+)" => s"\g<agroup> \1")"second first"
Numbered capture groups can also be referenced as\g<n>
for disambiguation, as in:
julia> replace("a", r"." => s"\g<0>1")"a1"
You can modify the behavior of regular expressions by some combination of the flagsi
,m
,s
, andx
after the closing double quote mark. These flags have the same meaning as they do in Perl, as explained in this excerpt from theperlre manpage:
i Do case-insensitive pattern matching. If locale matching rules are in effect, the case map is taken from the current locale for code points less than 255, and from Unicode rules for larger code points. However, matches that would cross the Unicode rules/non-Unicode rules boundary (ords 255/256) will not succeed.m Treat string as multiple lines. That is, change "^" and "$" from matching the start or end of the string to matching the start or end of any line anywhere within the string.s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. Used together, as r""ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string.x Tells the regular expression parser to ignore most whitespace that is neither backslashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The '#' character is also treated as a metacharacter introducing a comment, just as in ordinary code.
For example, the following regex has all three flags turned on:
julia> r"a+.*b+.*d$"ismr"a+.*b+.*d$"imsjulia> match(r"a+.*b+.*d$"ism, "Goodbye,\nOh, angry,\nBad world\n")RegexMatch("angry,\nBad world")
Ther"..."
literal is constructed without interpolation and unescaping (except for quotation mark"
which still has to be escaped). Here is an example showing the difference from standard string literals:
julia> x = 1010julia> r"$x"r"$x"julia> "$x""10"julia> r"\x"r"\x"julia> "\x"ERROR: syntax: invalid escape sequence
Triple-quoted regex strings, of the formr"""..."""
, are also supported (and may be convenient for regular expressions containing quotation marks or newlines).
TheRegex()
constructor may be used to create a valid regex string programmatically. This permits using the contents of string variables and other string operations when constructing the regex string. Any of the regex codes above can be used within the single string argument toRegex()
. Here are some examples:
julia> using Datesjulia> d = Date(1962,7,10)1962-07-10julia> regex_d = Regex("Day " * string(day(d)))r"Day 10"julia> match(regex_d, "It happened on Day 10")RegexMatch("Day 10")julia> name = "Jon""Jon"julia> regex_name = Regex("[\"( ]\\Q$name\\E[\") ]") # interpolate value of namer"[\"( ]\QJon\E[\") ]"julia> match(regex_name, " Jon ")RegexMatch(" Jon ")julia> match(regex_name, "[Jon]") === nothingtrue
Note the use of the\Q...\E
escape sequence. All characters between the\Q
and the\E
are interpreted as literal characters. This is convenient for matching characters that would otherwise be regex metacharacters. However, caution is needed when using this feature together with string interpolation, since the interpolated string might itself contain the\E
sequence, unexpectedly terminating literal matching. User inputs need to be sanitized before inclusion in a regex.
Another useful non-standard string literal is the byte-array string literal:b"..."
. This form lets you use string notation to express read only literal byte arrays – i.e. arrays ofUInt8
values. The type of those objects isCodeUnits{UInt8, String}
. The rules for byte array literals are the following:
\x
and octal escape sequences produce thebyte corresponding to the escape value.There is some overlap between these rules since the behavior of\x
and octal escapes less than 0x80 (128) are covered by both of the first two rules, but here these rules agree. Together, these rules allow one to easily use ASCII characters, arbitrary byte values, and UTF-8 sequences to produce arrays of bytes. Here is an example using all three:
julia> b"DATA\xff\u2200"8-element Base.CodeUnits{UInt8, String}: 0x44 0x41 0x54 0x41 0xff 0xe2 0x88 0x80
The ASCII string "DATA" corresponds to the bytes 68, 65, 84, 65.\xff
produces the single byte 255. The Unicode escape\u2200
is encoded in UTF-8 as the three bytes 226, 136, 128. Note that the resulting byte array does not correspond to a valid UTF-8 string:
julia> isvalid("DATA\xff\u2200")false
As it was mentionedCodeUnits{UInt8, String}
type behaves like read only array ofUInt8
and if you need a standard vector you can convert it usingVector{UInt8}
:
julia> x = b"123"3-element Base.CodeUnits{UInt8, String}: 0x31 0x32 0x33julia> x[1]0x31julia> x[1] = 0x32ERROR: CanonicalIndexError: setindex! not defined for Base.CodeUnits{UInt8, String}[...]julia> Vector{UInt8}(x)3-element Vector{UInt8}: 0x31 0x32 0x33
Also observe the significant distinction between\xff
and\uff
: the former escape sequence encodes thebyte 255, whereas the latter escape sequence represents thecode point 255, which is encoded as two bytes in UTF-8:
julia> b"\xff"1-element Base.CodeUnits{UInt8, String}: 0xffjulia> b"\uff"2-element Base.CodeUnits{UInt8, String}: 0xc3 0xbf
Character literals use the same behavior.
For code points less than\u80
, it happens that the UTF-8 encoding of each code point is just the single byte produced by the corresponding\x
escape, so the distinction can safely be ignored. For the escapes\x80
through\xff
as compared to\u80
through\uff
, however, there is a major difference: the former escapes all encode single bytes, which – unless followed by very specific continuation bytes – do not form valid UTF-8 data, whereas the latter escapes all represent Unicode code points with two-byte encodings.
If this is all extremely confusing, try reading"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets". It's an excellent introduction to Unicode and UTF-8, and may help alleviate some confusion regarding the matter.
Version numbers can easily be expressed with non-standard string literals of the formv"..."
. Version number literals createVersionNumber
objects which follow the specifications ofsemantic versioning, and therefore are composed of major, minor and patch numeric values, followed by pre-release and build alphanumeric annotations. For example,v"0.2.1-rc1+win64"
is broken into major version0
, minor version2
, patch version1
, pre-releaserc1
and buildwin64
. When entering a version literal, everything except the major version number is optional, therefore e.g.v"0.2"
is equivalent tov"0.2.0"
(with empty pre-release/build annotations),v"2"
is equivalent tov"2.0.0"
, and so on.
VersionNumber
objects are mostly useful to easily and correctly compare two (or more) versions. For example, the constantVERSION
holds Julia version number as aVersionNumber
object, and therefore one can define some version-specific behavior using simple statements as:
if v"0.2" <= VERSION < v"0.3-" # do something specific to 0.2 release seriesend
Note that in the above example the non-standard version numberv"0.3-"
is used, with a trailing-
: this notation is a Julia extension of the standard, and it's used to indicate a version which is lower than any0.3
release, including all of its pre-releases. So in the above example the code would only run with stable0.2
versions, and exclude such versions asv"0.3.0-rc1"
. In order to also allow for unstable (i.e. pre-release)0.2
versions, the lower bound check should be modified like this:v"0.2-" <= VERSION
.
Another non-standard version specification extension allows one to use a trailing+
to express an upper limit on build versions, e.g.VERSION > v"0.2-rc1+"
can be used to mean any version above0.2-rc1
and any of its builds: it will returnfalse
for versionv"0.2-rc1+win64"
andtrue
forv"0.2-rc2"
.
It is good practice to use such special versions in comparisons (particularly, the trailing-
should always be used on upper bounds unless there's a good reason not to), but they must not be used as the actual version number of anything, as they are invalid in the semantic versioning scheme.
Besides being used for theVERSION
constant,VersionNumber
objects are widely used in thePkg
module, to specify packages versions and their dependencies.
Raw strings without interpolation or unescaping can be expressed with non-standard string literals of the formraw"..."
. Raw string literals create ordinaryString
objects which contain the enclosed contents exactly as entered with no interpolation or unescaping. This is useful for strings which contain code or markup in other languages which use$
or\
as special characters.
The exception is that quotation marks still must be escaped, e.g.raw"\""
is equivalent to"\""
. To make it possible to express all strings, backslashes then also must be escaped, but only when appearing right before a quote character:
julia> println(raw"\\ \\\"")\\ \"
Notice that the first two backslashes appear verbatim in the output, since they do not precede a quote character. However, the next backslash character escapes the backslash that follows it, and the last backslash escapes a quote, since these backslashes appear before a quote.
The API for AnnotatedStrings is considered experimental and is subject to change between Julia versions.
It is sometimes useful to be able to hold metadata relating to regions of a string. AAnnotatedString
wraps another string and allows for regions of it to be annotated with labelled values (:label => value
). All generic string operations are applied to the underlying string. However, when possible, styling information is preserved. This means you can manipulate aAnnotatedString
—taking substrings, padding them, concatenating them with other strings— and the metadata annotations will "come along for the ride".
This string type is fundamental to theStyledStrings stdlib, which uses:face
-labelled annotations to hold styling information.
When concatenating aAnnotatedString
, take care to useannotatedstring
instead ofstring
if you want to keep the string annotations.
julia> str = Base.AnnotatedString("hello there", [(1:5, :word, :greeting), (7:11, :label, 1)])"hello there"julia> length(str)11julia> lpad(str, 14)" hello there"julia> typeof(lpad(str, 7))Base.AnnotatedString{String}julia> str2 = Base.AnnotatedString(" julia", [(2:6, :face, :magenta)])" julia"julia> Base.annotatedstring(str, str2)"hello there julia"julia> str * str2 == Base.annotatedstring(str, str2) # *-concatenation still workstrue
The annotations of aAnnotatedString
can be accessed and modified via theannotations
andannotate!
functions.
Settings
This document was generated withDocumenter.jl version 1.8.0 onWednesday 9 July 2025. Using Julia version 1.11.6.