A byte string library for Rust

Sep 7, 2022

bstr is a byte string library for Rust andits 1.0 version hasjust been released! It provides string oriented operations onarbitrary sequences of bytes, but is most useful when those bytes are UTF-8. Inother words, it provides a string type that is UTF-8 byconvention, where asRust’s built-in string types areguaranteed to be UTF-8.

This blog will briefly describe the API, do a deep dive on the motivation forcreatingbstr, show some short example programs usingbstr and concludewith a few thoughts.

Target audience: Rust programmers with some background knowledge of UTF-8.

Table of Contents

Brief API overview

Thebstr crate works primarily by defining two extension traits,ByteSlice andByteVec, that add string orientedmethods to the standard library&[u8] andVec<u8> types.

Since methods are added to existing types,bstr does not require you to useany new string types to get access to its APIs. However,bstr does provideits ownBStr andBString string types thatmirror the standard library&str andString string types. The main purposeof these types is for use in public APIs, for communicating intent, to gainaccess to string orientedDebug impls and for optional integration withSerde.

That’s pretty much it. The extension traits are where most of the APIs live.Most of those APIs look very similar if not identical to the APIs providedby the standard library&str andString types. The main difference is thatbstr’s APIs don’t require valid UTF-8. For some APIs like substring search,UTF-8 validity isn’t a concern at all. For other APIs like iterators overchars, invalid UTF-8 is handled by substituting the Unicodereplacement codepoint (U+FFFD):.

One last thing to mention is theB function, which I’ll use occasionallyin this blog. See the API docs for a complete explanation, but it makes itslightly more convenient to write byte slices. Namely, while"foo" has type&'static str, the corresponding byte stringb"foo" has type&'static [u8; 3]. In some cases, the use of an array leads to annoyances, for example,vec!["a", "ab"] compiles butvec![b"a", b"ab"] does not.

Quick examples

If you want to follow along at home with the examples, then a simple binaryRust program is sufficient:

$ mkdir bstrblog$ cd bstrblog$ touch main.rs$ cargo init --bin$ cargo add bstr

Then openmain.rs in your favorite editor and paste examples in there. Runyour program withcargo run --release.

First up is an example demonstrating substring search. Both the needle andthe haystack can be arbitrary bytes, just like classic Cmemmemroutine:

usebstr::ByteSlice;fnmain(){lethaystack=b"foo bar foo\xFF\xFFfoo quux foo";letmutmatches=vec![];forstartinhaystack.find_iter("foo"){matches.push(start);}assert_eq!(matches,[0,8,13,22]);}

This makes use of theByteSlice::find_iter method. Unlike thestandard library,bstr doesn’t define polymorphic substring search APIs andinstead keeps things a little more concrete. For example, to search for achar, you can use theByteSlice::find_char method.

Note that if all you need is substring search on arbitrary bytes, or even justa SIMD accelerated substring search which the standard library doesn’t yethave, then you can avoid bringing in all ofbstr and just use thememmemsub-module of thememchr crate instead. It’s the samesubstring search that powersripgrep.

Here’s another example that demonstrates Unicode-aware uppercasing, but on textthat is not valid UTF-8.

usebstr::{B,ByteSlice};fnmain(){// \xCE\xB2 is the UTF-8 encoding of β.letlower=b"\xFF hello\xCE\xB2";letupper=lower.to_uppercase();// \xCE\x92 is the UTF-8 encoding of Βassert_eq!(B(b"\xFF HELLO\xCE\x92"),upper);// Why use 'B' here? Because otherwise its type is &[u8; N] and// there is no PartialEq impl for it and Vec<u8>. Another way to// write it would have been &b"\xFF HEL..."[..].}

The above example demonstrates that invalid UTF-8 doesn’t actually prevent onefrom applying Unicode-aware algorithms on the parts of the string that arevalid UTF-8. The parts that are invalid UTF-8 are simply ignored.

Iterating overchars also works just fine even when the byte string is notentirely valid UTF-8. The parts that are invalid UTF-8 simply get subtitutedwith the Unicode replacement codepoint:

usebstr::ByteSlice;fnmain(){// We write out the raw encoding here because it isn't possible to// write a string literal in Rust that has both Unicode literals// and invalid UTF-8.letbytes=b"\xE2\x98\x83\xFF\xF0\x9D\x9E\x83\xE2\x98\x61";letchars:Vec<char>=bytes.chars().collect();assert_eq!(vec!['☃','\u{FFFD}','𝞃','\u{FFFD}','a'],chars);}

This next example shows one of the most useful aspects ofbstr: the abilityto get niceDebug representations of byte strings. The downside is that youneed to convert your byte string to aBStr first, because there is no way tooverride the standard libraryDebug impl for&[u8]. (Which just prints eachbyte as a decimal number.)

usebstr::ByteSlice;fnmain(){// \xCE\xB2 is the UTF-8 encoding of β.letbytes=b"\xFF hello\xCE\xB2";println!("{:?}",bytes);// Output: [255, 32, 104, 101, 108, 108, 111, 32, 206, 178]println!("{:?}",bytes.as_bstr());// Output: "\xFF hello β"}

Motivation based on concepts

Rust’s primary string types (&str/String) are perfectly fine for nearlyeverything. And having the property that strings are guaranteed to be validUTF-8 can be quite useful in many contexts. That is, it lets you worry aboutwhether your data is malformed or not at the edges, and everything downstreamfrom it can know it’s clean UTF-8 without ever having to worry about what to dowhen it’s not valid UTF-8.

In other words, if Rust’s primary string types work for your use case, then youshould probably ignorebstr altogether and continue using them.

So why have a byte string library? The simplest way to explain it is to pointat thestd::io::Read trait. How does it work? Well, it says“anything implementingstd::io::Read can take a writable slice of bytes, readfrom its underlying source and put the bytes from the source to the writableslice given.” Do you see anything missing? There’s no guarantee whatsoeverabout what those bytes are. They can be anything. They might be an image. Theymight be a video. Or a PDF. Or a plain text file.

In other words, the fundamental API we use to interact with data streamsdoesn’t make any guarantees about the nature of that stream. This is by designand it isn’t a Rust problem. On most mainstream operating systems, this ishow files themselves are represented. They are just sequences of bytes. Theformat of those bytes usually exists at some other layer or is determinedthrough some additional context.

Let’s try to make this concrete by considering how agrep program works: itreads lines from stdin and prints lines that match a literal (so no regexsupport). We’ll write it using strings, because that’s what you’re supposed todo… right?

usestd::io::Write;fnmain()->Result<(),Box<dynstd::error::Error>>{letneedle="Affiliate";forresultinstd::io::stdin().lines(){letline=result?;ifline.contains(needle){writeln!(std::io::stdout(),"{}",line)?;}}Ok(())}

The question is, what do you expect the behavior of this program to be ifstdin doesn’t contain valid UTF-8? And does that expectation line up withwhat youwant the behavior of the program to be?

Well, let’s try it out. And there’s no need to make up our own data either.Because invalid UTF-8 is more common than you might think. Whether it’s becauseof test data, or just not using UTF-8 (perhaps latin-1 instead) or justoutright errors. The Linux kernel as of a few years ago used to have plenty ofC source files that weren’t valid UTF-8. That has since been fixed. Butgecko-dev (the source code for Firefox) has plenty of filesthat aren’t valid UTF-8. Let’s try running our program above on one:

$ path/to/bstrblog < ./third_party/aom/PATENTS1.3. Defensive Termination. If any Licensee, its Affiliates, or its agentsError: Error { kind: InvalidData, message: "stream did not contain valid UTF-8" }

(Tip: Check out my littlefind-invalid-utf8 utilityfor how I quickly discover files that contain invalid UTF-8. It also doublesas a nice example usage ofbstr APIs that would be pretty annoying to writeusing Rust’s standard library string types.)

Now, to be clear, it is perfectly reasonable for you to say, “I’m okay withthis.” Indeed, plain text files that are also not valid UTF-8 are pretty rare.And if you’re building a purpose driven tool for data you control, it’sprobably pretty likely that you don’t care about your tool barfing on invalidUTF-8.

On the other hand, if you’re building a general purpose tool, invalid UTF-8 isnot quite rare enough to declare you don’t care about it. Rust’s own file pathhandling, for example, goes to great pains to ensure it can handle all possiblefile paths, even when they don’t conform to any UTF encoding.

So… what do you? There are a few choices I can think of:

  1. Skip any line that contains invalid UTF-8 and possibly print a warningmessage to stderr.
  2. If a line isn’t valid UTF-8, lossily decode it and search that. This doesmean you won’t be able to search for a needle that itself contains invalidUTF-8, but it will handle a lot of cases. This is maybe not so bad for ourlittle literal searching tool, but is probably less ideal if yourgrepprogram supported regexes. Do you or don’t you want something like.+ tomatch through invalid UTF-8?
  3. Use byte strings pretty much like you’d use&str/String, but have itwork automatically even when there’s invalid UTF-8. And it will work even ifthe needle contains invalid UTF-8.

The first two options can be done with Rust’s standard library pretty easily,but the third option cannot be. So if the third option is the choice you wantto make, thenbstr is probably exactly what you’re looking for. While thestandard library provides the&[u8]/Vec<u8> types, there is effectivelyalmost no support for treating them as byte strings. For example, theonlysubstring search that the standard library provides is defined on&str. Youcan’t use the standard library to do a substring search on&[u8]. So unlessyou write your own substring search implementation (you probably shouldn’t),you’re probably going to be looking for some kind of crate to do (3).

So what does this program look like withbstr? (It’s about the same size, butI’ve added some comments explaining a few things in the likely event thatyou’re unfamiliar withbstr.)

usestd::io::Write;// BufReadExt is an extension trait to std::io::BufRead// that defines a number of byte string oriented methods,// primarily for iterating over lines. These are useful// because most of the line iterators in the standard// library require valid UTF-8!usebstr::{io::BufReadExt,ByteSlice};fnmain()->Result<(),Box<dynstd::error::Error>>{letneedle="Affiliate";forresultinstd::io::stdin().lock().byte_lines(){letline=result?;// '[T]::contains' is already defined in std and is not// substring search, so bstr defines it under a new name and// makes it generic so that it accepts anything that implements// AsRef<[u8]>.ifline.contains_str(needle){// We can't use 'writeln!' any more because we// want to output exactly what we read, and the// 'writeln!' family of macros can only write// valid UTF-8.std::io::stdout().write_all(&line)?;std::io::stdout().write_all(b"\n")?;}}Ok(())}

Running it gives:

$ bstrblog < ./third_party/aom/PATENTS1.3. Defensive Termination. If any Licensee, its Affiliates, or its agents2.1. Affiliate.  Affiliate means an entity that directly or indirectly

In summary, when a program is given data, unless there is some other mechanismfor describing what that data is, the program simply does not know how tointerpret it. Inmost cases, this is a bad thing. You really want to havesome kind of expected format and barf when the data does not conform.

But for general purpose Unix-like tooling on plain text files, what is yourexpected format? It’s probably something like “valid UTF-8 with a reasonablysmall number of bytes between newline characters.” (And an honorable mentionfor UTF-16 on Windows.) But when there are so many files in practice that justaren’t valid UTF-8 but are still mostly plain text, it winds up being importantfor your general purpose tool to handle them by simply skipping over thoseinvalid UTF-8 bytes. But crucially, when it comes time for your tool to printits output, like agrep, it’s important for it to print exactly what wasread. Doing this with string types that are guaranteed to be valid UTF-8 isoften difficult and sometimes just impossible.

(Note: agrep tool doesn’t just search plain text. It can also searchbinary data. But mostgrep tools have heuristics for detecting binary data.In those cases, the data is still searched but output is often suppressedunless a special flag is given.)

But this is only half of the story. The other reason why a byte string libraryis useful is performance.

Motivation based on performance

The high level idea for why byte strings might be faster than strings thatare guaranteed to be valid UTF-8 is relatively simple. Namely, when you’reexpecting to see plain text, in most cases and in most contexts, that plaintext is going to be valid UTF-8. If you’re using a byte string library, howmuch does it cost to build the string in memory? It costs exactly as much asit takes to load the data from the file and into memory. But how much does itcost if your string types are guaranteed to be valid UTF-8? Well, the relativecost from byte strings is UTF-8 validation, which requires a full scan over thestring.

That’s pretty much it. Byte strings optimistically assume your strings areUTF-8 and deal with invalid UTF-8 by defining some reasonable behavior on allof its APIs for when invalid UTF-8 is encountered. In contrast, strings thatare guaranteed to be valid UTF-8 have to pessimistically assume the data mightbe invalid UTF-8. Thus, UTF-8 validation must run during construction. Stringsthat are guaranteed to be valid UTF-8 do have some performance upsides onusage, for example, iterating overchars in a stringcan be faster becausecode that knows and can rely on valid UTF-8 is likely to be faster than codethat needs to deal with error conditions.

(Musing: it is perhaps possible to build a string type between these twodesign points, but Rust’s&str API as it is certainly requires validationto be run before permitting the construction of a&str. Otherwise it’s notreally possible to call it a type that guarantees valid UTF-8.)

(Fun fact:str types merely have asafety invariant that they arealways valid UTF-8. They used to have a language levelvalidity invariant,butthis was relaxed some time ago. This doesn’t havemuch practical impact, but the short story is that building astr that isn’tvalid UTF-8isn’t instantly undefined behavior, but using almost any API onstr will probably result in undefined behavior.)

Let’s continue with ourgrep example in the previous section. We’ll startwith thegrep program that uses&str/String for string types with acouple tweaks:

  • We use a needle that we know occurs a small number of times in the haystack.This simplifies our benchmarking model somewhat by declaring that all we careabout is raw throughput. (Remember, all models are wrong, but some are useful.This is not theonly model, but it’s a decent one that balances real worlduse cases with simplicity.) We could instead choose a needle that nevermatches, but it’s good sense to pick one that matches sometimes to know thatwe’re actually achieving the problem we set out to solve: to print matchinglines. When it comes to throughput of a grep program, there is no practicaldifference between printing a small number of matching lines and printing nomatching lines.
  • Despite chiding one against implementing a substring search algorithm above,we do exactly that here to ensure we are comparing apples to apples. Inparticular,bstr’s substring search (which just uses thememchr::memmemimplementation) isoodles faster than the standard library’s due to its SIMDacceleration. We’ll do a quick bonus round at the end of this section to showthe difference.
  • We do our best to eliminate “obvious” performance problems by amortizingallocations. That is, we don’t create a whole new allocation for every line.However, we resist the urge to “write a fast grep.” Icover thatelsewhere, and instead stick to a decently simple program.
  • We change our haystack to something a bit bigger. If we use a haystack that’stoo small, then our benchmarking model likely needs to become more complicatedto account for noise. But if the haystack is big enough, noise is unlikely tomeaningfully impact our measurements.

To generate the haystack, clone the Rust repo and concatenate allRust files into a single file. For this blog post, I used commitb44197abb0b3ffe4908892e1e08ab1cd721ff3b9. Note also that the files are sortedbefore concatenating, so that the result is guaranteed to be deterministic.I’ve also duplicated the result 5 times to make it just a little bigger. Hereare the precise commands to generate the haystack:

$ git clone https://github.com/rust-lang/rust$ cd rust$ git checkout b44197abb0b3ffe4908892e1e08ab1cd721ff3b9$ find ./ -regex '[^ ]+\.rs' | LC_ALL=C sort | xargs cat > /tmp/rust.rs$ for ((i=0; i<5; i++)); do cat /tmp/rust.rs; done > /tmp/rust.5x.rs

So with that, here’s our revisedgrep program:

usestd::io::{BufRead,Write};typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letshiftor=ShiftOr::new("Sushi")?;letmutrdr=std::io::stdin().lock();letmutline=String::new();loop{line.clear();letnread=rdr.read_line(&mutline)?;ifnread==0{break;}ifshiftor.find(&line).is_some(){// 'read_line' doesn't strip the line// terminator, so no need to write our own.std::io::stdout().write_all(line.as_bytes())?;}}Ok(())}#[derive(Debug)]structShiftOr{masks:[u8;256],needle_len:usize,}implShiftOr{fnnew<T:AsRef<[u8]>>(needle:T)->Result<ShiftOr>{letneedle=needle.as_ref();letneedle_len=needle.len();ifneedle_len>7{// A match is found when bit 7 is set in 'result' in the search// routine below. So our needle can't be bigger than 7. We could// permit bigger needles by using u16, u32 or u64 for our mask// entries. But this is all we need for this example.returnErr("needle exceeds 7 bytes, too long".into());}letmutsearcher=ShiftOr{masks:[!0;256],needle_len};for(i,&byte)inneedle.iter().enumerate(){searcher.masks[usize::from(byte)]&=!(1<<i);}Ok(searcher)}fnfind<T:AsRef<[u8]>>(&self,haystack:T)->Option<usize>{lethaystack=haystack.as_ref();letmutresult=!1u8;for(i,&byte)inhaystack.iter().enumerate(){result|=self.masks[usize::from(byte)];result<<=1;ifresult&(1<<self.needle_len)==0{returnSome(i-self.needle_len+1);}}None}}

(Note: I chose to use theShift-Or algorithm because it’s simpleand visits each byte in the haystack at most once. It’s not going to come closeto something that is SIMD accelerated in terms of performance, but it keepsour benchmark model simple without making it totally naive. The other thingto notice here is that our substring search implementation works on&[u8].There is absolutely nothing about it that requires&str. And this is indeedtrue for just about all substring search algorithms. It’s just dealing withbytes and doesn’t care about UTF-8 at all. The key thing to remember is that ifit weren’t for trying to do an apples-to-apples comparison here, we would ofcourse be using the standard librarystr::contains method.We have to kind of hold our nose a little bit here and acknowledge that theShift-Or code is purely a property of our measurement model. We’ll explore what“real world” code performance looks like later.)

Now let’s build our code and run it. Since we chose a needle that occursexactly three times, we should see three lines of output on the smallerhaystack (which will be duplicated 5 times in the bigger haystack):

$ cargo build --release$ cp ./target/release/bstrblog grep-str$ ./grep-str < /tmp/rust.rs        repo: "https://github.com/BurntSushi/ripgrep",        repo: "https://github.com/BurntSushi/xsv",//! @BurntSushi.

Now let’s write the byte string version. If we did everything right, the onlyoperational difference between this program and the previous one is that thisprogram doesn’t do UTF-8 validation. But everything else should be the same.Note that we omit theShiftOr type from this code listing since it remainsunchanged.

usestd::io::{BufRead,Write};typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letshiftor=ShiftOr::new("Sushi")?;letmutrdr=std::io::stdin().lock();letmutline=Vec::new();loop{line.clear();letnread=rdr.read_until(b'\n',&mutline)?;ifnread==0{break;}ifshiftor.find(&line).is_some(){// 'read_line' doesn't strip the line// terminator, so no need to write our own.std::io::stdout().write_all(&line)?;}}Ok(())}

Now build the new program and save the binary to a different name like we didabove. Also do our quick test to ensure it’s doing the same work:

$ cp ./target/release/bstrblog ./grep-bytes$ ./grep-bytes < /tmp/rust.rs        repo: "https://github.com/BurntSushi/ripgrep",        repo: "https://github.com/BurntSushi/xsv",//! @BurntSushi.

Now let’s bake them off withHyperfine

$ hyperfine --warmup 5 "./grep-str < /tmp/rust.5x.rs" "./grep-bytes < /tmp/rust.5x.rs"Benchmark #1: ./grep-str < /tmp/rust.5x.rs  Time (mean ± σ):     573.0 ms ±   5.1 ms    [User: 531.1 ms, System: 41.3 ms]  Range (min … max):   567.1 ms … 583.5 ms    10 runsBenchmark #2: ./grep-bytes < /tmp/rust.5x.rs  Time (mean ± σ):     449.2 ms ±   2.0 ms    [User: 407.5 ms, System: 41.2 ms]  Range (min … max):   446.6 ms … 452.6 ms    10 runsSummary  './grep-bytes < /tmp/rust.5x.rs' ran    1.28 ± 0.01 times faster than './grep-str < /tmp/rust.5x.rs'

There we go. The byte string version of the program is 1.28 times faster. It’snot Earth shattering by any means, but it’s also nothing to sneeze at either.

Before popping up a level to discuss our findings, I think it would be fun tocompare the performance of similar programs, but without our Shift-Orimplementation. At the very least, it should tell us what kind of mistake wewould have made if we had just assumed that the standard library’s substringsearch had the same performance characteristics as the one found inbstr.

So here’s the program using standard library routines:

usestd::io::{BufRead,Write};typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letneedle="Sushi";letmutrdr=std::io::stdin().lock();letmutline=String::new();loop{line.clear();letnread=rdr.read_line(&mutline)?;ifnread==0{break;}ifline.contains(needle){std::io::stdout().write_all(line.as_bytes())?;}}Ok(())}

And now the program usingbstr:

usestd::io::{BufRead,Write};usebstr::ByteSlice;typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letneedle="Sushi";letmutrdr=std::io::stdin().lock();letmutline=Vec::new();loop{line.clear();letnread=rdr.read_until(b'\n',&mutline)?;ifnread==0{break;}ifline.contains_str(needle){std::io::stdout().write_all(line.as_bytes())?;}}Ok(())}

And now let’s bake them off with Hyperfine again:

$ hyperfine --warmup 5 "./grep-simple-str < /tmp/rust.5x.rs" "./grep-simple-bytes < /tmp/rust.5x.rs"Benchmark #1: ./grep-simple-str < /tmp/rust.5x.rs  Time (mean ± σ):     710.2 ms ±   7.7 ms    [User: 667.6 ms, System: 41.9 ms]  Range (min … max):   704.1 ms … 731.1 ms    10 runsBenchmark #2: ./grep-simple-bytes < /tmp/rust.5x.rs  Time (mean ± σ):     549.6 ms ±   2.9 ms    [User: 509.5 ms, System: 39.6 ms]  Range (min … max):   546.5 ms … 556.1 ms    10 runsSummary  './grep-simple-bytes < /tmp/rust.5x.rs' ran    1.29 ± 0.02 times faster than './grep-simple-str < /tmp/rust.5x.rs'

Errrmmm… Wait, the byte string version is 1.29 times faster, which is almostidentical to our apples-to-apples version above. And on top of that, bothprograms areslower than our Shift-Or programs despite supposedly both usingmuch fancier and faster substring search algorithms. Indeed, it turns out thata significant portion of the runtime of our program (~25% from a quick glanceat a profile) is theconstruction of the substring searcher. Owch! This meanswe’re really not measuring just throughput, but some combination of searchthroughput added on to searcher construction.

Indeed, if you look back to our apples-to-apples comparison above, you’llnotice that theShiftOr searcher is constructedonce at the start of theprogram because our needle is invariant throughout the lifetime of the program.Rebuilding it for every line doesn’t make sense.

So now what? Well, the standard library doesn’t really give us any options. Youcan’t build a substring searcher once like we did forShiftOr. You just haveto callcontains every time and eat the overhead.

bstr does provide a way to build the searcher once through, via itsFinderAPI:

usestd::io::{BufRead,Write};usebstr::ByteSlice;typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letsearcher=bstr::Finder::new("Sushi");letmutrdr=std::io::stdin().lock();letmutline=Vec::new();loop{line.clear();letnread=rdr.read_until(b'\n',&mutline)?;ifnread==0{break;}ifsearcher.find(&line).is_some(){std::io::stdout().write_all(line.as_bytes())?;}}Ok(())}

And now let’s bake this one off against our standard library routine:

$ hyperfine --warmup 5 "./grep-simple-str < /tmp/rust.5x.rs" "./grep-opt1-bytes < /tmp/rust.5x.rs"Benchmark #1: ./grep-simple-str < /tmp/rust.5x.rs  Time (mean ± σ):     708.1 ms ±   2.4 ms    [User: 663.6 ms, System: 43.9 ms]  Range (min … max):   704.5 ms … 710.8 ms    10 runsBenchmark #2: ./grep-opt1-bytes < /tmp/rust.5x.rs  Time (mean ± σ):     329.6 ms ±   2.9 ms    [User: 290.8 ms, System: 38.5 ms]  Range (min … max):   326.7 ms … 335.1 ms    10 runsSummary  './grep-opt1-bytes < /tmp/rust.5x.rs' ran    2.15 ± 0.02 times faster than './grep-simple-str < /tmp/rust.5x.rs'

This is now the fastestgrep program we’ve written. There is simply not anapples-to-apples comparison we can do with the standard library here, becausethe API is not available.

There is one morebstr API that helps things: itsBufReadExt extensiontrait. It provides internal iterators over lines in a bufferedreader. In effect, it lets one avoid an additional copy of the bytes into ourcaller providedline buffer in the code above. In exchange, we have toprovide a closure and invent our own protocol for stopping iteration early:

usestd::io::Write;usebstr::{io::BufReadExt,ByteSlice};typeResult<T>=std::result::Result<T,Box<dynstd::error::Error>>;fnmain()->Result<()>{letsearcher=bstr::Finder::new("Sushi");letmutrdr=std::io::stdin().lock();rdr.for_byte_line_with_terminator(|line|{ifsearcher.find(line).is_some(){std::io::stdout().write_all(line.as_bytes())?;}Ok(true)})?;Ok(())}

Baking this one off against our standard library routine gives us our bestrun yet:

$ hyperfine --warmup 5 "./grep-simple-str < /tmp/rust.5x.rs" "./grep-opt2-bytes < /tmp/rust.5x.rs"Benchmark #1: ./grep-simple-str < /tmp/rust.5x.rs  Time (mean ± σ):     707.7 ms ±   3.0 ms    [User: 665.3 ms, System: 41.7 ms]  Range (min … max):   702.4 ms … 712.7 ms    10 runsBenchmark #2: ./grep-opt2-bytes < /tmp/rust.5x.rs  Time (mean ± σ):     252.9 ms ±   1.1 ms    [User: 211.2 ms, System: 41.3 ms]  Range (min … max):   251.3 ms … 254.5 ms    11 runsSummary  './grep-opt2-bytes < /tmp/rust.5x.rs' ran    2.80 ± 0.02 times faster than './grep-simple-str < /tmp/rust.5x.rs'

So where does this leave us? Arguably, this section was just as much aboutbenchmarking methodology than it was about byte strings versus Rust’s defaultstrings. But the benchmarking methodology is critical because it’s important toknow what we’re measuring. It also lets us examine some of the strengths andweaknesses of our model.

One strength is that it lets us precisely characterize the performance we’releaving on the table by doing UTF-8 validation for every line we iterate over.This isn’t a micro-benchmark either. It’s a real program or part of a programthat someone might conceivably write, with perhaps a few alterations.Iterating over lines and doing something with each matching line is a commontask.

But, this does expose a weakness: the model issimplistic. By beingsimplistic, we are inherently leaving some performance on the table. Forexample, wecould rearchitect our program to decrease the granularity withwhich we run UTF-8 validation. Instead of doing it once per line, we might tryto do it once per 64KB buffer. Since it’s likely that UTF-8 validation mighthave some non-zero overhead, that could begin to add up when it’s called forevery line. And it’s likely that 64KB isa lot of lines. So it wouldeffectively eliminate that overhead cost.

This not only results in a more complex program, and while it might eliminatetheoverhead of UTF-8 validation, it does not eliminate the cost of UTF-8validation itself. That is, regardless of how you architect your program tomake UTF-8 validation faster, it will always have the relative disadvantage toa program that uses byte strings that might not ever need to care about UTF-8at all. It is perhapsconceivable that UTF-8 validation could be made to runso fast that it is a nearly unnoticeable given the other work theprogram is doing. But you’ll still need to architect your program around it toensure you aren’t getting bitten by overhead.

Program rearchitecture can actually make a very significant difference in thegrep problem domain. Consider baking off our fastest variant so far withripgrep (GNU grep achieves a similar speed up):

$ hyperfine --warmup 5 "./grep-opt2-bytes < /tmp/rust.5x.rs" "rg Sushi < /tmp/rust.5x.rs"Benchmark #1: ./grep-opt2-bytes < /tmp/rust.5x.rs  Time (mean ± σ):     253.6 ms ±   2.2 ms    [User: 210.1 ms, System: 43.2 ms]  Range (min … max):   250.6 ms … 257.8 ms    11 runsBenchmark #2: rg Sushi < /tmp/rust.5x.rs  Time (mean ± σ):      13.3 ms ±   0.9 ms    [User: 4.6 ms, System: 8.9 ms]  Range (min … max):    11.3 ms …  16.9 ms    178 runsSummary  'rg Sushi < /tmp/rust.5x.rs' ran   19.01 ± 1.35 times faster than './grep-opt2-bytes < /tmp/rust.5x.rs'

As I said, I’m not going to do a deep dive on how to write a fastgrep inthis blog, but it’s likely the main reason why ripgrep is so much faster hereis because it doesn’t actually iterate over the lines of the input. While lineiteration itself can be quite fast, what isn’t fast is calling the substringsearch implementation over and over again for every line. The substring searchimplementation has overhead and that overhead adds up. Since the number ofmatches are rare, ripgrep spends the vast majority of its time in substringsearch, where asgrep-opt2-bytes spends its time ping-ponging between lineiteration and substring search.

I leave it as an exercise to the reader to compare these programs when matchesare more frequent.

Example: counting characters, words and lines

In this example, we’re going to write a stripped down version ofwc, whichcounts things like lines, words and characters. We aren’t going to try to bePOSIX compliant or even match GNU’swc behavior (or performance) precisely,but we will make use of Unicode’s grapheme and word segmentation algorithms.Moreover, our program will work even when the input contains invalid UTF-8.

Let’s initialize our project and setup some dependencies:

$ mkdir wc$ cd wc$ touch main.rs$ cargo init --bin$ cargo add anyhow bstr lexopt

(Shout out to thelexopt crate, which provides what I think isthe best minimalist argument parser in Rust. Note that, by design, it doesn’tdo--help generation. But it gets all the corner cases correct and is justperfect for short little programs like these.)

Speaking oflexopt, let’s start with the argument parsing part of thisprogram:

/// A configuration that says what we should count.////// If no options are selected via arg parsing, then all options are/// enabled.#[derive(Clone, Debug, Default)]structConfig{chars:bool,words:bool,lines:bool,}implConfig{/// Parse the given OS string args into a `wc` configuration.fnparse<I>(args:I)->anyhow::Result<Config>whereI:IntoIterator<Item=std::ffi::OsString>+'static,{uselexopt::Arg::*;letmutconfig=Config::default();// lexopt is just the bee's knees for small little// programs like this!letmutparser=lexopt::Parser::from_iter(args);whileletSome(arg)=parser.next()?{matcharg{Short('m')|Long("chars")=>config.chars=true,Short('l')|Long("lines")=>config.lines=true,Short('w')|Long("words")=>config.words=true,Short('h')|Long("help")=>{anyhow::bail!("Usage: wc [-m/--chars -l/--lines -w/--words]");}_=>returnErr(arg.unexpected().into()),}}// If nothing is asked for, we do them all.if!config.chars&&!config.words&&!config.lines{config.chars=true;config.words=true;config.lines=true;}Ok(config)}}

There isn’t too much to see here. We set the fields onConfig based on theflags we see. If we don’t see any flags, then we enable all of them.

And now for the main part of our program:

usestd::io::{self,Write};usebstr::{io::BufReadExt,ByteSlice};/// Usage:///   wc [options] < stdin///   foo ... | wc [options]////// Where 'options' is zero or more flags:///   -m, --chars   Counts grapheme clusters.///   -l, --lines   Counts lines, terminated by \n.///   -w, --words   Counts words, using Unicode Word Segmentation.fnmain()->anyhow::Result<()>{letconfig=Config::parse(std::env::args_os())?;let(mutchars,mutwords,mutlines)=(0,0,0);letmutbufrdr=io::stdin().lock();bufrdr.for_byte_line_with_terminator(|line|{lines+=1;ifconfig.chars{chars+=line.graphemes().count();}ifconfig.words{words+=line.words().count();}Ok(true)})?;letmuttoprint=vec![];ifconfig.lines{toprint.push(lines.to_string());}ifconfig.words{toprint.push(words.to_string());}ifconfig.chars{toprint.push(chars.to_string());}writeln!(io::stdout(),"{}",toprint.join("\t"))?;Ok(())}

The parts of this program that actually make use ofbstr are quite brief, butthis simplicity comes in part because of how byte strings optimistically assumevalid UTF-8. Those parts are:

  • BufReadExt::for_byte_line_with_terminator gives us asuper fast way of iterating over lines. We do have to provide a closure, butin exchange, we don’t need to amortize allocations ourselves and there is noextra copying.
  • ByteSlice::graphemes gives us an iterator over allgrapheme clusters in a byte string. Grapheme clusters areUnicode’s answer to how toapproximate what an end user might think of asa character. When invalid UTF-8 is encountered, it is substituted with theUnicode replacement codepoint and yielded as its own grapheme.
  • ByteSlice:::words gives us an iterator over allUnicode words in a byte string. Like with graphemes, wheninvalid UTF-8 is encountered, it is subtituted with the Unicode replacementcodepoint and yielded as its own word.

This programcould be written using&str/String with theunicode-segmentation crate. But, in order to use thatcrate, you need a&str. This runs into similar issues we faced when writingourgrep program above. You could error out completely if invalid UTF-8 isseen, skip lines that are invalid UTF-8, or try to lossily decode lines thatcontain invalid UTF-8. Depending on your requirements, any of these options areworkable, but they come with extra code complexity and probably additionalruntime overhead. (I say “probably” because the grapheme and word countingmight dwarf UTF-8 validation, but this is very hand-wavy. Optimizing aUnicode-awarewc is a deep rabbit hole that is beyond the scope of thisblog.)

Example: windowing grep

In this example, we’re going to adapt ourgrep program above so that itprints a window of grapheme clusters around each match on a line. This isespecially useful when searching files with large lines (like minifiedJavascript). This way, you can still see a bit of context for each match, butwithout dumping a bunch of jibberish to your terminal.

We’ll get started similarly as the previous example, except we’ll addtermcolor so that we can colorize our matches.

$ mkdir window-grep$ cd window-grep$ touch main.rs$ cargo init --bin$ cargo add anyhow bstr lexopt termcolor

Let’s again start with the argument parsing aspect of the program:

usebstr::ByteVec;/// A configuration that says what we should look for and big the/// window to print around each match.#[derive(Clone, Debug)]structConfig{/// The needle we want to search for.needle:Vec<u8>,/// A window size bigger than 255 kind of defeats the purpose.window:u8,}implConfig{/// Parse the given OS string args into a `window-grep`/// configuration.fnparse<I>(args:I)->anyhow::Result<Config>whereI:IntoIterator<Item=std::ffi::OsString>+'static,{uselexopt::{Arg::*,ValueExt};constUSAGE:&str="Usage: window-grep [-w/--window SIZE] <needle>";letmutconfig=Config{needle:vec![],window:10};letmutsaw_needle=false;letmutparser=lexopt::Parser::from_iter(args);whileletSome(arg)=parser.next()?{matcharg{Short('w')|Long("window")=>{config.window=parser.value()?.parse()?;}Short('h')|Long("help")=>{anyhow::bail!(USAGE);}Value(v)=>{anyhow::ensure!(!saw_needle,USAGE);saw_needle=true;// This is a bstr API that is a no-op on Unix and// returns an error on Windows if the OS string// wasn't originally valid UTF-16. Such things are// rare on Windows and we don't care to support// them.config.needle=Vec::from_os_string(v).map_err(|_|{anyhow::anyhow!("needle is not valid UTF-16 on Windows",)})?;}_=>returnErr(arg.unexpected().into()),}}anyhow::ensure!(saw_needle,USAGE);anyhow::ensure!(!config.needle.is_empty(),"needle must be non-empty");Ok(config)}}

Now let’s write a couple routines to extract the leading and trailing graphemeclusters from an arbitrary byte string:

usebstr::ByteSlice;/// Return a slice of the `size` leading grapheme clusters from `slice`.fnleading_graphemes(slice:&[u8],size:u8)->&[u8]{slice.grapheme_indices().take(usize::from(size)).last().map_or(&[],|(_,end,_)|&slice[..end])}/// Return a slice of the `size` trailing grapheme clusters from `slice`.fntrailing_graphemes(slice:&[u8],size:u8)->&[u8]{slice.grapheme_indices().rev().take(usize::from(size)).last().map_or(&[],|(start,_,_)|&slice[start..])}

These routines make use of theByteSlice::grapheme_indices API inbstr, which notonly provides the grapheme cluster itself, but also the byte offsets at whichthe cluster started and ended in the original byte string. (It’s similar to thestandard librarystr::char_indices API, but for graphemeclusters.)

(Note: A similar approach isused in ripgrep toimplement a similar windowing feature.)

Consider how one might implement something like this without an APIto decode graphemes from byte strings. Even if you had a substringsearch implementation that works on byte strings, how would you go aboutfinding the surrounding grapheme cluster window? Since crates likeunicode-segmentation require a&str, you’d haveto do some kind of UTF-8 validation. In that case, you would in turn have toeither validate the entire line, or find some way to write an incrementalUTF-8 validator (which is difficult/annoying to do with just standard libraryroutines). But even if you had that, how would you know when to stop? The keyproblem here is that you don’t know when to stop decoding grapheme clusterswithout actually decoding them. Theunicode-segmentation crate couldpotentially help you by exposing an API that works on anIterator<Item=char>,but it’s not totally clear if that’s a good idea, and it could require biginternal refactoring.

Okay, let’s move on to a function that prints and colorizes the match:

/// Write the given slice as a colored match.fnwrite_match<W:WriteColor>(mutwtr:W,slice:&[u8])->io::Result<()>{usetermcolor::{Color,ColorSpec};letmutcolor=ColorSpec::new();color.set_fg(Some(Color::Red)).set_bold(true);wtr.set_color(&color)?;wtr.write_all(slice)?;wtr.reset()?;Ok(())}

There’s nothing much interesting to note here. So now finally, let’s look atthe meat of the program that looks for matches and does the printing:

usestd::io::{self,Write};usebstr::io::BufReadExt;/// Usage:///   window-grep [options] <needle> < stdin///   foo ... | window-grep [options] <needle>////// Where 'options' is zero or more flags:///   -w SIZE, --window SIZE   The window size in graphemes. Default is 10.fnmain()->anyhow::Result<()>{letconfig=Config::parse(std::env::args_os())?;letsearcher=bstr::Finder::new(&config.needle);letmutbufrdr=io::stdin().lock();letmutwtr=termcolor::StandardStream::stdout(ColorChoice::Auto);letmutlineno=0;bufrdr.for_byte_line(|line|{lineno+=1;// Contains the offset of the last printed byte. This ensures// we don't print overlapping windows if the span between// matches is less than our window size.letmutprinted=0;let(mutstart,mutend)=matchsearcher.find(line){None=>returnOk(true),Some(i)=>(i,i+config.needle.len()),};write!(wtr,"{}:",lineno)?;loop{letbefore=&line[printed..start];wtr.write_all(trailing_graphemes(before,config.window))?;write_match(&mutwtr,&config.needle)?;printed=end;matchsearcher.find(&line[end..]){None=>break,Some(i)=>{start=end+i;end=start+config.needle.len();}}}wtr.write_all(leading_graphemes(&line[end..],config.window))?;write!(wtr,"\n")?;Ok(true)})?;Ok(())}

The offset management is a bit dense, but it’s mostly to avoid printingoverlapping windows when the span between matches is smaller than ourconfigured window size. Otherwise, ourleading_graphemes andtrailing_graphemes functions are doing the most interesting work. And that’sreally where byte strings keep things simple in this program. If you wereforced to work with&str/String for this, then you’re likely either payingsome non-trivial additional cost or adding some complexity to your code. Havinga grapheme cluster segmenter that works directly on byte strings ends up beinga nice convenience!

Example: detecting invalid UTF-8

In this very short example, we’re going to demonstrate how to detect invalidUTF-8 by usingbstr’s decode-one-codepoint-at-a-time API. We’ll do this bywriting a program that prints only the lines from stdin that contain invalidUTF-8. It will print the invalid UTF-8 bytes in their hexadecimal form andcolorize them to make it easier to see.

We’ll start like we did with the previous examples. We won’t needlexopt forthis one, but we keeptermcolor around for colorizing.

$ mkdir badutf8$ cd badutf8$ touch main.rs$ cargo init --bin$ cargo add anyhow bstr termcolor

First, let’s write a helper function that will be responsible for printing theinvalid UTF-8 bytes. That is, it should print them in hexadecimal form andcolorize them.

/// Write each byte in the slice in its hexadecimal form,/// and with bold coloring.fnwrite_invalid_utf8<W:WriteColor>(mutwtr:W,slice:&[u8],)->io::Result<()>{usetermcolor::{Color,ColorSpec};letmutcolor=ColorSpec::new();color.set_fg(Some(Color::Red)).set_bold(true);wtr.set_color(&color)?;for&byteinslice.iter(){write!(wtr,r"\x{:X}",byte)?;}wtr.reset()?;Ok(())}

And now let’s see ourmain function that is responsible for iterating overall lines, looking for invalid UTF-8 and printing the line:

usestd::io::{self,Write};usebstr::{io::BufReadExt,ByteSlice};usetermcolor::{ColorChoice,WriteColor};/// Usage:///   badutf8 < stdin///   foo ... | badutf8fnmain()->anyhow::Result<()>{letmutbufrdr=io::stdin().lock();letmutwtr=termcolor::StandardStream::stdout(ColorChoice::Auto);letmutlineno=0;bufrdr.for_byte_line(|mutline|{lineno+=1;ifline.is_utf8(){returnOk(true);}write!(wtr,"{}:",lineno)?;loop{let(ch,size)=bstr::decode_utf8(line);ifsize==0{break;}elseifch.is_some(){wtr.write_all(&line[..size])?;}else{write_invalid_utf8(&mutwtr,&line[..size])?;}line=&line[size..];}write!(wtr,"\n")?;Ok(true)})?;Ok(())}

This makes use of thebstr::decode_utf8 API. It permitsincrementally decoding one codepoint at a time from a byte string. It isoccasionally useful when you just want to pluck out a codepoint from somewherein a byte string, and have complete control over how invalid UTF-8 is handled.

Here’s an example of how this program is used:

$ echo 'foo\xFFbar\xE2\x98quux' | badutf81:foo\xFFbar\xE2\x98quux$ badutf8 < gecko-dev/third_party/aom/PATENTS60:2.1. Affiliate.  \x93Affiliate\x94 means an entity that directly or indirectly63:2.2. Control. \x93Control\x94 means direct or indirect control of more than 50% of73:2.5. Final Deliverable.  \x93Final Deliverable\x94 means the final version of a82:2.7. License. \x93License\x94 means this license.84:2.8. Licensee. \x93Licensee\x94 means any person or entity who exercises patent101:2.11. Reference Implementation. \x93Reference Implementation\x94 means an Encoder105:2.12. Specification. \x93Specification\x94 means the specification designated by

You can’t see the color here, but all of the hexadecimal numbers are bolded andcolored in red when printed to a terminal.

Other crates that support byte strings

Almost every crate I publish that deals with text works on both&str and&[u8]. Some examples:

  • Thebytes submodule of theregex crate provides aRegexthat can search a&[u8] instead of a&str. Abytes::Regex is alsopermitted to match invalid UTF-8 (or even split a codepoint if you want itto), where as the top-levelRegex can never match invalid UTF-8 no matterwhat.
  • Theaho-corasick crate provides APIs that work on anythingthat implementsAsRef<[u8]>. This includes both the needles and thehaystack.
  • Thememchr crate works exclusively on&[u8].

The byte string support in these crates is absolutely critical for tools likeripgrep to exist at all. Writing ripgrep using strings that are guaranteed tobe valid UTF-8 everywhere is flatly infeasible for much of the reasonsdiscussed earlier in this blog. But it’s worth discussing one other reason aswell: file backed memory maps.

While searching files via memory maps isn’t necessarily faster, it can be a bitfaster in some cases. ripgrep tries to use memory maps in those cases. A memorymap effectively exposes the contents of a file as a&[u8]. The full slicemight not actually be loaded into memory, but as your program accesses it, pagefaults occur and the operating system loads data from the file into the&[u8]for you automatically. It’s all transparent and awesome. (But also, there arepitfalls, and it’s beyond the scope of this blog to explore them.)

So that&[u8] you get back might be huge. It might be bigger than availablememory. Now let’s say none of the crates above had byte string support. How doyou run a regex search on a&[u8] when all you have are APIs that work on&str? The regex crate doesn’t provide any way to incrementally feed it data.You could run it line-by-line, but that is quite slow and doesn’t work in thecase of multi-line searches. You’re kind of stuck. Your choices are:

  • Don’t use memory maps and miss out on the optimization.
  • Modify the regex crate to supportsearching streams of somekind.
  • UTF-8 validate the entire&[u8] and convert it to a&str.

(Note: A similar problem exists with running regex or glob searches onfile paths. You canget the underlying bytes of a file path to search onUnix without any additional cost, but on Windows, you’repretty much forced to pay some kind of cost because the internal WTF-8representation used byPath is hidden.)

These are all pretty bad choices. The last one in particular will force you todo two passes over the data, which is likely in turn to dramatically slowthings down for large files. And it also won’t let you deal with files thathave just a little invalid UTF-8 at all.And it won’t let you deal withbinary data at all either.

And that’s why these crates support byte strings. It is by far the easiestalternative. More to the point, none of these crates really need or benefitmuch from using&str internally anyway. So the only cost of exposing a bytestring API is the API surface itself. A small price to pay when compared to thealternatives.

Should byte strings be added to std?

Some folks have expressed a desire forbstr or something like it to be putinto the standard library. I’m not sure how I feel about wholesale adoptingbstr as it is.bstr is somewhat opinionated in that it provides severalUnicode operations (like grapheme, word and sentence segmentation), forexample, that std has deliberately chosen to leave to the crate ecosystem.

Moreover, adding theBStr andBString API is likely to confuse matters andadd to “string fatigue” that Rust programmers sometimes experience. Adding newbyte string types is likely to cause at least some decision paralysis when itcomes to choosing between, say,Vec<u8> andBString. It’s worth pointingout that the primary advantage of theBStr andBString types is to serve asa target for trait impls likestd::fmt::Debug andserde::{Deserialize, Serialize}. The standard library could help with theDebug impl by perhapsproviding adebug() method on&[u8], similar to thedisplay() method onPath.

Otherwise, I think the highest value addition that std could adopt is substringsearch where the needle and haystack are permitted to be&[u8].

Other API additions are likely useful too (I’m a big fan ofbstr::decode_utf8for example), but I’m not sure whether they belong in std. It might be wise toletbstr 1.0 bake for a bit and see how it’s used in the ecosystem after afew years.

Acknowledgments

Big thanks toThom Chiovoloni andRyan Lopopolo for notonly their code contributions tobstr, but for also participating in APIdesign discussions. They were extremely helpful in fleshing out the currentAPI and catching mistakes.