BurntSushi/bstrPublic

NotificationsYou must be signed in to change notification settings
Fork63
Star1k

A string type for Rust that is not required to be valid UTF-8.

License

Unknown and 2 other licenses found

Licenses found

1k stars 63 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
.github		.github
bench		bench
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
COPYING		COPYING
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
rustfmt.toml		rustfmt.toml

Repository files navigation

bstr

This crate provides extension traits for&[u8] andVec<u8> that enabletheir use as byte strings, where byte strings areconventionally UTF-8. Thisdiffers from the standard library'sString andstr types in that they arenot required to be valid UTF-8, but may be fully or partially valid UTF-8.

Documentation

https://docs.rs/bstr

When should I use byte strings?

See this part of the documentation for more details:https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings.

The short story is that byte strings are useful when it is inconvenient orincorrect to require valid UTF-8.

Usage

cargo add bstr

Examples

The following two examples exhibit both the API features of byte strings andthe I/O convenience functions provided for reading line-by-line quickly.

This first example simply shows how to efficiently iterate over lines in stdin,and print out lines containing a particular substring:

use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout());    stdin.lock().for_byte_line_with_terminator(|line|{if line.contains_str("Dimension"){            stdout.write_all(line)?;}Ok(true)})?;Ok(())}

This example shows how to count all of the words (Unicode-aware) in stdin,line-by-line:

use std::{error::Error, io};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut words =0;    stdin.lock().for_byte_line_with_terminator(|line|{        words += line.words().count();Ok(true)})?;println!("{}", words);Ok(())}

This example shows how to convert a stream on stdin to uppercase withoutperforming UTF-8 validationand amortizing allocation. On standard ASCIItext, this is quite a bit faster than what you can (easily) do with standardlibrary APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)

use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout());letmut upper =vec![];    stdin.lock().for_byte_line_with_terminator(|line|{        upper.clear();        line.to_uppercase_into(&mut upper);        stdout.write_all(&upper)?;Ok(true)})?;Ok(())}

This example shows how to extract the first 10 visual characters (as graphemeclusters) from each line, where invalid UTF-8 sequences are generally treatedas a single character and are passed through correctly:

use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout());    stdin.lock().for_byte_line_with_terminator(|line|{let end = line.grapheme_indices().map(|(_, end, _)| end).take(10).last().unwrap_or(line.len());        stdout.write_all(line[..end].trim_end())?;        stdout.write_all(b"\n")?;Ok(true)})?;Ok(())}

Cargo features

This crates comes with a few features that control standard library, serde andUnicode support.

std -Enabled by default. This provides APIs that require the standardlibrary, such asVec<u8> andPathBuf. Enabling this feature also enablesthealloc feature.
alloc -Enabled by default. This provides APIs that require allocationsvia thealloc crate, such asVec<u8>.
unicode -Enabled by default. This provides APIs that require sizableUnicode data compiled into the binary. This includes, but is not limited to,grapheme/word/sentence segmenters. When this is disabled, basic support suchas UTF-8 decoding is still included. Note that currently, enabling thisfeature also requires enabling thestd feature. It is expected that thislimitation will be lifted at some point.
serde - Enables implementations of serde traits forBStr, and alsoBString whenalloc is enabled.

Minimum Rust version policy

This crate's minimum supportedrustc version (MSRV) is1.73.

In general, this crate will be conservative with respect to the minimumsupported version of Rust. MSRV may be bumped in minor version releases.

Future work

Since it is plausible that some of the types in this crate might end up in yourpublic API (e.g.,BStr andBString), we will commit to being veryconservative with respect to new major version releases. It's difficult to sayprecisely how conservative, but unless there is a major issue with the1.0release, I wouldn't expect a2.0 release to come out any sooner than someperiod of years.

A large part of the API surface area was taken from the standard library, sofrom an API design perspective, a good portion of this crate should be on solidground. The main differences from the standard library are in how the varioussubstring search routines work. The standard library provides genericinfrastructure for supporting different types of searches with a single method,where as this library prefers to define new methods for each type of search anddrop the generic infrastructure.

Someprobable future considerations for APIs include, but are not limited to:

Unicode normalization.
More sophisticated support for dealing with Unicode case, perhaps bycombining the use cases supported bycaselessandunicase.

Here are some examples that areprobably out of scope for this crate:

Regular expressions.
Unicode collation.

The exact scope isn't quite clear, but I expect we can iterate on it.

In general, as stated below, this crate brings lots of related APIs togetherinto a single crate while simultaneously attempting to keep the total number ofdependencies low. Indeed, every dependency ofbstr, except formemchr, isoptional.

High level motivation

Strictly speaking, thebstr crate provides very little that can't already beachieved with the standard libraryVec<u8>/&[u8] APIs and the ecosystem oflibrary crates. For example:

The standard library'sUtf8Error can beused for incremental lossy decoding of&[u8].
Theunicode-segmentationcrate can be used for iterating over graphemes (or words), but is onlyimplemented for&str types. One could useUtf8Error above to implementgrapheme iteration with the same semantics as whatbstr provides (automaticUnicode replacement codepoint substitution).
Thetwoway crate can be used for fast substringsearching on&[u8].

So why createbstr? Part of the point of thebstr crate is to provide auniform API of coupled components instead of relying on users to piece togetherloosely coupled components from the crate ecosystem. For example, if you wantedto perform a search and replace in aVec<u8>, then writing the code to dothat with thetwoway crate is not that difficult, but it's still additionalglue code you have to write. This work adds up depending on what you're doing.Consider, for example, trimming and splitting, along with their differentvariants.

In other words,bstr is partially a way of pushing back against themicro-crate ecosystem that appears to be evolving. Namely, it is a goal ofbstr to keep its dependency list lightweight. For example,serde is anoptional dependency because there is no feasible alternative. In service ofthis philosophy, currently, the only required dependency ofbstr ismemchr.