Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork63
A string type for Rust that is not required to be valid UTF-8.
License
Unknown and 2 other licenses found
Licenses found
BurntSushi/bstr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This crate provides extension traits for&[u8] andVec<u8> that enabletheir use as byte strings, where byte strings areconventionally UTF-8. Thisdiffers from the standard library'sString andstr types in that they arenot required to be valid UTF-8, but may be fully or partially valid UTF-8.
See this part of the documentation for more details:https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings.
The short story is that byte strings are useful when it is inconvenient orincorrect to require valid UTF-8.
cargo add bstr
The following two examples exhibit both the API features of byte strings andthe I/O convenience functions provided for reading line-by-line quickly.
This first example simply shows how to efficiently iterate over lines in stdin,and print out lines containing a particular substring:
use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout()); stdin.lock().for_byte_line_with_terminator(|line|{if line.contains_str("Dimension"){ stdout.write_all(line)?;}Ok(true)})?;Ok(())}
This example shows how to count all of the words (Unicode-aware) in stdin,line-by-line:
use std::{error::Error, io};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut words =0; stdin.lock().for_byte_line_with_terminator(|line|{ words += line.words().count();Ok(true)})?;println!("{}", words);Ok(())}
This example shows how to convert a stream on stdin to uppercase withoutperforming UTF-8 validationand amortizing allocation. On standard ASCIItext, this is quite a bit faster than what you can (easily) do with standardlibrary APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)
use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout());letmut upper =vec![]; stdin.lock().for_byte_line_with_terminator(|line|{ upper.clear(); line.to_uppercase_into(&mut upper); stdout.write_all(&upper)?;Ok(true)})?;Ok(())}
This example shows how to extract the first 10 visual characters (as graphemeclusters) from each line, where invalid UTF-8 sequences are generally treatedas a single character and are passed through correctly:
use std::{error::Error, io::{self,Write}};use bstr::{ByteSlice, io::BufReadExt};fnmain() ->Result<(),Box<dynError>>{let stdin = io::stdin();letmut stdout = io::BufWriter::new(io::stdout()); stdin.lock().for_byte_line_with_terminator(|line|{let end = line.grapheme_indices().map(|(_, end, _)| end).take(10).last().unwrap_or(line.len()); stdout.write_all(line[..end].trim_end())?; stdout.write_all(b"\n")?;Ok(true)})?;Ok(())}
This crates comes with a few features that control standard library, serde andUnicode support.
std-Enabled by default. This provides APIs that require the standardlibrary, such asVec<u8>andPathBuf. Enabling this feature also enablestheallocfeature.alloc-Enabled by default. This provides APIs that require allocationsvia thealloccrate, such asVec<u8>.unicode-Enabled by default. This provides APIs that require sizableUnicode data compiled into the binary. This includes, but is not limited to,grapheme/word/sentence segmenters. When this is disabled, basic support suchas UTF-8 decoding is still included. Note that currently, enabling thisfeature also requires enabling thestdfeature. It is expected that thislimitation will be lifted at some point.serde- Enables implementations of serde traits forBStr, and alsoBStringwhenallocis enabled.
This crate's minimum supportedrustc version (MSRV) is1.73.
In general, this crate will be conservative with respect to the minimumsupported version of Rust. MSRV may be bumped in minor version releases.
Since it is plausible that some of the types in this crate might end up in yourpublic API (e.g.,BStr andBString), we will commit to being veryconservative with respect to new major version releases. It's difficult to sayprecisely how conservative, but unless there is a major issue with the1.0release, I wouldn't expect a2.0 release to come out any sooner than someperiod of years.
A large part of the API surface area was taken from the standard library, sofrom an API design perspective, a good portion of this crate should be on solidground. The main differences from the standard library are in how the varioussubstring search routines work. The standard library provides genericinfrastructure for supporting different types of searches with a single method,where as this library prefers to define new methods for each type of search anddrop the generic infrastructure.
Someprobable future considerations for APIs include, but are not limited to:
- Unicode normalization.
- More sophisticated support for dealing with Unicode case, perhaps bycombining the use cases supported by
caselessandunicase.
Here are some examples that areprobably out of scope for this crate:
- Regular expressions.
- Unicode collation.
The exact scope isn't quite clear, but I expect we can iterate on it.
In general, as stated below, this crate brings lots of related APIs togetherinto a single crate while simultaneously attempting to keep the total number ofdependencies low. Indeed, every dependency ofbstr, except formemchr, isoptional.
Strictly speaking, thebstr crate provides very little that can't already beachieved with the standard libraryVec<u8>/&[u8] APIs and the ecosystem oflibrary crates. For example:
- The standard library's
Utf8Errorcan beused for incremental lossy decoding of&[u8]. - The
unicode-segmentationcrate can be used for iterating over graphemes (or words), but is onlyimplemented for&strtypes. One could useUtf8Errorabove to implementgrapheme iteration with the same semantics as whatbstrprovides (automaticUnicode replacement codepoint substitution). - The
twowaycrate can be used for fast substringsearching on&[u8].
So why createbstr? Part of the point of thebstr crate is to provide auniform API of coupled components instead of relying on users to piece togetherloosely coupled components from the crate ecosystem. For example, if you wantedto perform a search and replace in aVec<u8>, then writing the code to dothat with thetwoway crate is not that difficult, but it's still additionalglue code you have to write. This work adds up depending on what you're doing.Consider, for example, trimming and splitting, along with their differentvariants.
In other words,bstr is partially a way of pushing back against themicro-crate ecosystem that appears to be evolving. Namely, it is a goal ofbstr to keep its dependency list lightweight. For example,serde is anoptional dependency because there is no feasible alternative. In service ofthis philosophy, currently, the only required dependency ofbstr ismemchr.
This project is licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE orhttps://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT orhttps://opensource.org/licenses/MIT)
at your option.
The data insrc/unicode/data/ is licensed under the Unicode License Agreement(LICENSE-UNICODE), althoughthis data is only used in tests.
About
A string type for Rust that is not required to be valid UTF-8.
Topics
Resources
License
Unknown and 2 other licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.