- Notifications
You must be signed in to change notification settings - Fork9
Fast C++ function "is_utf8": checks if the input is valid UTF-8. Made of a single source file. Optimized for ARM NEON, x64 SSE, AVX2 and AVX-512.
License
Apache-2.0 and 2 other licenses found
Licenses found
simdutf/is_utf8
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Most strings online are in unicode using the UTF-8 encoding. Validating stringsquickly before accepting them is important.
This is a simple one-source file library to validate UTF-8 strings at highspeeds using SIMD instructions. It works on all platforms (ARM, x64).
Build and linkis_utf8.cpp with your project. Code usage:
#include"is_utf8.h"char * mystring = ...bool is_it_valid = is_utf8(mystring, thestringlength);
It should be able to validate strings using less than 1 cycle per input byte.
- C++11 compatible compiler. We support LLVM clang, GCC, Visual Studio. (Ouroptional benchmark tool requires C++17.)
- For high speed, you should have a recent 64-bit system (e.g., ARM or x64).
- If you rely on CMake, you should use a recent CMake (at least 3.15).
- AVX-512 support require a processor with AVX512-VBMI2 (Ice Lake or better) anda recent compiler (GCC 8 or better, Visual Studio 2019 or better, LLVM clang 6or better). You need a correspondingly recent assembler such as gas (2.30+) ornasm (2.14+): recent compilers usually come with recent assemblers. If you mixa recent compiler with an incompatible/old assembler (e.g., when using arecent compiler with an old Linux distribution), you may get errors at buildtime because the compiler produces instructions that the assembler does notrecognize: you should update your assembler to match your compiler (e.g.,upgrade binutils to version 2.30 or better under Linux) or use an oldercompiler matching the capabilities of your assembler.
cmake -B buildcmake --build buildcd buildctest .Visual Studio users must specify whether they want to build the Release or Debugversion.
To run benchmarks, build and execute thebench command.
cmake -B buildcmake --build build./build/benchmarks/benchInstructions are similar for Visual Studio users.
This C++ library is part of the JavaScript packageutf-8-validate. Theutf-8-validate package is routinely downloaded more thana million times per week.
If you are using Node JS (19.4.0 or better), you already have access to thisfunction asbuffer.isUtf8(input).
- John Keiser, Daniel Lemire,Validating UTF-8 In Less Than One Instruction Per Byte,Software: Practice & Experience 51 (5), 2021
If you want a wide range of fast Unicode function for production use, you canrely on the simdutf library. It is as simple as the following:
#include"simdutf.cpp"#include"simdutf.h"intmain(int argc,char *argv[]) {constchar *source ="1234";// 4 == strlen(source)bool validutf8 =simdutf::validate_utf8(source,4);if (validutf8) { std::cout <<"valid UTF-8" << std::endl; }else { std::cerr <<"invalid UTF-8" << std::endl;return EXIT_FAILURE; }}
Seehttps://github.com/simdutf/
This library is distributed under the terms of any of the following licenses, atyour option:
- Apache License (Version 2.0)LICENSE-APACHE,
- Boost Software LicenseLICENSE-BOOST, or
- MIT LicenseLICENSE-MIT.
About
Fast C++ function "is_utf8": checks if the input is valid UTF-8. Made of a single source file. Optimized for ARM NEON, x64 SSE, AVX2 and AVX-512.
Topics
Resources
License
Apache-2.0 and 2 other licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors6
Uh oh!
There was an error while loading.Please reload this page.