| Localization library | |||||||||||||||||||||||||
| Regular expressions library(C++11) | |||||||||||||||||||||||||
| Formatting library(C++20) | |||||||||||||||||||||||||
| Null-terminated sequence utilities | |||||||||||||||||||||||||
| Byte strings | |||||||||||||||||||||||||
| Multibyte strings | |||||||||||||||||||||||||
| Wide strings | |||||||||||||||||||||||||
| Primitive numeric conversions | |||||||||||||||||||||||||
| |||||||||||||||||||||||||
| Text encoding identifications | |||||||||||||||||||||||||
| |||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Defined in header <codecvt> | ||
template< class Elem, | (since C++11) (deprecated in C++17) (removed in C++26) | |
std::codecvt_utf8 is astd::codecvt facet which encapsulates conversion between a UTF-8 encoded byte string and UCS-2 or UTF-32 character string (depending on the type ofElem). Thisstd::codecvt facet can be used to read and write UTF-8 files, both text and binary.
UCS-2 is an archaic encoding that is a subset of UTF-16, which encodes scalar values in the range U+0000-U+FFFF (Basic Multilingual Plane) only.
Contents |
| Elem | - | eitherchar16_t,char32_t, orwchar_t |
| Maxcode | - | the largest value ofElem that this facet will read or write without error |
| Mode | - | a constant of typestd::codecvt_mode |
(constructor) | constructs a newcodecvt_utf8 facet(public member function) |
(destructor) | destroys acodecvt_utf8 facet(public member function) |
explicit codecvt_utf8(std::size_t refs=0); | ||
Constructs a newstd::codecvt_utf8 facet, passes the initial reference counterrefs to the base class.
| refs | - | the number of references that link to the facet |
~codecvt_utf8(); | ||
Destroys the facet. Unlike the locale-managed facets, this facet's destructor is public.
| Type | Definition |
intern_type | internT |
extern_type | externT |
state_type | stateT |
| Member | Description |
std::locale::idid[static] | the identifier of thefacet |
invokesdo_out(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_in(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_unshift(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_encoding(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_always_noconv(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_length(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] | |
invokesdo_max_length(public member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | converts a string fromInternT toExternT, such as when writing to file(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | converts a string fromExternT toInternT, such as when reading from file(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | generates the termination character sequence ofExternT characters for incomplete conversion(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | returns the number ofExternT characters necessary to produce oneInternT character, if constant(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | tests if the facet encodes an identity conversion for all valid argument values (virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | calculates the length of theExternT string that would be consumed by conversion into givenInternT buffer(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
[virtual] | returns the maximum number ofExternT characters that could be converted into a singleInternT character(virtual protected member function of std::codecvt<InternT,ExternT,StateT>)[edit] |
| Nested type | Definition |
| enum result{ ok, partial, error, noconv}; | Unscoped enumeration type |
| Enumeration constant | Definition |
ok | conversion was completed with no error |
partial | not all source characters were converted |
error | encountered an invalid character |
noconv | no conversion required, input and output types are the same |
Although the standard requires that this facet works with UCS-2 when the size ofElem is 16 bits, some implementations use UTF-16 instead. The term "UCS-2" was deprecated and removed from ISO 10646.
The following example demonstrates the difference between UCS-2/UTF-8 and UTF-16/UTF-8 conversions: the third character in the string is not a valid UCS-2 character.
#include <codecvt>#include <cstdint>#include <iostream>#include <locale>#include <string> int main(){// UTF-8 data. The character U+1d10b, musical sign segno, does not fit in UCS-2std::string utf8="z\u6c34\U0001d10b"; // the UTF-8 / UTF-16 standard conversion facetstd::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> utf16conv;std::u16string utf16= utf16conv.from_bytes(utf8);std::cout<<"UTF-16 conversion produced "<< utf16.size()<<" code units:\n"<<std::showbase<<std::hex;for(char16_t c: utf16)std::cout<<static_cast<std::uint16_t>(c)<<' '; // the UTF-8 / UCS-2 standard conversion facetstd::wstring_convert<std::codecvt_utf8<char16_t>,char16_t> ucs2conv;try{std::u16string ucs2= ucs2conv.from_bytes(utf8);}catch(conststd::range_error& e){std::u16string ucs2= ucs2conv.from_bytes(utf8.substr(0, ucs2conv.converted()));std::cout<<"\nUCS-2 failed after producing "<<std::dec<< ucs2.size()<<" characters:\n"<<std::showbase<<std::hex;for(char16_t c: ucs2)std::cout<<static_cast<std::uint16_t>(c)<<' ';std::cout<<'\n';}}
Output:
UTF-16 conversion produced 4 code units:0x7a 0x6c34 0xd834 0xdd0bUCS-2 failed after producing 2 characters:0x7a 0x6c34
The following behavior-changing defect reports were applied retroactively to previously published C++ standards.
| DR | Applied to | Behavior as published | Correct behavior |
|---|---|---|---|
| LWG 2229 | C++98 | the constructor and destructor were not specified | specifies them |
| Character conversions | locale-defined multibyte (UTF-8, GB18030) | UTF-8 | UTF-16 |
|---|---|---|---|
| UTF-16 | mbrtoc16 /c16rtomb(with C11's DR488) | codecvt<char16_t,char,mbstate_t> | N/A |
| UCS-2 | c16rtomb(without C11's DR488) | codecvt_utf8<char16_t> | codecvt_utf16<char16_t> |
| UTF-32 | codecvt<char32_t,char,mbstate_t> | codecvt_utf16<char32_t> | |
| systemwchar_t: UTF-32(non-Windows) | mbsrtowcs /wcsrtombs | codecvt_utf8<wchar_t> | codecvt_utf16<wchar_t> |
| converts between character encodings, including UTF-8, UTF-16, UTF-32 (class template)[edit] | |
(C++11)(deprecated in C++17)(removed in C++26) | tags to alter behavior of the standard codecvt facets (enum)[edit] |
(C++11)(deprecated in C++17)(removed in C++26) | converts between UTF-16 and UCS-2/UCS-4 (class template)[edit] |
(C++11)(deprecated in C++17)(removed in C++26) | converts between UTF-8 and UTF-16 (class template)[edit] |