Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

String encoding conversion in Julia using iconv

License

NotificationsYou must be signed in to change notification settings

JuliaStrings/StringEncodings.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build statusCodecov coverage status

This Julia package provides support for decoding and encoding texts between multiple character encodings. It is currently based on the iconv interface, and supports all major platforms usingGNU libiconv.

Encoding and Decoding Strings

Encoding a refers to the process of converting a string (of anyAbstractString type) to a sequence of bytes represented as aVector{UInt8}.Decoding refers to the inverse process.

julia>using StringEncodingsjulia>encode("café","UTF-16")10-element Array{UInt8,1}:0xff0xfe0x630x000x610x000x660x000xe90x00julia>decode(ans,"UTF-16")"café"

Use theencodings function to get the list of all supported encodings on the current platform:

julia>encodings()411-element Array{String,1}:"850""862""866""ANSI_X3.4-1968""ANSI_X3.4-1986""ARABIC""ARMSCII-8""ASCII""ASMO-708""WINDOWS-1257""windows-1258""WINDOWS-1258""windows-874""WINDOWS-874""WINDOWS-936""X0201""X0208""X0212"

(Note that many of these are aliases for standard names.)

TheEncoding type

In the examples above, the encoding was specified as a standard string. Though, in order to avoid ambiguities in multiple dispatch and to increase performance via type specialization, the package offers a specialEncoding parametric type. Each parameterization of this type represents a character encoding. Thenon-standard string literalenc can be used to create an instance of this type, like so:enc"UTF-16".

Since there is no ambiguity, theencode anddecode functions accept either a string or anEncoding object. On the other hand, other functions presented below only support the latter to avoid creating conflicts with other packages extending Julia Base methods.

In future versions, theEncoding type will allow getting information about character encodings, and will be used to improve the performance of conversions.

Reading from and Writing to Encoded Text Files

The package also provides several simple methods to deal with files containing encoded text. They extend the equivalent functions from Julia Base, which only support text stored in the UTF-8 encoding.

A method foropen is provided to write a string under an encoded form to a file:

julia> path=tempname();julia> f=open(path,enc"UTF-16","w");julia>write(f,"café\nnoël");julia>close(f);# Essential to complete encoding

The contents of the file can then be read back usingread(path, String):

julia>read(path, String)# Standard function expects UTF-8"\xfe\xff\0c\0a\0f\0\xe9\0\n\0n\0o\0\xeb\0l"julia>read(path, String,enc"UTF-16")# Works when passing the correct encoding"café\nnoël"

Other variants of standard convenience functions are provided:

julia>readline(path,enc"UTF-16")"café"julia>readlines(path,enc"UTF-16")2-element Array{String,1}:"café""noël"  julia>for lineachline(path,enc"UTF-16")println(l)endcafénoëljulia>readuntil(path,enc"UTF-16","ë")"café\nno"julia>String(read(path,enc"UTF-16"))"café\nnoël"julia>String(read(path,5,enc"UTF-16"))"café"

When performing more complex operations on an encoded text file, it will often be easier to specify the encoding only once when opening it. The resulting I/O stream can then be passed to functions that are unaware of encodings (i.e. that assume UTF-8 text):

julia> io=open(path,enc"UTF-16");julia>read(io, String)"café\nnoël"

In particular, this method allows reading encoded comma-separated values (CSV) and other character-delimited text files:

julia>using DelimitedFilesjulia>open(readcsv, path,enc"UTF-16")2x1 Array{Any,2}:"café""noël"

Advanced Usage:StringEncoder andStringDecoder

The convenience functions presented above are based on theStringEncoder andStringDecoder types, which wrap I/O streams and offer on-the-fly character encoding conversion facilities. They can be used directly if you need to work with encoded text on an already existing I/O stream. This can be illustrated using anIOBuffer:

julia> b=IOBuffer();julia> s=StringEncoder(b,"UTF-16");julia>write(s,"café");# Encoding happens automatically herejulia>close(s);# Essential to complete encodingjulia>seek(b,0);# Move to start of bufferjulia> s=StringDecoder(b,"UTF-16");julia>read(s, String)# Decoding happens automatically here"café"

Do not forget to callclose onStringEncoder andStringDecoder objects to finish the encoding process. ForStringEncoder, this function callsflush, which writes any characters still in the buffer, and possibly some control sequences (for stateful encodings). For bothStringEncoder andStringDecoder,close checks that there are no incomplete sequences left in the input stream, and raise anIncompleteSequenceError if that's the case. It will also free iconv resources immediately, instead of waiting for garbage collection.

Conversion currently raises an error if an invalid byte sequence is encountered in the input, or if some characters cannot be represented in the target enconding. It is not yet possible to ignore such characters or to replace them with a placeholder.

About

String encoding conversion in Julia using iconv

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors11

Languages


[8]ページ先頭

©2009-2025 Movatter.jp