Movatterモバイル変換
[0]ホーム
{-# LANGUAGE Trustworthy #-}{-# LANGUAGE NoImplicitPrelude #-}------------------------------------------------------------------------------- |-- Module : GHC.IO.Encoding.Failure-- Copyright : (c) The University of Glasgow, 2008-2011-- License : see libraries/base/LICENSE---- Maintainer : libraries@haskell.org-- Stability : internal-- Portability : non-portable---- Types for specifying how text encoding/decoding fails-------------------------------------------------------------------------------moduleGHC.IO.Encoding.Failure(CodingFailureMode(..),codingFailureModeSuffix,isSurrogate,recoverDecode,recoverEncode)whereimportGHC.IOimportGHC.IO.BufferimportGHC.IO.ExceptionimportGHC.BaseimportGHC.CharimportGHC.WordimportGHC.ShowimportGHC.NumimportGHC.Real(fromIntegral)--import System.Posix.Internals-- | The 'CodingFailureMode' is used to construct 'System.IO.TextEncoding's,-- and specifies how they handle illegal sequences.dataCodingFailureMode=ErrorOnCodingFailure-- ^ Throw an error when an illegal sequence is encountered|IgnoreCodingFailure-- ^ Attempt to ignore and recover if an illegal sequence is-- encountered|TransliterateCodingFailure-- ^ Replace with the closest visual match upon an illegal-- sequence|RoundtripFailure-- ^ Use the private-use escape mechanism to attempt to allow-- illegal sequences to be roundtripped.deriving(Show-- ^ @since 4.4.0.0)-- This will only work properly for those encodings which are-- strict supersets of ASCII in the sense that valid ASCII data-- is also valid in that encoding. This is not true for-- e.g. UTF-16, because ASCII characters must be padded to two-- bytes to retain their meaning.-- Note [Roundtripping]-- ~~~~~~~~~~~~~~~~~~~~---- Roundtripping is based on the ideas of PEP383.---- We used to use the range of private-use characters from 0xEF80 to-- 0xEFFF designated for "encoding hacks" by the ConScript Unicode Registery-- to encode these characters.---- However, people didn't like this because it means we don't get-- guaranteed roundtripping for byte sequences that look like a UTF-8-- encoded codepoint 0xEFxx.---- So now like PEP383 we use lone surrogate codepoints 0xDCxx to escape-- undecodable bytes, even though that may confuse Unicode processing-- software written in Haskell. This guarantees roundtripping because-- unicode input that includes lone surrogate codepoints is invalid by-- definition.------ When we used private-use characters there was a technical problem when it-- came to encoding back to bytes using iconv. The iconv code will not fail when-- it tries to encode a private-use character (as it would if trying to encode-- a surrogate), which means that we wouldn't get a chance to replace it-- with the byte we originally escaped.---- To work around this, when filling the buffer to be encoded (in-- writeBlocks/withEncodedCString/newEncodedCString), we replaced the-- private-use characters with lone surrogates again! Likewise, when-- reading from a buffer (unpack/unpack_nl/peekEncodedCString) we had-- to do the inverse process.---- The user of String would never see these lone surrogates, but it-- ensured that iconv will throw an error when encountering them. We-- used lone surrogates in the range 0xDC00 to 0xDCFF for this purpose.codingFailureModeSuffix::CodingFailureMode->StringcodingFailureModeSuffix :: CodingFailureMode -> StringcodingFailureModeSuffixCodingFailureModeErrorOnCodingFailure=String""codingFailureModeSuffixCodingFailureModeIgnoreCodingFailure=String"//IGNORE"codingFailureModeSuffixCodingFailureModeTransliterateCodingFailure=String"//TRANSLIT"codingFailureModeSuffixCodingFailureModeRoundtripFailure=String"//ROUNDTRIP"-- | In transliterate mode, we use this character when decoding-- unknown bytes.---- This is the defined Unicode replacement character:-- <http://www.fileformat.info/info/unicode/char/0fffd/index.htm>unrepresentableChar::CharunrepresentableChar :: CharunrepresentableChar=Char'\xFFFD'-- It is extraordinarily important that this series of-- predicates/transformers gets inlined, because they tend to be used-- in inner loops related to text encoding. In particular,-- surrogatifyRoundtripCharacter must be inlined (see #5536)-- | Some characters are actually "surrogate" codepoints defined for-- use in UTF-16. We need to signal an invalid character if we detect-- them when encoding a sequence of 'Char's into 'Word8's because they-- won't give valid Unicode.---- We may also need to signal an invalid character if we detect them-- when encoding a sequence of 'Char's into 'Word8's because the-- 'RoundtripFailure' mode creates these to round-trip bytes through-- our internal UTF-16 encoding.{-# INLINEisSurrogate#-}isSurrogate::Char->BoolisSurrogate :: Char -> BoolisSurrogateCharc=(Int0xD800Int -> Int -> Boolforall a. Ord a => a -> a -> Bool<=IntxBool -> Bool -> Bool&&IntxInt -> Int -> Boolforall a. Ord a => a -> a -> Bool<=Int0xDBFF)Bool -> Bool -> Bool||(Int0xDC00Int -> Int -> Boolforall a. Ord a => a -> a -> Bool<=IntxBool -> Bool -> Bool&&IntxInt -> Int -> Boolforall a. Ord a => a -> a -> Bool<=Int0xDFFF)wherex :: Intx=Char -> IntordCharc-- Bytes (in Buffer Word8) --> lone surrogates (in Buffer CharBufElem){-# INLINEescapeToRoundtripCharacterSurrogate#-}escapeToRoundtripCharacterSurrogate::Word8->CharescapeToRoundtripCharacterSurrogate :: Word8 -> CharescapeToRoundtripCharacterSurrogateWord8b|Word8bWord8 -> Word8 -> Boolforall a. Ord a => a -> a -> Bool<Word8128=Int -> Charchr(Word8 -> Intforall a b. (Integral a, Num b) => a -> bfromIntegralWord8b)-- Disallow 'smuggling' of ASCII bytes. For roundtripping to-- work, this assumes encoding is ASCII-superset.|Boolotherwise=Int -> Charchr(Int0xDC00Int -> Int -> Intforall a. Num a => a -> a -> a+Word8 -> Intforall a b. (Integral a, Num b) => a -> bfromIntegralWord8b)-- Lone surrogates (in Buffer CharBufElem) --> bytes (in Buffer Word8){-# INLINEunescapeRoundtripCharacterSurrogate#-}unescapeRoundtripCharacterSurrogate::Char->MaybeWord8unescapeRoundtripCharacterSurrogate :: Char -> Maybe Word8unescapeRoundtripCharacterSurrogateCharc|Int0xDC80Int -> Int -> Boolforall a. Ord a => a -> a -> Bool<=IntxBool -> Bool -> Bool&&IntxInt -> Int -> Boolforall a. Ord a => a -> a -> Bool<Int0xDD00=Word8 -> Maybe Word8forall a. a -> Maybe aJust(Int -> Word8forall a b. (Integral a, Num b) => a -> bfromIntegralIntx)-- Discard high byte|Boolotherwise=Maybe Word8forall a. Maybe aNothingwherex :: Intx=Char -> IntordCharcrecoverDecode::CodingFailureMode->BufferWord8->BufferChar->IO(BufferWord8,BufferChar)recoverDecode :: CodingFailureMode-> Buffer Word8 -> Buffer Char -> IO (Buffer Word8, Buffer Char)recoverDecodeCodingFailureModecfminput :: Buffer Word8input@Buffer{bufRaw :: forall e. Buffer e -> RawBuffer ebufRaw=RawBuffer Word8iraw,bufL :: forall e. Buffer e -> IntbufL=Intir,bufR :: forall e. Buffer e -> IntbufR=Int_}output :: Buffer Charoutput@Buffer{bufRaw :: forall e. Buffer e -> RawBuffer ebufRaw=RawBuffer Charoraw,bufL :: forall e. Buffer e -> IntbufL=Int_,bufR :: forall e. Buffer e -> IntbufR=Intow}=do--puts $ "recoverDecode " ++ show ircaseCodingFailureModecfmofCodingFailureModeErrorOnCodingFailure->IO (Buffer Word8, Buffer Char)forall a. IO aioe_decodingErrorCodingFailureModeIgnoreCodingFailure->(Buffer Word8, Buffer Char) -> IO (Buffer Word8, Buffer Char)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Word8input{bufL :: IntbufL=IntirInt -> Int -> Intforall a. Num a => a -> a -> a+Int1},Buffer Charoutput)CodingFailureModeTransliterateCodingFailure->doIntow'<-RawBuffer Char -> Int -> Char -> IO IntwriteCharBufRawBuffer CharorawIntowCharunrepresentableChar(Buffer Word8, Buffer Char) -> IO (Buffer Word8, Buffer Char)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Word8input{bufL :: IntbufL=IntirInt -> Int -> Intforall a. Num a => a -> a -> a+Int1},Buffer Charoutput{bufR :: IntbufR=Intow'})CodingFailureModeRoundtripFailure->doWord8b<-RawBuffer Word8 -> Int -> IO Word8readWord8BufRawBuffer Word8irawIntirIntow'<-RawBuffer Char -> Int -> Char -> IO IntwriteCharBufRawBuffer CharorawIntow(Word8 -> CharescapeToRoundtripCharacterSurrogateWord8b)(Buffer Word8, Buffer Char) -> IO (Buffer Word8, Buffer Char)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Word8input{bufL :: IntbufL=IntirInt -> Int -> Intforall a. Num a => a -> a -> a+Int1},Buffer Charoutput{bufR :: IntbufR=Intow'})recoverEncode::CodingFailureMode->BufferChar->BufferWord8->IO(BufferChar,BufferWord8)recoverEncode :: CodingFailureMode-> Buffer Char -> Buffer Word8 -> IO (Buffer Char, Buffer Word8)recoverEncodeCodingFailureModecfminput :: Buffer Charinput@Buffer{bufRaw :: forall e. Buffer e -> RawBuffer ebufRaw=RawBuffer Chariraw,bufL :: forall e. Buffer e -> IntbufL=Intir,bufR :: forall e. Buffer e -> IntbufR=Int_}output :: Buffer Word8output@Buffer{bufRaw :: forall e. Buffer e -> RawBuffer ebufRaw=RawBuffer Word8oraw,bufL :: forall e. Buffer e -> IntbufL=Int_,bufR :: forall e. Buffer e -> IntbufR=Intow}=do(Charc,Intir')<-RawBuffer Char -> Int -> IO (Char, Int)readCharBufRawBuffer CharirawIntir--puts $ "recoverEncode " ++ show ir ++ " " ++ show ir'caseCodingFailureModecfmofCodingFailureModeIgnoreCodingFailure->(Buffer Char, Buffer Word8) -> IO (Buffer Char, Buffer Word8)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Charinput{bufL :: IntbufL=Intir'},Buffer Word8output)CodingFailureModeTransliterateCodingFailure->doifCharcChar -> Char -> Boolforall a. Eq a => a -> a -> Bool==Char'?'then(Buffer Char, Buffer Word8) -> IO (Buffer Char, Buffer Word8)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Charinput{bufL :: IntbufL=Intir'},Buffer Word8output)elsedo-- XXX: evil hack! To implement transliteration, we just-- poke an ASCII ? into the input buffer and tell the caller-- to try and decode again. This is *probably* safe given-- current uses of TextEncoding.---- The "if" test above ensures we skip if the encoding fails-- to deal with the ?, though this should never happen in-- practice as all encodings are in fact capable of-- reperesenting all ASCII characters.Int_ir'<-RawBuffer Char -> Int -> Char -> IO IntwriteCharBufRawBuffer CharirawIntirChar'?'(Buffer Char, Buffer Word8) -> IO (Buffer Char, Buffer Word8)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Charinput,Buffer Word8output)-- This implementation does not work because e.g. UTF-16-- requires 2 bytes to encode a simple ASCII value--writeWord8Buf oraw ow unrepresentableByte--return (input { bufL=ir' }, output { bufR=ow+1 })CodingFailureModeRoundtripFailure|JustWord8x<-Char -> Maybe Word8unescapeRoundtripCharacterSurrogateCharc->doRawBuffer Word8 -> Int -> Word8 -> IO ()writeWord8BufRawBuffer Word8orawIntowWord8x(Buffer Char, Buffer Word8) -> IO (Buffer Char, Buffer Word8)forall (m :: * -> *) a. Monad m => a -> m areturn(Buffer Charinput{bufL :: IntbufL=Intir'},Buffer Word8output{bufR :: IntbufR=IntowInt -> Int -> Intforall a. Num a => a -> a -> a+Int1})CodingFailureMode_->IO (Buffer Char, Buffer Word8)forall a. IO aioe_encodingErrorioe_decodingError::IOaioe_decodingError :: IO aioe_decodingError=IOException -> IO aforall a. IOException -> IO aioException(Maybe Handle-> IOErrorType-> String-> String-> Maybe CInt-> Maybe String-> IOExceptionIOErrorMaybe Handleforall a. Maybe aNothingIOErrorTypeInvalidArgumentString"recoverDecode"String"invalid byte sequence"Maybe CIntforall a. Maybe aNothingMaybe Stringforall a. Maybe aNothing)ioe_encodingError::IOaioe_encodingError :: IO aioe_encodingError=IOException -> IO aforall a. IOException -> IO aioException(Maybe Handle-> IOErrorType-> String-> String-> Maybe CInt-> Maybe String-> IOExceptionIOErrorMaybe Handleforall a. Maybe aNothingIOErrorTypeInvalidArgumentString"recoverEncode"String"invalid character"Maybe CIntforall a. Maybe aNothingMaybe Stringforall a. Maybe aNothing)
[8]ページ先頭