Movatterモバイル変換
[0]ホーム
Efficient implementation of Normalization checking for XML 1.1
We now have two versions, one for UTF-16 and one for UTF-8.
- Modifiedcharlint to generate the necessary data file in C format (option -nf16check): Seeversion 0.53 or higher
- Files: The source files have been moved topublic CVS
- nf16check.h: Include file for programs that want to use checking for UTF-16
- nf16check.c: File that implements checking object/algorithm for UTF-16
- nf8check.h: Include file for programs that want to use checking for UTF-8
- nf8check.c: File that implements checking object/algorithm for UTF-8
- nf16data.h: Include file for checking data structure (no need to include in application programs).NOTE: both nf8check and nf16check currently use nf16data.
- nf16data.c: Actual data file. The C source is huge (700K), but compiled, this should be less than 50k. This can be linked statically; no data in this file is ever changed.
NOTE: The code compiles, but has not really been tested.
- How to use:
- For each thread/entity/whatever that you want to check concurrently, create a checker with
nf8checkNew
ornf16checkNew
- If you want to check just for NFC (without the specialW3C Character Modelstarting conditions), call
nf8checkInit
ornf16checkInit
. - To check data, pass it as 0-delimited blocks to
nf8check
ornf16check
. For the first block, set startP to check W3C Character Model starting conditions. Block boundaries can be at any point (e.g. between high and low surrogate, or between base and combining character or between two combining characters). However, for UTF-16, do not pass in a string with only a high surrogate on the first call (i.e. when startP is set); this will give wrong results. For UTF-8, there are no restrictions on block boundaries at all. Also, it is assumed that the data is clean, i.e. for UTF-16, it is assumed that surrogates always appear in high-low pairs, and for UTF-8, it is assumed that the leading/trailing patterns are correct, or are checked elsewhere. - You can reuse a checker to check something completely different as soon as it is no longer needed. If you want to change whether the checker checks for existing characters, call
nf8checkExists
ornf16checkExists
. - Once a checker is no longer needed, delete with
nf16checkDelete
ornf16checkDelete
- Data used:
Overall, on a Win2000 box with cygwin gcc, this takes 45604 bytes. Storage allocation is straightforward, so this shouldn't differ much on other systems.
- Algorithm sketch:
- At the start of a stretch, check whether first character is allowed. If not, stop.
- For each character, check whether allowed. If not, stop. If yes, store one character in a buffer.
- If 'trailer' (combining) is detected, go into extended mode, using a list of characters including:
- starter/base of combining sequence
- last previously found combining character
- For each new trailer (including the first one):
- compare for adequate combining class sorting with previous combining character
- Check for potential combination with starter. The list of recombinations is calculated so that any combining character that would lead to a change (full combination, recombination so that the combining character gets combined in but another combining character is separated, complete decomposition,...) is listed. 2176 such pairs have been found.
- Ideas for higher efficiency:
- Combine the code with some other code, e.g. code checking correct UTF-8/UTF-16, or doing some lookups.
- Use 'register' declarations in
nfXcheck
, pretty much in the order the variables appear. - Create versions of
nfXcheck
that deal with special calling conditions only. - Create more inner loops, similar to the inner loop for
c <= noPROBLEMmax
. This can eliminate assignements forstartP
,lastclass
, andstarterflag
. - Change binary sorts to register arithmetic.
- Change flag table to use one byte per flag to speed up lookup.
Started 2003-06-06 by Martin Dürst. Additions 2003-06-09 (completed UTF-16code) and 2003-06-14 (completed UTF-8 code).
$Id: Overview.html,v 1.2 2003/08/29 21:23:24 duerst Exp $
[8]ページ先頭