Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commite1e133f

Browse files
committed
Add to locale TODO.detail.
1 parenta9d2cc4 commite1e133f

File tree

1 file changed

+377
-0
lines changed

1 file changed

+377
-0
lines changed

‎doc/TODO.detail/locale

Lines changed: 377 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2662,3 +2662,380 @@ TIP 9: In versions below 8.0, the planner will ignore your desire to
26622662
choose an index scan if your joining column's datatypes do not
26632663
match
26642664

2665+
From pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org Tue Dec 27 06:30:14 2005
2666+
Return-path: <pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org>
2667+
Received: from ams.hub.org (ams.hub.org [200.46.204.13])
2668+
by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBRDUET16504
2669+
for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 08:30:14 -0500 (EST)
2670+
Received: from postgresql.org (postgresql.org [200.46.204.71])
2671+
by ams.hub.org (Postfix) with ESMTP id 3BCBA67A57F
2672+
for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 09:30:14 -0400 (AST)
2673+
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
2674+
Received: from localhost (av.hub.org [200.46.204.144])
2675+
by postgresql.org (Postfix) with ESMTP id 1DB3B9DC859
2676+
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Dec 2005 09:29:47 -0400 (AST)
2677+
Received: from postgresql.org ([200.46.204.71])
2678+
by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
2679+
with ESMTP id 89101-04
2680+
for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
2681+
Tue, 27 Dec 2005 09:29:50 -0400 (AST)
2682+
X-Greylist: from auto-whitelisted by SQLgrey-
2683+
Received: from svana.org (svana.org [203.20.62.76])
2684+
by postgresql.org (Postfix) with ESMTP id 319839DC8A5
2685+
for <pgsql-hackers@postgresql.org>; Tue, 27 Dec 2005 09:29:44 -0400 (AST)
2686+
Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian))
2687+
id 1ErEte-0001lm-00; Wed, 28 Dec 2005 00:29:46 +1100
2688+
Date: Tue, 27 Dec 2005 14:29:46 +0100
2689+
From: Martijn van Oosterhout <kleptog@svana.org>
2690+
To: pgsql-hackers@postgresql.org
2691+
Subject: [HACKERS] Proposed COLLATE implementation
2692+
Message-ID: <20051227132941.GA32404@svana.org>
2693+
Reply-To: Martijn van Oosterhout <kleptog@svana.org>
2694+
MIME-Version: 1.0
2695+
Content-Type: multipart/signed; micalg=pgp-sha1;
2696+
protocol="application/pgp-signature"; boundary="W/nzBZO5zC0uMSeA"
2697+
Content-Disposition: inline
2698+
User-Agent: Mutt/1.3.28i
2699+
X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6
2700+
X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6
2701+
X-PGP-Key-URL: <http://svana.org/kleptog/0DC67BE6.pgp.asc>
2702+
X-Virus-Scanned: by amavisd-new at hub.org
2703+
X-Spam-Status: No, score=0.08 required=5 tests=[AWL=0.080]
2704+
X-Spam-Score: 0.08
2705+
X-Mailing-List: pgsql-hackers
2706+
List-Archive: <http://archives.postgresql.org/pgsql-hackers>
2707+
List-Help: <mailto:majordomo@postgresql.org?body=help>
2708+
List-Id: <pgsql-hackers.postgresql.org>
2709+
List-Owner: <mailto:pgsql-hackers-owner@postgresql.org>
2710+
List-Post: <mailto:pgsql-hackers@postgresql.org>
2711+
List-Subscribe: <mailto:majordomo@postgresql.org?body=sub%20pgsql-hackers>
2712+
List-Unsubscribe: <mailto:majordomo@postgresql.org?body=unsub%20pgsql-hackers>
2713+
Precedence: bulk
2714+
Sender: pgsql-hackers-owner@postgresql.org
2715+
Status: OR
2716+
2717+
2718+
--W/nzBZO5zC0uMSeA
2719+
Content-Type: text/plain; charset=us-ascii
2720+
Content-Disposition: inline
2721+
Content-Transfer-Encoding: quoted-printable
2722+
2723+
Greetings all,
2724+
2725+
If you're not interested in COLLATE, operator classes or related
2726+
things, stop now, this is quite a long email.
2727+
2728+
Firstly, status. PostgreSQL doesn't really support collations at all.
2729+
The order of strings is defined at initdb time by the locale then and
2730+
cannot be changed later. We allow lists to be sorted in either
2731+
ascending or decending order but that's about it. Whatever order there
2732+
is is deduced from b-tree operator classes.
2733+
2734+
The purpose of this patch is to raise collations to (reasonably) first
2735+
class object. The idea is that you can define a collation across any
2736+
type and that you will then be able to ORDER BY, GROUP BY and INDEX
2737+
using that collation. A collation defines both order and equality.
2738+
2739+
The SQL standard does define COLLATE although they only apply that to
2740+
character strings. There are no predefined collations in the standard.
2741+
There are rules about how collations and collation states propegate
2742+
from the leaves of the parse tree all the way to the root. In its
2743+
simplest form, columns and constants have defined collations which
2744+
modify the behaviour of functions using these values. At any point in
2745+
the parse tree the user can override the collation with an explicit
2746+
<collate clause>. If there is ambiguity about what collation applies at
2747+
any point for a function that needs to know, this is a error.
2748+
2749+
All this is parse-time analysis.
2750+
2751+
Proposed Implementation:
2752+
2753+
NODES
2754+
2755+
To implement the above, two new node types are created: CollateClause
2756+
which represents the <collate clause> in SQL syntax, and CollateState
2757+
which represents the actual state at any node. Currently the only nodes
2758+
expected to require these are OpExpr, FuncExpr, Var and Const. Although
2759+
I guess it may apply to any node that can be used in an expression.
2760+
2761+
CATALOG CHANGES
2762+
2763+
To track collations requires a new table in the catalog, which I have
2764+
named pg_collations. It contains the following fields:
2765+
2766+
Oid oid; -- OID for this collation
2767+
Name collname; -- Name of the collation (for collate clause)
2768+
bool collasc; -- Ascending or descending
2769+
Oid collopclass; -- Implementing Operator Class
2770+
int4 colltype; -- Currently, 0=3Dsimple, 1=3Duses locale
2771+
Oid colllocale; -- Locale in pg_locales
2772+
2773+
(Should we be identifying the type here? or is it ok to lookup the type
2774+
via the operator class).
2775+
2776+
The first few fields name the collation so it can be referred to by the
2777+
user. Then the collasc field determines how to use the operator class
2778+
as given in the collopclass field. If it indicates descending order, it
2779+
will invert the sense of the operator class. For example, asking for
2780+
the GT op for a reverse collation will actually return the LT operator
2781+
for the operator class.
2782+
2783+
The purpose of the colltype and colllocale fields are described further
2784+
down.
2785+
2786+
The important thing at this point is that by specifying a collation you
2787+
are also specifying an operator class. At the moment the ascending and
2788+
descending collations for each type are hard-coded for initdb. At the
2789+
moment they have been allocated OIDs starting at 2800, which is the
2790+
first large available block.
2791+
2792+
Each column of a table has a default collation, which defaults to the
2793+
default collation of the type but can be specified in the table
2794+
declaration. To store this requires an additional column in
2795+
pg_attribute (attcollate) which contains the OID of the collation for
2796+
that column. When it is referenced in a query, this collation is copied
2797+
to the CollateState node of the Var node, from whence it can affect the
2798+
query.
2799+
2800+
Finally, to allow the parser to complain about ambiguous CollateStates,
2801+
we need to indicate which functions actually need a senseble collate
2802+
state to function. To this end a single boolean field has been added to
2803+
pg_proc (proneedcollate). If this is true, the parser should error out
2804+
when the collation state is COLLATE_NONE.
2805+
2806+
INDEXES
2807+
2808+
Another place you will be allowed to use the collate clause is while
2809+
creating indexes. If you declare an index using a particular collation,
2810+
it can be used in queries that order by the same collation. Note, that
2811+
the collate clause indicates the operator class, so you can either
2812+
specify one or the other, but not both.
2813+
2814+
So each column of an index will also have a collation. However,
2815+
pg_attribute has already got an extra field to store the collation for
2816+
columns so it makes sense to store the collation here. In the process
2817+
the pg_index.indclass field becomes redundant as it can be inferred
2818+
from the pg_attribute rows associated with the index.
2819+
2820+
To make this work there also needs to be a notion of compatability
2821+
between collations. For example, two collations which are the reverse
2822+
of eachother are compatable in the sense that an index defined with one
2823+
collation would be usable for the other simply by scanning in reverse.
2824+
2825+
FUNCTIONS
2826+
2827+
In particular for string comparison but also possibly for user-defined
2828+
types, a function will need to know what collation it is operating
2829+
under. For this purpose an extra field (fn_collate) is added to
2830+
FmgrInfo which is filled in with the collation from the expression tree
2831+
(if any) or wherever relevent (eg. from the pg_attribute column when
2832+
doing statistics or creating indexes).
2833+
2834+
A PG_GETCOLLATE() macro is added to facilitate user-functions
2835+
retreiving this data. This function throws an error when no collation
2836+
has been defined. This shouldn't happen in practice as issues should
2837+
have been weeded out at parse-time.
2838+
2839+
This macro returns the OID of the collation but in many cases it will
2840+
not be necessary. In particular, functions should NOT invert their
2841+
result if the collation is inverted. It is considered the
2842+
responsibility of the caller to invert the result if necessary. The
2843+
reasons for this are:
2844+
2845+
1. In most cases that matter (order comparison) the issue can be dealt
2846+
with at parse time by the NEGATOR or COMMUTATOR options.
2847+
2. For index scans, we would just do a reverse scan instead (or forward
2848+
if the index is inverted)
2849+
3. Requiring every function to check the collation for inversion is
2850+
wasteful, since in many cases the case can be dealt with statically.
2851+
2852+
DEFAULT COLLATIONS
2853+
2854+
At this point I'm inclined to define a few collations to be built in or
2855+
specially handled:
2856+
2857+
COLLATE ASC - default collation for type, ascending (ie, what we do now)
2858+
COLLATE DESC - default collation for type, inverted
2859+
COLLATE POSIX - For strings, define a simple bytewise string comparison.
2860+
2861+
Indeed, it is expected that by default, all columns involving strings
2862+
in the system catalog will always use COLLATE POSIX. Additionaly, type
2863+
"name" will always use that collation, even if the user changes the
2864+
default (by a method to be specified). This is straightforwardly done
2865+
at initdb time.
2866+
2867+
The purpose of COLLATE DESC is to simplify index declarations. Saying
2868+
2869+
CREATE INDEX foo ON bar( a COLLATE ASC, b COLLATE DESC );
2870+
2871+
would allow it to be used in a query using ORDER BY a, b DESC, without
2872+
the user having to lookup the name of the collation.
2873+
2874+
When it comes to naming collations, the question arises whether
2875+
ascending/descending collations need to have different names. Or
2876+
should there be two collations with the same name with ASC/DESC as a
2877+
modifier? Do collations have to be unique across different types; for
2878+
example, can varchar and text both have a collation "ignorecase"?
2879+
2880+
Another issue is that a column could be declared with a descending
2881+
collation by default. Say it was an integer column, then (a < 5) would
2882+
return FALSE for a =3D 1. While technically correct, I'm thinking of
2883+
ruling it out on the basis of being too confusing, and only allow
2884+
descending collations at query time or in index specifications.
2885+
2886+
Another strange point at the moment is how to determine the default
2887+
collation of a type. At the moment it is done by finding the default
2888+
operator class and looking up the ascending version of that. However,
2889+
we may want to provide the user a way of specifying it directly,
2890+
perhaps by:
2891+
2892+
ALTER TYPE text SET DEFAULT COLLATION ignorecase;
2893+
2894+
PATHKEYS
2895+
2896+
Currently during planning, pathkeys are indicated by an operator of the
2897+
operator class. Here we would simply replace that with the oid of the
2898+
collation, which can be matched directly with the collation defined by
2899+
the index.
2900+
2901+
USER DEFINED TYPES
2902+
2903+
None of this is interesting unless it can be applied to user-defined
2904+
types also. Fortunatly this is easy, when the user declares a b-tree
2905+
operator class, we can generate the collations automatically. We may
2906+
even allow the user to specify the name of the collation. However, if
2907+
the user wanted to create multiple collations based on the same
2908+
operator class (by using the PG_GETCOLLATE() macro above, we may want
2909+
to provide them a way of creating them directly.
2910+
2911+
COLLATIONS USING LOCALES
2912+
2913+
For strings, collation can be done in many different ways defined by
2914+
what is referred to as a locale. As indicated above in the definition
2915+
of pg_collations, there is a colltype field. For most collations this
2916+
will be 0 (simple collation). However, for strings the intention is to
2917+
use a type 1 (using locales). In this case the last column refers to
2918+
the OID of the locale, so you can many collations using the same
2919+
operator class, but different locale oids. On a system level it changes
2920+
nothing, but inside the functions implementing it, they should use
2921+
PG_GETLOCALE(). This will return an opaque pg_locale_t (see below)
2922+
handle which can then be used to implement the specifics.
2923+
2924+
In principle, user-defined types need to be able to use this also,
2925+
perhaps by using the clause COLLATE USING LOCALE in the operator class.
2926+
In theory there should a collation for each combination of
2927+
locale-dependant datatype, locale and order ascending/descending.
2928+
How/when these are created has not yet be determined.
2929+
2930+
MORE TYPES OF COLLATION
2931+
2932+
Another collation type I've speculated about but not thought about
2933+
implementing is a "mapping collation", in which you map the values
2934+
through a function and then collate that. The obvious example would be
2935+
a case-insensetive mapping where lower is applied before collation.
2936+
2937+
Implementation could be pretty much done by simply substituting the
2938+
functions into the parse tree. For example, if you defined something
2939+
like:
2940+
2941+
CREATE COLLATION ignorecase ON text USING lower($1) COLLATE defaulttext;
2942+
2943+
Then anytime you did a comparison with that collation, you would simply
2944+
insert those function calls into the parse-tree and then collate with
2945+
"defaulttext". When declaring an index you would just make it a
2946+
functional index. The rules for functional indexes should make it work
2947+
out-of-the-box.
2948+
2949+
OTHER TECHNICAL ISSUES
2950+
2951+
- Applying a COLLATE clause to an unknown literal causes it to be
2952+
coerced to the type that collation is based on. But what about if we
2953+
have something like COLLATE DESC?
2954+
2955+
- This requires some changes in the bootstrap procedures given that we
2956+
need to be able to do lookups on the operator class for each type
2957+
fairly early on. Some are predefined but it does require moving the
2958+
opclass setup further up the list. However, if we store a default
2959+
collation in pg_type, we wouldn't need to do that.
2960+
2961+
- Sorting arrays. Should they get their own collations, or should
2962+
they use the collations of their base types.
2963+
2964+
LOCALES
2965+
2966+
I've left this to the end because I don't want people distracted by
2967+
what is essentially a side-issue. While this would be needed to
2968+
implement COLLATE the way the SQL spec intended, it can actually be
2969+
implemented and dealt with as a seperate patch. The main reason a basic
2970+
implementation exists is that it provides a great way of finding places
2971+
that didn't define a collation, since any comparison involving "text"
2972+
requires one.
2973+
2974+
To deal with locales I created another table in the catalog,
2975+
pg_locales. This provides an OID which can be referenced from
2976+
elsewhere, such as the pg_collations table.
2977+
2978+
The design is intended to provide some pluggability, so locale
2979+
information can come from multiple sources. Also, each locale will be
2980+
referenced by an identifier which is unrelated to any external
2981+
identifier, so we're not bound by them.
2982+
2983+
The columns defined currently are:
2984+
2985+
Namelocname - Identifier used by postgresql
2986+
Namelocsysname - String identifying the locale for the locale provider
2987+
int4locsource - System providing this locale
2988+
int4locencoding - Encoding expected by provider
2989+
2990+
It is expected that the list of sources for locale data will be short,
2991+
probably hard-coded into the backend (currenty internal/system/icu).
2992+
The only locale defined at startup is POSIX, which is implemented
2993+
internally. The intention is for any other locales to be defined at the
2994+
end of initdb. The expected syntax is something like:
2995+
2996+
CREATE LOCALE hungarian AS 'hu_HU' USING glibc;
2997+
2998+
This should use the provider to check the locale exists and has a
2999+
conpatible encoding. If so it is entered into the table ready for use.
3000+
3001+
In the backend, there will be implementations of functions like
3002+
pg_strcoll_l, pg_localeconv_l, which work like the C system library
3003+
versions only they take an extra pg_locale_t argument. This is used to
3004+
dispatch the call to the right place. There will be a function to
3005+
quickly determine if a locale is C to shortcircuit complexity where it
3006+
is not needed.
3007+
3008+
STATUS
3009+
3010+
Implementation so far is available here:
3011+
3012+
http://svana.org/kleptog/temp/collate-current.patch.gz
3013+
3014+
This patch isn't "clean" and changes a few things that are not strictly
3015+
necessary. It won't finish initdb right now because it gets an error in
3016+
ANALYSE (the array issue above).
3017+
3018+
Feedback, help, comments: please reply.
3019+
3020+
Have a nice day,
3021+
--=20
3022+
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
3023+
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
3024+
> tool for doing 5% of the work and then sitting around waiting for someone
3025+
> else to do the other 95% so you can sue them.
3026+
3027+
--W/nzBZO5zC0uMSeA
3028+
Content-Type: application/pgp-signature
3029+
Content-Disposition: inline
3030+
3031+
-----BEGIN PGP SIGNATURE-----
3032+
Version: GnuPG v1.0.6 (GNU/Linux)
3033+
Comment: For info see http://www.gnupg.org
3034+
3035+
iD8DBQFDsUHDIB7bNG8LQkwRAnh0AJ0YUNLkVaSY3u0jWBPdlaq+9dujZACfTmis
3036+
JR1mF60lKx14Ih850p3lpVk=
3037+
=/Ghs
3038+
-----END PGP SIGNATURE-----
3039+
3040+
--W/nzBZO5zC0uMSeA--
3041+

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp