On 31 May 2004 08:09:08 -0700 in comp.software.international,
Post by Lenny DomnitserI am working on a wiki engine that I hope to fully internationalize.
In creating paths to pages, I want to convert all spans of
non-essential characters to single hyphens. Though I can browse
through character descriptions, I need somebody well-versed in each
language supported by Unicode to help determine which characters carry
essential meaning and which are mainly punctuation, style, etc., that
need not appear in the paths.
[(0x002D, 0x002D),
(0x0030, 0x0039),
(0x0041, 0x005A),
(0x0061, 0x007A),
(0x00C0, 0x0259),
(0x0386, 0x04E9),
(0x05D0, 0x05F2),
(0x0621, 0x064A),
(0x0660, 0x0669),
(0x0670, 0x06D3),
(0x06D5, 0x06D5),
(0x06F0, 0x1Ef9),
(0xFB01, 0xFC62),
(0xFDF2, 0xFEFC)]
Ranges are inclusive. Some parts of it may well be wrong (though I am
pretty confident about the ASCII ones).
Thanks in advance for any contributions, be they full or for specific
languages.
The following is Annex D (normative) of the Committee Draft N869
(unchanged in the final version) of the C Standard by WG14. It was
available on the ISO WG14 web site.
As it says, it is a copy from WG20 TR 10176, which was available on
the ISO WG20 web site IIRC.
It lists all the characters that are valid in identifiers: some of the
special characters at the bottom may not be required for your
purposes, but a little checking of Unicodes will let you decide that
for yourself.
I've crunched it down so that I could compare it to the final version,
and take up less space in the posting.
"
Universal character names for identifiers
1 This clause lists the hexadecimal code values that are valid in
universal character names in identifiers.
2 This table is reproduced unchanged from ISO/IEC TR 10176, produced
by ISO/IEC JTC1/SC22/WG20, except for the omission of ranges that are
part of the required character set.
Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217,
0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA,
03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45,
1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4,
1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC,
1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4,
04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2
Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE,
06D0-06DC, 06E5-06E8, 06EA-06ED
Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2,
09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 09DF-09E3,
09F0-09F1
Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33,
0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 0A59-0A5C,
0A5E, 0A74
Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0,
0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 0AD0, 0AE0
Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30,
0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D,
0B5F-0B61
Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C,
0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 0BBE-0BC2,
0BC6-0BC8, 0BCA-0BCD
Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33,
0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3,
0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39,
0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
Thai: 0E01-0E3A, 0E40-0E5B
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F,
0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9, 0EBB-0EBD,
0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69,
0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093, 309B-309C
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
CJK Unified Ideographs: 4E00-9FA5
Hangul: AC00-D7A3
Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F,
0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F,
0E50-0E59, 0ED0-0ED9, 0F20-0F33
Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1,
02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107,
210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138,
2160-2182, 3005-3007, 3021-3029
"
--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
***@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply