Unicode All Letter/Number Ranges (No Punctuation, Etc.)

Discussion:

(too old to reply)

Lenny Domnitser

2004-05-31 15:09:08 UTC

I am working on a wiki engine that I hope to fully internationalize.
In creating paths to pages, I want to convert all spans of
non-essential characters to single hyphens. Though I can browse
through character descriptions, I need somebody well-versed in each
language supported by Unicode to help determine which characters carry
essential meaning and which are mainly punctuation, style, etc., that
need not appear in the paths.

What I have come up with so far is this set:

[(0x002D, 0x002D),
(0x0030, 0x0039),
(0x0041, 0x005A),
(0x0061, 0x007A),
(0x00C0, 0x0259),
(0x0386, 0x04E9),
(0x05D0, 0x05F2),
(0x0621, 0x064A),
(0x0660, 0x0669),
(0x0670, 0x06D3),
(0x06D5, 0x06D5),
(0x06F0, 0x1Ef9),
(0xFB01, 0xFC62),
(0xFDF2, 0xFEFC)]

Ranges are inclusive. Some parts of it may well be wrong (though I am
pretty confident about the ASCII ones).

Thanks in advance for any contributions, be they full or for specific
languages.

Brian Inglis

2004-05-31 21:36:48 UTC

Permalink

On 31 May 2004 08:09:08 -0700 in comp.software.international,

Post by Lenny Domnitser
I am working on a wiki engine that I hope to fully internationalize.
In creating paths to pages, I want to convert all spans of
non-essential characters to single hyphens. Though I can browse
through character descriptions, I need somebody well-versed in each
language supported by Unicode to help determine which characters carry
essential meaning and which are mainly punctuation, style, etc., that
need not appear in the paths.
[(0x002D, 0x002D),
(0x0030, 0x0039),
(0x0041, 0x005A),
(0x0061, 0x007A),
(0x00C0, 0x0259),
(0x0386, 0x04E9),
(0x05D0, 0x05F2),
(0x0621, 0x064A),
(0x0660, 0x0669),
(0x0670, 0x06D3),
(0x06D5, 0x06D5),
(0x06F0, 0x1Ef9),
(0xFB01, 0xFC62),
(0xFDF2, 0xFEFC)]
Ranges are inclusive. Some parts of it may well be wrong (though I am
pretty confident about the ASCII ones).
Thanks in advance for any contributions, be they full or for specific
languages.

The following is Annex D (normative) of the Committee Draft N869
(unchanged in the final version) of the C Standard by WG14. It was
available on the ISO WG14 web site.
As it says, it is a copy from WG20 TR 10176, which was available on
the ISO WG20 web site IIRC.
It lists all the characters that are valid in identifiers: some of the
special characters at the bottom may not be required for your
purposes, but a little checking of Unicodes will let you decide that
for yourself.
I've crunched it down so that I could compare it to the final version,
and take up less space in the posting.

"
Universal character names for identifiers
1 This clause lists the hexadecimal code values that are valid in
universal character names in identifiers.
2 This table is reproduced unchanged from ISO/IEC TR 10176, produced
by ISO/IEC JTC1/SC22/WG20, except for the omission of ranges that are
part of the required character set.
Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217,
0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 03DA,
03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 1F20-1F45,
1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 1F80-1FB4,
1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 1FE0-1FEC,
1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4,
04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 05F0-05F2
Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE,
06D0-06DC, 06E5-06E8, 06EA-06ED
Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 09B2,
09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 09DF-09E3,
09F0-09F1
Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 0A32-0A33,
0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 0A59-0A5C,
0A5E, 0A74
Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 0AAA-0AB0,
0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 0AD0, 0AE0
Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30,
0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D,
0B5F-0B61
Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 0B9C,
0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 0BBE-0BC2,
0BC6-0BC8, 0BCA-0BCD
Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33,
0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3,
0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39,
0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
Thai: 0E01-0E3A, 0E40-0E5B
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 0E99-0E9F,
0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 0EB0-0EB9, 0EBB-0EBD,
0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69,
0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093, 309B-309C
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
CJK Unified Ideographs: 4E00-9FA5
Hangul: AC00-D7A3
Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F,
0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F,
0E50-0E59, 0ED0-0ED9, 0F20-0F33
Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 02D0-02D1,
02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 2102, 2107,
210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 212A-2131, 2133-2138,
2160-2182, 3005-3007, 3021-3029
"

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

***@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply

Lenny Domnitser

2004-06-01 21:04:07 UTC

Permalink

Thank you. I believe that these ranges include all character related
to that language, such as special commas, etc.

Also, I found the Python function unicode.isalnum, which should cover
most cases of "essential characters", though if somebody that knows
any languages in which some characters that might not be considered
alphanumeric that are indeed critical to understanding, please speak.

Brian Inglis

2004-06-02 02:30:30 UTC

Permalink

On 1 Jun 2004 14:04:07 -0700 in comp.software.international,

Post by Lenny Domnitser
Thank you. I believe that these ranges include all character related
to that language, such as special commas, etc.

ISTM the character set specific sections should include only letters,
and that all numerals and special characters should be in the last two
sections. Examination of WG20 TR 10176 should confirm this.

Post by Lenny Domnitser
Also, I found the Python function unicode.isalnum, which should cover
most cases of "essential characters", though if somebody that knows
any languages in which some characters that might not be considered
alphanumeric that are indeed critical to understanding, please speak.

Some cross checking of the code against quoted references, and
examination of those references, are always advisable in case the code
has errors (no, never!), or the references have been updated since the
code was written.

--
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada

***@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply