Discussion:
Definition of Unicode
(too old to reply)
v***@gmail.com
2005-07-02 01:16:13 UTC
Permalink
Hi,

I am a little confused of the definition of Unicode.

My understanding is that Unicode defines the set of characters (which
is a supset of most character reportires), but it doesn't define the
encoding scheme.

UTF-8, UTF-16 and UTF-32 are possible ones for unicode. But I keep
hearing that unicode will use 2 bytes per character. That's not always
true is it? B/c UTF-8 is smart enough to use 1 byte for Latin
characters I thought.

Please help.
Victor
Jim Kingdon
2005-07-02 03:36:11 UTC
Permalink
Post by v***@gmail.com
UTF-8, UTF-16 and UTF-32 are possible ones for unicode. But I keep
hearing that unicode will use 2 bytes per character.
Well, 2 bytes per character (what has sometimes been known as UCS-2)
used to be enough. So back then, before UTF-8 and such existed,
"Unicode" meant UCS-2, but this is an obsolete usage (although one
might still find it, as you have seen).

http://www.unicode.org/faq/basic_q.html#23

So old interfaces which are tied to 2 byte characters (e.g. Java,
Win32 APIs) now generally go with UTF-16 (which is two bytes for most
characters, but can be 4 bytes - a pair of 2 byte units - for the less
common ones).
Post by v***@gmail.com
UTF-8 is smart enough to use 1 byte for Latin characters I thought.
For ASCII characters, yes. But there are other latin characters which
are two bytes (for example, LATIN CAPITAL LETTER A WITH ACUTE from
Latin-1). See Latin-1, Latin Extended A, Latin Extended B, etc, at
http://www.unicode.org/charts/
v***@gmail.com
2005-07-02 16:08:40 UTC
Permalink
Thanks Jim.

I have also heard a lot about a 3-tier web application should ideally
have all 3 tiers unicode enabled to support multiple languages. I am
trying to understand more of the details.

In particular from Java mid-tier to DB: Java keeps characters in
unicode, and if the DB was only ISO Latin1, does that mean when writing
java strings through JDBC into a VARCHAR/CLOB column, the data gets
"converted" to ISO Latin 1 and causes problems?

The part I am missing is that such DB won't be able to "interpret" the
characters, but won't "modify" the actual bytes right? So, when java
retrieves data again from jdbc, characters are interpreted fine again
in the JVM as unicode right? Why would it have been a problem?

Thanks a lot.
Jim Kingdon
2005-07-02 16:28:07 UTC
Permalink
Post by v***@gmail.com
In particular from Java mid-tier to DB: Java keeps characters in
unicode, and if the DB was only ISO Latin1, does that mean when writing
java strings through JDBC into a VARCHAR/CLOB column, the data gets
"converted" to ISO Latin 1 and causes problems?
Well, yes, at the JDBC level you pass text to/from the database as
either a String or a Reader/Writer (via the Clob class), all of which
deal in characters (that is, unicode), not bytes.

What I don't know is what happens in the JDBC driver, the wire
protocol to the database (which is almost invariably proprietary), and
the database itself. I vaguely remember Oracle bugs in character
handling but I don't remember the nature of the bugs or which versions
of Oracle (I think it depended which JDBC driver and such), and I
could easily believe it for other databases too.
Post by v***@gmail.com
The part I am missing is that such DB won't be able to "interpret" the
characters, but won't "modify" the actual bytes right?
That might be logical, but, although I don't claim to know a whole lot
about databases, I don't think I'd assume that VARCHAR or CLOB passes
through arbitrary binary data. Only BLOB is really designed to do
that. And putting character data in a BLOB, although it might be a
decent workaround, is hardly appealing.

It wouldn't be hard to write a short test program to see what your
database does... Call the JDBC setString("\u1234"), then read back
the data, and see what you get.

Continue reading on narkive:
Loading...