Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Sql Server not recognizing unicode characters

Author  Topic 

esowell
Starting Member

4 Posts

Posted - 2003-11-09 : 08:33:10
Here's my issue, which I've queried about in various places and with various people, though with no success. The only way that I can describe the phenomenon that I'm seeing is to say that it seems that Sql Server is ignoring some of my characters. Here's what I'm seeing. I run a sql statement like this

SET @lex_form_id = (SELECT lex_form_id FROM morph_lex_form WHERE lex_form LIKE @lex_form)

where @lex_form_id is an INT and column lex_form and variable @lex_form are NVARCHAR. If I do

SET @lex_form_id = (SELECT lex_form_id FROM morph_lex_form WHERE lex_form = @lex_form)

I get the same issue. Okay, so that's my sql. Here's the problem. When I try to match small words that have lots of diacritical markings it doesn't work. For example, "apo" and "epi" give me problems. Say that the variable @lex_form is the word "apo". If I run the statement and "epi" is in the table morph_lex_form, they match. And here's why I think so. Both of the vowels in both words have diacriticals over them. If it ignores those characters, then all you're matching is "p" and "p", which obviously match.

In connection with that, when I do an order by on the table it is also clear that the vowels that have diacritical marks are being ignored.

I do have the collation set on that column to Greek, as you can see here:
[lex_form] [nvarchar] (50) COLLATE Greek_CS_AS NOT NULL

And I have the Greek settings for xp professional turned on as well.

Any idea how to solve this problem? Thanks.

esowell

nr
SQLTeam MVY

12543 Posts

Posted - 2003-11-09 : 10:46:36
try WHERE lex_form = @lex_form collate Greek_CS_AS


what are the ascii codes for the o and i that you are having problems with.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.
Go to Top of Page

esowell
Starting Member

4 Posts

Posted - 2003-11-09 : 23:43:10
Thanks for the hint, but unfortunately it didn't work. The characters for "a" and "o" are transliterations of the Greek characters alpha and omicron. In the real situation the alpha and omicrons would both have diacriticals. I am also using the Greek extended set of characters for some of the characters, and the regular Greek set for the others, the code charts for which can be seen here.

Regular - http://www.unicode.org/charts/PDF/U0370.pdf
Extended - http://www.unicode.org/charts/PDF/U1F00.pdf

The two words are make up of the following characters, which can all be found at the address above:

"apo"
1F00, 03c0, 1f79

"epi"
1f10, 03c0, 1f77

Does that help you diagnose?

Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2003-11-10 : 05:26:02
Greek_CS_AS doesn't collate Unicode characters in the Greek Extended range properly.


CREATE VIEW Greek_chars AS
SELECT n, NCHAR(n) COLLATE Greek_CS_AS AS c
FROM Numbers
WHERE (n BETWEEN 0x0374 AND 0x03FB
OR n BETWEEN 0x1F00 AND 0x1FFE)
AND n NOT IN (0x0376, 0x0377, 0x0378, 0x0379, 0x037B, 0x037C, 0x037D,
0x037F, 0x0380, 0x0381, 0x0382, 0x0383, 0x038B, 0x038D, 0x03A2, 0x03CF,
0x1F16, 0x1F17, 0x1F1E, 0x1F1F, 0x1F46, 0x1F47, 0x1F4E, 0x1F4F, 0x1F58,
0x1F5A, 0x1F5C, 0x1F5E, 0x1F58, 0x1F7E, 0x1F7F, 0x1FB5, 0x1FC5, 0x1FD4,
0x1FD5, 0x1FDC, 0x1FF0, 0x1FF1, 0x1FF5)

GO

SELECT CAST(UNICODE(A.c) AS binary(2)), A.c, CAST(UNICODE(B.c) AS binary(2)), ct
FROM Greek_chars AS A
INNER JOIN (
SELECT c, COUNT(*) AS ct
FROM Greek_chars
GROUP BY c
) AS B
ON A.c = B.c
ORDER BY UNICODE(A.c), UNICODE(B.c)


Edit: Hmm, Given that Unicode comparisons are done with a Windows API call, I wonder whether this is operating-system dependent. Anyone running Windows Server 2003 want to give this a try?
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-10 : 05:26:33
How does this work:

WHERE lex_form = NCHAR(7936) + NCHAR(960) + NCHAR(8057) -- this is "apo"

???
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-10 : 05:29:51
31 seconds... Not so bad.
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-10 : 07:06:26
Is it silly workaround:

WHERE cast(lex_form as varbinary(800)) = cast(@lex_form as varbinary(800))

???
Go to Top of Page

esowell
Starting Member

4 Posts

Posted - 2003-11-11 : 14:58:58
Thanks for your responses guys. I'm not sure if your suggestion will work or not, Stoad. I'll try it later.

As for your surely time-intensive response Mr. Fribble, I did get that working but I wasn't sure exactly how to apply that to my query to solve my problem. I'm interested in your thoughts on that.

But, for the time being, I've got the functionality I need. When I originally set the collation for the field to the Greek collation (GREEK_CS_AS) I did so through a sql statement rather than through enterprise manager. Now, I was poking around in another table that had Greek characters but still had the default Latin collation. I decided to change it to Greek. I didn't remember the sql statement to change the collation, so I decided to open up the table in design view to set the collation. When I opened up the dialog I went first to the sql collation combo box and was surprised to find the GREEK_CS_AS collation missing, which I thought was a little odd. But, in the spirit of adventure, I decided to try using a Windows collation, chose the one for Greek, and set it on Binary sort. Suddenly, my queries worked! Then I checked the old table that I had previously set using a sql query and saw that it too was set to the Windows Greek collation, except that it was set to Dictionary sort rather than binary sort. Then I saw what the problem was. Binary is just zero's and one's, so of course that would work. Comparisons of that kind should work with no problem. But, if the collation is set to dictionary sort, it doesn't sort by the binary value but by the internal dictionary that Sql Server uses to deal with case sensitivity and accents in other languages. So when you say that GREEK_CS_AS doesn't support the Greek extended set you mean that the dictionary that Sql Server uses doesn't support that set. Ahhh!!!!!! Problem solved. Of course, this means for the time being that I can't use the dictionary and thus get case and accent insensitivity when I want it, which is a feature I will need at some point. But, for the moment, what I was trying to do before works great because of the nature of the words I was dealing with.

So, a temporary fix, though I'll need to deal with being able to use dictionary sort at some point. So, any more explanation would be greatly appreciated.
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-11 : 16:02:58
How comparisons for strict equality can be affected by collations?

It's above my head...

The whole idea of the Unicode looks rotten in that.
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2003-11-12 : 06:03:23
If = didn't use the same collation as > and < then indexes wouldn't work.

What the Unicode Consortium think about collation can be found here:
http://www.unicode.org/reports/tr10
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2003-11-12 : 06:17:31
To respond to esowell, I think using GREEK_BIN collation would be preferable to converting the strings to binary -- it's less noisy.

Sorry for not saying so, but my code was really just to illustrate that there is a problem with collation for all of the Greek Extended range of Unicode characters in SQL Server 2000 (at least, running on Windows 2000 and XP).

I'm afraid I don't have any good suggestions for how you can get sensible dictionary collation for strings containing Greek Extended characters.
Go to Top of Page

esowell
Starting Member

4 Posts

Posted - 2003-11-12 : 06:30:16
Well, thanks for your help anyway. I suppose I'll just have to find some way of dealing with case and accent sensitivity some other way. Do you know if a dictionary could be created that could then be used with the Greek collation? If this is possible, is it a super huge project that I wouldn't want to get into?
Go to Top of Page

Stoad
Freaky Yak Linguist

1983 Posts

Posted - 2003-11-12 : 08:26:45
Funny reading on that link :) Plus, I think that there are only few of
those (e.g.) greeks who know how exactly their words should be sorted...

1.3 Contextual Sensitivity
Beyond the concept of levels, there are additional complications in certain languages, whereby the comparison is context sensitive: it depends on more than just single characters compared directly against one another.

First are contractions, where two (or more) characters sort as if they were a single base character. In the table below, CH acts like a character after C. Second are expansions, where a single character sorts as if it were two (or more) characters in sorting. In the table below, an OE ligature sorts as if it were O + E. Both of these can be combined: that is, two (or more) characters may sort as if they were a different sequence of two (or more) characters. In the example below, for Japanese, a length mark sorts like the vowel of the previous syllable: as an A after KA and as an I after KI.

H < Z, but
CH > CZ

There are some further oddities in the ways that languages work. Normally, all differences in sorting are assessed going from the start to the end of the string. If all of the base characters are the same, the first accent difference determines the final order. In row 1 of the example below, the first accent difference is on the o, so that is what determines the order. In French and a few other languages, however, it is the last accent difference that determines the order, as in row 2.
Go to Top of Page
   

- Advertisement -