how to check for double byte chars in a string

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

how to check for double byte chars in a string

Author

Topic

donpolix
Yak Posting Veteran

97 Posts

Posted - 2005-06-09 : 07:28:27

If I have a field with type nvarchar(4000), I want to query this field so that if the data contains any double byte characters, it will return 'NON ENGLISH CHARS'.

for example: fieldname = MessageField

MessageField
--------------
asdfq wer
rtu im äµ�¬äŽ
z2xcv asdf
12 ‹X . hello

I want to return it this way:

MessageField
--------------
asdfq wer
'NON ENGLISH CHARS'
z2xcv asdf
'NON ENGLISH CHARS'

Appreciate any help on this.

Thanks!

Donn Policarpio

donpolix
Yak Posting Veteran

97 Posts

Posted - 2005-06-09 : 07:40:52

any help.

Donn Policarpio

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-09 : 08:22:30

Assuming that you don't stray outside the Unicode baseplane, all characters in an nvarchar column are stored as two bytes. (It's probably not a good idea to expect SQL Server 2000 to work well with characters outside the baseplane, since nvarchar was based on Unicode 2.0 and that didn't have any.)

So presumably what you meant was characters that would require more than 1 byte if stored as UTF-8. In that case, it's just a question of finding characters c where UNICODE(c) >= 128.
Can't think of any quick ways of doing that. Two slow ways are:
1. write a UTF that returns whether an nvarchar value contains any characters with a Unicode codepoint >= 128.
2. use a tally table that replaces the WHILE loop in the UTF and do the same check in the query itself.
Neither is very appealing!

donpolix
Yak Posting Veteran

97 Posts

Posted - 2005-06-09 : 20:55:26

quote:
Originally posted by Arnold Fribble

Assuming that you don't stray outside the Unicode baseplane, all characters in an nvarchar column are stored as two bytes. (It's probably not a good idea to expect SQL Server 2000 to work well with characters outside the baseplane, since nvarchar was based on Unicode 2.0 and that didn't have any.)

So presumably what you meant was characters that would require more than 1 byte if stored as UTF-8. In that case, it's just a question of finding characters c where UNICODE(c) >= 128.
Can't think of any quick ways of doing that. Two slow ways are:
1. write a UTF that returns whether an nvarchar value contains any characters with a Unicode codepoint >= 128.
2. use a tally table that replaces the WHILE loop in the UTF and do the same check in the query itself.
Neither is very appealing!

Thanks Arnold!
>128 it is. I actually wrote a UDF to scan each characters, however, this is painful since I am about to go through a bunch of text (as in a body of an email)..But since, if anybody else might also think of going with this crunching approach (like what you suggested), I will be fine with it. I guess I have no other choice =(.

Donn Policarpio

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-06-10 : 03:37:26

I'm being an idiot! You only want to know whether they contain characters where UNICODE(c) >= 128, you don't need to know what the characters are. Consequently, this will work:


CASE WHEN MessageField COLLATE Latin1_General_BIN
  LIKE N'%[' + NCHAR(128) + '-' + NCHAR(65535) + N']%'
THEN N'''NON ENGLISH CHARS'''
ELSE MessageField
END

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-06-10 : 03:49:37

>> I'm being an idiot!

Gee, that must be by Your standard.
What does that make the rest of us, imbecils ?

rockmoose

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources