Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
Kristen
Test
22859 Posts |
Posted - 2006-02-22 : 14:18:21
|
I've generating some XML out of SQL on one server.I'm wrapping that (in a BATCH file) with<?xml version="1.0" encoding="UTF-8"?><root>...</root> and then trying to get it back in with the XML Bulk Import object.Its complaining about some weird characters that my editor is coughing on too! Looks like a degree (i.e. temperature) symbolAny ideas how I persuade FOR XML EXPLICIT to encode those characters? or some other way to handle them please?Edit: In case it helps my "well-formedness" checker says:Error An exception occured! Type:TranscodingException, Message:An invalid multi-byte source text sequence was encounteredThanksKristen |
|
|
my_aro
Yak Posting Veteran
50 Posts |
Posted - 2006-02-22 : 16:27:33
|
| mam can i see the complete detail of that xml tags please..? |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-23 : 03:19:32
|
| Well, I don't suppose that the characters will display here, but the tag in question is:<sys_html_Text><b>xx xxxxxxxxxxx xxxxxxxxxxx: +25#56130;#57184;0#56154;#57140;;/b><br>xxxx ... xxxx</sys_html_Text>The actual HTML is:<b>xx xxxxxxxxxxx xxxxxxxxxxx: +25°C - 0°C</b><br>and in case the degree symbol is lost in transmission I'll substitute it here:<b>xx xxxxxxxxxxx xxxxxxxxxxx: +25[[DEGREE]]C - 0[[DEGREE]]C</b><br>Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-23 : 03:30:15
|
| Your problem is that you're telling the XML parser that the contents of the XML file is encoded as UTF-8 and it isn't.If it's choking on a non-ascii character then the output's presumably not UTF-16 (otherwise it wouldn't make it past the first few characters). So I'd guess (and I stress it's a guess) it's going to be whatever code page is used by the default database collation for varchar. Probably CP 1252.You could try changing it to say encoding="windows-1252", but XML parsers aren't obliged to accept this encoding and not all do. If you're sure that there are no characters that encode into CP 1252 in the range 128-159 then you can get away with encoding="ISO-8859-1" (actually, XML parsers aren't obliged to accept that either, but I think they pretty much all do). But check that it does the right thing, because -- unlike UTF-8 -- it won't be able to detect bogus octet sequences with either of those encodings.Edit: Hmm... given your second post, I'm not so sure that the above is what you want. |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-23 : 03:42:42
|
quote: Originally posted by Kristen Well, I don't suppose that the characters will display here, but the tag in question is:<sys_html_Text><b>xx xxxxxxxxxxx xxxxxxxxxxx: +25#56130;#57184;0#56154;#57140;;/b><br>xxxx ... xxxx</sys_html_Text>
I think something went tragically wrong with that when you posted it! We seem to have lost some ampersands in there somewhere, and I can't see the C's either, or the hyphen, or the start of the </b>But these numbers look particularly fishy:56130 = 0xDB4257184 = 0xDF6056154 = 0xDB5A57140 = 0xDF34They look to me like 2 sets of UTF-16 surrogates pairs being expressed as if they were character references. That should never happen.Edit: the trouble is, if were UTF-16 surrogate pairs they'd representU+E0B60U+E6B34respectively, which makes no sense: AFAIK, no codepoints in plane 14 are assigned beyond U+E01EF. |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-23 : 05:18:22
|
I was really surprised with it too. The thing I pasted it from had each "[Degree]C" as a square box. Goodness knows why/what it apparently concatenated the symbol PLUS the 'C' into a single Unicode character ...Is there some simple Hex viewer I can point at the original file to see what it actually is? I'd prefer not to pre-edit the file to extract a snippet in case the editor mucks it up - so I need a Hex viewer that can handle a 30MB file Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-23 : 05:55:58
|
quote: Originally posted by Kristen I was really surprised with it too. The thing I pasted it from had each "[Degree]C" as a square box. Goodness knows why/what it apparently concatenated the symbol PLUS the 'C' into a single Unicode character ...
Right, that's U+2103 (DEGREE CELCIUS). The thought that it could be passed briefly through my head but didn't seem to agree with the numbers you posted. It's only in Unicode for the purposes of round-trip compatibility with some east Asian encodings (e.g. JIS X 0212), and it has a compatibility decomposition to U+00B0 U+0043 (i.e. the degree character followed by a capital C). So really, it's not a character you'd expect people to be using normally.If you're looking at a UTF-8 encoded file containing U+2103 you'd expect to see the 3 octet sequence 0xE2 0x84 0x83. |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-23 : 10:22:14
|
Arnold, I have no idea how you know this stuff, but I'm rally glad that you do and that you hang around these parts. Many thanks.Can you help me with a) why FOR XML EXPLICIT didn't properly encode this character and b) how I can force it to - or for it to use a more appropriate encoding?Some sample code (nicked from something you did earlier, ta!) below. The only things I'm getting encoded are:" (34)& (38)' (39)< (60)> (62)which doesn't seem to be very much  DROP TABLE XMLCollationTestGOCREATE TABLE XMLCollationTest( n int PRIMARY KEY, c1 varchar(20) COLLATE Latin1_General_BIN NOT NULL, c2 varchar(20) COLLATE Latin1_General_BIN NOT NULL)DECLARE @n intSET @n = 32 -- Start with a space-- fill varchar columns with English and CHAR versions of each characterSET NOCOUNT ONBEGIN TRANSACTIONWHILE @n < 256 BEGIN INSERT INTO XMLCollationTest (n, c1, c2) SELECT [n] = @n, [c1] = 'Char-' + CONVERT(varchar(20), @n), [c2] = '[' + CHAR(@n) + ']' SET @n = @n + 1ENDCOMMIT TRANSACTIONSET NOCOUNT OFFGO/**-- DEBUG : Check what was storedSELECT *FROM XMLCollationTestORDER BY n**/SELECT [Tag] = 10, [Parent] = NULL, [XMLCollationTest!10!n!element] = n, [XMLCollationTest!10!c1!element] = c1, [XMLCollationTest!10!c2!element] = c2FROM dbo.XMLCollationTestORDER BY [XMLCollationTest!10!n!element], [Tag]FOR XML EXPLICIT Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-23 : 14:25:22
|
| From the output of that, and what you get when you change the columns to nvarchar, CHAR() to NCHAR() and extend the numbers up to, say, 500, it certainly looks like it never escapes characters as character references (i.e. sequences of the form [ampersand] [hash] [digits] [semicolon]).If you were outputing the data into a file with bcp in "-c" character mode and the data included characters that couldn't be represented in the codepage of the output (usually CP1252), then those characters would get downconverted. The only way I can see to keep those characters would be to use the "-w" Unicode character mode which would mean it's encoded as UTF-16. The only wrinkle being that you appear to get a BOM (byte order mark U+FEFF) at the start of the output in that case, so prepending your XML declaration is a bit tricky -- although possibly not: U+FEFF in its other guise as zero-width no-break space is probably harmless between the root element start tag and the first start tag of its content.This would be less problematic if it turns out that there's some secret option in "FOR XML" that allows you to specify an encoding character set for the output so that all characters outside that set get replaced with character references. "FOR XML" is not something I know much about.If you're consuming the output of your SELECT ... FOR XML some other way, I don't really have any advice! |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-23 : 14:46:19
|
As a wild guess I tried addingFOR XML EXPLICIT, BINARY BASE64it didn't make any difference Thanks for your advice and input Arnold, but I've given up. I'm pre-converting using:1) A test to find any "rogue" characters. This is repeated for each "potential" column-- Check for rogue characters before XML exportDECLARE @strPattern nvarchar(1000)SELECT @strPattern = '%[^-+_(),.:;!&*#@?%=<>/"'' 0-9A-Za-z' + CHAR(13) + CHAR(10) + CHAR(9) + CHAR(96) -- Permitted -- Rogue characters found so far + CHAR(145) + CHAR(146) + CHAR(147) + CHAR(148) + CHAR(150) + CHAR(151) + CHAR(153) + CHAR(163) -- ... + ']%'-- Show a few sample values, not already found.-- Convert CR LF and TAB to single character to make character position indication easierSELECT TOP 100 MyPK, [Offset] = PATINDEX(@strPattern, MyStringColumn), [ASCII] = ASCII(SUBSTRING(MyStringColumn, PATINDEX(@strPattern, MyStringColumn COLLATE Latin1_General_BIN), 1)), CHAR(13)+CHAR(10)+' ' + REPLICATE('.', PATINDEX(@strPattern, MyStringColumn COLLATE Latin1_General_BIN)-1)+'v', CHAR(13)+CHAR(10)+' ' + REPLACE(REPLACE(REPLACE(MyStringColumn, CHAR(13), '¬'), CHAR(10), '¬'), CHAR(9), '¬')FROM dbo.TEMP_ImportDataWHERE MyStringColumn LIKE @strPattern COLLATE Latin1_General_BINand then a wrapper-function to replace any rogue characters that does: RETURN REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE( @strData-- See http://www.ascii.cl/htmlcodes.htm-- NOTE: Space added after each "&" to prevent forum displaying as character! , CHAR(145), '& #8216;') -- ‘ (See note 1) , CHAR(146), '& #8217;') -- ’ (See note 1) , CHAR(147), '& #8220;') -- “ (See note 1) , CHAR(148), '& #8221;') -- ” (See note 1) , CHAR(150), '& ndash;') -- – , CHAR(151), '& mdash;') -- — , CHAR(153), '& #8482;') -- ™ (See note 1) , CHAR(163), '& pound;') -- £ , CHAR(174), '& reg;') -- ® , CHAR(176), '& deg;') -- ° , CHAR(178), '& sup2;') -- ² , CHAR(180), '& acute;') -- ´ , CHAR(183), '& middot;') -- · Might be intended to be a bullet , CHAR(186), '& ordm;') -- º , CHAR(188), '& frac14;') -- ¼ , CHAR(189), '& frac12;') -- ½ , CHAR(190), '& frac34;') -- ¾ , CHAR(191), '& iquest;') -- ¿ , CHAR(239), '& iuml;') -- ï-- Notes: (1) HTML 4.01, ISO 10646, ISO 8879, Latin extended A and B, Browser support: Internet Explorer > 4, Netscape > 4 Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-23 : 17:47:06
|
quote:
[...] , CHAR(150), '–') -- – , CHAR(151), '—') -- —[...] , CHAR(163), '£') -- £ , CHAR(174), '®') -- ® , CHAR(176), '°') -- ° , CHAR(178), '²') -- ² , CHAR(180), '´') -- ´[...]
Sorry Kristen, I don't understand this bit at all. These replaces won't do anything, will they? Which means that you still need to be able to represent those character as encoded characters (and character references) in the output. And since they include characters with CP1252 codes in the range 128-159, you're probably going to need output it as CP1252 (or UTF-16) and treat that output accordingly. |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-24 : 00:35:26
|
| Hehehe ... Snitz has replace my &xxx with the actual character!I'll go back and doctor the original ... thanks.Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-24 : 03:59:10
|
| Ah, that would explain it!However, those HTML character entities are not an intrinsic part of XML. If your output isn't referencing a DTD that defines those character entities then an XML parser will still show them as errors (though it's likely to be a better explained error than you were getting with bad UTF-8 sequences). If you use the character reference then it will work in any XML context (e.g. & #x2014; or & #8212; for an em dash, depending on whether you like hex or decimal) |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-24 : 12:47:54
|
Someone's out to make this XML malarkey really hard to cope with, aren't they!!There was I thinking how easy "& mdash;" was to read in the code ... and now I need to replace it with "& #8212;" Thanks for the headsup though ... definitely a big hole I was heading to fall into. I bet it will work fine in have the MS stuff I am using to test with, and then fail in some Open Source equivalent that the client will, no doubt!, be using.Kristen |
 |
|
|
aiken
Aged Yak Warrior
525 Posts |
Posted - 2006-02-24 : 19:36:56
|
| You can also define the entities at the top of your XML document; something like <!ENTITY mdash "& #x2014;">-b |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-25 : 01:34:23
|
| Thanks aiken, I didn't know that."those HTML character entities are not an intrinsic part of XML"Arnold: what is happening, in practice, is that I convert the original MDash character into "& mdash;" in the column [in my staging table].The FOR XML EXPLICIT output I am doing encodes this as "& amp;mdash;" - so the mdash does not have to be known, per se.This transports through the XML "layer" of the task, and then successfully un-encodes on import back to "& mdash;".So for my purpose of getting some data from A to B this appears to do nicely!ThanksKristen |
 |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-02-25 : 04:38:27
|
| Ah, I see, so all the actual HTML is just text with lots of less-than and greater-than characters in it as far as XML is concerned.I'm still a bit mystified about what happened at the start of this thread, though. Where did this U+2103 come from? Was it stored in an nvarchar column? Did you just decide to replace it with "& #176;C"? |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-25 : 07:58:26
|
| "so all the actual HTML is just text with lots ..."Well not quite! The client has actually put ASCII codes in for the various extended characters. They ought to have used "& xxx;" character entities instead.So first off I exported their data using FOR XML EXPLICIT.Then I tried to BulkXMLImport it and I get "invalid character" messages.So I then opened it in my trusty Programmer's Editor (Visual Slick Edit) and saw the rogue characters.At this point I don't really know whether SQL's FOR XML EXPLICIT or my Editor mucked up and interpreted the character as U+2103.(So I then did some UPDATE REPLACE(...REPLACE(... to substitute character entities instead of the original Extended ASCII codes - which will hopefully make the web pages display better anyway!!)Kristen |
 |
|
|
|
|
|
|
|