Author |
Topic |
Keith.Macdonald
Starting Member
5 Posts |
Posted - 2006-10-03 : 06:39:57
|
I know that BULK LOAD does not like Unix UTF-8 files. This is a great shame because our old email system (Imail) was running on an MS server and so it gave us log files that bulk load would accept. Now, we're moving to Communigate on a Unix platform. So the logs are in UTF-8 files. The actual problem is our old log processing relies on stored procedures that start with a bulk load of the logfile before running queries to seperate out things like POP3 connections, SMTP messages sent & received, etc. Has anyone else had a similar problem and what did they use instead of bulk load? All suggestions gratefully received. |
|
Kristen
Test
22859 Posts |
Posted - 2006-10-03 : 06:53:02
|
I don't know what a Unix UTF-8 file looks like, but isn't it just a question of telling Bulk Load what the Delimiter and Row Separators are?or failing that running the file through something that converts them to whatever Bulk Load wants?Kristen |
|
|
Keith.Macdonald
Starting Member
5 Posts |
Posted - 2006-10-03 : 07:09:59
|
Re << isn't it just a question of telling Bulk Load what the Delimiter and Row Separators >>I've tried that. BULK INSERT CMailBulkLoadRaw_tbl FROM "T:\CMailLog\Test\2006-09-16.log" WITH ( ROWTERMINATOR = '\n' )I've tried ROWTERMINATOR = '\r' as well. I've found that if I open the file in Textpad, I can Save As and then choose PC instead of Unix as the format. Then the BULK INSERT works fine. Of course, that's OK for hacking in development, but no good for a production system running 365x24 where the stored procedure is being run at 03:00 from an SQL Agent Job.I did a Google search for "convert UTF-8" and found TextPipe but functionally that does nothing more for me than Textpad does for free. My options (so far) seem to be:1) use something else in the stored procedures instead of bulk load (probably DTS?)2) find a freeware utility that can convert the file format before the SQL job runs3) write a convertor utility 4) umm? |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-10-03 : 08:08:39
|
Might be worth having a look at the file before/after saving with Textpad and seeing what it has actually changed. I would expect that it has CHAR(10) for linebreaks before, and CHAR(13)+CHAR(10) after, but maybe there are other changes too.Perhaps a FORMATFILE would help you more precisely define the file than just using ROWTERMINATOR?Kristen |
|
|
QuantumBunny
Starting Member
2 Posts |
Posted - 2006-10-06 : 18:42:12
|
quote: Originally posted by Keith.Macdonald I know that BULK LOAD does not like Unix UTF-8 files. This is a great shame because our old email system (Imail) was running on an MS server and so it gave us log files that bulk load would accept. Now, we're moving to Communigate on a Unix platform. So the logs are in UTF-8 files. The actual problem is our old log processing relies on stored procedures that start with a bulk load of the logfile before running queries to seperate out things like POP3 connections, SMTP messages sent & received, etc. Has anyone else had a similar problem and what did they use instead of bulk load? All suggestions gratefully received.
Okay, the only benefit, as I am sure you are aware, to UTF-8 over ASCII, is the ability to handle "special characters" like accented letters in French, chinese characters, etc. The only benefit of UTF-8 over full Unicode, is character size(and thus over large files, file size). However, with FileType = 'widechar' in your bulk insert, you should be able to upload Unicode files perfectly. However, UTF-8 still didn't work perfectly for me, so the best way to get the file in without losing anything for me, was to convert the file to full Unicode, and then Bulk Insert it.PS, you shouldn't open flat text files you intend to Bulk Insert with Notepad, it can change EOL characters. I recommend getting a file editor like TextPad or UltraEdit, which will not change the basic aspects of the file. |
|
|
Keith.Macdonald
Starting Member
5 Posts |
Posted - 2006-10-11 : 04:30:50
|
QuantumBunny Re your comment << I recommend getting a file editor like TextPad or UltraEdit, which will not change the basic aspects of the file.>>Thanks for the suggestion, but I recommend you read what I've already posted i.e. << I've found that if I open the file in Textpad, I can Save As and then choose PC instead of Unix as the format. Then the BULK INSERT works fine.>> |
|
|
Keith.Macdonald
Starting Member
5 Posts |
Posted - 2006-10-11 : 07:35:05
|
KristenI've found a solution. In the batch file that pulls the daily log from the email server to the SQL Server ready for uploading, I've added this one-line command:TYPE yyyymmdd.log | FIND "" /V > yyyymmdd.txtwhich streams the entire log file to a new version in a PC format.Keith |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-10-11 : 08:02:58
|
quote: Originally posted by Keith.MacdonaldTYPE yyyymmdd.log | FIND "" /V > yyyymmdd.txtwhich streams the entire log file to a new version in a PC format.
While this converts the line endings from UNIX to DOS/Windows (i.e. prefix each LF character with a CR), it doesn't change the character encoding from UTF-8 to something else, does it?You still have to either:(a) convert it from UTF-8 to UTF-16 and import it with the DATATYPEFILE = 'widechar' setting.(b) convert it to Windows-1252 (down-converting any characters that are not representable) and import it with the CODEPAGE = 'ACP'.(c) leave it as UTF-8, and lie about the encoding when you import it. If you do this, any non-ASCII characters in the original text will be misrepresented as a (nonsensical-looking) sequence of characters in the database. |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-10-11 : 08:21:26
|
"will be misrepresented as a (nonsensical-looking) sequence of characters"... which you might be able to convert to something more visually correct withREPLACE(MyColumn, CHAR(111), CHAR(222)(Obviously not the real numbers for CHAR() function!)Kristen |
|
|
Keith.Macdonald
Starting Member
5 Posts |
Posted - 2006-10-11 : 08:33:48
|
Re << it doesn't change the character encoding from UTF-8 to something else, does it? >>I might be wrong, but as a new completely file is being created on the SQL Server, I believe the answer might be Yes. Without making any changes to the BULK INSERT command, BULK INSERT CMailBulkLoadRaw_tbl FROM "T:\CMailLog\Test\Unix\2006-09-29.txt" WITH ( ROWTERMINATOR = '\n')it can now read the file and bulk load data that is not only readable but also produces meaningful results when queried. |
|
|
Arnold Fribble
Yak-finder General
1961 Posts |
Posted - 2006-10-11 : 09:51:57
|
quote: Originally posted by Kristen "will be misrepresented as a (nonsensical-looking) sequence of characters"... which you might be able to convert to something more visually correct withREPLACE(MyColumn, CHAR(111), CHAR(222)(Obviously not the real numbers for CHAR() function!)Kristen
It's a little more involved than that: a single unicode character can be represented by up to 4 bytes in UTF-8. But if you do get into that situation, this might be useful:http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=62406 |
|
|
ciaran
Starting Member
40 Posts |
Posted - 2006-10-11 : 10:50:28
|
Keith, ya I had this problem. It was a pain. basically you'll need to convert the file to unicode or any non ascii characters will appear translated as some thing else. I wrote a .net console application that does the conversion for me and i called it from my dts package. Bit of a hack i know but this was the only way i could think of getting around it. |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-10-11 : 14:07:43
|
If your only tool is a hammer all your problems are nails! |
|
|
|