Author |
Topic |
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2007-12-07 : 04:55:54
|
Hi all,on a daily basis I import about 500 undelimited, fixed width text files into sql server tables using a 2-step bulk import process. Step 1 is to bulk insert all rows in the text-file into a "staging-table" which has only one single wide column ("AllColumns" varchar(4000)) that holds all the characters from each row in the file. Step 2 is to insert into the real tables using sql like this:INSERT INTO table1SELECT Column1 = SUBSTRING(AllColumns, 1, 5), Column2 = SUBSTRING(AllColumns, 6, 2), ...FROM staging_table1 This works great and is quite fast compared to other methods I've tried and I got great help in deciding to use this method (http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=68124). I do however suspect that this method is really memory intensive when it comes to the large files and some of them beeing over 700MB I suspect that performance is degrading quite a bit. There are also several instanses of this job running at the same time. My question is this: do you guys think that I could see a performance gain by dividing the load of these large tables into several smaller steps, say batches of 100k rows at a time (step 1 is using this method but not step 2)?--Lumbago"Real programmers don't document, if it was hard to write it should be hard to understand" |
|
rmiao
Master Smack Fu Yak Hacker
7266 Posts |
Posted - 2007-12-08 : 00:09:13
|
May speed up process and prevent log file growth, but don't think will use less memory. |
|
|
jezemine
Master Smack Fu Yak Hacker
2886 Posts |
Posted - 2007-12-08 : 01:32:23
|
to keep the log file down, you can use the -b param to bcp, so that you commit rows 100k at a time or whatever, rather than committing all the rows in the file as one transaction. this is particularly important if you have many, many rows in the file, like 100 million or so.are you by any chance using -hTABLOCK? that can speed things up a lot, but will cause blocking if you have simultaneous bulk inserts into the same table.to make this really fast though, you could parse the files with a custom app written in compiled code to produce a csv which you could then bulk insert directly into the final destination table. SQL is slow at string parsing compared to languages like C/C++/C#. so if the parsing part is the bottleneck, you could do that in compiled code and then just bulk insert the result. elsasoft.org |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2007-12-10 : 04:16:06
|
I actually did some simple tests by adding an identity field to the staging table and do a loop with 50k rows at a time but I didn't see any difference at all actually. I just meisured the time it took and they were only a seconds apart on a 4 minute job. The bulk insert part takes only a fragment of the total time, it's the substring and all the which takes time. My problem is that the loading times are highly inconsistant and sometimes one full job can take 3 hours, other times it can take up to 5 hours. I have only just begun to monitor the server for other types of activity so even though there shouldn't be any I hope I find something...--Lumbago"Real programmers don't document, if it was hard to write it should be hard to understand" |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2007-12-10 : 08:27:13
|
Try exporting the data from a view from the staging table then importing it to the production table (or a view on it) using bulk insert.You might find this faster depending on the recovery model and table.Another option is to use a format file to do the import (although I tend to think of that as a last resort).==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2007-12-10 : 08:42:46
|
Hm...so are you actually saying that I should create views of my selects, bcp them all to csv-files or something (which format did you have in mind?) and then reimport into the production table using bulk insert? Just wanna make sure I get you right...--Lumbago"Real programmers don't document, if it was hard to write it should be hard to understand" |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2007-12-10 : 08:57:12
|
You should be able to bcp out and in using native format (if you get the datetypes correct in the view) lots faster.It depends on your system but this can be a lot faster than inserts using substrings - at the expense of complication.How much data are you importing? 5 hours is an awfully long time.Are you importing from the server or a remote machine - consider copying to the server before the import.How long does the import into the staging table take? This should be insignificant compare to the inserts the way you are doing it.Are you replacing all the data in the production tables? If so then drop all indexes and create them after the inserts. If not then consider loading into another table and using views to access them (maybe partitioned views - would be a partitioned table in v2005).==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2007-12-10 : 09:34:00
|
Hm, I never even thought of this option but it could be fun to try it out! For this particular system I import about 3.7GB of flat .txt-files every day and using this method I import about 10GB on a daily basis so it's definetly worth trying to optimize even more. I don't know actually how long the import into the staging table takes (I only time the total for both staging and import to the prod-table) but a qualified guess would be about 10% of total loading time. I truncate each table every time and the recovery model is simple so there shouldn't be any logging causing significant overhead. I do have one clustered index on each table though but the input from the files are sorted in the same order as the index so this shouldn't cause much overhead either. But I will definetly try out the view/bcp method and see how it goes. It it works well I hope I'm able to automate it properly too since I'm doing about 500 files every day --Lumbago"Real programmers don't document, if it was hard to write it should be hard to understand" |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2007-12-10 : 09:52:01
|
Would definitely log the start and end of each process so you can see where the time is used up and where to concentrate your efforts.==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2007-12-12 : 06:36:50
|
Allright, I did try the view/bcp/bulk insert method you proposed but it didn't give quite the performance gain I was hoping for. It seemed to be faster but I also tested during daytime which is actually off-peak on this server so the test results are not 100% reliable. However, I did leave out one vital piece of information (on purpose) that is responsible for the majority of the performance blow, simply because I didn't see any real alternatives but to keep it as it is. But to be able to get the right datatypes from the text-files I run a few really simple UDFs for basically all substrings, but all together they add up to some serious overhead. So I have experimented a bit with the larger files (one by one) and tried to remove the UDFs on some of the columns and that helps out quite a bit. So I guess I have to go through all of the large files and tune them individually, which will be tedious but hopefully save me loading time...--Lumbago"Real programmers don't document, if it was hard to write it should be hard to understand" |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2007-12-12 : 19:33:08
|
A udf will cause row by row processing and kill performance. You're better off running an update on the column - actually better adding an identity, inserting the changed column into another table, indexing then joining for the insert.==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
|
|
|