Desperately need help with .CSV import! [SOLVED]

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Import/Export (DTS) and Replication (2000)

Desperately need help with .CSV import! [SOLVED]

Author

Topic

ACushen
Starting Member

29 Posts

Posted - 2006-09-22 : 23:40:55

Hi all!

I am in need of some help importing a .CSV file into a SQL Server 2005 Express database. I can't use SSIS/DTS because it's SQL Server Express. The operation needs to run as a parameterized Stored Proc which I will call from ASP.NET. The Data Import Wizard doesn't work either for reasons stated below.

The problem is that I need to get the only 2 columns from the text file, then "tack on" 3 more columns that will have data that changes, which will come in as parameters to the T-SQL stored proc. This prevents me from using a straight BULK INSERT.

The data in each of these columns will be the same for each record in that column - that is, every record in every field in column 3, for example, will be the same.

Some of the files will have 4 million records and up, so speed is of the essence here. I tried using BULK INSERT to dump the data into a temporary table - which took 38 seconds and was acceptably fast - but then, my next step was to dump the additional data into the other columns using UPDATE... SET. I gave up on this after the query ran for THIRTY MINUTES! My next step was going to be to move the data from the temporary table into the target permanent table somehow, but I never got that far when I saw how long the previous step took...

It's a little odd, because I can do the same thing in MS Access in under 5 minutes by using a SQL statement like this in my ASP.NET code:

INSERT INTO [Target_Table] (field1, field2, field3, field4)
SELECT F1, F2, strSource , Now() AS DateTimeStamp
FROM [Text;HDR=NO;DATABASE=c:\test;].[test.CSV]

strSource is a string that the User enters (properly vetted for security); strPath and strFilename are strings holding the path and filename to the .CSV file. I create a unique filename from the file the user uploads. This works in under 5 minutes for several millions of records in MS Access, as I said above. I've had no luck getting anything similar to work in SQL Server, though.

Can anyone suggest a way to do this at a reasonable speed? Is there a way to speed up the following approach:
1) Use BULK INSERT or similar to dump .CSV data into temp table
2) Use UPDATE to get additonal data into temp table
4) Use INSERT...SELECT INTO to get the data from the temp table into the target permanent db table

Is there a faster way to do any of the above steps? Using ALTER TABLE perhaps??

HELLLLLLPPPP.......

Thanks in advance,

-Andrew

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-09-23 : 01:59:16

There's no need to update the temp table before you insert into the target table, you can put those three variable values into the insert much like you did with Access. Assuming your three variables coming into the stored proc are @var1, @var2, and @var3, just do this

INSERT [Target_Table] (col1, col2, col3, col4, col4)
SELECT tempCol1, tempCol2, @var1, @var2, @var3
FROM #TempTable

You should try dropping any indexes and constraints from the target table before you start and then adding them back again afterwards.

ACushen
Starting Member

29 Posts

Posted - 2006-09-23 : 11:15:53

Wow. I've been struggling with this so long that I didn't see what was in front of my face...Thanks!!

My 21MB file now imports in 6:40 (min:sec), which is MUCH better...

I will try dropping any indexes and constraints from the target table. Does anyone have any other ideas for speeding this monster up?

I have already added DATAFILETYPE = 'native' and TABLOCK to the BULK_INSERT statement, and they helped. I tried SET RECOVERY BULK_LOGGED, but that seemed to actually increase the time the query took. Perhaps I need to run that test again...

Thanks again,

-Andrew

nr
SQLTeam MVY

12543 Posts

Posted - 2006-09-23 : 11:33:41

You can also choose the columns to bulk insert by using a format file.
see
http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html

This would make the bulk insert faster and also the insert into the final table because it has to read less pages from disk. It won't make much difference with a small file but it will be appreciable with a large one.

If you are going to play with the table structure then you could set a default on the extra columns and bulk insert directly into the final table.
If you don't want to do that then faster than the insert would probably be
bulk insert to table
bcp out to file from query including the new values
bulk insert to final table

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

ACushen
Starting Member

29 Posts

Posted - 2006-09-23 : 12:05:45

NR-

Thanks, I am already using a format file.

As far as setting a default on the target table columns, the problem is that the data going into those columns is coming from a user-entered string. If I set a default then bulk insert the 2 columns from the .CSV directly to the target db table, how do I then get my user-entered value into just those columns without considerable gymnastics, or slowing the whole thing down?

I would like to avoid using bcp if possible..

Thanks,

-Andrew

nr
SQLTeam MVY

12543 Posts

Posted - 2006-09-23 : 12:10:51

Just add the defaults before you start the insert and remove them afterwards.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Kristen
Test

22859 Posts

Posted - 2006-09-23 : 12:50:21

"Does anyone have any other ideas for speeding this monster up?"

6 minutes sounds like a very long time for a 21MB import to me ...

If the import file is pre-sorted in the order of the Clustered Index on the table you can use the "ORDER" bulk copy hint - which will alert SQL Server to the fact that the file is the same order as the clustered index (if the Order command specifies an order different to the clustered index SQL Server will just ignore the hint)

Kristen

ACushen
Starting Member

29 Posts

Posted - 2006-09-26 : 11:45:59

Thanks all,

I will try a few of your ideas.

-Andrew

ACushen
Starting Member

29 Posts

Posted - 2006-09-27 : 14:19:36

quote:
Originally posted by nr

Just add the defaults before you start the insert and remove them afterwards.

UPDATE: I have changed the table so that I now only need to add one value to the target table. I am trying to set that value as a default, so I can BULK_INSERT directly to the target table. But...

nr:

Remember I am calling the stored proc which does the .CSV import from an ASP.NET website, so I can't manually add & remove the Defaults.
It is my understanding that using

ALTER TABLE tableName
ALTER COLUMN colName
SET DEFAULT defaultValue

requires a constant value for the defaultValue; I need to use a user-entered value. I have tried building the above statement dynamically with EXEC and sp_executesql, but I get an error like this:

"Procedure sp_executesql, Line 1
Procedure expects parameter '@statement' of type 'ntext/nchar/nvarchar'."

Is it possible to set a user-entered value as a Default??

-Andrew

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-09-27 : 14:45:12

Yes you can, here's an example

CREATE TABLE Test (one int CONSTRAINT ColOneDefault DEFAULT 0)
GO
ALTER PROC AddsDefault (@def int)
AS
DECLARE @sql nvarchar(1000)
SET @sql = 'ALTER TABLE Test '
	+ 'ADD CONSTRAINT ColOneDefault DEFAULT ' 
	+ cast(@def AS varchar(10))
	+ ' FOR one'
ALTER TABLE Test
DROP CONSTRAINT ColOneDefault
EXEC sp_executesql @stmt = @sql
GO
EXEC AddsDefault 85

ACushen
Starting Member

29 Posts

Posted - 2006-09-27 : 18:06:33

snSQL:

Thanks, I had finally gotten that to work using EXEC(), and forgotten to post. When I have a few minutes I will take your code and try to convert my proc to work with it.

To all-

I have determined that doing a BULK_INSERT directly to the target table will be too slow; I already have data in the table, and a clustered index exists, so setting the Recovery mode to BULK_LOGGED has no effect. As a result, the direct BULK_INSERT is painfully slow. Not to mention that my log file balloons from 128MB to over 10GB...

I have discovered that, using a combination of methods suggested by helpful folks on this and other forums, I can get my 4-million-row test file into a temporary table in approximately 5 seconds, which is "plenty fast". Unfortunately, my INSERT INTO...SELECT method of getting the data from the temp table into the target permanent table, is taking upwards of 10 minutes...

Any advice on the fastest way to get all the records from the temp table into the target table?

Thanks again...

-Andrew

ACushen
Starting Member

29 Posts

Posted - 2006-09-27 : 23:05:47

UPDATE: "Things are really getting weird" dept....

OK. So I have played around with a few different options and done some tests.

As a test file, I used one .CSV file that is about 21.5 MB, containing 3.68 Million records.
In between each test, I wiped the Transaction log back to its starting size. I also deleted the table at the end of each run, re-creating it from the same script.

I tested the following 3 scenarios:

Run 1 - BULK INSERT into temp table, INSERT...SELECT INTO target table; Recovery set to BULK_LOGGED temporarily
Run 2 - BULK INSERT directly into target table, setting a temporary default on one column; Recovery set to BULK_LOGGED temporarily
Run 3 - BULK INSERT directly into target table, setting a temporary default on one column; Recovery set to SIMPLE temporarily

Here are the results: (min:sec)
Run 1: 03:58
Run 2: 09:57
Run 3: 04:19

I have to say that I am really surprised that dumping the data into a temp table and then moving it to the target table was so much faster than the BULK_LOGGED version which dumped the data directly to the target table!

Anyone have any thoughts?

Also, I am still in search of a faster way to get the data from the temp table into the target table. INSERT...SELECT INTO doesn't seem like it's that quick...and while the 3:58 import is fast enough, some of the files I will need to import are more than 3 times the size of the test file I used above, and fifteen minutes for a single import is too long...

Thanks again everyone,

-Andrew

Kristen
Test

22859 Posts

Posted - 2006-09-28 : 00:06:21

Have you tried with your CSV file pre-sorted in clustered-index order? (You need to provide a Hint to Bulk Insert too)

Is it the 15 minutes for a single transaction thats the problem, or the fact that it takes 15 minutes to get the data in?

I mean: could you transfer the data from Staging to Target table in batches so that it wasn't a single lock-out transaction?

Kristen

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-09-28 : 00:23:52

Are you dropping indexes and dropping or turning off constraints on the target before you INSERT ... SELECT? Is there a clustered index on the target table, if so, try setting the fillfactor very low on that index before the INSERT ... SELECT.

ACushen
Starting Member

29 Posts

Posted - 2006-09-28 : 15:24:32

Kristen-

Unfortunately the target table is not sorted at all - records are entered as the users enter them, pretty much randomly, and the Clustered Index is set on an Identity column. The .CSV file is sorted by the 2 columns that are in it, but there is no column corresponding to the Identity column in the target table (if that makes any sense--?)

As far as the 15 minutes - the entire import is happening via an ASP.NET web page. The user selects a .CSV file from his/her PC, and enters some text to identify it. The file is then uploaded to the server, my code re-names it with a unique name, and then the import starts via Stored Procedure. I am using a background thread in ASP.NET to do the upload/Stored Proc call, and the web page keeps re-loading via an HTML META REFRESH, checking the thread to see if it's finished. So I don't see how I can do the upload in batches yet still keep the user informed -- it's a complex situation and perhaps not the best design, but the client wanted a web interface, so...
Basically, I want to get the time of the entire operation down to as short a period as possible, so the users don't have to sit there with a browser window open any longer than necessary.

snSQL:

I have not yet tried your suggestions. I will tackle them next and get back to you.

Thanks,

-Andrew

Kristen
Test

22859 Posts

Posted - 2006-09-28 : 15:37:48

OK, then "Option One", from your Benchmarks, is probably your best bet. Organising your Temp Table to have a PK / Clustered index on CSV's 2 sorted columns (and provide the appropriate Bulk Insert Hint) may help get it into the Temp Staging Table a bit quicker.

If the transfer to the Main Table is then too long (in that I mean it is locking other people out for too long) consider transferring it in batches - which won't work if the job needs to be Atomic of course.

Add an Identity Column to your TEMP table, and then you can do something like:


DECLARE @intID    int,
        @intMaxID int
SELECT @intID = 0
SELECT @intMaxID = MAX(MyIdentity) FROM TempTable 

WHILE @intID < @intMaxID
BEGIN
    SELECT @intID = @intID + 10000

    INSERT INTO MainTable 
    SELECT * 
    FROM TempTable 
    WHERE MyIdentity <= @intID

    DELETE TempTable 
    WHERE MyIdentity <= @intID

    WAITFOR DELAY '000:00:05'
END

Kristen

ACushen
Starting Member

29 Posts

Posted - 2006-09-28 : 16:03:58

snSQL:

There is only one Index on the table - the Clustered index.

I will try dropping constraints.

How do I programmatically change the fillfactor from a Stored Proc??

-Andrew

snSQL
Master Smack Fu Yak Hacker

1837 Posts

Posted - 2006-09-28 : 16:43:29

quote:
How do I programmatically change the fillfactor from a Stored Proc??

You use DBCC DBREINDEX but you don't want to do that in the sp that runs every time. It rebuilds the entire index which will itself take a while. You want to just run it periodically so that the index pages have more free space in them for when the new records are added.

ACushen
Starting Member

29 Posts

Posted - 2006-09-28 : 16:57:12

UPDATE:

I think I am now going to mark this one Closed, with many thanks to all who participated.

I have now got the insert going directly into the target table to run in 1:28 (min:sec). That's fast enough for me! Plus all the time I've spent tuning the insert comes out of MY wallet - I can't charge the client.

The key was noticing that the only index on the table was the clustered index on the identity column. I played around with the indexes and found that putting a single index on BOTH the 2 columns I'm inserting did the trick (the index was placed on the target table in the db).
Interestingly, The no-temp-table approach worked faster than the temp-table approach - the opposite of everything I saw up to this point.

I will continue playing around with the indexes, but I think I can do that without any assistance.

Thanks again to all who took time from their personal lives to help me with this one. It was really driving me crazy!

-Andrew

ACushen
Starting Member

29 Posts

Posted - 2006-09-28 : 17:26:01

snSQL:

Ahhh, thanks. I was wondering, because the info I found in Books Online made it sound like something not to be done lightly...

BTW, is there any way to "officially" mark a thread closed in this Forum, other than just saying so in a post?

Thanks,

-Andrew

Kristen
Test

22859 Posts

Posted - 2006-09-29 : 01:26:34

"BTW, is there any way to "officially" mark a thread closed in this Forum, other than just saying so in a post?"

Some people re-edit the subject and append "[SOLVED]" - that probably Googlers

Glad you got it solved, the timings now look like what I would expect for that size of import.

Kristen

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources