Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Database Design and Application Architecture
 Index effects on insert

Author  Topic 

mciasco
Starting Member

5 Posts

Posted - 2009-03-10 : 07:49:20
I have to load data from a txt file, so I use the BULK INSERT command.
But the question is more general.

I have to create the table from scratch and then populate it.
I think there are only 2 options:
1) create table - create indexes - load data
2) create table - load data - create indexes

Which is the best command sequence?

Is different if there is a primary key in the table? And if so, when it must be created for better performance of the loading operation?

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-03-10 : 08:44:32
Load the data and then implement index so that even fragmentation and statistics is taken care.
Go to Top of Page

mciasco
Starting Member

5 Posts

Posted - 2009-03-10 : 12:44:55
quote:
Originally posted by sodeep

Load the data and then implement index so that even fragmentation and statistics is taken care.



Mmm, I'm thinking about the "exact" behaviour of the 3 options...

1) create table - create indexes - load data
What happens when I create indexes in an empty table? I think that the DBMS just create the root node for each B-tree index. Right?
So there are no pages allocated for data.
Then, during the data loading, for each row the DBMS must allocates page memory space, store phisically the row and update all the indexes.

2) create table - load data - create indexes
Here, loading data in an empty-not-indexed table, each row is stored sequentially, in the reading order. Again, for each row must be allocated some page space. But each insertion does not require indexes updates. But at the next step, when creating indexes on the populated table, each index is created reading the entire set of rows in sequential order, so the index must be ordered again, as a normal update.

Well, why a solution is better than the other? In both cases, there are no pages pre-allocated for data, and in both cases there's the cost of updating all the indexes for each row.
Go to Top of Page

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-03-10 : 13:07:00
The question is how big data you are inserting?
Go to Top of Page

mfemenel
Professor Frink

1421 Posts

Posted - 2009-03-10 : 13:19:13
Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.

Mike
"oh, that monkey is going to pay"
Go to Top of Page

mciasco
Starting Member

5 Posts

Posted - 2009-03-10 : 13:22:42
quote:
Originally posted by sodeep

The question is how big data you are inserting?



Actually I can't answer. I'm testing with a data file of 40 MB, but data file dimension could go from a few KB to hundreds of MB.

However, at this stage, I'm not really interested in finding a particular solution that works for me, but instead, I'm interested in the correct approach (even if it's just theory), cause I need to understand also the mechanisms behind the scene.
Go to Top of Page

mfemenel
Professor Frink

1421 Posts

Posted - 2009-03-10 : 13:26:32
Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.

Mike
"oh, that monkey is going to pay"
Go to Top of Page

mciasco
Starting Member

5 Posts

Posted - 2009-03-10 : 14:01:17
quote:
Originally posted by mfemenel

Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.



Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works!
If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same.
I can't understand the difference of behaviour between pre-creation and post-creation of indexes.

quote:
Originally posted by mfemenel

Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.



Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment).
Go to Top of Page

sodeep
Master Smack Fu Yak Hacker

7174 Posts

Posted - 2009-03-10 : 14:24:36
quote:
Originally posted by mciasco

quote:
Originally posted by mfemenel

Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.



Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works!
If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same.
I can't understand the difference of behaviour between pre-creation and post-creation of indexes.

quote:
Originally posted by mfemenel

Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.



Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment).




Option 1-

You are dealing with double work with data pages and index page.This will be slow because data being inserted has to go in data pages as well map in index pages Which will lead to term called Fragmentation.It deals with External and internal fragmentation and when query on it, it has to do more work due to page splits/more page scans and not good statistics. you might again have to Rebuild or Reorganize index.

Option 2-

You are directly dealing with Data pages and it is faster as you are not taking overhead of index pages.when you apply indexes later on,you are mapping correctly to index pages as well as removing fragmentation,keeping statistics upto date.
Go to Top of Page

mciasco
Starting Member

5 Posts

Posted - 2009-03-11 : 06:07:05
Ok I made some tests.

I tested 3 DBMS: MySQL, PostgreSQL and SQLServer with the same data bulk file.
1) Each DBMS is installed locally (no network delays)
2) The bulk file is 50MB and contains 120.000 rows
3) each opeation (create table, load bulk and create indexes) is done separately and require a new connection, so for each time value is included the connection time
4) the values showed are result of 3 runs of the same test, so they are average values.

Sequence: CREATE TABLE - LOAD BULK - CREATE INDEXES
MYSQL
- create table (s): 7
- load bulk (s): 26
- create indexes(s): 3667
- total (s): 3700

POSTGRES
- create table (s): 5
- load bulk (s): 50
- create indexes(s): 83
- total (s): 509

SQLSERVER
- create table (s): 12
- load bulk (s): 21
- create indexes(s): 24
- total (s): 426

----------------
Sequence: CREATE TABLE - CREATE INDEXES - LOAD BULK
MYSQL
- create table (s): 1
- load bulk (s): 0
- create indexes(s): 565
- total (s): 893

POSTGRES
- create table (s): 1
- load bulk (s): 1
- create indexes(s): 499
- total (s): 883

SQLSERVER
- create table (s): 11
- load bulk (s): 2
- create indexes(s): 100
- total (s): 462


So, it seems that creating indexes after the bulk load is better for Postgres and SQLServer (Postgres takes more advantages from this solution, while SQLServer performances are almost the same in both cases).
But this approach in MySQL is totally wrong! The time needed for create indexes after the bulk load grows enormously! For MySQL is better to create indexes first, and then load data. And the difference between the two approaches is huge!
This seems strange if compared to the behaviour of Postgres and SqlServer.

Overall, SqlServer is the fastest DBMS in both cases, while MySQL is the slowest one.

What do you think about?
Go to Top of Page

darkdusky
Aged Yak Warrior

591 Posts

Posted - 2009-03-11 : 07:52:42
The "correct" solution depends on the situation.
For instance while the overall time might be the same for the two approaches using SQL Server (using this file size), there may be cases where you want the load to occur as fast as possible and are happy to build the indexes overnight where the overhead is less noticed. Also datatype of indexes will also affect time to build index.
Also I ran a test of creating a large table with no indexes and then tried to create indexes. The create index failed due to memory error (I ran this on a my local PC - so it may succeed on better spec server).
So there are limits to how large of a table indexes can be created afterwards.

CREATE TABLE [dbo].[testA](
[id] [int] IDENTITY(1,1) NOT NULL,
[t] [varchar](2) NULL

) ON [PRIMARY]


INSERT testA
DEFAULT VALUES
GO 20000000
Go to Top of Page
   

- Advertisement -