Author |
Topic |
mciasco
Starting Member
5 Posts |
Posted - 2009-03-10 : 07:49:20
|
I have to load data from a txt file, so I use the BULK INSERT command.But the question is more general.I have to create the table from scratch and then populate it.I think there are only 2 options:1) create table - create indexes - load data2) create table - load data - create indexesWhich is the best command sequence?Is different if there is a primary key in the table? And if so, when it must be created for better performance of the loading operation? |
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-03-10 : 08:44:32
|
Load the data and then implement index so that even fragmentation and statistics is taken care. |
|
|
mciasco
Starting Member
5 Posts |
Posted - 2009-03-10 : 12:44:55
|
quote: Originally posted by sodeep Load the data and then implement index so that even fragmentation and statistics is taken care.
Mmm, I'm thinking about the "exact" behaviour of the 3 options...1) create table - create indexes - load dataWhat happens when I create indexes in an empty table? I think that the DBMS just create the root node for each B-tree index. Right?So there are no pages allocated for data.Then, during the data loading, for each row the DBMS must allocates page memory space, store phisically the row and update all the indexes.2) create table - load data - create indexesHere, loading data in an empty-not-indexed table, each row is stored sequentially, in the reading order. Again, for each row must be allocated some page space. But each insertion does not require indexes updates. But at the next step, when creating indexes on the populated table, each index is created reading the entire set of rows in sequential order, so the index must be ordered again, as a normal update.Well, why a solution is better than the other? In both cases, there are no pages pre-allocated for data, and in both cases there's the cost of updating all the indexes for each row. |
|
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-03-10 : 13:07:00
|
The question is how big data you are inserting? |
|
|
mfemenel
Professor Frink
1421 Posts |
Posted - 2009-03-10 : 13:19:13
|
Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.Mike"oh, that monkey is going to pay" |
|
|
mciasco
Starting Member
5 Posts |
Posted - 2009-03-10 : 13:22:42
|
quote: Originally posted by sodeep The question is how big data you are inserting?
Actually I can't answer. I'm testing with a data file of 40 MB, but data file dimension could go from a few KB to hundreds of MB.However, at this stage, I'm not really interested in finding a particular solution that works for me, but instead, I'm interested in the correct approach (even if it's just theory), cause I need to understand also the mechanisms behind the scene. |
|
|
mfemenel
Professor Frink
1421 Posts |
Posted - 2009-03-10 : 13:26:32
|
Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.Mike"oh, that monkey is going to pay" |
|
|
mciasco
Starting Member
5 Posts |
Posted - 2009-03-10 : 14:01:17
|
quote: Originally posted by mfemenel Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.
Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works!If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same.I can't understand the difference of behaviour between pre-creation and post-creation of indexes.quote: Originally posted by mfemenel Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.
Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment). |
|
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-03-10 : 14:24:36
|
quote: Originally posted by mciasco
quote: Originally posted by mfemenel Since you're creating the table from scratch, Option 2. If you create the index first you incur overhead by writing the data to the table and the index. Loading just the table writes it once to disk and it's loaded. You can then create the index after the load has completed.
Mmm, can you tell me why this happens? I'm not saying it's wrong, I really want to understand how it works!If I create indexes after the load, I think that DBMS needs to write data to the index too, so the storing time should be the same.I can't understand the difference of behaviour between pre-creation and post-creation of indexes.quote: Originally posted by mfemenel Will it always be the case that you're loading a clean table or will you be appending. The 2 appraoches are somewhat different.
Tables are always created from scratch and populated reading rows from a non-empty file. The appending of single row during the normal table's life is not a problem (for the moment). Option 1-You are dealing with double work with data pages and index page.This will be slow because data being inserted has to go in data pages as well map in index pages Which will lead to term called Fragmentation.It deals with External and internal fragmentation and when query on it, it has to do more work due to page splits/more page scans and not good statistics. you might again have to Rebuild or Reorganize index.Option 2-You are directly dealing with Data pages and it is faster as you are not taking overhead of index pages.when you apply indexes later on,you are mapping correctly to index pages as well as removing fragmentation,keeping statistics upto date. |
|
|
mciasco
Starting Member
5 Posts |
Posted - 2009-03-11 : 06:07:05
|
Ok I made some tests.I tested 3 DBMS: MySQL, PostgreSQL and SQLServer with the same data bulk file.1) Each DBMS is installed locally (no network delays)2) The bulk file is 50MB and contains 120.000 rows3) each opeation (create table, load bulk and create indexes) is done separately and require a new connection, so for each time value is included the connection time4) the values showed are result of 3 runs of the same test, so they are average values.Sequence: CREATE TABLE - LOAD BULK - CREATE INDEXESMYSQL- create table (s): 7- load bulk (s): 26- create indexes(s): 3667- total (s): 3700POSTGRES- create table (s): 5- load bulk (s): 50- create indexes(s): 83- total (s): 509SQLSERVER- create table (s): 12- load bulk (s): 21- create indexes(s): 24- total (s): 426----------------Sequence: CREATE TABLE - CREATE INDEXES - LOAD BULKMYSQL- create table (s): 1- load bulk (s): 0- create indexes(s): 565- total (s): 893POSTGRES- create table (s): 1- load bulk (s): 1- create indexes(s): 499- total (s): 883SQLSERVER- create table (s): 11- load bulk (s): 2- create indexes(s): 100- total (s): 462So, it seems that creating indexes after the bulk load is better for Postgres and SQLServer (Postgres takes more advantages from this solution, while SQLServer performances are almost the same in both cases).But this approach in MySQL is totally wrong! The time needed for create indexes after the bulk load grows enormously! For MySQL is better to create indexes first, and then load data. And the difference between the two approaches is huge! This seems strange if compared to the behaviour of Postgres and SqlServer.Overall, SqlServer is the fastest DBMS in both cases, while MySQL is the slowest one.What do you think about? |
|
|
darkdusky
Aged Yak Warrior
591 Posts |
Posted - 2009-03-11 : 07:52:42
|
The "correct" solution depends on the situation. For instance while the overall time might be the same for the two approaches using SQL Server (using this file size), there may be cases where you want the load to occur as fast as possible and are happy to build the indexes overnight where the overhead is less noticed. Also datatype of indexes will also affect time to build index. Also I ran a test of creating a large table with no indexes and then tried to create indexes. The create index failed due to memory error (I ran this on a my local PC - so it may succeed on better spec server).So there are limits to how large of a table indexes can be created afterwards.CREATE TABLE [dbo].[testA]( [id] [int] IDENTITY(1,1) NOT NULL, [t] [varchar](2) NULL) ON [PRIMARY]INSERT testA DEFAULT VALUESGO 20000000 |
|
|
|