Best practice for large amount of data

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

Best practice for large amount of data

Author

Topic

Marcel de Groot
Starting Member

3 Posts

Posted - 2010-06-04 : 02:13:00

We are looking for the best way to enter a lot of data for a "long" period of time into SQL tables. I will try to describe the issue we are facin.

We have a machine that is checking products at a speed of 100 products per second (10 lanes doing 10 products per second each). Each product has 750 "measured values", resulting in a kind of quality and size. In the past we only stored those major outcomes (about 10 values per product), but for analysis we curently want to store all data of each product. This means we want to build a table containing all values of all products (the table contains 750 coulumns) and we want to enter 100 records per second into it. The records should be entered from an external application. The solution has to be able to run for a long period of tme without data purging (30 days).

Our questions are:

Is it possible to use SQL for this?
What is best practice for this?
In what way can the data be entered best?
We have a solution available with an third party "linked server", but in that manner we enter 750 records per product instead of 1 containing 750 coulumns. So we can say that the capacity of our computer enviroment is capable in handling the data fast enough.

Feel free to give any answer that helps us forward in this.

With kind regards,

Marcel de Groot

Kristen
Test

22859 Posts

Posted - 2010-06-04 : 02:38:14

Is it 24/7? if not you can use the downtime to purge stale data and do housekeeping tasks like rebuilding indexes etc. to keep the performance on track.

You can use Horizontal Partitioning - e.g. storing each month in a separately named table, and then combining the tables (in a VIEW - there are specific requirements about CHECK constraints for this to work properly) for querying - makes it easy to then drop the oldest month for housekeeping.

What are you wanting to query? Do the queries need to be real-time?

What if the database is not available? can you tolerate some data loss?

I think there is a serious risk that the queries will interfere with the writing if its all in the same database. If you will "copy" the database to a different server for querying then a batch-processing operation would be equivalent, and much lower-tech.

My inclination would be to log the data to some form of delimited file (CSV for example) and then bulk load it to the database - either once-a-day or every-few-minutes depending on how rapidly you need to be able to query the data. This type of approach has the benefit that the delimited-file is "queueing" the data, so if the database doesn't import it the only outcome is that the delimited files get larger (I'll assume you have enough disk space for them to last you a "reasonable" period), and the next database import will take longer than normal.

I favour delimited files as the complexity, and therefore likelihood that something irreversible goes wrong, is very low. You could use Service Broker or some sort of Queue manager - which will just feed the data to the database, and queue it up if the database is offline/busy, but to my mind that just adds complexity.

Using bulk-import you could rearrange the data as you save it - e.g. using one-row-per-attribute if you prefer (which may make the data easier to report on), or "warehouse" the data in multiple different formats for the benefit of reporting.

Marcel de Groot
Starting Member

3 Posts

Posted - 2010-06-04 : 04:13:22

Thanks Kristen for your fast and valuable respons.
The system is about 16 of 24 and 5 out of 7, if you know what I mean by that.
We do not need to be able to query during the logging sessions, so we can do database manipulations proir to queries on the data being done.
Currently the data is available in one or more csv files. The only question left is, is it possible to dump the csv data fast enough into SQL tables. This conversion needs to be faster than the datacollection itself,because otherwise we are facing a increasing number of csv files that need to be "converted to SQL". We can split data using partitioning.
I'm very happy with your answer. Thanks again!

With kind regards,

Marcel de Groot

Kristen
Test

22859 Posts

Posted - 2010-06-04 : 05:09:11

Yes, CSV can be imported fast enough (i.e. much faster than single record inserts). Of course you need hardware that can cope with the expected data insertion rate, but the 16/5 hours/days gives respite if it falls behind for any reason (unless you need real-time query ability)

It would help if the CSV is in the clustered-index-order - e.g. there is an ascending date/time column. But that is not critical (just improves performance, but daylight-saving-switch might screw that anyway ...). An ever-ascending ID number would be ideal in that regard.

My approach would be:

Rename CSV file (so that the production line then starts making a new one)
Move renamed-CSV file to IMPORT folder
Have a batch job that moves files from IMPORT to STAGING and then imports all files in STAGING and moved completed files to DONE

This file-moving gets around making sure that files are only processed once. But if you have unique-ID or somesuch in the CSV files that can be used to double-insure that there are no DUPs (and none missing either, I guess?!)

"We do not need to be able to query during the logging sessions, so we can do database manipulations proir to queries on the data being done."

Can you clarify a bit what queries you need to run, and will these be on the logging-database, or a copy of it? (I see no reason to have a copy-database if you are using batch import - big queries may delay import a bit, but copy-database will delay the real-time-ness of the data A LOT!!

Marcel de Groot
Starting Member

3 Posts

Posted - 2010-06-04 : 13:38:50

Hi Kristen,

Thanks for your clarifications.

What I mean with "the copy" is the raw data (the csv data) that is "copied" into the database where we do queries on.

I agree 100% with you that we do not need a copy. We have the csv anyway.

I can tell you we will test your proposal within a few weeks. We do need to complete the csv files first and then are starting to test the "inserts" from the csv file.

Thanks again and have a very fine weekend!

With kind regards,

Marcel de Groot

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources