huge amount of data - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

huge amount of data

Author

Topic

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-08 : 09:50:03

hello everyone! i'm dealing with a table in SQL Server of about 5 million records (and it's getting bigger) with this structure:

date | kwh |period
2010-06-05 8:00:00 | 1222 |1
2010-06-05 8:00:05 | 1222 |1
2010-06-05 8:00:10 | 1223 |1
2010-06-05 8:00:15 | 1224 |1
2010-06-05 8:00:20 | 1224 |5
2010-06-05 8:00:25 | 1227 |5
2010-06-05 8:00:30 | 1227 |3
2010-06-05 8:00:35 | 1233 |3
2010-06-05 8:00:40 | 1233 |1
2010-06-05 8:00:45 | 1237 |1
2010-06-05 8:00:50 | 1239 |2
..... | ... |..

well, the point is that i have to calculate the increment of kwh for a given period (1, 2, 3, 4, 5 or 6). in the table above, for period 1, it would be 1224-1222+1237-1233=2+4=6.

The problem is that the t-sql queries that i have designed take too much time to execute: i have use cursors (it takes about 3 minutes in returning the value) and i have also tried to split the table in the number of variations of the period (to reduce the number of iterations in the query); but surprisingly, that's even slower.

i don't know.. i rely on the potential of SQL Server, and i think there must be a better and faster way to do it... i feel like i'm missing something or forgetting an t-sql tool that could be used. any ideas are welcomed. thank you for your time!
Alejandro

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-08 : 10:56:27

Huh?

Here's the data in a sample table


DECLARE @dataTab TABLE (
	[dateStamp] DATETIME
	, [kwh] INT
	, [period] INT

	PRIMARY KEY ([period], [dateStamp])
	)

INSERT @dataTab ([dateStamp], [kwh], [period])

      SELECT '2010-06-05 8:00:00', 1222, 1
UNION SELECT '2010-06-05 8:00:05', 1222, 1
UNION SELECT '2010-06-05 8:00:10', 1223, 1
UNION SELECT '2010-06-05 8:00:15', 1224, 1
UNION SELECT '2010-06-05 8:00:20', 1224, 5
UNION SELECT '2010-06-05 8:00:25', 1227, 5
UNION SELECT '2010-06-05 8:00:30', 1227, 3
UNION SELECT '2010-06-05 8:00:35', 1233, 3
UNION SELECT '2010-06-05 8:00:40', 1233, 1
UNION SELECT '2010-06-05 8:00:45', 1237, 1
UNION SELECT '2010-06-05 8:00:50', 1239, 2


SELECT * FROM @dataTab ORDER BY [period], [dateStamp]

/*
well, the point is that i have to calculate the increment of kwh
for a given period (1, 2, 3, 4, 5 or 6). in the table above, for period 1,
it would be 1224 - 1222 + 1237 - 1233 = 2+4 = 6.
*/

I don't understand the rules. You mention only the values [1224 - 1222 + 1237 - 1233] for period 1 -- what happens to the other values?

Does period 1 model two distinct time segments? From [2010-06-05T08:00:00 to 2010-06-05T08:00:15] and [2010-06-05 08:00:40 to 2010-06-05 08:00:45]

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-08 : 11:34:38

Thank you for your reply Transact Charlie.
i have to calculate the kwh spent when period=1, but the kwh in the table are absolute (they refer to the kwh consumed since 2010-01-01) and i have to calculate the difference in groups:

In the example:
2 kwh have been consumed in period 1,
later, 3 kwh have been consumed in period 5
later, 6 kwh have been consumed in period 3
later, 4 kwh have been consumed in period 1

that makes 6 kwh consumed in period 1 (and 3 kwh in period 5, and 6 kwh in period 3)

the periods refer to the different prices that general electric stablishes per kwh depending on the season of the year and time of the day. so, the goal is to know how many kwh have been consumed in period 1, to calculate the costs later.

i hope it's more clear now...

Ifor
Aged Yak Warrior

700 Posts

Posted - 2011-02-08 : 11:43:45

There are two techniques in this thread:

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=155770

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-08 : 12:10:08

OK -- I think this might work. It uses a quirky update so it is the equivalent of black magic but......

All the usual caveats apply -- you need to copy the information to a table you can completely lock and apply the clustered index over.

There is almost certainly a better way to do this but I'm too tired to think of it.


DECLARE @dataTab TABLE (
	[dateStamp] DATETIME
	, [kwh] INT
	, [period] INT

	PRIMARY KEY ([period], [dateStamp])
	)

INSERT @dataTab ([dateStamp], [kwh], [period])

      SELECT '2010-06-05 8:00:00', 1222, 1
UNION SELECT '2010-06-05 8:00:05', 1222, 1
UNION SELECT '2010-06-05 8:00:10', 1223, 1
UNION SELECT '2010-06-05 8:00:15', 1224, 1
UNION SELECT '2010-06-05 8:00:20', 1224, 5
UNION SELECT '2010-06-05 8:00:25', 1227, 5
UNION SELECT '2010-06-05 8:00:30', 1227, 3
UNION SELECT '2010-06-05 8:00:35', 1233, 3
UNION SELECT '2010-06-05 8:00:40', 1233, 1
UNION SELECT '2010-06-05 8:00:45', 1237, 1
UNION SELECT '2010-06-05 8:00:50', 1239, 2


SELECT * FROM @dataTab ORDER BY [period], [dateStamp]

/*
well, the point is that i have to calculate the increment of kwh
for a given period (1, 2, 3, 4, 5 or 6). in the table above, for period 1,
it would be 1224 - 1222 + 1237 - 1233 = 2+4 = 6.
*/

IF OBJECT_ID('tempdb..#quirky') IS NOT NULL DROP TABLE #quirky
CREATE TABLE #quirky (
	[dateStamp] DATETIME
	, [kwh] INT
	, [period] INT
	, [diff] INT

	PRIMARY KEY CLUSTERED (
		[dateStamp]
		, [kwh]
		, [period]
		)
	)

INSERT #quirky ([dateStamp], [kwh], [period]) SELECT * FROM @dataTab

SELECT * FROM #quirky

DECLARE @periodMark INT SET @periodMark = -1
DECLARE @rnk BIGINT SET @rnk = 0

-- Rank the table
UPDATE q SET
	[diff] = @rnk
	, @rnk = CASE WHEN [period] = @periodMark THEN @rnk ELSE @rnk + 1 END
	, @periodMark = [period]
FROM
	#quirky AS q

-- Do the math
SELECT
	[period]
	, SUM([kwused]) AS [kwused]
FROM
	(
	SELECT
		[period]
		, MAX([kwh]) - MIN([kwh]) AS [kwused]
	FROM
		#quirky
	GROUP BY
		[period], [diff]
	)
	AS subParts
GROUP BY
	[period]

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-08 : 12:11:38

Final Results
===================
period kwused
1 6
2 0
3 6
5 3

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-09 : 05:52:46

Ifor, using CTE is a good idea, but in this case, I don't have a common id for each group. I only know that the group has changed because the next record has a different period (thus, at first sight, i can't partitionate the set, or at least, I can't see the way to do it). Same thing for the outter apply shown in
[url]http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=155770[/url].

In my case:

DECLARE @dataTab TABLE (
[dateStamp] DATETIME
, [kwh] INT
, [period] INT

PRIMARY KEY ([period], [dateStamp])
)

INSERT @dataTab ([dateStamp], [kwh], [period])

SELECT '2010-06-05 8:00:00', 1222, 1
UNION SELECT '2010-06-05 8:00:05', 1222, 1
UNION SELECT '2010-06-05 8:00:10', 1223, 1
UNION SELECT '2010-06-05 8:00:15', 1224, 1
UNION SELECT '2010-06-05 8:00:20', 1224, 5
UNION SELECT '2010-06-05 8:00:25', 1227, 5
UNION SELECT '2010-06-05 8:00:30', 1227, 3
UNION SELECT '2010-06-05 8:00:35', 1233, 3
UNION SELECT '2010-06-05 8:00:40', 1233, 1
UNION SELECT '2010-06-05 8:00:45', 1237, 1
UNION SELECT '2010-06-05 8:00:50', 1239, 2

It would be great if i could partitionate the table in 5 groups: for period 1, for period 5, for period 3, for period 1 and for period 2. If that were achieved, I would only have to make the difference between the bigger and the lowest kwh in each group. Using CTE and partitioning seems to be the right way to do it; but, again, i feel like I'm missing something.

Ifor
Aged Yak Warrior

700 Posts

Posted - 2011-02-09 : 06:22:37

You just have to find two different orders to make the group:


;WITH dataGrp
AS
(
	SELECT *
		,ROW_NUMBER() OVER (ORDER BY dateStamp)
			- ROW_NUMBER() OVER (PARTITION BY period ORDER BY dateStamp) AS Grp
	FROM @dataTab
)
, PeriodGrp
AS
(
	SELECT period, Grp
		,MAX(kwh) - MIN(kwh) AS KWUsed
	FROM dataGrp
	GROUP BY period, Grp
)
SELECT period, SUM(KWUsed) AS KWUsed
FROM PeriodGrp
GROUP BY period
ORDER BY period

Ifor
Aged Yak Warrior

700 Posts

Posted - 2011-02-09 : 06:24:33

You just need to find two different order to make the group:


;WITH dataGrp
AS
(
	SELECT *
		,ROW_NUMBER() OVER (ORDER BY dateStamp)
			- ROW_NUMBER() OVER (PARTITION BY period ORDER BY dateStamp) AS Grp
	FROM @dataTab
)
, PeriodGrp
AS
(
	SELECT period, Grp
		,MAX(kwh) - MIN(kwh) AS KWUsed
	FROM dataGrp
	GROUP BY period, Grp
)
SELECT period, SUM(KWUsed) AS KWUsed
FROM PeriodGrp
GROUP BY period
ORDER BY period

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-09 : 06:47:35

if you do implement that I'd be interested to see the performance of the two ROW_NUMBER() ranking functions vs the quirky update (which is effectively doing a ranking over the data)

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-02-09 : 07:07:08

Wow, this grouping technique was really cool Ifor! I've been looking at it for the past 30 mins to fully understand what's going on (I'm not as sharp as I like to imagine

) and I don't think I've seen anything like this before.

EDIT: I think it will be vastly outperformed by the quirky update but still a nice trick for the bag'o'tricks!

- Lumbago
My blog-> www.thefirstsql.com

Ifor
Aged Yak Warrior

700 Posts

Posted - 2011-02-09 : 09:43:26

quote:
Originally posted by Lumbago

Wow, this grouping technique was really cool Ifor!

That's what I though when I first saw the technique.

To understand it you only have to eye-ball the results of the following query:


SELECT *
	,ROW_NUMBER() OVER (ORDER BY dateStamp) AS TemporalOrder
	,ROW_NUMBER() OVER (PARTITION BY period ORDER BY dateStamp) AS TemporalByPeriodOrder
	,ROW_NUMBER() OVER (ORDER BY dateStamp)
		- ROW_NUMBER() OVER (PARTITION BY period ORDER BY dateStamp) AS Grp
FROM @dataTab
ORDER BY dateStamp

I would also be interested in knowing how it compares with the quirky update.
I suspect the quirky update will be slower as it has to write a temp table and then update it.

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-09 : 11:58:19

First of all, Thank you both Ifor and Transact Charlie for the time that you have dedicated to this issue, I really appreciate that, sirs.

The method proposed by Transact Charlie has one problem: the update of the column [diff] from the quirky table (which stablishes the groups) doesn't occur ordenately. It doesn´t update the records ordenately by date, so the number of groups created is bigger than it should be. We can simply check that by executing the queries: each execution returns different results for the same data. The time the system takes to execute the queries is approximately 15 seconds.

About the technique proposed by Ifor, i just can feel amazed. It's awesome how the two row_number() work together to determine the different groups that a first sight had nothing in common: grp in conjuction with period will determine always different groups, since the two row_number are always increasing for each period and for all the data.
Deffinitely, CTE is the most efficient solution: everything is done in just one transaction! And its execution only takes 5 seconds!! I have lots of places where I have to split data in a similar way, so it's really helpful for me. Thanks for sharing your knowledges Ifor.

Alejandro

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-02-09 : 12:04:31

Quirky updates are wicked fast,but can cause trouble if not done properly. The table HAS to have a clustered index, and you HAVE to be careful of MAXDOP. Parallelism can cause the quirky update to misfire.
If it isn't updating in the correct order, you can modify the FROM statement to use OPTION (MAXDOP 1) and you can also force use of the index with WITH (INDEX("IX_NAME")). The data needs to be in the correct order. However, a lot of times the CTE methods can be exploited to do similar things, combined with Ranking/Rownumber functions.

Nice thread.

Poor planning on your part does not constitute an emergency on my part.

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-09 : 12:15:04

In addition, CTE methods are much faster and efficient. In my case, a client is logged on a webpage, and he is waiting for the results while they are being calculated (thanks to those excellent queries). It wouldn't be proper to make him wait for more time than necessary..

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-09 : 16:51:50

well - thanks for testing!

I'm surprised the dual ROW_NUMBER approach is sufficiently fast. I'm definitely (but very pleasantly) surprised that it outperforms a quriky update for ranking as well.

Just out of interesters how many rows are you performing an average query over? I take if from the statement:

"In my case, a client is logged on a webpage, and he is waiting for the results while they are being calculated "

That the data set is some subset of your 5 million row table?

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2011-02-09 : 17:07:11

CTE is really quick for the ROW_NUMBER function, even on millions of rows ( I use it frequently) . However, I suspect that the quirky update will fare far better on the actual full data set. I haven't found much that will outpeform the quirky update for actual updates to large datasets. I have used it for lightning fast updates with multiple anchors and conditional evaluations against tables for 90 million row tables and above and it is uber-fast.

The same operation using CTE ranking was slower in a massive way (it was expected to be slower)..this was a while back though. All I remember is that the quirky uddate did all 90 million records faster than the CTE did 1 years worth of records (< 9 million) (the 90 million was 15 years worth of monthly data, for 600K accounts each month) WIsh I still had those results. Moden's article here is quite good:http://www.sqlservercentral.com/articles/T-SQL/68467/

The multiple row_number rankings/ordering I have seen used here on a number of problems, and I like that alot...comes in very very handy when you can utilize it.

Poor planning on your part does not constitute an emergency on my part.

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2011-02-09 : 17:20:15

Sounds good. I love ROW_NUMBER() -- I think it's probably my favourite 2005+ feature.

This was a nice thread!

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-10 : 02:40:49

Well, in my case CTE combined with rowset is about 3 times faster than the quirky update. In this case, the result shown to the client is discrete (i return him just one value, the amount of kwh consumed in the selected period), and the query has to go through the 5 million row data table (5 million rows is approximately the number of rows we store in each table per year). The website I have designed allows the client to get subsets of the table correlating it with other tables, to send the subset periodically in an email, to export it to a .csv file...

Alejandro

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-02-10 : 06:23:06

Here is another approach (using same sample data as provided above)

SELECT		t1.Period,
		COALESCE(SUM(t1.kwh - t2.kwh), 0) AS Consumption
FROM		@dataTab AS t1
LEFT JOIN	@dataTab AS t2 ON t2.Period = t1.Period
			AND t2.dateStamp = DATEADD(SECOND, -5, t1.dateStamp)
GROUP BY ALL	t1.Period
ORDER BY	t1.Period

N 56°04'39.26"
E 12°55'05.63"

alejandrolepetittomas
Starting Member

9 Posts

Posted - 2011-02-10 : 10:46:48

Well, in my real table, the interval between consecutive dateStamps isn't always 5 seconds, sometimes is 4, or 6 seconds... maybe the sample wasn't as clear as it shuld be.
But I guess that a similar left join could be done by adding an integer autoincrement id field to the table. Thanks for the idea!

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources