Database Design for Site Data Statistics

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

Database Design for Site Data Statistics

Author

Topic

irs2k3
Starting Member

22 Posts

Posted - 2010-04-03 : 03:56:50

Hi Guys,

I have a table called News that stores all news fetched from RSS feeds now I want to record the number of clicks for each news item in the GUI.

I have created another Table called News_MIS that contains newsID and #clicks.

When I want to display the number of clicks I join the 2 tables to get the number of clicks for each news.

For the news table with over 5 million records is this the best approach of should I place the Clicks column inside the news Table that way I query a Table and not a view ?

Thanks

Kristen
Test

22859 Posts

Posted - 2010-04-03 : 04:20:08

I would put the clicks in a separate table. Its a very high update table, and I would not want to be updating the news table itself that often (it will have a TEXT or Varchar(MAX) column I presume, and may have Trigger to store old versions, etc., best not to share that with the high-volume updating columns.

So a 1:1 "parallel" table for the Clicks with the same PK as the News (which is hopefully an Integer ID, and thus very "narrow").

Kristen
Test

22859 Posts

Posted - 2010-04-03 : 04:21:18

P.S. I would create the CLICKS table row at the same time as the News Record (with a CLICK COUNT of zero), so that the "Clicked" action just needs to increment it using UPDATE, never have to INSERT a row.

irs2k3
Starting Member

22 Posts

Posted - 2010-04-03 : 04:24:46

Thanks for the heads up.

Because I heard that quering a table is better than querying a view or 2 tables?

Is this true for a table of like 5 milions record?

Kristen
Test

22859 Posts

Posted - 2010-04-03 : 05:20:27

Depends!

I think the high volume of updates is the thing you need to optimise for, rather than the Select query

If you want to find the number of clicks for News Item = 12345 then that is a single, clustered, primary key lookup - "instant", even if you have a huge number of rows.

If you want to find the top-10 most popular news items that is a more involved query on a big table. You could index by the number of clicks - but then EVERY update to the Clicks has to also update the Index.

I think it would be better to have a scheduled task than runs, say, every hour (or maybe only once a day) that gets the top 100 news stories, top 100 stories "new" in the last week / month, top 100 stories that have had the most increase in their Clicks in the last, say, week ... and store the results into a "TopNewsItems" table - and then query than whenever anyone wants the "Top 10" / "Top 100" stores / recent stories / biggest increasers in popularity etc. - kinda like a Cache.

The VIEW thing is moot. If you have a VIEW


CREATE VIEW MyView
AS
SELECT N.ID AS ID, N.NCol1, N.NCol2, ..., C.CCol1, C.CCol2, ...
FROM NewsTable AS N
    JOIN NewsClicks AS C
        ON C.ID = N.ID

then


SELECT ID, NCol1, CCol1 FROM MyView WHERE ID = 12345

is exactly the same as doing


SELECT N.ID, NCol1, CCol1
FROM NewsTable AS N
    JOIN NewsClicks AS C
        ON C.ID = N.ID
WHERE N.ID = 12345

and assuming that ID is the Clustered PK on both NewsTable and NewsClicks it will be "instant" (for a single, specific, ID)

Also


UPDATE U
SET NumberOfClicks = NumberOfClicks + 1
FROM NewsClicks
WHERE ID = 12345

to increment the number-of-clicks on a given News ID record will be very fast. [number-of-clicks] is a fixed length, integer column, so will not increase in width, so SQL can just replace it with the new value in-situ - no page splits etc. That operation is Atomic, so will not block other queries etc.

Finding top 100 popular news items is


SELECT TOP 100 ID
FROM NewsClicks
ORDER BY NumberOfClicks DESC

this will be slow (on a big table, if there is no index on NumberOfClicks - and I don't think an index on NumberOfClicks is a good idea because of the huge number of updates on that column), hence I suggest caching it.

However, because [NewsClicks] table is very "slim" (just News-ID and Number-of-Clicks columns) there will be lots of records oin each data page, so it will be very fast to do a table-scan - as compared to having the Clicks in the same table as the News Record / Text itself.

But to reduce the frequency of such queries I would save them in a "caching table" - as I described above, using a scheduled task every hour / day.

irs2k3
Starting Member

22 Posts

Posted - 2010-04-03 : 05:32:49

Man, I appreciate your help. Thank you.

I have a Table in the DB called TH_News_MIS with the following schema :

,[NewsID]
,[Clicks]
,[LastClickDate]
,[Trims]
,[LastTrimDate]
,[Comments]
,[LastCommentDate]

When A User clicks a news this table will be updated if the news exist else a new row will be created in the table.

Now to get the Top 100 News for example, I can Query this MIS table and get the ids of news and then query the news table.

What do you think?

Kristen
Test

22859 Posts

Posted - 2010-04-03 : 08:12:46

"When A User clicks a news this table will be updated if the news exist else a new row will be created in the table."

I would create a ROW in TH_News_MIS when you create a row in News table - so that there is always a row in this table.

Otherwise EVERY time you increment the count you have to do some sort of EXISTS test to decide whether to Insert or Update. Much faster to just do an UPDATE knowing that the row will exist.

"Now to get the Top 100 News for example, I can Query this MIS table and get the ids of news and then query the news table"

Yes, but that will be slow on a large table, hence my suggestion to have a Caching Table for the "TOP 100" type queries, and refresh the cache table periodically.

Then when a user checks the "Popular News" you just query the Cache Table and JOIN to the News Table by NewsID

irs2k3
Starting Member

22 Posts

Posted - 2010-04-03 : 09:40:18

Thanks a lot mate.

I will try to implement this technique.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources