Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-14 : 06:02:36
|
I have a question about storing the history of particular objects in a database. For example, if I had a table of "People" which had fields "PersonId", "Name", "PhoneNumber", "Height", "Weight", "Proffession" the data in every field stored for each person can change over time, except for the "PersonId", of course, which is why it is included. I would like to be able to view a persons attributes at any point in time and therefore need to maintain a history. The currenct approach in place is to archive images of the whole table at certain points in time, which is unacceptable as it misses some changes, is not very accessible and also stores data which does not change.My solution would be to created seperate tables for each changing attribute and have corresponding date for each change. For example, for phone numbers have a table "PeoplePhoneNumbers" with fields "PersonId","PhoneNumber" and "ChangeDate". A few shortcomings I can see in this approach is that firstly there will be many tables, one for each changing attribute, which can be in far greater number than those mentioned. Secondly, joins will have to created between every attribute table to get the orignal single table form, although I don't see this as a very important issue.I am wondering; is there a more elegent way to structure for objects of this changing nature, or is having seperate tables for each changing attribute the best solution? I'm sure this is a very common issue. Thanks very much for the help,Mike |
|
pootle_flump
1064 Posts |
Posted - 2008-05-15 : 05:43:29
|
FYI:http://www.dbforums.com/showthread.php?p=6338419#post6338419 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-15 : 06:15:09
|
This is pretty common in datawarehouses and is essential for auditing purposes in banks and such. The most common way to solve this is to just add 4 columns to your existing table; Active(bit), Valid_from(datetime), Valid_to(datetime) and Inserted(datetime). The inserted column can be omitted, it's just "nice to know". It would look something like this:DECLARE @Person table ( ID int IDENTITY(1, 1) PRIMARY KEY CLUSTERED, PersonID int , Firstname varchar(200), Lastname varchar(200), Active bit, Valid_from datetime, Valid_to datetime, Inserted datetime)--> First there are two people in the table, both records are active -- meaning that this is the current recordINSERT INTO @PersonSELECT 1, 'John', 'McClain', 1, '1900-01-01 00:00:00', '9999-12-31 23:59:59', GETDATE() UNION ALLSELECT 2, 'Hannah', 'Olsen', 1, '1900-01-01 00:00:00', '9999-12-31 23:59:59', GETDATE()SELECT * FROM @Person--> Then John and Hannah marries and Hannah takes Johns last nameUPDATE @Person SET Valid_to = GETDATE(), Active = 0 WHERE PersonID = 2 AND Active = 1INSERT INTO @PersonSELECT 2, 'Hannah', 'McClain', 1, DATEADD(ss, 1, GETDATE()), '9999-12-31 23:59:59', GETDATE() Now you still have two active records but you have preserved the old row where Hannah's last name was different. If you do this for all records you will have a full history of changes and you can query the table "back in time". This query will get the status at new years' this year:SELECT * FROM @Person WHERE '2008-01-01 00:00:00' BETWEEN Valid_from AND Valid_to --Lumbago |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-15 : 06:18:10
|
And for performance reasons you can have a job or somthing that will populate only the active records to another table if you like...these history-tables have a tendency to grow way out of proportions.--Lumbago |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-15 : 07:21:18
|
Thanks! Since asking this question I have found some more information on the problem and found two main solutions:1- For a change in any attribute create a new row with a new valid date/time, where the valid date and the PersonId are the primary key. For example:People Table:| PersonId (PK)| DateValid (PK) | Name | PhoneNumber | Weight | BirthTown | Other attibutes...2- Create seperate table for every attribute that may change over time and that change needs to be logged. For Example:People Table:| PersonId (PK)| BirthTown | Other none changing attributes..PeopleNames Table:| PersonId (FK)| DateValid (PK) | Name (PK) |PeopleHeights Table:| PersonId (FK)| DateValid (PK) | PhoneNumber (PK) |Other tables for changing attributes...Lumbago, I'm guessing you're suggesting option 2, with an added Valid_to and Active fields to simplify Queries. I guess the option 1 is better if changes are infrequent and can be on any field, and option 2 when a low number of fields are changed often. Does this sounds correct? Also, does anyone know of another option that could that would cater for many fields changing often? |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-15 : 07:35:39
|
I would never in my life choose option 2 as it will cause a massive amount of tables in the database and it would be a royal pain in the neck to get the full status of a persons attributes at a given point in time. Option 1 is in my opionion the only way to go but instead of having one valid date/time, make two like in my suggestion(Valid_from/Valid_to). Even if changes are infrequent and the amount of columns changed are few, I'd still go with duplication the entire row of data and setting the Active flag/Valid_from/Valid_to. This method is beeing used in basically every dwh-project that needs history preserving and all major ETL-tools out there has built-in support for it.--Lumbago |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-15 : 07:41:27
|
Just run my example and you'll see how it works. If you think a little bit about it, it's quite a powerful and yet really simple way to do it. And you can get the status of a person at any point in time only by knowing the PersonID and the statusdate you want, not "a billion" different tables that may or may not have records in it for the person you seek.--Lumbago |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-15 : 09:04:45
|
The main reason I'd use number two is I don't like the idea of replicating unchanged data, it seems wasteful and also it's harder to track how many changes have been made to each field. I understand if there are many fields changing then it would become difficult having a huge amount of tables, but if only one field was changing, say "status" or something like that, why replicate the whole record everytime?Also, I know you've already said that the Active field is optional so I wont discuss it, but I don't see why you'd need a valid_from and valid_to when a DateValidFrom date will do. The query is not that difficult to write. Say I wanted to find a record for the 1/1/2000:SELECT TOP 1 * FROM People WHERE DateValidFrom < '01/01/2000' ORDER BY DateValidFrom DESCI appreciate your replies and answers even if I disagree, sometimes just discussing an issue is enough to make a decision. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-16 : 02:35:17
|
quote: SELECT TOP 1 * FROM People WHERE DateValidFrom < '01/01/2000' ORDER BY DateValidFrom DESC
Sure, but even the simplest query like the one in my example (all user data at new years 2008) gets alot more complicated all because you saved yourselv 8 bits of data per row (one datetime column). What will you do when you need to run a complex query back in time? I'm not telling you what to do here but I have worked professionally with this sort of logic for over 2 years and all I'm saying is that it will make your life soooo much easier. And I'm not making this up all by myself here, the technique is a so called "type 2 slowly changing dimension" and it is descriped in brief here by one of the two "Gods" in data warehousing, Ralph Kimball: http://www.intelligententerprise.com/030422/607warehouse1_1.jhtml--Lumbago |
|
|
pootle_flump
1064 Posts |
Posted - 2008-05-16 : 03:51:33
|
quote: Originally posted by michael.appleton The main reason I'd use number two is I don't like the idea of replicating unchanged data, it seems wasteful and also it's harder to track how many changes have been made to each field. I understand if there are many fields changing then it would become difficult having a huge amount of tables, but if only one field was changing, say "status" or something like that, why replicate the whole record everytime?
Writing complex SQL to get the latest value of every attribute will waste developer time, IO, CPU etc.. Disc space is cheap and can be easily scaled up.The status part you mention later is totally different. Maintaining a status history is totally different to (quote) "storing the history of particular objects in a database". So if only the status changes then yes - your status becomes a temporal entity. If you are basically auditing changes, save yourself the hassle and just save the entire row on insert, delete, alter. |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-16 : 04:48:24
|
Thanks, point taken.I think I naturally find minimalist approach with regards to data storage more appealing. I'd like to think if the data is stored in a logical way, which having only one ValidDate is, then the Querying and extracting simple information should be straight forward. As you've pointed out, this is not always the case when working with a relational database and SQL, and the limitations such as processing time, development time have to be taken into consideration and can have a large impact on structural design when sometimes a compromise of storing redundant data is necessary to cater for these limitations.Basically, designing a database seems like a balancing act between:-Data integrity -Development time-Speed-SpacePossibly in that order of importance. Obviously, data integrity should hardly ever be compromised, but storing redundant data makes constraints and checks on data a little harder to maintain, so in this minor way integrity is slightly compromised. Thanks for the help, and please comment or correct anything you disagree with. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-16 : 05:05:56
|
My only advice if you're still not convinced would be to start developing some small scale examples and create some simple reports. My bet is that you'll change opinions quite fast when you see what you're getting yourself in to.--Lumbago |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-16 : 05:10:03
|
I think you misinterpreted my last post. When I wrote I naturally find a minimalist approach more appealing, I meant before recognising the limitations this approach can have when stored in a relational database and the other problems it will bring. So actually, I do agree with your method after looking into it. Thank you very much for discussing this with me, it's been a great help. |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2008-05-16 : 05:19:18
|
Ah, sorry for misunderstanding...it's good to know that you're on the right track --Lumbago |
|
|
pootle_flump
1064 Posts |
Posted - 2008-05-16 : 05:57:35
|
quote: Originally posted by michael.appleton Thanks, point taken.I think I naturally find minimalist approach with regards to data storage more appealing. I'd like to think if the data is stored in a logical way, which having only one ValidDate is, then the Querying and extracting simple information should be straight forward. As you've pointed out, this is not always the case when working with a relational database and SQL, and the limitations such as processing time, development time have to be taken into consideration and can have a large impact on structural design when sometimes a compromise of storing redundant data is necessary to cater for these limitations.Basically, designing a database seems like a balancing act between:-Data integrity -Development time-Speed-SpacePossibly in that order of importance. Obviously, data integrity should hardly ever be compromised, but storing redundant data makes constraints and checks on data a little harder to maintain, so in this minor way integrity is slightly compromised. Thanks for the help, and please comment or correct anything you disagree with.
Hi MichaelYou might think this is splitting hairs but I don't actually believe that this is storing redundant data. You are archiving the entity as a whole instead of treating each attribute of the entity as an entity in itself. Also, it is a myth that a goal of good relational design & normalisation is to remove redundant data. The goal of normalisation is to prevent update anomalies - it just so happens that achieving this goal almost always results in the removal of redundant data.Actually - rereading your post I think you are addressing the dual date part rather than splitting attributes off all over the place. Personally I don't bother with the "effective_from" date but it is a more minor point than having a single archiving table. |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2008-05-19 : 09:46:36
|
Hello again,Just a question on your last post pottle_flump. You said"Personally I don't bother with the "effective_from" date"How do you deal with dates on this history storing tables? Thanks again,Mike |
|
|
pootle_flump
1064 Posts |
Posted - 2008-05-20 : 05:45:50
|
I mean I just have the one date. To have two dates means you must look up & write to the previous change record to "close" it. That's fine for data warehousing but IIRC we are talking about transactional auditing tables. They must impact as little as possible on the principal transaction. |
|
|
pootle_flump
1064 Posts |
Posted - 2008-05-20 : 06:29:25
|
Ok - I've read the thread again. I use two distinct tables. The "live" table and then an audit\ archive table where all insert, mods and deletes get written to (via triggers). I missed that Lumbago uses a single table (archiving off every so often). |
|
|
michael.appleton
Posting Yak Master
160 Posts |
Posted - 2011-03-24 : 10:49:19
|
Wrong thread!! |
|
|
|
|
|
|
|