Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
nlocklin
Yak Posting Veteran
69 Posts |
Posted - 2001-12-10 : 21:55:42
|
| Well, I've come across ridiculous idea #2 at work. Here's the situation - I'm wondering if anyone else has encountered this before, and what you eventually decided?I've been told that we have to store "different types of NULL values" in our databases. Basically, we need to be able to know whether a particular value is NULL because we're missing the information (and need to get it) or because that particular value will just never have a value.For some fields (esp. fields using a lookup table), I'm thinking that although it's not the best idea, I could probably live with assigning a code to "Missing"... but it gets worse. Since new people will always be joining the research project, we have to come up with some kind of standard value for "Missing", like 9 (which is what they currently use... mostly). Ok, that's a little more difficult, but still possible. But, I point out, what about fields where you've already assigned the value 9 to something else? "Then we'll use 99". It's a slippery slope.But it gets worse! They also want to assign these values to numeric fields. For example, if we're tracking someone's weight as a decimal value, and the value is missing (but we need to get it), they want to store the value as 999.99! I can't wait for the first time someone doesn't understand how that works, and publishes an average in a report somewhere ("Wow, that seems high"). They also want to store dates as 99/99/9999, or just parts of dates as missing (ie if they know something was from this month, they'd store 12/99/2001 - which is why they currently use numeric values to store dates).Has anyone else heard of this? It seems pretty ridiculous to me. I've already found many cases where this causes some big headaches in their data. I've been thinking about how else to do this, but haven't come up with much (other than a table to track missing values, which could be huge if they want to track many columns). Anyone have any thoughts? Or Aleve?--"It's not that I'm lazy, it's that I just don't care." |
|
|
Merkin
Funky Drop Bear Fearing SQL Dude!
4970 Posts |
Posted - 2001-12-10 : 22:09:54
|
I have seen this method used in market research. I did data analysis in that industry for a while.It was standard for people assigning codes and data entry people to want things like 99 for "Not Answered". Sometimes you would have 98 for "Not Applicable" which, when you asked people, definitely needed it's own code Also, the data entry programs required an answer, so 99 was easy for the DE people to type.At the end of it, you would just net all of those codes together, and leave them out of your aggrigates. The code assigning people were happy, because they had more options to play with, but it didn't matter to the final results because you would generally only want stats for the people that answered.So, in SQL (the data analysis software I used wasn't SQL), you would probably use a CASE statement to only do calculations on the good data.As for them being NULLs, well that is up to you how the data is stored, they never really need to know.CASE is your friend....good luck!Damian |
 |
|
|
fisherman_jake
Slave to the Almighty Yak
159 Posts |
Posted - 2001-12-10 : 22:25:15
|
nlocklin, from my experience, as long as the developers agree on a particular method of flagging Null, N/A, Missing then you shouldn't be tripping over yourselves and kicking somebody in the nuts. You could always implement a look-up table to interpret the code, which is what I've experienced with Thomson Financial's Worldscope, basically they have something like 132million records and 8 multi use lookup tables, it is all very confusing, but every column has a type and that type may have x number of lookups depending on other flags associated with that data. Apart from the data and the look-up tables they also send you a 600+ page book of terms and glossary, to make sense of the data. So basically unless it's absolutely neccessary work in an ALL or NOTHING basis (in the Quant world, it's either significant or insignificant).. Why would you want to capture incomplete data, unless you want to count the number of incomplete entries.. hehehe.. If it's through an app then your app should not accept the inputs.... It sounds all too funny to me, to be getting dates without days or months, unless you just want the year... Sorry, I can ramble on forever, but I suppose different applications will require different specs. So just stick to your methodologies and you wont be kicking anybody in the nuts..I need a ==================================================World War III is imminent, you know what that means... No Bag limits!!!Master Fisherman |
 |
|
|
izaltsman
A custom title
1139 Posts |
Posted - 2001-12-11 : 10:26:20
|
| If your data structure and requirements do not change often, I would consider adding an int field to every table, and use it as a bitmask of missing values (i.e. each bit in this field would represent a column, so whenever some column has a NULL due to a missing value, set the corresponding bitmask bit to 1). |
 |
|
|
mono
Starting Member
36 Posts |
Posted - 2001-12-11 : 18:42:42
|
| I'd go one further than izaltsman: add value type code to every column that needs it. You may want to make it something like varchar(4) so that you capture user codes as text, allowing "NI" for not intelligible as well as "9" and "99".This avoids having one long list of codes where not all make sense in for all fields and some may have varying interpretations between fields. You can prepopulate the lists with the truly common values.The idea of capturing the dates of the form d??m?yy? is interesting. You'd have a hard time extracting information consistently. Double counting, where 12??2001 counts as a value for every month i.e. 12 times. To avoid that consistently you would have to exclude such values in your queries, which, as fisherman_jake points out, begs the question why are they being stored in the first place?This sort of thing applied to sequences other than dates e.g. weight (float) would have you trying to intepret "?34.5?g". Now is that in the range 34 to 35g or is it 834.5kg?I my (not extensive) experience, such problems are best handled at the capture or input stage i.e. validate on input and do not allow invalid data in. But there could be genuine reasons for capturing such data. In that case, treat it as text. The intepretation as date is something that will have to be carefully defined. You might then want to record 2 values: the raw text input and the interpreted value which will be null if the raw text value is not interpretable as a datetime. Systematically avoiding the invalid dates in queries then requires just an IS NOT NULL (effectively the same as izaltsman mask). |
 |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2001-12-11 : 19:06:38
|
| Another 2¢ if you want it:I agree with a lot of the suggestions here. Basically these are encoding schemes, which Joe Celko likes to promote. You're creating your own special codes to represent data either in shorthand, or to represent missing, not applicable, or other states of the data you have.I'm not much of a believer in differentiating between missing and N/A data either, but that's just me. NULL works equally well for both 99% of the time. N/A is a little different; I almost never keep unlike things in the same table, where one column would apply to certain rows but not others. When I do need to distinguish, I'll usually put 'Unknown' in a string column or a -1 in a numeric column, if negative numbers aren't normal (weight is a good example).The only problem I see with your situation is the bit about missing PART of a date, or PART of another column value. To me, it would make no difference to simply leave the date column empty than to fill in the month and year, but leave the day out. Or just to pick the first of that month. My feeling is that if someone is so lazy or stupid that they can't completely fill out a column, flagging that column with some encoding that says, "go back and complete this", will make no difference. They'll never go back and complete it! Now you need to encode incomplete data (!), not just missing or N/A! I can't see any way for an encoding scheme to be consistent in this kind of atmosphere.Columns should be all-or-nothing: you either have a value or you don't. If you can make a row all-or-nothing, even better. This sounds simplistic, but all you're doing is requiring data integrity. Providing someone with the flexibility to fill out only the data they want to isn't a benefit, it'll only create a bigger mess than what's already there. mono is 100% right about that; make them fill it out right, and don't allow them to put in bullshit data. |
 |
|
|
|
|
|
|
|