Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Database Design and Application Architecture
 DB design for sparse integer tables

Author  Topic 

bharat1
Starting Member

2 Posts

Posted - 2008-11-28 : 14:42:23


Hello

I am attempting to design a database for computerizing statistical/demographic information collection and reporting process for educational institutions. (I am new to DB design so the following problem could be simpler than it appears to me. :-( )

The whole input process is a bunch of forms heavily matrix/grid-style in nature. Apart from some basic attributes about a college, all other information collected is purely numeric (integers that are in the 10s or 100s). There is *no* individual information about students, instructors, etc. Everything is just in numbers. (For e.g., a typical form about data posted to the system for a college contains the count of students of various ethnicities (columns in the form) across their majors (rows in the form). Another form collects the data regarding teh age-groups of students across their majors. Yet another form presents a count of students who can speak a certain language. For each metric so collected there are separate fields for male & female. Thus a typical form has 20 or so columns and 10 or so rows (or vice versa depending on how the presentation is).

The primary goals of the system are two-fold:
(a) It shall be deployed at a data collection center for a particular area (of 10-20 colleges) and the numbers shall be fed into it on a college by college basis for the area.
(b) Data will then be aggregated up to the city level and then further up to the state/country level.
At each of the aggregation levels, the primary concern will be querying to generate various aggregate reports. The reports are expected to provide insight to guide decisions regarding various development initiatives.


In terms of size, a typical form (as viewed) contains about 200-300 small-ints when submitted for a college. There are 10 to 15 such forms per college and there can be as many as 100,000 colleges in the entire system when all data is aggregated for reporting.

However, for any one form, the actual non-zero values typically input are only 15-20 (out of 200). So, basically, more than half of the rows when a completed form is viewed are *all* zeros. Even in the remaining rows, more than half of the entries are zeros. Overall, when you see any one form, it has a very sparse appearance.

Given this, I am trying to determine what should be the right approach to designing the tables:

Option (A) student_stats, instructor_stats etc. with lots of columns (representing counts corresponding to various attributes) that simply store the numbers as they are

OR

Option (B) Does it makes sense to optimize on storage by
EITHER
(i) taking an EAV type approach and storing only the tuples (school_id, attr_x, attr_y, value)
OR
(ii) devising some kind of optimization to encode an entire row corresponding to a particular entry form into a single stored value ... (school_id, attr_x, y_bitField,encoded_y_values). (Basically, attr_x will represent the form row and the y_bitField will represent which of a fixed set of columns have non-zero values for the row and the encoded_value might be a large number with every two digits starting from the end representing one of the non-zero values (y1 + 100*y4 + 10000*y8 + etc.)

This is to be supported on SQL-2005 so SPARSE columns is not an option.

While I find options in (B) interesting from a data storage optimization perspective ... I am worried that - because majority of interest in the system is to query and generate reports at the top level - performance could become a major problem - definitely so with approach (B)-(i).

Besides, option (B)-(ii) I feel will complicate programming quite a bit and hurt maintainability. It will also make reporting cumbersome because any aggregation will first need to decode data for the specific attribute being sought. Approach (B)-(ii) also worries me because it feels like I might be building something that the DBMS ought to be doing.

So, overall, I am leaning towards option (A) and letting SQL Server figure out how to store the numbers efficiently. (Will it?) I feel that performance would be better with that. If that is indeed the right choice, are there other tricks I could play with can-be-NULL flags etc. so that SQL Server will recognize the opportunity to optimize even better?

Appreciate your thoughts/suggestions.


Thanks,
Bharat

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2008-11-29 : 08:05:20
Don't go the EAV. A sounds the best unless you can model each question/form in a separate table, but without knowing the details it's hard to say if that is actually better. Generally though you want to store the data in its most natural form, so keep your college separate as you have proposed is good. You might also want a consider a table per form/group of questions to allow you to further group up the results.
At the end of the day though, if you need the answer to 300 questions per college the best soultion is 300 columns.
Go to Top of Page

bharat1
Starting Member

2 Posts

Posted - 2008-12-02 : 13:31:04

Thanks for your response. I am leaning towards a table per form as well.

Please let me know if the following clarifies the issue enough to affirm that taking the one-table-per-form approach is correct.

If we use a couple of such forms as example (hobbies of students by majors, languages spoken by students by majors) ... this is how the corresponding tables look like:

TABLE STUDENT_HOBBY_COUNTS
PK_COLLEGE_ID
FK_MAJOR_ID
Hobby1_Men
Hobby1_Women
Hobby2_Men
Hobby2_Women
...
...
Hobby10_Men
Hobby10_Women
(Note: Hobby1 could actually be HorseRiding, Hobby2 Chess, etc.)
===================

TABLE STUDENT_LANGUAGE_COUNTS
PK_SCHOOL_ID
FK_MAJOR_ID
Lang1_Men
Lang1_Women
Lang2_Men
Lang2_Women
....
Lang15_Men
Lang15_Women

===================


As you can see, the data gathered from any one college would be just 2 digit numbers ... and each table will be very sparse. For instance, the Hobby Table would contain 15 rows (one for each major) and 20 columns ... totally up to 300 integers. However, out of these only 15 to 20 will be non-zero.

Also as mentioned earlier, data collection at the college level is also in the form of aggregates. So there's no normalization per se to be done. (It is not that we are collecting info about each individual student into the DB ... rather the data is being fed in form of counts of various categories of students. E.g., Count of Civil Engg major Men that can speak German, etc)


For each college there are 15 such tables keeping track of numbers for different aspects of interest (mostly by student majors).

The systems at data input locations have to cater to up to 100 colleges. However, once data is aggregated up to the state level there will be 100,000 colleges (from which overall reporting would be done).

Thanks,
Bharat
Go to Top of Page
   

- Advertisement -