performance of PK - int vs. varchar

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

performance of PK - int vs. varchar

Author

Topic

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-07 : 17:01:03

I am designing an SQL db for potentially large web app. I'm trying to decide between two formats for the generic table PK. I am open to other options, but right now I'm thinking about two integer values that together would uniquely identify a record in any table. First, system_id would identify the server node or other system that generated the record (web server, app server, db server etc, depending on final architecture). Second, user_id (or other table_name_id). We would most likely have a web server component that manages numeric key values on each node rather than using the IDENTITY function, so that we don't have to insert a record in order to obtain an ID.

So my question at this point is simply around which data type method will be more efficient, both in coding and in performance:

A. I compound key on 2 columns like this:
sys_id int
user_id int

(example values 10, 1001)

B. A key on a single concatenated column like this:

user_id varchar(50)

(example value 10-1001)

The question is basically just which data type provides faster joins ..... either joining on a single varchar(50) column made up of 9999-9999999 or joining on 2 INT (or BIGINT) columns made up of 9999 and 9999999.

Preliminary tests are inconclusive, with the varchar(50) seeming to win by a tiny margin.

I have found one MSDN forum where this question was asked 2 years ago, with lots of responses but no one addressing the actual issue. I'm confident I will get more thoughtful responses here!

thanks
SfG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-07-07 : 17:27:59

Don't know how "thoughtfull" this is but here's my opinion :)

I don't like the idea of a concatenated "intelligent" string because you are bound to occasionally want to treat the components individually. If they are together in a string then you will need to perform string manipulation and possibly type conversions. All unnecessary if you have them in two int columns. As your tests showed the JOINs were not significantly faster with two columns. So the only disadvantage would be a little extra typing for your JOIN criteria. But I think the advantage of having the data stored at its most detailed level will far out weigh the bit of typing you'd save.

Be One with the Optimizer
TG

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-07 : 17:38:47

Nicely put. Thanks. I'm still very interested in hearing opinions about other aspects of the issue, but so far I am inclined to agree (all other things being equal, which I'm not convinced, but I think they are).

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-07-07 : 17:43:00

Another option you didn't mention - NOT that I'm recommending it but it is another option:
make a surrogate primary key as a single int column. Then your alternate key columns (now just attributes) could have a unique constraint.

Be One with the Optimizer
TG

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-07 : 17:52:58

Feh! (pbththbhpthth) and other non-verbal silliness retorts. These are already essentially surrogate keys. Besides, the surro-surrogate you suggest would need to follow all the same rules and requirements we're already discussing and we'd end up back at this same discussion, only with twice as many keys. Thanks for thinking of it though.

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-07 : 18:14:54

Oooh, sorry. My response sounded much more like light-hearted silliness in my head. It looks downright mean on screen. Not what I intended! (I'd better stick to the facts instead of trying to be cute).

SfG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-07-07 : 19:18:01

light-hearted silliness percieved - and I liked the sound affects

Now I'm really not intending to advocate that surrogate key approach but I'm not sure I agree with you. They may be surrogate keys in their respective tables ([user] and [sys]) but for this table they would not be. One argumement could be why use surrogate keys in those table and not this one?

Be One with the Optimizer
TG

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-08 : 00:30:48

"This" table that I'm talking about would be the USER table. Its PK would be sys_id + user_id. The sys_id would be an identifier for the system that created the record. So if we end up with 10 SQL Servers replicating data back and forth (not what I expect, but it works for example) then sys_id would be a value of 1 thru 10 depending on the current server. User_id would be equivalent to the IDENTITY column that would be easy to use if we were (a) using a single server to generate records, and (b) okay with having to insert a record to the database and retrieve the new ID value in order to then insert subsequent child records in other tables.

User_id in this case is not exactly a surrogate key, but it is also not particularly "real" data. It is not published to the front-end. But it is needed for referential integrity.

In the case of applying an additional surrogate key, we would then run into the same issue, namely what mechanism is used to generate and maintain uniqueness across all records and also make the web code aware of the new pk value prior to insert so that child records can be inserted in the same (or VERY close subsequent) call to the database.

I hope that makes better sense. I realize my initial question lacked necessary detail, but this post is not exactly clear-as-mud.... (Excuse: I just finished cleaning up fresh barf from myself, clothes, hair, my son, and all his bedding..... now back to the computer for a little light work before bed.)

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2009-07-08 : 09:35:02

If you have a natural PK that is only two ints then use it. As you have worked out, performance is negligible so you might as well keep a proper data model. You will hit other performance problems before this one is worth looking at e.g. XML/web code, crappy transaction handling etc. Do the right thing.

I'm a bit concerned about your 'not using identity' comment. Why would you want an ID with no record and why would you want this handled anywhere other than the DB?

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-08 : 10:47:07

Specifically I'm thinking about the situation wherein a set of related records coming from user input need to be inserted into 3 related tables in the same database call. Let's say user, order_hd, order_dtl records.... I would rather have the web code (or middleware) already know what the IDs of each record are going to be so all PK and FK values can be set ahead of time, and just pop the records in. So that leads me to a single component that manages bigint values for each table on each web/app server. The database will enforce PK and FK constraints, but the application will be responsible for generating the proper values for those keys.

The DB design is quite simple, with no more than 30 tables all-told. But the user volume is potentially large (we all wanna create a popular web app right?). So If my DB calls can be:
"insert into table A; insert into table B; insert into table C" rather than
"insert into table A; select back new ID; insert into table B using new a.ID as FK in table B; select back new b.ID; insert into table C using new b.ID as FK in table C"

wouldn't the first option be much better?

Or am I over thinking? I'm used to a vastly different types of performance concerns....

Thanks
SfG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-07-08 : 11:14:32

If you go with the two INT natural key then the point is moot. If you go identity column then a good method is to have an insert SP which returns the new identity value (via scope_identity) as an output variable.

Have you decided on the structure (PKs) for the other tables yet? I'm leaning towards your two int natural key but I also like the idea of maintaining continuity with the methods used by the other tables.

Be One with the Optimizer
TG

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-08 : 12:05:16

I'm pretty much decided on the two INT keys on each table. Each and every table will have a sys_id (or created_system_id) and a tablename_id (user_id for the user table, order_id for the order table). The combination of sys_id and tablename_id will be the natural PK. This will be the same in each table.

Now, I'm open to a discussion of the merits of using IDENTITY property for the tablename_id in each table, and having to insert/select (or equivalent with a stored proc and output variable) prior to issuing the insert to the child table, vs. having a (presumably as elegant as possible) web or middletier component that maintains and assigns identity-like values for each tablename_id so that all PK and FK value assignments are set prior to the actual DB call by the Web/App server.

Thanks for the continued discussion!

SfG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-07-08 : 13:55:59

Some of the challenges of having a non-database process assign PK values are:

- concurrency collisions.

- the non-db process not being part of the db transaction. So if something goes wrong in the DB like the paraent table insert fails for any reason will the outside process handle it appropriately.

- adding another tier adds another possible point of failure

Be One with the Optimizer
TG

SQLforGirls
Starting Member

48 Posts

Posted - 2009-07-08 : 15:04:59

OK So I'm picturing a way now, whereby data comes in from the Web/App server (take again the example of a new user, new order_hd and new order_dtls all coming in the same DB call) without PKs or FKs, and the single call from Web to DB (like single insert proc call) takes care of: Insert to user, return pk, insert to order_hd, return pk, insert to order_dtls.... Then returns all necessary IDs/confirmations back to the Web/App server at the end.

Then IDENTITY is sufficient for all tablename_id PK.... and system_id is only necessary if/when there are multiple DB server nodes... So that each node can have the same IDENTITY value ranges, with the additional system_id being necessary to fully identify a unique record across all servers.

X002548
Not Just a Number

15586 Posts

Posted - 2009-07-08 : 16:51:02

maybe it would help if you had posted some sample data

If there are any natural keys at all, this is the way to go

I'm curious about the multiple sources though

Do they all connect to your database, do you get feeds, or what?

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources