A thorny one (for me) - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

A thorny one (for me)

Author

Topic

aiken
Aged Yak Warrior

525 Posts

Posted - 2005-10-15 : 16:00:43

I've got an app that uses a precomputed zip code distance table to do quick sort-by-distance lookups. The zip code table is in the form:

quote:
zip1 zip2 dist
12345 12346 1
12345 12348 2
12345 12350 10

As you can imagine, this table gets big quick. To keep it in check and prevent redundancy, zip2 is always greater than zip1. So for zip1=00001, there are tons of entries for zip2. For zip1=96999, there are only a few rows, since it will have been the zip2 for every lower zip code. I hope that makes sense

.

Anyways, my problem is this: to search for rows that are within 100 miles of an arbitrary zip code X, the search has to take into account that it's matching either zip1=X or zip2=X.

Right now, I'm using functions that return the minimum and maximum values from a pair, so it does something like

where zip1=dbo.f_min(@zipCenter,referenced_zip) and zip2=dbo.f_max(@zipCenter,referenced_zip) and dist<100

...but that seems ugly, and I'm not good enough with query plans to be able to see how much that function hurts. I'm sure this is a relatively common design issue, so I'd love some advice. Would I be better off with something like "where ((@zipCenter<referenced_zip) and zip1=@zipCenter and zip2=referenced_zip and dist<100) or ((referenced_zip<@zipCenter) and zip1=referenced_zip and zip2=@zipCenter and dist<100)"?

Or am I missing the smart way altogether?

Thanks
-b

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2005-10-15 : 16:48:12

Yeah, the smart way would be to store the geographic X/Y (longitude/latitude) coordinates of the center of each zip code, and then get a little help from an old greek DBA named Pythagoras who can show you an easy way to calculate the distance beteen any two points.

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-10-15 : 17:07:03

select zip2 from zips where dist < 100 and zip1 = 12345
union all
select zip1 from zips where dist < 100 and zip2 = 12345

rockmoose

aiken
Aged Yak Warrior

525 Posts

Posted - 2005-10-17 : 00:04:05

blindman, that *would* be nice, if our customers were willing to wait a few days. It's not unusual for this app to need to *sort* by distance among permutations of well over 100,000 locations. Run-time computation is definitely out.

Thanks, rockmoose. That helps, but I need this in a relatively complex dynamic SQL query, and I really don't want to mess with two copies of the whole thing with the union (and then using a table variable or something to sort them together).

Cheers
-b

eyechart
Master Smack Fu Yak Hacker

3575 Posts

Posted - 2005-10-17 : 01:32:35

quote:
Originally posted by aiken

blindman, that *would* be nice, if our customers were willing to wait a few days. It's not unusual for this app to need to *sort* by distance among permutations of well over 100,000 locations. Run-time computation is definitely out.

Thanks, rockmoose. That helps, but I need this in a relatively complex dynamic SQL query, and I really don't want to mess with two copies of the whole thing with the union (and then using a table variable or something to sort them together).

Cheers
-b

I would think that the calculation method would be pretty fast. The lookup method you are using makes for a very inefficient table. you are probably having to do huge index range scans to find the different zip codes. A lookup table with long/lat for each zip code (with unique index) may in fact be faster.

I would agree that a lookup table that has unique values would be the best, but that is not at all what you are describing.

if you are interested check out this project on sourceforge http://sourceforge.net/projects/zips/. The dataset includes a zip code database with long/latitude data. You can also easily see the calculation needed to calc distance in the php code included.

-ec

Kristen
Test

22859 Posts

Posted - 2005-10-17 : 01:45:30

" It's not unusual for this app to need to *sort* by distance among permutations of well over 100,000 locations. Run-time computation is definitely out"

Are you reporting 100,000 rows as a resultset, sorted by distance, or just trying to find a handful of the nearest ones FROM a 100,000 row dataset?

If its the second then calculation is going to be plenty fast enough I would have though - because I would start off narrowing down by the total different in Lat & Long, and then some heavy calculation on the likely candidates.

If what you actually want is "places within X miles" then you can calculate the max/min lat/long for a square of X miles, find everything within that (which will be a very fast index lookup), and then use the heavy calculation to weed out the ones outside a circle, rather than a square, and for sorting by distance.

Kristen

aiken
Aged Yak Warrior

525 Posts

Posted - 2005-10-17 : 02:20:49

I should probably have been more clear about the context to start with; I apologize for that. It's always tough balancing being concise about the question with giving enough info for people to be able to actually help.

What's happening here is this: there's a table of about 1,000,000 items, each of which has a location (right now, a reference to a zip code table, but for the purposes of this discussion it's fine to consider it as lat/lon). Users search for the nearest applicable item, so that 1,000,000 figure is being pruned by some big, easy stuff first (color, etc). What's left is somewhere between just a few and maybe 250,000 matching items (if someone doesn't narrow the search enough, or is searching for a very very common part), and they need to be sorted by distance from the user's required location. Of course, people only really care about maybe the first 500 or so, but that still requires the computation of all of the distances for sorting purposes.

However, Kristen, that square/circle thing is very clever. Rather than actually having to compute the distance between the requirement and each part, we can probably just look for parts that are within a lat/lon range, and then if necessary do the pythagorean thing among *those*, or even just the top 500 of those if there are tons.

I'll look into that as a longer-term fix for the underlying problem, because I think that's going to be a lot more elegant.

However, to come full circle, can anyone give me some salt-of-the-earth advice about the performance impact of using functions as originally described rather than more complicated where clauses? I realize that the best answer to that is "try it both ways and benchmark it," and if there's no general answer I'll accept that, but I'd love to hear if there's any conventional wisdom here.

Cheers
-b

Kristen
Test

22859 Posts

Posted - 2005-10-17 : 04:31:47

"performance impact of using functions:

where zip1=dbo.f_min(@zipCenter,referenced_zip) and zip2=dbo.f_max(@zipCenter,referenced_zip) and dist<100"

Well ... if reference_zip is an indexed column the function is probably going to force a table scan - but it depends on whether the optimiser unravels the function sufficiently and, of course, what the function does!

Is it easy enough to construct a test case, in Query Analyser? Something with a COUNT(*) and just that WHERE clause plus "Colour=Red" or whatever to restrict the hits somewhat?

if so I would then make a second copy and hand code a replacement for the Function (I mean cut & paste the source from the function in, but also optimise it for how you would have written it without the function [might be identical of course!!])

Then I'd stick this snippet of code around each of the samples and see what LOGICAL values you get. I would then try messing around with some indexes etc. to see how the LOGICAL values change - if one of the things you try makes them significantly smaller then that's your winner!

I find this route normally doesn't take very long, and I usually wind up with an "Oh my Golly Gosh" [substitute superlatives/expletives as appropriate!] because, in my code at least!, it normally brings to light something obvious that we should have been doing since day one, but everyone had overlooked.


-- See http://www.sql-server-performance.com/statistics_io_time.asp for more details

-- Clear cache (for level playing field 
--    - only if wanting to check PHYSICAL performance rather than LOGICAL performance)
-- Do **NOT** run these on a Live server
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

-- Comment in the SHOWPLAN to see the Query Plan, OR the STATISTICS to see the logical stats
-- SET SET SHOWPLAN_TEXT ON
GO
-- SET STATISTICS IO ON; SET STATISTICS TIME ON

-- ... put query here

SELECT * FROM Northwind.dbo.Products

SET STATISTICS IO OFF; SET STATISTICS TIME OFF
GO
SET SET SHOWPLAN_TEXT OFF
GO

Kristen

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2005-10-17 : 08:49:46

I don't know how much of this thread is applicable, but the first page is fun to read:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=16859

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-10-17 : 09:21:07

quote:
Originally posted by rockmoose

select zip2 from zips where dist < 100 and zip1 = 12345
union all
select zip1 from zips where dist < 100 and zip2 = 12345

rockmoose

If the table is like you describe +
One index on zip1, (possibly covering dist as well)
One index on zip2, (possibly covering dist as well)

I can't see how you can get much faster than the above query.

Please tell me if I'm missing something...

aiken
Aged Yak Warrior

525 Posts

Posted - 2005-10-18 : 03:21:07

rockmoose, I'm thinking I just didn't understand your answer; as it turns out, it is dramatically faster (about 95% faster, to be exact) to do something like "where parts.zip_code in (select...", with your union'ed select statement in that subquery. A pretty significant improvement over the whole function. Thanks much, and sorry for being dense.

I think Kristen's square/circle approach is probably the better long term goal since it will be more scalable once the number of zip codes goes up (well, international postal codes, that is). But this is a huge improvement in the short term.

Thanks, everyone!
-b

Kristen
Test

22859 Posts

Posted - 2005-10-18 : 04:10:22

"well, international postal codes, that is"

I've come across a cool site that uses some public domain data to "Locate" places (it converts them to Lat/Long).

http://www.heavens-above.com/countries.asp

You choose a country, and then a town/village/hamlet/etc. Having chosen that you can click on "neighbours" and it gives you a list of nearby places to narrow your choice.

I like this because when I tried it I put in the name of my nearest town - about 10 miles away - because I assumed that a web application would not know the name of the little hamlet of about 5 houses that I live in. When I saw the list of "neighbours" I realised it had more data, so picked the appropriate one from the list (yes, it has got my hamlet in it!)

On the other hand I have no idea how to process Postal Codes from all bar a couple of countries in the world ... and many don't have any, or they cover very wide areas (France for example). Plus people mistype their postal codes of course.

So I figure I can use this approach to get people to indicate where they are located.

Kristen

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-10-18 : 05:43:47

The precalculated table approach is viable if the # of zipcodes is not to high.
I consider this a "semi" static table and you could index it to your hearts delight.

If distances over a certain number are totally uninteresting, you might cut the size significantly by simply excluding those rows from the pre-calc table.

declare @nozips int; set @nozips = 10000
select ((@nozips-1)*(@nozips/2.0)) / 1000000 as millionsofcomb

millionsofcomb
--------------
49.995

The approaches that everybody else has posted with the "proper" zip code table with coordinates is of course much more scalable.
And a good target for nifty distance calulating algorithms.
At some point dynamic calculations can probably be as well-performing,
as long as you limit the amount of calcs by efficiently limiting the resultset that has to be avaluated.

rockmoose

aiken
Aged Yak Warrior

525 Posts

Posted - 2005-10-18 : 13:55:03

Yeah, we do exclude distances greater than a set amount, otherwise the factorial thing really kicks in in a big way. Queries that use the precomputed table just know that if a row isn't there, the distance is greater than the maximum we support (not many people care about the difference between 2000 and 3000 km when they're looking for something).

I'll take a look at the site; we use commercial data from a variety of sources right now, but I'm sure it would go over well if I could reduce expenses.

Thanks!
-b

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources