| Author |
Topic |
|
SamC
White Water Yakist
3467 Posts |
Posted - 2006-02-13 : 10:57:14
|
| I've got a table which collects a VARCHAR (8000) column of "user feedback". I usually sort the feedback reports by date, sometimes by the "length" of the feedback. Length allows me to easily discard shorter feedback: e.g., No, None, Not this time., etc...What would be a big help would be if "similar" feedback could be grouped together. For example, there will usually be 5 or 10 feedback entries which pertain to a particular subject, and there is usually a keyword in the feedback that ties them together. Obviously, the keyword won't be any of: The, a, if, and, but, or. Keywords are usually longer and more significant, e.g., firewall, or security might be examples.Has anyone done or seen a query that has the kind of smarts to pick out significant keywords an group related feedback rows together? |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-13 : 11:37:32
|
| I have my own little "help desk" application that parses out text into individual words, evaluates word relevance by frequency, and then uses a algorithm to identify similar text strings. But it is complicated and involved several tables, sprocs, and functions. |
 |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2006-02-13 : 12:06:47
|
quote: Originally posted by blindman I have my own little "help desk" application that parses out text into individual words, evaluates word relevance by frequency, and then uses a algorithm to identify similar text strings. But it is complicated and involved several tables, sprocs, and functions.
If it were easy, I wouldn't post here... Parsing into individual words isn't too bad. Relevance by frequency is dodgy, a key word may appear only once. What's the algorithm that groups similar text strings like? |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-13 : 13:41:47
|
| A keyword is MORE relevant if it appears only once. It's the frequent words that hold little discriminatory information.The relevance formula I use is:1-Exp(Sum(Log(1-1/(10^(10*[Frequency])))))Which is essentially the 1 minus the sum product of 1-Frequency of each word used. Yields a value between 0 (no relevance) to 1 (high or exact relevance). |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-13 : 13:49:51
|
| We provide a drop-down-list on our Feedback page for users to categorise the issue. If the user arrives at the feedback page from some obvious route ("I want to return my parcel") we pre-allocate the code for them (i.e. "RETURNS").This is lo-tech compared to what you are talking about, so I only mention it in case it gives you any ideas!Kristen |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2006-02-13 : 13:51:38
|
| P.S. I am very interested in what you come up with, though, because its very relevant to what we do in our FeedBack page too. Problem is that users are fickle in how they describe things and our clients, to date, take user-support very seriously so use humans to promptly, and accurately, answer the questions and are [mercifully!] unlikely to compromise on that. But if I could give them better first-shot routing of issues that worked most of the time I'm sure they would be pleased!Kristen |
 |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2006-02-13 : 15:40:23
|
quote: Originally posted by blindman A keyword is MORE relevant if it appears only once. It's the frequent words that hold little discriminatory information.The relevance formula I use is:1-Exp(Sum(Log(1-1/(10^(10*[Frequency])))))Which is essentially the 1 minus the sum product of 1-Frequency of each word used. Yields a value between 0 (no relevance) to 1 (high or exact relevance).
Oh! You are teasing us. Must I ask a question for each step? Once you have relevance calculated, for each parsed word: - Do you do any other filtering to remove simple words: and, all, but, etc. - How do you use the relvance to GROUP BY? I could guess at a few methods, but what have you done yourself? |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-13 : 17:06:45
|
| I am such a tease.Once frequency is calculated for the words, the formula I gave places very little relevance on frequent words. Only words that occur less than 10-20 percent of the time really impact the results. If I remember correctly, you can adjust the sensitivity of this by modifying the "10" value, with lower values being less sensitive to rarity.At this point I am no doing filtering, because my dataset is not prohibitively large, but I do have a boolean IGNORE field on my word list so that at some point when I have sufficent data I can mark words to exclude from relevance calculations.I don't group by the data, but instead show related text values that pass a certain relevance level (75% at the moment). Or you could do select TOP N. At any rate, I am only calculating relevance to a single record at a time.Honestly, I don't see how you could "group by" a fuzzy value anyway. It's very possible that record AAA could be similiar enough to BBB to be related, and BBB could be similiar to CCC, and AAA might NOT be similiar to CCC. Fuzzy relationships cannot be counted upon to be transitive.Wow. The Transitive property. I remember back in high-school wondering what possible use I would ever get out of knowing transitivity.I'll be damned. |
 |
|
|
SamC
White Water Yakist
3467 Posts |
Posted - 2006-02-13 : 17:58:11
|
| Still, having a revance factor on each word doesn't intuitively tell me how to group similar feedback together. I'm not worried about erronious groupings. Getting feedback in rough groups is better than none.So... Sort the words in each feedback by relevance, then group by words in col1, then col2? |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-14 : 00:11:47
|
| Whatever. That logic would have to be pretty specific to the needs of your application and the vagaries of your data.How are you going to deal with records that satisfy the fuzzy criteria for more than one group? Are you going to allow them to be repeated in each group? And how do you group by a fuzzy value that is by its nature "ill-defined"?I suppose you could create a table of representative text strings and then group each record with the sample string it most closely matches.I think you are going to find it difficult to define the exact requirements and behavior of your fuzzy logic, and the basic rule of coding is "If you can't describe it verbally, you can't desribe it programatically."I hope that I have given you some ideas to play around with, and I'd be interested in hearing about what solution you finally decide to implement. |
 |
|
|
mmarovic
Aged Yak Warrior
518 Posts |
Posted - 2006-02-14 : 04:48:17
|
| Reading the discussion an idea came up to my mind. I haven't implemented anything similar, so it may be just foolish, not feasable at all.Anyway, there is data mining clustering algorithm for grouping "the cases". I guess you can use it regurarly (e.g. each month) to find words comming toghether and build the structure that will register groups and members. Later you can use groups and members for your reports.Or instead of clustering you may just build groups and members using some other method. |
 |
|
|
blindman
Master Smack Fu Yak Hacker
2365 Posts |
Posted - 2006-02-14 : 10:46:44
|
| OK, here is a possible algorithm I will toss out.Think of each record as a point in N-dimensional space, and imagine that this N-dimensional space and points within it are subject to a gravitational force. Over time the points are pulled towards their closest neighbors to form clumps, or groups.Start by using a similarity algorithm such as the one I gave to determine which two points are closest to eachother, and then aggregate them by merging their word lists into a new record. Repeat until you have reduced the universe to the number of groups you want.My concern would be that this would require extensive processing time, but it may be feasable to do this as a nightly batch process. |
 |
|
|
|