Parsing CSV Values Into Multiple RowsBy Rob Volk on 19 February 2001 | Tags: SELECT There have been quite a few requests on how to take a string of comma-separated values (CSV), and parse out the values individually. graz's article uses a stored procedure to extract each value into a variable. However, if you have a table with a column of CSV values, and you want to translate them into a normalized table, this technique becomes pretty hairy. You would need to use a cursor to loop through your CSV table and process each row. (A loop within another loop...I'm getting queasy) Fortunately there's a way to do this using just a regular SELECT. It requires a table of numeric values, but it does the job beautifully; you can INSERT normalized CSV values into a table with one SQL statement!
The tricky part is that each word can be a different length, and each CSV can have a different number of words in it. We'll need a separate table (Tally), with a single column of sequential integer values. The "tally table" is mentioned in Joe Celko's SQL for Smarties -- the best SQL book ever written -- as a handy source for a numeric sequence. Here, it will replace the loop needed to parse out each value from the CSV string. (I'll use an upper limit of 8,000; this is the maximum size of a SQL Server varchar value)
Here's the SQL statement that will chop up the Phrase column into each Word (and include the Author) and place them in separate rows:
SELECT Author, NullIf(SubString(',' + Phrase + ',' , ID , CharIndex(',' , ',' + Phrase + ',' , ID) - ID) , '') AS Word FROM Tally, Quotes WHERE ID <= Len(',' + Phrase + ',') AND SubString(',' + Phrase + ',' , ID - 1, 1) = ',' AND CharIndex(',' , ',' + Phrase + ',' , ID) - ID > 0 The best way to explain how this works is to use a single row from Quotes. The Shakespeare row contains the Phrase A,rose,by,any,other,name,smells,just,as,sweet. We'll use the SQL function SubString() to extract part of the Phrase. We'll begin with the first word, starting from the 1st character, and extending to the first comma in the string. To get A, the following values would work:
SELECT SubString('A,rose,by,any,other,name,smells,just,as,sweet', 1, 2) AS Word However, I don't know the position of the first (or any) comma. The CharIndex() function will tell me:
SELECT SubString('A,rose,by,any,other,name,smells,just,as,sweet', 1, CharIndex(',' , 'A,rose,by,any,other,name,smells,just,as,sweet', 1 ) - 1) AS Word I subtract 1 from the CharIndex() value to eliminate the comma from the end (A instead of A,). Here's where the Tally table values come into play. Instead of using a 1, I can substitute ID from Tally. This will let me extract each character, up to the comma:
SELECT SubString('A,rose,by,any,other,name,smells,just,as,sweet', ID, CharIndex(',' , 'A,rose,by,any,other,name,smells,just,as,sweet', ID ) - ID) AS Word FROM Tally A few problems come up. The last word of the phrase will not be parsed, because it appears after the last comma in the string. We'll get an error with ID numbers that are greater than the length of the string. For each ID value (3,4,5,6) this formula will read from that character position up to the next comma, resulting in 4 rows for rose, ose, se, e, when all we want is one row for the second word, rose. To get around these issues, we'll concatenate a comma to the beginning and end of the string. We'll use a WHERE clause to restrict ID values to less than or equal to the length of the string. For each word we extract, we will check the character immediately to the left of that word. If it is a comma, we know that we have started a new word; if not, it is a word fragment, and the WHERE clause will discard it:
SELECT SubString(',' + Phrase + ',' , ID , CharIndex(',' , ',' + Phrase + ',' , ID) - ID) AS Word FROM Tally, Quotes WHERE ID <= Len(',' + Phrase + ',') AND SubString(',' + Phrase + ',' , ID - 1, 1) = ',' Almost done! Suppose there are missing words (consecutive comma delimiters), but I want to include a row for them -- Coleridge is missing a few :) I'll use NullIf() to replace empty strings ('') with NULL. I'll add a condition to the WHERE clause to prevent NULL rows in the results; remove this condition if you want NULL rows included:
SELECT Author, NullIf(SubString(',' + Phrase + ',' , ID , CharIndex(',' , ',' + Phrase + ',' , ID) - ID) , '') AS Word FROM Tally, Quotes WHERE ID <= Len(',' + Phrase + ',') AND SubString(',' + Phrase + ',' , ID - 1, 1) = ',' AND CharIndex(',' , ',' + Phrase + ',' , ID) - ID > 0 --remove this line to keep NULL rows Finally, I use this SELECT as part of my INSERT statement to put them into the OnlyWords table (I removed the last WHERE condition to include the missing words from the Coleridge row):
INSERT INTO OnlyWords SELECT Author, NullIf(SubString(',' + Phrase + ',' , ID , CharIndex(',' , ',' + Phrase + ',' , ID) - ID) , '') AS Word FROM Tally, Quotes WHERE ID <= Len(',' + Phrase + ',') AND SubString(',' + Phrase + ',' , ID - 1, 1) = ',' You can wrap this up in a stored procedure that accepts any delimiter character (like graz's code) and parses using that delimiter. If you combine it with a little Dynamic SQL, you can pass a delimiter, source table(s), source column(s), and destination table/column, and use this for any table in your database! You could also modify it to count words in a string, or as an alternate way to extract complete words from a string. This link will list all of the CSV related articles on this site, including one by Garth on how to combine rows into one CSV value. -robvolk P.S. - Hopefully someone may apply the Tally table solution to the question that got me hooked on SQLTeam.com!
|
- Advertisement - |