Tuesday, June 12, 2007

Unique strings in a text field

So, I want to reduce data usage of a text field, by storing unique strings separated by a delimiter. So to do so I came up with this:

Given a table

`a` int(10) unsigned NOT NULL default '0',
`b` text NOT NULL,
`c` int(10) unsigned NOT NULL default '0',

INSERT INTO hmm VALUES (1, 'hmm8',1) ON DUPLICATE KEY UPDATE b = IF(FIND_IN_SET('hmm8', b)>0, b, CONCAT_WS(',',b, 'hmm8')), c = c+1;

| a | b | c |
| 1 | hmm4,hmm5,hmm6,hmm7,hmm8 | 7 |

Let's break apart the INSERT statement, if column 'a' produces a duplicate ON DUPLICATE KEY UPDATE will issue an IF statement

FIND_IN_SET will return the position of the found string. The logic of IF say if FIND_IN_SET is is greater then 0 i.e. the string was found, update column b with b itself (do nothing) else concat the column with the column contents and the new string separated by a comma. Additionally in all cases increment column c by 1.

Why do it this way? Well I removed the need to do a SELECT and provided a solution to store distinct text values instead of the raw value, thus saving space and resources on network transfers etc.

A little quick statement that might be useful for others.


Anonymous said...

Why are you sacrificing normalization that way? Wouldn't it be more appropriate to use a second table which contains all the strings for a given entry in the master table?

Dathan said...

sure, why not. This is an example of how to store unique strings in a text field.

Personally I'm not a fan of normalization, since I can get faster lookups when I don't normalize in many cases.

snev said...

This is a bad solution to what isn't really a problem; If your normalization is slower than this, then your normalization is wrong.

Offloading the given processing onto the database does not scale well. It also makes indexing on the values, well, not really possible. Placing an index across multiple fields will simplify processing, speed up the search, and is nicely scalable.

In a world where TB drives are commonplace and 10Mb connections are considered slow, there's no reason to do this sort of thing. Unless, of course, you're developing on an embedded platform with a very small footprint. Which then begs the question, why use MySQL in the first place???

Anonymous said...



A片,色情,成人,做愛,情色文學,A片下載,色情遊戲,色情影片,色情聊天室,情色電影,免費視訊,免費視訊聊天,免費視訊聊天室,一葉情貼圖片區,情色,情色視訊,免費成人影片,視訊交友,視訊聊天,視訊聊天室,言情小說,愛情小說,AIO,AV片,A漫,av dvd,聊天室,自拍,情色論壇,視訊美女,AV成人網,色情A片,SEX