Tuesday, June 12, 2007

Unique strings in a text field

So, I want to reduce data usage of a text field, by storing unique strings separated by a delimiter. So to do so I came up with this:

Given a table


CREATE TABLE `hmm` (
`a` int(10) unsigned NOT NULL default '0',
`b` text NOT NULL,
`c` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

INSERT INTO hmm VALUES (1, 'hmm8',1) ON DUPLICATE KEY UPDATE b = IF(FIND_IN_SET('hmm8', b)>0, b, CONCAT_WS(',',b, 'hmm8')), c = c+1;

SELECT * FROM hmm;
+---+--------------------------+---+
| a | b | c |
+---+--------------------------+---+
| 1 | hmm4,hmm5,hmm6,hmm7,hmm8 | 7 |
+---+--------------------------+---+



Let's break apart the INSERT statement, if column 'a' produces a duplicate ON DUPLICATE KEY UPDATE will issue an IF statement

FIND_IN_SET will return the position of the found string. The logic of IF say if FIND_IN_SET is is greater then 0 i.e. the string was found, update column b with b itself (do nothing) else concat the column with the column contents and the new string separated by a comma. Additionally in all cases increment column c by 1.

Why do it this way? Well I removed the need to do a SELECT and provided a solution to store distinct text values instead of the raw value, thus saving space and resources on network transfers etc.

A little quick statement that might be useful for others.

3 comments:

Anonymous said...

Why are you sacrificing normalization that way? Wouldn't it be more appropriate to use a second table which contains all the strings for a given entry in the master table?

Dathan Pattishall said...

sure, why not. This is an example of how to store unique strings in a text field.

Personally I'm not a fan of normalization, since I can get faster lookups when I don't normalize in many cases.

Anonymous said...

This is a bad solution to what isn't really a problem; If your normalization is slower than this, then your normalization is wrong.

Offloading the given processing onto the database does not scale well. It also makes indexing on the values, well, not really possible. Placing an index across multiple fields will simplify processing, speed up the search, and is nicely scalable.

In a world where TB drives are commonplace and 10Mb connections are considered slow, there's no reason to do this sort of thing. Unless, of course, you're developing on an embedded platform with a very small footprint. Which then begs the question, why use MySQL in the first place???