So, I had an app that was sticking data into S3 when the data was marked or qualified as warehouse data. The problem with this method was when retrieving the data from S3 it was not chunked but appended to. So pulling 100M over the wire, parsing, sorting and adding to the db would fail since the process consumed more then its allowed share of resources. Plus grab data from S3 has a very high response time, plainly put its slow.
So, since the warehouse data is prone to error-the main reason why I needed a new solution, I decided to use ZLIB compression that is native to Innodb's Barracuda build.
To install innodb plugin here is a good write up I will not duplicate it.
Configuring the plugin you must set in your my.cnf
LOAD the INNODB plugin via a few commands and tada your done.
Next define your table with
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4
I use utf8 for everything, KEY_BLOCK_SIZE will define how much extra overhead the CPU will have to work, think of this as a compression level. I use 4K since some rough benchmarks I've done shows that this is the sweet spot for my INSERT rate + Select rate.
Finally I stick the entire dataset on a slow RAID-5 array. I use RAID-5 because when I want the data-I want to get it fast and put it into the production dataset real-time. I need the disk space and redundancy. The writes are not extraordinary and not very expensive from a I/O perspective. I'm getting 60% reduction in the INNODB table space. On top of that the format of the data is the same format as the production system. So I can pull this data fast write to the shard in a single httpd process. I can migrate archive data to production quality in less then a second for every 3000 rows. I qualify this as a win. We'll see how stable this is over the next few weeks.
So in conclusion, I've replaced S3 with a cheaper solution, faster solution, I put archived user data on cheap servers, and as a result the price per user goes down, since my capacity plan is based off a function of disk growth-on my fast disk cluster. Next I can store my backups on S3 instead of semi-live data :)