So, I'm developing a new application that's in the capacity planning phase. The application is designed to scale linearly and scale is very easy to do just by adding another database server. But, at what point do you upgrade? What point in the applications life do you add new servers to serve the expectation of the users?
To figure out these questions, thresholds need to be defined. When these thresholds are exceeded then upgrades need to occur.
Here is a process (my process) of questions to ask to define thresholds: First I start out with generic questions listed below.
What is the expected amount of users who are going to use the product initially?
What is the expected usage pattern? For instance if they are adding data what is the add rate? If they are reading data what is the read rate?
Given that the first two questions are answered now ask when is it going to break. If you can answer this your golden. This is the hard part. How do you know when it's going to break and under what conditions without putting the service live. The best thing to do IMHO is to benchmark the system under typical usage patterns then double that.
In my case, I know for sure that it will work on launch and for months there after but I don't know when it's going to fail. Since getting hardware here is an involved process I need to know ahead of time when things will fail-it makes us all honest.
To get an idea on when it will break, I'm gauge'ing the passive additions that are in production now to get a baseline on how much data grows from day to day. In my case the dataset grows 10GB per day spread across 5 servers, thus 2GB per day from a single point of server view. Now I have a base line. In how many days will the application fail to perform under thresholds previously set? The thresholds set, are defined as-all data retrieval and addition must not take longer then 300ms for all components involved. So, when the passive additions on average take 20% of 300ms then I know the application is about to hit my own personal saturation point, thus I must ask for new equipment.
But, the problem is not as clear cut, really I need to answer the question does InnoDB performance degrade at O(nlogn) when adding strings to it? If not what is the degradation of string addition and retrieval when the dataset is HUGE like @40-200+ GB
(Where did I get log(n) from? Well, indexes in INNODB are B-Tree's-I/O performance should degrade at nlogn as data grows.)
Some good reading mySQL Insert-Speed
This is a little old but it's a good approximation.
Things of note:
innodb_buffer_pool_size=12000M
innodb_additional_mem_pool_size=20M
innodb_log_file_size=512M
innodb_log_buffer_size=16M
innodb_flush_log_at_trx_commit=2
I must watch that the merged records count stays within a few hundred thousand of the inserted record count in the insert buffer and adaptive hash index part of SHOW ENGINE INNODB STATUS IF it doesn't then INNODB is hitting it's own limitation.
Wednesday, September 26, 2007
Friday, September 14, 2007
The mySQL Optimizer and your applications
What is the mySQL Optimizer? Well it's the part of the mysql query engine that determines what index to use, based on given sql. It sucks in every version that I tested, which is
3.23
4.0
4.1
5.0
5.1
I find that in nearly every app that I've end up writing I need to create my own optimizer to determine what index to use.
Why?
Well, imagine this. Your table has 30 indexes and many of the prefixes of compound indexes are the same. The reason is because given a question you want to sort the data quickly in different views. The optimizer sucks at figuring out which index to use under these cases (and many others)
For example:
Now the table contains an index on
Which index should mysql pick?
Well since perms is a range and karma is a range then using the index (owner_id, perms, karma, date_create, photo_id) to avoid a filesort from an ORDER BY is not possible. The reason to avoid filesorts especially in cases of large amounts of data, is because a filesort is 60% of the query time. So, the course of action is to pick the index where a filesort is avoided, thus
owner_id, date_create, photo_id should be picked.
But in fact mysql doesn't pick the correct index, ever. Innodb especially from query to query execution of the same type query the index changes over and over due to how innodb determines cardinality.
Now, you may be thinking why not just run analyze table? Still mysql will pick the wrong index, and Analyze table for INNODB only dumps the innodb_buffer_pool which sucks.
How does one get around this?
Write your own optimizer! Here is some PHP snippits
Why use strpos? Well, it's faster then preg_match and strstr and I only care about the existence of SQL fields in the where and sort clause, not the contents of the rvalue of each column.
3.23
4.0
4.1
5.0
5.1
I find that in nearly every app that I've end up writing I need to create my own optimizer to determine what index to use.
Why?
Well, imagine this. Your table has 30 indexes and many of the prefixes of compound indexes are the same. The reason is because given a question you want to sort the data quickly in different views. The optimizer sucks at figuring out which index to use under these cases (and many others)
For example:
SELECT * FROM SomeTable WHERE owner_id = ? AND perms IN (0,1) AND karma IN (0,1) ORDER by date_create DESC, photo_id DESC limit 10
Now the table contains an index on
owner_id, perms, karma, date_create,photo_id
owner_id, date_create, photo_id
and 28 other index.
Which index should mysql pick?
Well since perms is a range and karma is a range then using the index (owner_id, perms, karma, date_create, photo_id) to avoid a filesort from an ORDER BY is not possible. The reason to avoid filesorts especially in cases of large amounts of data, is because a filesort is 60% of the query time. So, the course of action is to pick the index where a filesort is avoided, thus
owner_id, date_create, photo_id should be picked.
But in fact mysql doesn't pick the correct index, ever. Innodb especially from query to query execution of the same type query the index changes over and over due to how innodb determines cardinality.
Now, you may be thinking why not just run analyze table? Still mysql will pick the wrong index, and Analyze table for INNODB only dumps the innodb_buffer_pool which sucks.
How does one get around this?
Write your own optimizer! Here is some PHP snippits
function sometable_get_index($where, $sort) {
if(strpos($where, 'owner_id') !== FALSE) {
if (strpos($where, 'karma') !== FALSE) {
if (strpos($sort,'date_create') !== FALSE &&
strpos($where, 'IN') !== FALSE) {
return 'USE INDEX (owner_datecreate_id)';
}
if(strpos($where, 'perms') !== FALSE && !$sort) {
return 'USE INDEX (owner_perms_karma)';
}
}
MORE Parser logic....
}
Why use strpos? Well, it's faster then preg_match and strstr and I only care about the existence of SQL fields in the where and sort clause, not the contents of the rvalue of each column.
Subscribe to:
Posts (Atom)
