Scribe is a bit buggy with displaying this presentation:
Scaling a Widget Company
Monday, June 29, 2009
Thursday, June 11, 2009
PHP mysql connect and using APC to recover
So, when you connect on the fly to a database your subject to a variety of issues, like when the db is not available and when the db does not have a route.
One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.
But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.
Common Failures:
No route to Host
Flapping NIC
Locked Tables
Recovering from a Crash
more of the same.
My Environment:
I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.
Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.
Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.
In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set
mysql.connect_timeout = 5 // the default is 60
In my common_db class when connecting to a database, the connect routine returns a database handle object
Now PRIOR to calling the database connect code I check to see if the IP is up.
But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.
But, if your app can't connect to the DB aren't you down anyway?
Yes, although I don't have to restart all the httpds.
Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.
I have a variety of methods to deal with these scenarios but this is the quickest to implement.
One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.
But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.
Common Failures:
No route to Host
Flapping NIC
Locked Tables
Recovering from a Crash
more of the same.
My Environment:
I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.
Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.
Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.
In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set
mysql.connect_timeout = 5 // the default is 60
In my common_db class when connecting to a database, the connect routine returns a database handle object
if (PEAR::isError($dbh) && $delta_to_connect >= 5){
#
# mark ip as dead for 15 min
#
apc_store($ip, array('DEAD'), 900);
return false;
}
return $dbh;
Now PRIOR to calling the database connect code I check to see if the IP is up.
$status = apc_fetch($ip);
if ($status[0] === 'DEAD'){
return false;
}
... do connect ...
But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.
if ($status[0] === 'DEAD' && (mt_rand(0, 1000) != 1)){
return false;
}
But, if your app can't connect to the DB aren't you down anyway?
Yes, although I don't have to restart all the httpds.
Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.
I have a variety of methods to deal with these scenarios but this is the quickest to implement.
Posted by
Dathan Vance Pattishall
at
4:06 PM
2
comments
Links to this post
Labels:
apache,
apc,
connect code,
php
Tuesday, May 19, 2009
Multi Direction Sorts and avoiding a file sort
There are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC.
When these two sort definitions are put together in a single statement a filesort is produced.
Why do we want to avoid filesorts?
Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.
So, here is an example
Notice the filesort? So how does one get around this filesort?
Well
Let's define some roles for columns C and D. C is the parent while D is the child.
We want all the latest parents (C)
We want all the oldest children (D)
We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED
Now
FOREACH($C_parent as $i => $c_id) {
$C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;
}
So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.
What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.
When these two sort definitions are put together in a single statement a filesort is produced.
Why do we want to avoid filesorts?
Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.
So, here is an example
CREATE TABLE `ABCD` (
`A` int(10) unsigned NOT NULL default '0',
`B` int(10) unsigned NOT NULL default '0',
`C` int(10) unsigned NOT NULL default '0',
`D` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`a`,`b`,`c`,`d`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ABCD
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: const,const
rows: 2
Extra: Using where; Using index; Using filesort
1 row in set (0.00 sec)
Notice the filesort? So how does one get around this filesort?
Well
Let's define some roles for columns C and D. C is the parent while D is the child.
We want all the latest parents (C)
We want all the oldest children (D)
We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED
SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC
explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ABCD
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: const,const
rows: 2
Extra: Using where; Using index
1 row in set (0.00 sec)
Now
FOREACH($C_parent as $i => $c_id) {
$C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;
}
So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.
What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.
Posted by
Dathan Vance Pattishall
at
12:26 PM
2
comments
Links to this post
Labels:
filesort,
optimization,
order by
Thursday, April 23, 2009
Restoring a backup of a Circular Replication config
So say you have two mySQL servers called A and B. A and B are in a circular replication ring meaning
A replicates from B
B replicates from A
In addition, log-slave-updates is on, so if one server of either A or B goes down you can recover all the data from either side.
Therefore, here are some steps.
Take a snapshot of a server in the ring, with your favorite backup tool (mine is ibbackup)
When either server A or B dies, restore from said backup. Here are the steps. Let us assume server A goes down, and backups are done from server B.
restore the backup of server B to server A
On server-A turn off log-slave-updates (you will see why on the next line)
Add replicate-same-server-id on server A to my.cnf
CHANGE MASTER TO to the position and file of the binary log that is reported from your tool on server-A.
IF the binary log did not get corrupted on server-B your cool, else flip the binary log to the next binary log from server-A on server-B.
Let server-A catch up
Stop server-A
Add log-slave-updates on server-A
remove replicate-same-server-id on server A
Start server-A
done.
A replicates from B
B replicates from A
In addition, log-slave-updates is on, so if one server of either A or B goes down you can recover all the data from either side.
Therefore, here are some steps.
Take a snapshot of a server in the ring, with your favorite backup tool (mine is ibbackup)
When either server A or B dies, restore from said backup. Here are the steps. Let us assume server A goes down, and backups are done from server B.
restore the backup of server B to server A
On server-A turn off log-slave-updates (you will see why on the next line)
Add replicate-same-server-id on server A to my.cnf
CHANGE MASTER TO to the position and file of the binary log that is reported from your tool on server-A.
IF the binary log did not get corrupted on server-B your cool, else flip the binary log to the next binary log from server-A on server-B.
Let server-A catch up
Stop server-A
Add log-slave-updates on server-A
remove replicate-same-server-id on server A
Start server-A
done.
Wednesday, April 08, 2009
Recover from No-Route to Host or Storage Engine Plugin Crashes
There are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.
In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.
I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?
Use a Load balancer
This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.
Use a memcache layer to keep DB state
This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.
Use IRON DNS:
IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to 127.0.0.1 or a routable interface that produces a Connection Refused (111 instead of 110).
For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.
For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.
If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.
In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.
I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?
Use a Load balancer
Use a memcache layer to keep DB state
Use IRON DNS:
For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.
For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.
- No Route to Host
- INNODB recovery
- Disk Fails (disk fills up)
- Switch Fails / Flaps
- Plug-in storage engine crashes, yet mySQL is up
- DNS Resolve fails
Here are some conditions where connections take to long:
If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.
Tuesday, March 31, 2009
What do you think about adding ZLIB to memcache storage
Memcache is a fantastic Hash table-very fast and one of the great successes of Brad Fitzpatrick-who in my opinion has done more for the open social movement as an individual then anyone else. I use memcache quite extensively, now I am thinking about adding ZLIB native to compress the value of each key-much like how INNODB does with the Barracuda file format. The theory is with a CPU hit, we can store more data per memcache instance. I've talked to the Northscale guys and they love the idea. What do you think?
Update: Well what do you know
http://us3.php.net/manual/en/function.memcache-setcompressthreshold.php
for PHP for instance compresses the data on the client side.
There still might be some value compressing the data on the server-side, but now I'm not as motivated.
What might be a good alternative is to compress keys into 8 byte longs in memcached automatically instead of the actual string that can be huge. To give some more detail,
8 byte longs is a 64-bit int. A string can easily be converted into a big int by bit manipulation - and the address space is huge so key conflict is effectively removed.
Update: Well what do you know
http://us3.php.net/manual/en/function.memcache-setcompressthreshold.php
for PHP for instance compresses the data on the client side.
There still might be some value compressing the data on the server-side, but now I'm not as motivated.
What might be a good alternative is to compress keys into 8 byte longs in memcached automatically instead of the actual string that can be huge. To give some more detail,
8 byte longs is a 64-bit int. A string can easily be converted into a big int by bit manipulation - and the address space is huge so key conflict is effectively removed.
Monday, March 23, 2009
Cloud computing without spending money
Cloud computing is the big thing now days, weather you are an app developer using EC2 or the Google APP engine, or a new company trying to build your own cloud product. If you are hosting, or using a platform it costs money. I hate to spend money, especially money that is for my company. If I find an idle server, I use it to 100% utilization (prior to the saturation point).
I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.
Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design.
Get to the point already Dathan:
I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.
It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.
Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.
Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.
By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.
PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.
I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.
Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design.
Get to the point already Dathan:
I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.
It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.
Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.
Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.
By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.
PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.
Thursday, March 12, 2009
Walking an INNODB table Fast
Walking a table means, traversing each row, commonly used in building queues, fixing data, or dumping a table. I've recently ran into a problem-caused by an assumption, where walking a table was taking way to long using the method
The assumption was since INNODB uses a cluster index, this would traverse the table using the PRIMARY key. This is not the case, its not a problem in INNODB but a bad assumption, that I fell victim to. A table scan to each $pos occurs producing a Big-O of N^2. So, when the query:
SELECT col FROM TABLE LIMIT 1000000, 1000 is executed mySQL will scan all the rows up to row position 1001000 and for each subsequent iteration.
This is SLOOOOW. IMHO since the table is sorted by the primary key, mySQL should optimize this case - but it does not and will not. So, to walk an INNODB table fast, and keep liner time or a Big-O of N an alternative is
This dumps a table very fast, almost as fast as doing a count(*) on the PRIMARY KEY.
Another method is to
SELECT col INTO OUTFILE "/dir/file.ids" FROM TABLE;
but the data is local to the database - thus the need for the application to grab data. Another draw back of this method is that the dump produces more disk IO then walking a table off of a key, slowing down access to this table.
In conclusion, even if the storage engine keeps the table order consistent like INNODB does, do not assume that LIMIT 100000, 1000 is equivalent to a file seek of position 100000, without telling the Optimizer to use an index.
$pos = 0;
do {
$result = SELECT col FROM TABLE LIMIT $pos, 1000;
$pos += 1000;
} while ($result);
The assumption was since INNODB uses a cluster index, this would traverse the table using the PRIMARY key. This is not the case, its not a problem in INNODB but a bad assumption, that I fell victim to. A table scan to each $pos occurs producing a Big-O of N^2. So, when the query:
SELECT col FROM TABLE LIMIT 1000000, 1000 is executed mySQL will scan all the rows up to row position 1001000 and for each subsequent iteration.
This is SLOOOOW. IMHO since the table is sorted by the primary key, mySQL should optimize this case - but it does not and will not. So, to walk an INNODB table fast, and keep liner time or a Big-O of N an alternative is
$last_id = 0
do {
$result = SELECT col FROM TABLE USE INDEX(PRIMARY) WHERE pkey_part > $last_id LIMIT 1000
$last_id = $result[count($result) - 1]->pkey_part
}while($result);
This dumps a table very fast, almost as fast as doing a count(*) on the PRIMARY KEY.
Another method is to
SELECT col INTO OUTFILE "/dir/file.ids" FROM TABLE;
but the data is local to the database - thus the need for the application to grab data. Another draw back of this method is that the dump produces more disk IO then walking a table off of a key, slowing down access to this table.
In conclusion, even if the storage engine keeps the table order consistent like INNODB does, do not assume that LIMIT 100000, 1000 is equivalent to a file seek of position 100000, without telling the Optimizer to use an index.
Subscribe to:
Posts (Atom)

