Monday, June 29, 2009

mysql UC 2009 Talk

Scribe is a bit buggy with displaying this presentation:

Scaling a Widget Company

Thursday, June 11, 2009

PHP mysql connect and using APC to recover

So, when you connect on the fly to a database your subject to a variety of issues, like when the db is not available and when the db does not have a route.

One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.

But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.

Common Failures:
No route to Host
Flapping NIC
Locked Tables
Recovering from a Crash
more of the same.


My Environment:
I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.

Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.

Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.

In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set

mysql.connect_timeout = 5 // the default is 60


In my common_db class when connecting to a database, the connect routine returns a database handle object

if (PEAR::isError($dbh) && $delta_to_connect >= 5){

#
# mark ip as dead for 15 min
#

apc_store($ip, array('DEAD'), 900);
return false;
}

return $dbh;




Now PRIOR to calling the database connect code I check to see if the IP is up.

$status = apc_fetch($ip);
if ($status[0] === 'DEAD'){
return false;
}

... do connect ...


But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.


if ($status[0] === 'DEAD' && (mt_rand(0, 1000) != 1)){
return false;
}



But, if your app can't connect to the DB aren't you down anyway?

Yes, although I don't have to restart all the httpds.
Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.

I have a variety of methods to deal with these scenarios but this is the quickest to implement.

Tuesday, May 19, 2009

Multi Direction Sorts and avoiding a file sort

There are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC.
When these two sort definitions are put together in a single statement a filesort is produced.

Why do we want to avoid filesorts?

Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.

So, here is an example

CREATE TABLE `ABCD` (
`A` int(10) unsigned NOT NULL default '0',
`B` int(10) unsigned NOT NULL default '0',
`C` int(10) unsigned NOT NULL default '0',
`D` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`a`,`b`,`c`,`d`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1



mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ABCD
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: const,const
rows: 2
Extra: Using where; Using index; Using filesort
1 row in set (0.00 sec)



Notice the filesort? So how does one get around this filesort?

Well

Let's define some roles for columns C and D. C is the parent while D is the child.

  • We want all the latest parents (C)
  • We want all the oldest children (D)

    We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED

    SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC
    explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G
    *************************** 1. row ***************************
    id: 1
    select_type: SIMPLE
    table: ABCD
    type: ref
    possible_keys: PRIMARY
    key: PRIMARY
    key_len: 8
    ref: const,const
    rows: 2
    Extra: Using where; Using index
    1 row in set (0.00 sec)




    Now

    FOREACH($C_parent as $i => $c_id) {

    $C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;


    }

    So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.

    What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.
  • Thursday, April 23, 2009

    Restoring a backup of a Circular Replication config

    So say you have two mySQL servers called A and B. A and B are in a circular replication ring meaning

    A replicates from B
    B replicates from A

    In addition, log-slave-updates is on, so if one server of either A or B goes down you can recover all the data from either side.

    Therefore, here are some steps.
    Take a snapshot of a server in the ring, with your favorite backup tool (mine is ibbackup)

    When either server A or B dies, restore from said backup. Here are the steps. Let us assume server A goes down, and backups are done from server B.

    restore the backup of server B to server A

    On server-A turn off log-slave-updates (you will see why on the next line)
    Add replicate-same-server-id on server A to my.cnf

    CHANGE MASTER TO to the position and file of the binary log that is reported from your tool on server-A.

    IF the binary log did not get corrupted on server-B your cool, else flip the binary log to the next binary log from server-A on server-B.

    Let server-A catch up
    Stop server-A
    Add log-slave-updates on server-A
    remove replicate-same-server-id on server A
    Start server-A

    done.

    Wednesday, April 08, 2009

    Recover from No-Route to Host or Storage Engine Plugin Crashes

    There are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.

    In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.

    I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?

    Use a Load balancer
    This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.

    Use a memcache layer to keep DB state

    This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.

    Use IRON DNS:
    IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to 127.0.0.1 or a routable interface that produces a Connection Refused (111 instead of 110).


    For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.

    For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.

      Here are some conditions where connections take to long:
    • No Route to Host
    • INNODB recovery
    • Disk Fails (disk fills up)
    • Switch Fails / Flaps
    • Plug-in storage engine crashes, yet mySQL is up
    • DNS Resolve fails


    If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.

    Tuesday, March 31, 2009

    What do you think about adding ZLIB to memcache storage

    Memcache is a fantastic Hash table-very fast and one of the great successes of Brad Fitzpatrick-who in my opinion has done more for the open social movement as an individual then anyone else. I use memcache quite extensively, now I am thinking about adding ZLIB native to compress the value of each key-much like how INNODB does with the Barracuda file format. The theory is with a CPU hit, we can store more data per memcache instance. I've talked to the Northscale guys and they love the idea. What do you think?


    Update: Well what do you know

    http://us3.php.net/manual/en/function.memcache-setcompressthreshold.php

    for PHP for instance compresses the data on the client side.

    There still might be some value compressing the data on the server-side, but now I'm not as motivated.

    What might be a good alternative is to compress keys into 8 byte longs in memcached automatically instead of the actual string that can be huge. To give some more detail,
    8 byte longs is a 64-bit int. A string can easily be converted into a big int by bit manipulation - and the address space is huge so key conflict is effectively removed.

    Monday, March 23, 2009

    Cloud computing without spending money

    Cloud computing is the big thing now days, weather you are an app developer using EC2 or the Google APP engine, or a new company trying to build your own cloud product. If you are hosting, or using a platform it costs money. I hate to spend money, especially money that is for my company. If I find an idle server, I use it to 100% utilization (prior to the saturation point).

    I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.

    Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design.

    Get to the point already Dathan:

    I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.

    It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.

    Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.

    Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.

    By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.

    PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.

    Thursday, March 12, 2009

    Walking an INNODB table Fast

    Walking a table means, traversing each row, commonly used in building queues, fixing data, or dumping a table. I've recently ran into a problem-caused by an assumption, where walking a table was taking way to long using the method


    $pos = 0;
    do {

    $result = SELECT col FROM TABLE LIMIT $pos, 1000;
    $pos += 1000;
    } while ($result);

    The assumption was since INNODB uses a cluster index, this would traverse the table using the PRIMARY key. This is not the case, its not a problem in INNODB but a bad assumption, that I fell victim to. A table scan to each $pos occurs producing a Big-O of N^2. So, when the query:

    SELECT col FROM TABLE LIMIT 1000000, 1000 is executed mySQL will scan all the rows up to row position 1001000 and for each subsequent iteration.

    This is SLOOOOW. IMHO since the table is sorted by the primary key, mySQL should optimize this case - but it does not and will not. So, to walk an INNODB table fast, and keep liner time or a Big-O of N an alternative is

    $last_id = 0
    do {

    $result = SELECT col FROM TABLE USE INDEX(PRIMARY) WHERE pkey_part > $last_id LIMIT 1000
    $last_id = $result[count($result) - 1]->pkey_part

    }while($result);


    This dumps a table very fast, almost as fast as doing a count(*) on the PRIMARY KEY.

    Another method is to

    SELECT col INTO OUTFILE "/dir/file.ids" FROM TABLE;

    but the data is local to the database - thus the need for the application to grab data. Another draw back of this method is that the dump produces more disk IO then walking a table off of a key, slowing down access to this table.


    In conclusion, even if the storage engine keeps the table order consistent like INNODB does, do not assume that LIMIT 100000, 1000 is equivalent to a file seek of position 100000, without telling the Optimizer to use an index.