At Flickr adding capacity is easy. Since we are able to scale at a function of user growth, independent of hardware classes. This makes adding new hardware easy. I'm able to adjust the server weight on the FLY to give more users to a certain class of hardware over another.
Changing schema is easy as well and now I do it much safer. In the pass I would turn off a master for each partitioned dataset all at the same time: then stay up for 20+ hours and execute an alter across the entire server farm, that was off line.
Now I do it in stages and do it throughout the week. Its a little slower but I get to sleep. Additionally with the new method I am able to do more all at the same time-like rebuild the entire dataset for a partition.
Here are my steps for doing a change that takes more then 10 hours:
remove the servers from the site config
push my.cnf.maintenance to the servers not taking on site traffic
- this starts mysql on port 3307 and turns off replication and binary logging
- this step is crucial to tell our backend bots not to do stuff on the servers in maintenance.
mkdir -p /var/lib/mysql/restore
DUMP all data using mysqldump and SELECT *,[NEW FIELD DEFAULT VALUE1...VALUE-N] INTO OUTFILE '/var/lib/mysql/restore/[TABLE].txt 8 processes at a time
remove all of innodb datafiles
copy data with 8-20 processes to the filer to sustain 50MB a sec. Pretty nify script
mkfs.ext3 -Tlargefile4 -LMYSQL [device]
copy some of the data back (everything except the txt files)
LOAD data in parallel about 8 processes at a time
verify that utf8 data was loaded properly
wait about 45 mins for 12 hours of events to catch up.
put it back into the site config
I do these steps across 6-10 servers at a time-using a dsh like application
Your question may be why are you rebuilding the filesystem and using ext3? Well EXT-3 is my only option and it fragments really bad-messing with performance. So if I am going to do a massive alter-I throw that in there for good measure-I plan on doing this once a year.
I do not do this procedure for alters that last less then 10 hours, but if I alter Flickr's main table which holds most of the data-I take the opportunity to shrink all the datafiles.