Thursday, July 10, 2014

Manually Switch Slaves to new Masters in mySQL 5.6 (XTRADB 5.6)

I'm really excited about Fabric which was recently announced. Everything it does has been a variety of scripts for  me or manual tasks, but before I can integrate Fabric into my system I must know more about it. When dealing with live-data and moving servers around I still do things manually just because it makes me feel better to know that if data is lost, I was the cause for doing something dumb. Basically I need to know everything about Fabric including line by line execution until I will deploy it.


Here are my steps for switching and replacing a Shard Slave.

Imagine having a setup in the following Config.

Shard Server  10.0.30.123 - this is the master endpoint

The Global Shard which holds Friend Info to join against is 

10.0.1.1

10.0.30.123 --- replicates from ---> 10.0.1.1

Now the Shard Server
10.0.30.123 has 3 slaves, thus 10.0.30.123 is set up to log-slave-updates
The 3 slaves are 10.0.18.78, 10.0.22.76, 10.0.22.77 and I want to make 10.0.22.76 the new master for the said Shard with 10.0.22.77 as its slave. So, what I have is

3 slaves --- replicates from ---> 10.0.30.123 --- replicates from ---> 10.0.1.1

what I will end up with is

10.0.22.77 -- replicates from ---> 10.0.22.76 ---> 10.0.1.1

I am getting rid of 10.0.30.123 and 10.0.18.78


Here are the steps.

Tell 10.0.22.77 and 10.0.22.76 to SLAVE UNTIL the next binary log in 10.0.123.1
  • ssh to each box
  • STOP SLAVE (using mysql 5.6) on 10.0.22.7[6-7]
  • SHOW SLAVE STATUS\G -- get Master_Log_File : master-bin.000612
  • START SLAVE UNTIL MASTER_LOG_FILE='master-bin.000613', MASTER_LOG_POS=4
Now what I did here was tell the slaves to replicate until the next bin log is reached
mysql> START SLAVE UNTIL MASTER_LOG_FILE='master-bin.000613', MASTER_LOG_POS=4;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> SHOW WARNINGS;
+-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                  |
+-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1278 | It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mysqld restart |
| Note  | 1753 | UNTIL condtion is not supported in multi-threaded slave mode. Slave is started in the sequential execution mode.                                                                         |
+-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec) // notice the minor bug in the spelling :)

I also get a warning that says my multiple SQL threads are now a single one which is fine.


My next step is to ssh to 10.0.30.123

  • FLUSH LOGS - this tells 10.0.30.123 to rotate all log files including mysql-bin.000613
Now on the slaves I wait until they stop

Once both stop, on 10.0.22.76 I issue RESET MASTER // I don't care about what was replicated at this point and saved already in the binlogs - I've already verified that they are in-sync with CHECKSUM TABLE

On 10.0.22.77 I issue the command
 
STOP SLAVE; CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000001', MASTER_LOG_POS=4, MASTER_HOST='10.0.22.76'; START SLAVE;

if you get an error  

Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs; these UUIDs must be different for replication to work.
stop mysql, remove auto.cnf in your $DATADIR (/var/lib/mysql)

On 10.0.22.76 I issue 

START SLAVE


Now I wait until the SLAVE catches up to the MASTER 10.0.30.123 (remember this works because of log-slave-update)


Next in my dbconfig.php file I change all references to 10.0.30.123 to 10.0.22.76

Verify everything is in sync (USE CHECKSUM TABLE ACROSS TABLES/SERVERS ) and push out the new config

After the push Make sure to restart all daemons or queue workers, they may cache the database config

Now do this all over again to make

10.0.22.76 replicate from 10.0.1.1


In conclusion, this is just to manual and screams for automation.  Soon it will be with Fabric which manages this process once I get around to rolling that out.













Tuesday, June 03, 2014

CTO of Shots on Core Technology, Culture and Working on the greatest App in the World

MySQL has opened a lot of avenues and opportunities for me. I am the CTO of Shots and I got here because I was a mySQL DBA, who can code in a variety of languages, understand data access, layout and design fast backends that scales to well over 100 million users, manage a team, give back to the community and prove myself through constant good work. Plus I've made every single mistake, so I know what not to do.

At Shots we of course use Percona XTRA DB 5.6, with Memcache, Redis, ElasticSearch, HAProxy, FluentD with Logstash plugins, Ruby, PHP 5.4, Go, Java, Erlang and AWS which are managed via a custom CHEF build. We use chef server like chef solo :)

In four months we grabbed over 1 million ACTIVE users all on IOS, which are mainly from US, UK, Australia, Canada and Brazil. We are able to handle Justin Bieber's traffic that is insane. Currently we have no downtime (yet, always plan for downtime). We Moved DC to AWS us-west-1, and S3 from S3-east crappy to S3 west. We grow organically, and are at the cusp of hitting our hockey stick growth, all on 12 Servers currently :)

There are 4 of us. Everything described here is what I handle, yet I say "we" throughout the description of this post and this is a good segway for our culture. As a team we build for our consumers, our Shotties, a positive, bully free, app that works across all platforms to keep humans interacting with humans and not all the other cruft you find on social networks. Team and Community Focus is our culture, with the confidence to build the best App in the world.

It's a photo status update or better known as a selfie app, which everyone from Shaq, King Bach, Justin Bieber to Floyd Money Mayweather and many others are using. Its cool to have these folks but we are not building a platform for just them we are building a platform for you. For the Teens who are different and like Cosplay, for the teen that loves Bieber, for the person who wants to remember the moment, for you.

I invite you all to use it. I invite you or for you to ask your friends if they would like to join us because now I am hiring an IOS and Andriod Dev. If you or they want to work for a startup, be apart of something cool with the purpose of changing the world and the way we interact with one another online, join us. The requirements are live in the Bay Area, can code and want to make a change :)

Monday, October 28, 2013

MariaDB 10.0.4, BeanStalkD, Geographic Replication, Event Tracker for stats gathering at 60K stats a second

Every company needs to see stats to understand how the application is performing, and how users are using the application(s). Typically a stat for most basic questions and even some advance questions can be summarized as  "What is said event over time?".  We call this EventTracker.

To add to the complexity of generating stats, how do you get stat events from a DataCenter (DC) in Singapore, a DC in Western Europe, a DC in Oregon to a database for querying in West Virginia - near real-time? I used multisource replication, and the BLACKHOLE storage-engine to do so with MariaDB.


Above is an image that shows a Webserver in some part of the world sends Events for tracking various interrupts to a BeanstalkD queue at time T in the same region. Each Region has a set of Python workers that grab events from BeanStalkD and writes the event to a local DB. Then the TSDB Database, a MariaDB 10.0.4 instance, replicates from each BlackHole StorageEngine BeanStalkD Worker server.

The obvious question might be why not use OpenTSDB? The TSDB daemon couldn't handle the onslaught of stats/second. The current HBase TSDB structure is much larger compared to a compressed INNODB row for the same stat. Additionally a region may loose connectivity to another region for some time so I would need to queue in some form or another events until the network was available again. Thus the need for a home grown solution. Now back to my solution.

The Structure for the event has the following DDL.

Currently we are using 32 shards defined by each bigdata_# database. This allows us to scale per database and our capacity plan is not DISK IO but based on diskspace.


MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| bigdata_0          |
| bigdata_1          |
| bigdata_10         |
| bigdata_11         |
| bigdata_12         |
| bigdata_13         |
| bigdata_14         |
| bigdata_15         |
| bigdata_16         |
| bigdata_17         |
| bigdata_18         |
| bigdata_19         |
| bigdata_2          |
| bigdata_20         |
| bigdata_21         |
| bigdata_22         |
| bigdata_23         |
| bigdata_24         |
| bigdata_25         |
| bigdata_26         |
| bigdata_27         |
| bigdata_28         |
| bigdata_29         |
| bigdata_3          |
| bigdata_30         |
| bigdata_31         |
| bigdata_4          |
| bigdata_5          |
| bigdata_6          |
| bigdata_7          |
| bigdata_8          |
| bigdata_9          |
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
35 rows in set (0.16 sec)

The database bigdata_0 is the only database that is slightly different than the rest. It has a table defined as EventTags that is not in the rest of the databases. EventTags is the map of eventId to tagName where eventId is just a numerical representation of a part of the md5 of the tagName. Each numerical representation falls into an address space that denotes the range of which database a tag should belong to. We use the EventTags table for the front-end to search for a stat to plot on a graph.

 

CREATE TABLE `EventTags` (
  `eventId` bigint(20) unsigned NOT NULL DEFAULT '0',
  `tagName` varchar(255) NOT NULL DEFAULT '',
  `popularity` bigint(20) unsigned NOT NULL DEFAULT '0',
  `modifiedDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`eventId`,`tagName`(25)),
  KEY `eventTag` (`tagName`(25))
) ENGINE=BLACKHOLE DEFAULT CHARSET=utf8

EventDay contains the actual value of the stat combined for the last 1 minute (currently). Our granularity allows down to a second but we found seeing events for the last minute is fine. The SQL produced for events are the following.


CREATE TABLE `EventDay` (
  `eventId` bigint(20) unsigned NOT NULL,
  `createDate` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'min Blocks',
  `count` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`eventId`,`createDate`)
) ENGINE=BLACKHOLE DEFAULT CHARSET=utf8


MariaDB's multi-source replication then downloads the statement based binary logs from each of the workers in some part of the world and applies the SQL to a combined database that represents all regions. Note that the central database is  of the same Structure BUT now the engine is Compressed INNODB with KEY BLOCK SIZE set to 8.

The front-end sits on top of the central database and we record everything from a single item being sold to load on our auto-scaling web-farm. Which allows us to do some interesting plots like Items sold as a function of Load over time. I(L(t))


Currently with this method we are producing 60K events a second that translates to a few thousand database updates a second across 4 replication threads (8 threads total). Keeping all data up to date from within the last minute near realtime.


Sunday, July 07, 2013

Speaking at RAMP: Scale Patterns and handling exponential growth without downtime

I will be in Budapest talking about Scale and Rapid Growth. I will start off with Flickr's Five minute conversation to take a direction on how to scale the backend to getting 90 million users in 3 weeks after going Viral.

http://rampconf.com/main.html#schedule


RAMP will also have live streaming broadcasted at TNW,  HWSW and on USTREAM.


Monday, June 24, 2013

First Week.5 in China: Part Two, Refactor PHP get 10% more capacity with one change

The PHP code that I've experienced in China so far is pretty good. I have been in some environments where the Code is horrendous-where variables are set in one file yet used in another file via a require_once.  If that magic variable is not set everything would break with side-effect galore. This is not the case here for the China Team. This team is really good not to imply the other-one wasn't just praising the current one.

The SQL, like many other companies I have been at requires some more extra effort, but the hunger to learn and improve is throughout the culture of the team here. Really that is the first step to improve a system, the willingness by developers and management in getting things done and fixed-fast.

Entering in the environment, first I read all the code. Then created a development environment to play with the code. Next I profiled how the database is being interacted with, and in conjunction with the cache. All looked ok, but with some back of the envelope calculations the server farm is to big for the amount of traffic. Traffic is huge don't get me wrong (5M+ DAU)! But the farm is too big. Digging some more I found that the code spins to search for items on a map by loading all map items, and in a for-loop go through each item until that 1 item is found, then return. This is done two - four times for every api request, especially to trigger an achievement if the item is found on your map. More on the fix later.

Before making any changes, I wanted to get feedback of what the biggest issue was that seem to cause bugs or slow down development. The consensus was the DB Layer was mixed into the Model Layer causing fear of changing said models because there was fear that they would break the DB Layer. It was not quite clear how the code communicated with the DB, thus work was done by the team to use more of the same existing functionality to fulfill feature requests which is sub-optimal if the root functionality was slow or expensive to use in the 1st place at scale.

Thus the 1st recommendation was to separate out the DB Layer in a way where the data being requested is accessed through Data Access Objects (DAO). This concept encapsulates DB logic and for the most part requires only three methods: add, get, delete. Some more complex objects calling DAO had specific SQL to make getting data faster but for the most part three methods per table was all that was needed. Following the new Directory Structure backed by PHP Namespaces all SQL was easy to find and isolated away from the model.

The second recommendation was to remove a bunch of in PHP caches of data, because this was the cause of a vast amount of copies chewing up a ton of memory per request as well as chewing up CPU to build the caches per request. If the cache hit rate is not good don't cache-added complexity sucks to maintain-and can actually slow things down if not needed.

The third recommendation was to make each Model a single instance per distinct entity (singleton-map) throughout the request which reduced the overall amount of database queries by coupling the model creation to database fetches. The database queries are reduced because instead of pulling the same data for object creation in various parts of the code, the single object was referenced.

So here is the new structure for models/database access/utils

v2
v2/classes/DB/   -- DB connection logic
v2/classes/DAO  -- DB Access 
v2/classes/Models  -- New Models
v2/classes/Util  -- common Utility classes
v2/init.php -- everything is setup from this structure

With this new structure, separation of responsibility has been created in the code. More people can work on the same feature. One person can optimize the SQL, while another plugs in the model and yet another handles the access logic (controller). Or a single person can do it all. Most importantly the team loves the new setup.

In my 1st week and 1/2)with the new model format added to the existing code base via editing 242 files for a single model's usage (the largest and one of the most important models that controls the MAP locations of the game) the result has been great, a 10% drop in the number of servers and no user complaints with still more room for improvement. The biggest change was due to removing the spin through all the map locations to find a single item. The fix was changing a O(n) method in PHP getting hit hard to a O(1).

60 more models to go.

The good note about the unoptimized code is its forced the dev ops side of things to mature quickly and the tools that they built are really robust. To deal with features being pushed out that may not be mature enough for the request load the team built this cool dashboard with Jenkins automation, home grown software, realtime server metrics and rules to launch new instances and shrink them automatically throughout the day. It works flawlessly, for the front ends that is. It's pretty cool. Its so good and works so well I am hoping one day that it could be an OpenSource Project on its own.

Tuesday, June 18, 2013

First Week in China: Build a new Dev Environment

I see my role as enabling others. When I was a pure awesome DBA in the early 2000s I enabled developers and customers of a companies product by making mySQL fault-tolerant and fast. As I moved up the stack as an Architect while still holding onto my roots as a DBA-I kept my DBA discipline by enabling my team and company through all the knowledge I garnered.

The first thing I identified in China that can really help my team-members is making a new development environment. The reason, the production and dev environments are wildly different. Dev is on Windows while production runs various flavors of Linux's 2.6 Kernel-mostly Centos-6. Additionally when the code is ready to be push to what I like to call pre-integration servers-meaning the code is not checked in but copied to a test server then checked in if the tests past. As a result developers spend time organizing which test server to use and this server can only be used while in the office.

Generally as a developer you should develop in something simular to your production environment, and the integration server should serve as QA of the product and not as the post development process that by-passes all unit tests (which did not exist).  Also a lot of effort was put into making this Windows to Linux environment to work-just good enough-which really is not. Since PHP behaves slightly differently under Windows, I found that time was being spent on issues that could possibly not show up on Linux's php version. Thus these issues provided enough justification to build an integrated environment, where the end developer can work from home, or from where ever even if  there is no direct network connection to the outside world.


The Setup

Forcing a developer to change their OS of choice, or IDE or what have you is not going to fly in any country-its just too disruptive. Thus I chose to build the environment on virtualbox, a free VM that works on MAC and Windows, the two primary Dev environments. I pre-built the VM and uploaded it to the local fileserver. Now all the team has to do is download the VM.

Here is what is installed on the VM. (These steps follow after installing Centos-6-minimal)

First, I set up a shared directory from the HOST (Mac) machine to the GUEST machine (VM), which contains the code to run the site. This allows the user to use their favorite native IDE app or vim.

Next,  I set up a host virtual network, so even if the HOST does not have a connection to the net, it can always talk to the VM via ssh, httpd or what have you.  I also setup another network interface for the VM to talk to the outside world via the NAT setting so packages can be installed directly on it via yum.


Then I configured the yum repos for Centos-6 epel for core linux utils, 10gen for mongoDB, percona for XtraDB by modifying /etc/yum.repos.d and adding the following repos to my list

Percona.repo
epel-testing.repo
epel.repo
remi.repo
CentOS-Vault.repo
CentOS-Media.repo
CentOS-Debuginfo.repo
CentOS-Base.repo
10gen.repo 


Additionally I installed Percona, MongoDB, php, php-cli, php-frm, nginx, apache, vim-enhanced, etc. via Yum on the VM.


Finally I wrote documentation for the whole process and tested on a few people who have good Spoken English skills. With their feedback the documentation was improved and sent to the rest of the team, who have pretty good written english skills.

Now all the dev team members have to do is download the vm, configured the shared code directory and tada the entire dev environment in a box!

The next step is to resolve Schema Changes, and use Chef to update configurations and packages as if the VM was a real server-this is currently an manual process.


Next Post: Refactor PHP Models and add Unit Tests

Wednesday, June 12, 2013

In China and Spreading mySQL/MariaDB/XtraDB Ganglia, GearmanD, Memcache, MongoDB, HAProxy, Nginx, PHP, Python

I am currently in Beijing for a month as the VP of Technology for Fun+, a US/China based gaming company, spreading the joys of open-source  I have an entire team to do benchmarks, study INNODB flushing, build new technologies, which I hope to open-source  I will also post the results here. Our Stack is mostly on AWS with the following.

HA Proxy Load Balances the Web Tier
Web-Tier runs nginX and php-frm
Data is stored in a new Sharded mySQL layer, Gift platform is on MongoDB
Memcache is used to cache frequently accessed items to give state to our stateless Web-tier and reduce DB load, although we can run without it.

What I am focusing on is

Code-Style
When to cache and not to Cache
How to get the most out of mySQL and MongoDB especially on Index Design
Tools for DevOps by DevOps
Reducing cost

I hope to have a lot of information to share in the next couple of weeks.

Monday, May 13, 2013

How to pick indexes is the same for MongoDB as mySQL


I recently went to MongoDB Days, a conference about everything MongoDB in SF. Starting my career as a Systems Programmer then Web Developer, MySQL DB[Admin|Architect], to Software|System Architecture I like to keep an open mind about new technology and trends. When you work with a lot of different languages, and technology you find out that it’s basically the same Science from about 40 years ago.

An index in MongoDB is like an index in mySQL since a Btree is a Btree regardless of what application uses it. Just like with mySQL the best performance improvement for an application using MongoDB as a datastore is adding the correct indexes.


To create an index in MongoDB:

db.<tableName>.ensureIndex({ col#1:1, col#2:-1, col#3:1 }); // note 1 means ASC -1 means DESC

MongoDB follows the same left-most-prefix rule meaning

col#1, col#2, col#3 is an index
col#1, col#2 is an index
col#1 is an index

col#2, col#3 IS NOT AN INDEX

So, just like with mySQL for ONE compound index you get a total of THREE indexes if you follow the left-most-prefix rule, the columns from left to right (in order) in a compound index is an index.

MongoDB also gets a performance boost by using Covering indexes just like mySQL. What is a Covering index? Instead of reading from the datapage (or document store for MongoDB) which exists on disk your reading the data from the index which should be in memory for the most part. A common practice is to follow the left-most-prefix pattern, then add the columns which you are returning at the end of the compound index. For instance

SELECT photoId from Photos WHERE userId=? AND dateCreate=? AND privacy=?

The index in mySQL I would make is

ALTER TABLE Photos ADD INDEX `userId-dateCreate-privacy-photoId` (userId,dateCreate,privacy,photoId)

Thus following the left-most-prefix of a compound index I have an index on

userId, dateCreate, privacy, photoId
userId, dateCreate, privacy
userId, dateCreate
userId

and a Covering Index satisfied by the query above.

For mongoDB its the same

db.photos.ensureIndex({ userId: 1, dateCreate: 1, privacy: 1, photoId: 1});


So, in conclusion, understand the Computer Science of a Btree, Hash, LinkedList and you will understand how indexes work across technology and find that essentially it's the same. More info on indexes for mySQL can be found here.

Also note:

Explain in mongoDB is your friend just like Explain in mySQL