mySQL DBA, Architecture, Dev, Scale, HA, Code

ERROR 3546 (HY000) at line 24: @@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED

2023-02-07T15:43:00.001-08:00

ERROR 3546 (HY000) at line 24: @@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED

As a MySQL 8.0 user, you may have encountered the following error message when trying to dump data from one database server and add that data to another server:

"ERROR 3546 (HY000) at line 24: @@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED."

This error occurs when the Global Transaction Identifier (GTID) sets of the source and target servers overlap, probably from a previous import. A use case for this is importing staging into development as an example.

GTIDs are unique identifiers that are generated for each transaction in MySQL 8.0. They allow you to track changes to your data, even across multiple servers. When you receive this error message, it means that there is a conflict between the source and target server GTID sets.

The solution to this issue is to reset the master on the target server before importing the dump file. Resetting the master will erase all the binary logs and start a new one, allowing you to import the dump file without encountering the error.

RESET MASTER
mysql -uroot db < dump.sql

It is nice to blog again, I am blogging here about mySQL and @ https://dathan.github.io/blog/ on random other things.

Debugging awslab's aws-service-operator with go delve on vscode

2019-04-23T11:45:00.002-07:00

Currently, I'm doing a lot of work in Kubernetes, especially around operators. One operator, in particular, I am working on is aws-service-operator from awslabs. We ran into a bug with the default behavior around the dynamodb CR. There is a bug in this cloudformation template that defaults RangeAttributeTypes into Strings, when the operator supports strings, number, bytes.

I know this is a bug, the highlighted text from the click-through clearly states the bug, but how do I verify the bug? My environment is a macbook pro with vscode using all the go tools extensions.

So let's set up the debug environment:

First I need to setup the repo itself
mkdir -p awslabs
cd $GOPATH/src/github.com/awslabs
git clone git@github.com:awslabs/aws-service-operator.git

Now let's follow the development guideline and build the environment outside of vscode (getting dep and everything working)

$> code aws-service-operator // this is an extension from vscode to call it at the command line.

Click the menu Debug, click Add Configuration. Paste below.


{

    // Use IntelliSense to learn about possible attributes.

    // Hover to view descriptions of existing attributes.

    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387

    "version": "0.2.0",

    "configurations": [

        {

            "name": "Launch",

            "type": "go",

            "request": "launch",

            "mode": "debug",

            "remotePath": "",

            "port": 2345,

            "host": "127.0.0.1",

            "program": "${workspaceRoot}/cmd/aws-service-operator",

            "env": {},

            "args": ["server","--kubeconfig=/Users/dathan.pattishall/.kube/config", "--region=us-west-2", "--cluster-name=dathan-eks-cluster", "--resources=s3bucket,dynamodb,sqs", "--bucket=wek8s-dathan-aws-service-operator", "--default-namespace=system-addons"],

            "showLog": true

        }

    ]

}

Click the menu Debug and click Start Debugging. This assumes that you're using saml for aws auth, your auth is admin and has at least IAM EKSWorkerNodeRole. If you are using AWS-Admin like I am, you are good.

Now let's start debugging. Put a breakpoint at line 101 of pkg/helpers/helpers.go. Step into

resource, err := clientSet.CloudFormationTemplates(cNamespace).Get(cName, metav1.GetOptions{})

You'll see that the application makes a call to itself to try to get the cloudformation templates you installed. If you didn't install any cloudformation template called dynamodb the default will be used:

https://s3-us-west-2.amazonaws.com/cloudkit-templates/dynamodb.yaml

This is where the bug is. The cloudformation yaml has a bug where it does not ref the Hash or Range Attribute Types and the workaround is to install a cloudformation CR.

apiVersion: service-operator.aws/v1alpha1
kind: CloudFormationTemplate
metadata:
name: dynamodb
output:
url: "https://s3-us-west-2.amazonaws.com/a-temp-public-test/dynamodb.yaml"

output.url contains the ClouldFormationTemplate with a data field that defines the cloudformation template. I can only surmise that to make common code paths, that they will make extra API calls for reuseability, because even though the aws-service-operator has the CloudFormationTemplate, it needs to fetch it remotely due to how the code is constructed, making redundant calls. You'll see this in the debug. Make an API call to itself, then parse the YAML, then fetch the YAML from a remote endpoint.

Now what we see here is that the operator needs to pull the CR from a REST endpoint or HTTP endpoint even though it already has it defined in K8s itself.

The fix to the bug is as follows.

From:

AttributeDefinitions:
-
AttributeName: !Ref HashAttributeName
AttributeType: "S"
-
AttributeName: !Ref RangeAttributeName
AttributeType: "S"

To

AttributeDefinitions:
-
AttributeName: !Ref HashAttributeName
AttributeType: !Ref HashAttributeType
-
AttributeName: !Ref RangeAttributeName

AttributeType: !Ref RangeAttributeType

Additional to this, awslabs uses N as a value. This value means false in YAML (why I don't know). Thus for the yaml passed to create a dynamodb table you need to quote it.

So in the end to create my table I need the following yaml to create the dynamodb table which I use to test the operator.

kind: DynamoDB
+ metadata:
+   name: sample-tablename
+ spec:
+   hashAttribute:
+     name: AuthorizationCode
+     type: "S"
+   rangeAttribute:
+     name: CreatedAt
+     type: "N"
+   readCapacityUnits: 10
+   writeCapacityUnits: 10

Notice the S is quoted along with the "N" otherwise, N equates to false.

In conclusion. Delve is awesome, the operator has a bug and I was table to figure it out with this debugging method to produce this case https://github.com/awslabs/aws-service-operator/issues/181

Aurora mySQL differences

2019-02-12T14:37:00.001-08:00

Working with Aurora MySQL I thought would be a breeze, but its subtle differences make me scratch my head. Thus I need to find out more about this and write a post :)

What is Aurora?

It's a mySQL wire protocol compatible storage management system that sits on top of mySQL and modifies some innodb internals. You can read about more of the architecture here: I think of it as a Proxy Storage Engine System

The differences start from just starting the server. Aurora MySQL has Huge Page support turned on by default since AWS launches Aurora MySQL server with their custom flag for innodb large page support:

innodb_shared_buffer_pool_uses_huge_pages

This is not an open source setting documented by MySQL official build. In fact, there is not much information on this setting at all. I can only assume RDS instances are configured with Huge page support as detailed here and this custom setting for Aurora turns large page support on for mysqld.

So, what else is different between Aurora and Innodb? From Amazon's docs

The following MySQL parameters do not apply to Aurora MySQL:

innodb_adaptive_flushing

innodb_adaptive_flushing_lwm

innodb_checksum_algorithm

innodb_doublewrite

innodb_flush_method

innodb_flush_neighbors

innodb_io_capacity

innodb_io_capacity_max

innodb_log_buffer_size

innodb_log_file_size

innodb_log_files_in_group

innodb_max_dirty_pages_pct

innodb_use_native_aio

innodb_write_io_threads

thread_cache_size

The following MySQL status variables do not apply to Aurora MySQL:

innodb_buffer_pool_bytes_dirty

innodb_buffer_pool_pages_dirty

innodb_buffer_pool_pages_flushed

Note

These lists are not exhaustive.

In summary, Aurora uses mySQL but it's also a layer on top of mySQL. In all essence, it's just another storage engine which forks Innodb and provides management primitives built into the DBMS system.

In the next weeks, I'll describe how we launch Aurora instances and why as well as capturing more differences that have not made it into this list.

Hackathon process per week Sprints Idea

2018-07-20T18:14:00.000-07:00

I like hackathons. Hackathons provide the freedom to build outside the process. The forced speed to deliver something to demo and the fun self-deprecation of "ooh this is really ugly/bad TODO don't do this." in the source/commit logs which tells a great story. Also, a great side effect; people are really interested in refactoring and fixing the code especially if the demo went well.

So, I started thinking what if we can take this naturally formed fun process and define a weekly sprint, with a daily standup reporting on the process to achieve the product goal, using a hackathon method.

Day 1 and 2

"How much can you get done in two days for the demo"

This portion is no more than an hour planing. You talk to your team and divide up tasks for the hack you want to demo-in two days. For instance, "Johnny says I'll write the service" and "Amanda says I'll provide the data-it will be in MySQL". Sammy says "I'll write the front end to demo, Johnny let's agree what you'll send me, for now, I will simulate some pho data."
Then each person builds their part.
During the process, Johnny is building the interface from an un-authenticated HTTP Get request that has a JSON response to define what his service will return. Amanda finishes the process of testing some queries for functionality she checks in her part of how to get data, massage it and what tables are what, NOT performance.
Johnny sends a sample interface to Sammy so some dynamic data can be injected into the mockup when Sammy requests data. They agreed that a REST API using GET with a JSON response.
There are PR requests when sharing the same addition to the same place otherwise frequent merges
When fixing something that made it into master fix forward so check into master :P
Each check-in should be filled with a series of TODO, FIXME or "TODO don't do this" statements for speed until that's not needed when you have a refined process.
Demo

What does the individual developer each get?

Each developer produced something quick to verify the viability of the idea. A vested interest to fix the hacks and beautify the code, reusing reusable parts, etc.

What does the team get?

The team feels that they got something out pretty quick, the team has some talking points of what to fix next and what systems the team envisions that could possibly be used in other parts of the code. Finally, the chance to learn something new in the knowledge transfer or the ability to fix an approach before going too far down the rabbit hole.

Day 3

The next day is mapping out what the developer wants to refactor, has to change and gets to delete. With knowledge transfer of the good, bad, and embarrassing things with an idea of the direction each person took. It is fun.

This is looking over the queries to make they make sense.
Are the correct indexes there?
Are we really answering the correct questions efficiently if not how can we?
What hacks do we need to undo to provide what we delivered?
How do I test this thing? I need to make sure before I refactor I have reproducible tests.

Day 4

Document, Test, Refactor agree more as a team and focus on a code structure that enables adding the next round of features while setting standards of the direction going forward or revisiting them if need be.

Day 5

Do more of the same or get a beer with the team.

This process makes me feel that I am building something fast. The reason for the speed was to validate the idea or approach. Time is built into the process for testing, refactoring and documenting. The refactoring takes into account how to add new things faster. 50% building 50% testing, documenting, refactoring, making better. Producing a 4 day work week with daily standups

What about a really big project and delivering constantly

Whiteboard what is needed to deliver such as what the product is, what does it solve, what are the features.
Answer what is alpha
Answer what is beta.
Divide and conquer the vision for each "hackathon period"
Adjust projection of delivery based on the previous hackathon progress
Keep working and visit each hackathon period to verify the correct thing is built correctly.
Profit from a fun fast paced delivery of code that treats features and delivery of great code the team all validates as equal partners.

Spotify Top 200 in mySQL

2018-01-27T10:10:00.000-08:00

I do a lot of data analysis lately, and I try to find answers to questions through data for my companies pressing questions. Let's look at the past year of 2017 and answer questions for people who like music.

artist is the artist name
track is the artist's track name
list_date is which chart date the artist show up on the top200
streams is the number of plays following spotify specific rules

Let's look at the data set

select count(*) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
+----------+
| count(*) |
+----------+
| 74142 |
+----------+
1 row in set (0.04 sec)

How many artists made it in the top200 for the United States?

mysql> select count(DISTINCT(artist)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
+-------------------------+
| count(DISTINCT(artist)) |
+-------------------------+
| 527 |
+-------------------------+
1 row in set (0.09 sec)

Wow, it's really hard to be a musician. Only 527 broke the top200.

How many tracks in 2017 broke the top200?

select count(DISTINCT(track)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';

+------------------------+

| count(DISTINCT(track)) |

+------------------------+

| 1682 |

+------------------------+

For the entire year, 1682 songs defined the united states listing habits for the most part.

Who showed up the most in the top200 for 2017?

mysql> select artist,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;

+------------------+------+

| artist | CNT |

+------------------+------+

| Drake | 3204 |

| Lil Uzi Vert | 1891 |

| Kendrick Lamar | 1874 |

| Post Malone | 1776 |

| Ed Sheeran | 1581 |

| The Weeknd | 1566 |

| Migos | 1550 |

| Future | 1536 |

| The Chainsmokers | 1503 |

| Kodak Black | 1318 |

+------------------+------+

10 rows in set (0.16 sec)

Drake killed it, but Lil Uzi Vert is the star of the year, IMHO. Drake has a pedigree while Lil Uzi just started running.

Also from these artists I can tell HIP HOP dominated us charts; Let's verify this assumption.

mysql> select artist,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;

+------------------+------------+

| artist | CNT |

+------------------+------------+

| Drake | 1253877919 |

| Kendrick Lamar | 1161624639 |

| Post Malone | 954546910 |

| Lil Uzi Vert | 818889040 |

| Ed Sheeran | 714523363 |

| Migos | 682008192 |

| Future | 574005011 |

| The Chainsmokers | 557708920 |

| 21 Savage | 472043174 |

| Khalid | 463878924 |

+------------------+------------+

10 rows in set (0.48 sec)

Yup hip hop dominated the top 10 steams.

What about tracks? What are the top 10 tracks by streams?

select track,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;

+-------------------+-----------+

| track | CNT |

+-------------------+-----------+

| HUMBLE. | 340136186 |

| XO TOUR Llif3 | 314758565 |

| Congratulations | 283551832 |

| Shape of You | 280898054 |

| Unforgettable | 261753940 |

| Mask Off | 242524530 |

| Despacito - Remix | 241370570 |

| rockstar | 225517132 |

| Location | 224879215 |

| 1-800-273-8255 | 219689749 |

+-------------------+-----------+

10 rows in set (0.43 sec)

Which tracks and artists had the most time in the top200?

select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;

+------------------+-------------------------------------+-----+

| artist | track | CNT |

+------------------+-------------------------------------+-----+

| D.R.A.M. | Broccoli (feat. Lil Yachty) | 485 |

| French Montana | Unforgettable | 417 |

| PnB Rock | Selfish | 394 |

| Travis Scott | goosebumps | 365 |

| Post Malone | Go Flex | 365 |

| Childish Gambino | Redbone | 365 |

| Post Malone | Congratulations | 365 |

| Post Malone | White Iverson | 365 |

| Migos | Bad and Boujee (feat. Lil Uzi Vert) | 364 |

| Bruno Mars | That's What I Like | 364 |

+------------------+-------------------------------------+-----+

10 rows in set (0.20 sec)

Also from this data I can tell that Post Malone had a fantastic year!

So, more questions can be answered, like who held the number 1 position on the top200 the most?

select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and pos=1 and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;

+----------------+-------------------------------------+-----+

| artist | track | CNT |

+----------------+-------------------------------------+-----+

| Post Malone | rockstar | 105 |

| Kendrick Lamar | HUMBLE. | 67 |

| Ed Sheeran | Shape of You | 48 |

| Luis Fonsi | Despacito - Remix | 47 |

| Migos | Bad and Boujee (feat. Lil Uzi Vert) | 29 |

| 21 Savage | Bank Account | 20 |

| Drake | Passionfruit | 12 |

| Logic | 1-800-273-8255 | 10 |

| Taylor Swift | Look What You Made Me Do | 10 |

| French Montana | Unforgettable | 7 |

+----------------+-------------------------------------+-----+

10 rows in set (0.26 sec)

Wow can see hear that Post Malone is the star!

In summary, getting public data sources and doing simple queries can give a clearer insight into data to answer some pressing questions one may have.

With the schema above what questions would you answer?

Deploying Go Applications in Docker Containers using a Scratch Docker File

2018-01-19T13:01:00.002-08:00

Programming in golang is fantastic. I find it fun, expressive and simple to build concurrent programs. Deploying a golang app from laptop to production is as hard now as when it was back when I was building Monolithic services. A great way to deploy nowadays is to deploy microservices in containers. Containers keep the environment between laptop and AWS Linux instance in sync since in essence the environment is deployed and not just the code or binary.

Containerization of the environment is not ideal although. Sometimes you can ship containers of 1GB in size or more. Deploying that across the LAN is ok, over the WAN .. it is debatable. So, to deal with this problem I work with scratch Dockerfiles when deploying applications.

Starting from scratch Dockerfiles, I know that there is no real environment overhead since the environment is the most basic it can be. Additionally, I do not have to worry about the golang environment in my container because we are not going to ship "golang and all its packages", we are going to ship the binary itself. This is best described as an example.

The Dockerfile, is like a Makefile but for your environment. Each line describes what the Dockerfile does. Prior to executing the docker file, we will need to set up the environment.

GOOS=linux go build .

This line will build the go program as a Linux binary.

docker build -t dathanvp/goprogram:latest .

This line says; execute the docker file and tag the image as dathanvp/goprogram.

docker run -p 8282:8282 -v /Users/dathan/gocode/src/github.com/dathanvp/goprogram/logs:/mnt:rw dathanvp/goprogram:latest

Now, this is the magic. Docker will open port 8282 and map it to port 8282 in the container. A volume is attached from my laptop to the container's /mnt directory with read and write privileges. (When executing my container in production only this line changes.) This volume is to keep the logs persistent. Containers reset state, thus losing anything generated and the reason for my volume. Finally docker run is going to run my image dathanvp/goprogram

I deploy my container's to AWS by executing

docker push dathanvp/goprogram

This pushes my go program from my laptop to cloud.docker.com where my aws instances can then pull from, enabling running my programs in production without having to set up the environment on aws (other than docker of course).

Finally, why do it this way? I want my program to run on my laptop and on my AWS ubuntu servers without having to keep golang development environments in sync. Additionally, I want my containers to be really small so I don't have to ship hundreds of megs around to start the application, which itself is about 13MB. Uploading from comcast sucks. So, in conclusion, this is the best way I've found so far :)

Please let me know how you ship go applications and why.

Designing a RDBMS SQL Table

2017-12-11T13:32:00.001-08:00

Building tables initially should not really require a lot of thought. What? I'm suggesting that when designing a table think of the Table as a spreadsheet. Yes.

For instance, let's create a table that combines all social scores of a users' media in a single table. We will call this table platform_resources.

What do we need to record the social score total of a single person?

Who is this person? How I know this persona.
What is the platform? Which Social Platform does this reference refer to
What is the platform identifier? What is the social platform identifier
What is a common social score for each user? View, Likes, Comments

So the table above answers my questions. For each piece of media that and interna_name owns, I am able to collect a summary of basic stats. By no means is this optimized. The row size is roughly

21+51+51+4+4+4+4+4+256+4 = 403 bytes not taking into account the primary key which is very large and takes a small byte overhead due to the exceeding an internal limit.

We are not optimizing yet, we are just answering questions.

The Primary Key was picked to be platform_id, platform, internal_name. Following the Left Most prefix rule for composite indexes, we have roughly 3 indexes in 1 index. The original primary key, platform_id & platform, then finally platform_id. The primary key was picked to be this because for a platform the platform_id is unique and the person who owns this platform_id should be represented. Additionally, since we are using INNODB the table is sorted by the primary key.

No optimizations just a basic table get's the job done. Now how would you optimize this table?
First, you should ask what are you optimizing the table for? Disk size? Memory fit? Because its ugly and it bothers me?

Let's estimate how this table will grow. This table is a MxN problem where for each internal_name they will have N resources per platform. The bounds of the growth are around 1000 items per year per platform. M is less than 20K so, It's really not worth it to optimize for any other reason just to do it because. So don't.

If I had to optimize because the MxN problem turned into a huge overhead.
First, I would reduce the row size of the table by making lookup tables for internal_name, platform, platform_id which keeps the primary key smaller - probably in 64 bits.

Next, distribute the table by either date_taken range since queries will be more interested in the latest data, or we can distribute the table by internal_name; this is another post.

Finally, sometimes you just need a table and you just want to query it like give me the total sum of views for all Instagram videos by a creator. The post is to think about optimizations when you need to think about optimizations and not beforehand. If your needs changes; change the schema to focus on the optimization you are going for. :)

Back to Sharing stuff I learned

2017-12-06T10:26:00.000-08:00

I have not been regular in blog posts as I've just been focused on everything. I got lazy. Well, that is over.

At Shots Studios, a teen social network consisting of nearly 2M lines of code is no more. Shots is now a one-stop shop for select Creators. We are a Production Studio, Ad/Talent Agency, Talent Management Media company focused on creating timeless content. A 21st-century answer to getting great content from great creators in front of their audience.

Your internal monologue after reading this is how does this have anything to do with MySQL, HA, Scale, Coding; if not, this is still a good segway to explain how.

Shots the App, did really well yet not well enough to compete with Snapchat and Instagram. We did gain a lot of insight, mainly in what is called Influencers. A large percentage of time in growing the Shots platform was handling their cases of spikey scale. When Influencers posted they would promote their Selfie on other platforms sending waves of teens all at once to their data. Honestly, this was an amazing challenge to scale on a tight budget. Cold to Performant in millisecond time, with a 600% increase in load/concurrency suddenly. The short answer to scale this was to keep data in memory - From this, we understood that influencers reach and ability to move users is more effective than Display Ads. Period.

We did a huge analysis about our user base, and from that analysis, we made the decision to keep all "Influencers" in memory, and people who were the sticky users-the percentage of DAU that comes back with frequency. Next, to make sure that we did not saturate a network interface by keeping their data in memory on a single box, we replicated this subset of users among redundant pairs. Finally, we had to keep higher than normal frontends in reserve to handle the sudden burst without the startup delta of dynamic scaling pools.

Now we use a subset of the tech developed to mine, analyze, data about Creators. Creators, were influencers but now create, perform, direct, edit content thus they are called Creators. For instance, we use a custom performant event tracking system to monitor the social engagement of all creators. If you heard of a site called socialblade, I basically duplicated it at a much higher precision then their data.

With this we are able to tell which of a creator's content strikes a chord with users then we produce more of that performant content. For instance, https://shots.com/superheroes. With this insight, analysis, data collection and maximizing the reach channels on platforms like YouTube, Instagram with a shoestring budget we are making data-rich informed decisions.

Golang (Go) and BoltDB

2017-05-23T17:29:00.000-07:00

I've been using Go for some time now (3 years) and I am constantly impressed with the language's ease of use. I originally started my career in C-Unix System Programming, then Java, then PHP and now I am rather language agnostic. Out of all the languages I know, go is the most fun and there is a strong community behind it.

BoltDB is yet another NoSQL Key-Value store, designed to be embedded and I happened across it for a small use case. I use GO to crawl sites and parse HTML DOM in a very concurrent manner to gather data for analysis from a variety of remote web sources. BoltDB is used to keep state as I transfer from my local mac book to a remote server and it is very easy to use. Basically, I needed a portable embedded database that is fast and resilient without setting up MySQL and keeping the schema in sync between dev and production. This is not user facing just a set of go packages that help me keep state so I can know where to pick up from in case of some sort of error, like I turn off my laptop or some random panic.

Let's look at BoltDB usage. Below is my struct, everything is a string because I am not formatting or typing things yet.

type TableRow struct {       


       Title string       
       Time string       
       Anchor string      
       Price string       
       Notified string // could make this a Time Struct but let's be simple
}

Next, I create my.db if it doesn't exist. The function check looks to see if there are errors and panics. The line defer db.Close() will close the db at the end of the function which these calls are made from. The function addRecord will create a bucket called parser_bucket which is a const and add the key byte with value triggering a bucket creation if this is the first run. It is something fast to make a point and yes there are more efficient ways to do this.

db, err := bolt.Open("my.db", 0644, &bolt.Options{Timeout: 10 * time.Second})
check(err)

defer db.Close()
addRecord(db, []byte("start"), "starting") // create bucket when it doesn't exist

The function addRecord takes 3 arguments; db - the boltdb struct, key a byte array and a value which can be anything, in our case, TableRow the struct above. The function is lower case so it is not "public". The interface v is marshaled into a byte array and stored in boltdb after it checks that the bucket is created. Finally, the addRecord function returns an error if an error occurred.

func addRecord(db *bolt.DB, key []byte, v interface{}) error {
       value, err := json.Marshal(v)
       check(err)
       return db.Update(func(tx *bolt.Tx) error {

                  bkt, err := tx.CreateBucketIfNotExists([]byte(bucket))

                  if err != nil {

                     return err

                  fmt.Printf("Adding KEY %s\n", key)

                  return bkt.Put(key, value)

})

To get a TableRow out of the database a read transaction is performed in BoltDB. This method is capitalized so it is a package public method. GetRecord returns a table row or panics if an error occurred.

func GetRecord(db *bolt.DB, key string) *TableRow {
       row := TableRow{}       err := db.View(func(tx *bolt.Tx) (error) {
              bkt := tx.Bucket([]byte(bucket))

              if bkt == nil {

                 return fmt.Errorf("Bucket %q not found!\n", bucket)


              val := bkt.Get([]byte(key))
              if len(val) == 0 {

                 fmt.Printf("key %s does not exist\n", key)

                 return nil

              }
              err := json.Unmarshal(val, &row)
              return err
       })
       check(err)
       return &row
}

Calling this function returns a TableRow reference. There are no real pointers in go but I conceptualize this internally as a pointer.

This is it. This is all there really is to BoltDB. Read Transactions, Write Transactions that are concurrency-safe. You can even run the Unix command strings on the database file so see if you stored the data correctly as a sanity check and you should see json from the output (if that is your serializer).

In conclusion, BoltDB is fast, so far safe and does exactly what I need. Store State, without expecting an external DB. Embedded databases are awesome and go is awesome. Give it a try.

INNODB Tablespace Copy in Go Lang

2016-11-22T12:30:00.002-08:00

I just uploaded a quick tool that I think you will find useful if you need to consolidate, expand innodb databases if tablespaces are in use.

This golang application will copy an entire innodb database from one server to another server via scp.
innodb-tablespace-copy follows the algorithm described here. This golang application copies 4 tables in parallel after setting up the remote environment. Then in parallel import the tablespaces. I've only used this application on Percona XTRADB 5.6 but it should work for all flavors of innodb that are out there.

Note to recover from interruption, this is done manually either by discarding the tablespace or by dropping the remote database.

Feel free to add to it and make it better :)

https://github.com/dathan/innodb-tablespace-copy

Tech Stack at Shots Quick Post

2016-08-10T19:03:00.001-07:00

The Shots APP we use the following technology to serve many millions of Photos, Videos and Cached Links.

LAMP

RedHat Enterprise 6 on the Front ends and DBs. Amazon Linux (Centos) on Elastic Search and Go servers

Apache 2+

Percona 5.6 XTRADB with some minor custom stuff (sharded)

PHP

we have a little bit of Python, JAVA and a lot of GO!

One of the current features which has wildly been successful is sharing links on mobile, which is very hard. Mobile is not built for links but fortunately Instagram and YouTube are. To make this work; we have the client read from the clipboard. The client makes a call home where the link is sent to a distributed worker system which fetches the content of the HTML page, finds the media, manipulates the media and then distributes the media on our CDN. Links only last for a few days.

This is like a poor man's AMP and only took us a few days to write. We even retranscode videos to make sure the format fits our timeline and doesn't hog up to much bandwidth.

MySQL keeps state so the same link is not rebuilt and everything is fronted with Redis - since redis supports pipelined commands - which is great for a feed our size. The next feed version will be a go-tao-like system.

All in all for 3 days of work, the system works great and scales linearly. It is near real time. Give it a try. Link a Instagram or Youtube url and you will see for yourself.

Some things that I'd like to do in the future is use QUIC on a websocket layer. To have a non-blocking messaging system which is blazing fast and works on spotty networks and integrate ROCKSDB.

But that's another post.

Wish there is another String DataType called LIST but there is not

2015-09-28T13:29:00.000-07:00

I believe the future of SQL is to take a lot of primitives that are Computer Science fundamentals and add them as datatypes to expand on the allowed columns today. The idea is of the ilk of a merging of noSQL and SQL for solving problems to make it easier for a new person to develop.

For instance, what would be awesome is a LIST type, where the list contains a distinct number of string items mapped to a bit, much like SET yet you don't need to predefine all the items in a set.

Here is a good example as how I would use a list type:

Imagine you need permissions on a per row basis. Some rows are public, some are private, some are viewable by a small set of people. (Less than 64).

Let's take the example of Finding all rows that are public or are viewable by only me.

When creating a row

INSERT INTO resource_permissions (resource_id, perm_bit, list_dt) VALUES(1, 2, "dathan, sam, fred")

perm_bit is 0 private, 1 = public, 2 public to a list of people

When selecting rows that I "dathan" can see

SELECT resource_id FROM resource_permissions WHERE perm_bit = 1 UNION SELECT resource_id FROM resource_permissiongs WHERE perm_bit = 2 AND FIND_IN_LIST(list_dt, "dathan");

What the above statement says is give me all the public resource_ids and resource_ids that I "dathan" can see.

Right now I can't do this, I have to use a medium_blob and a LIKE

SELECT resource_id FROM resource_permissions WHERE perm_bit = 1 UNION SELECT resource_id FROM resource_permissions WHERE perm_bit = 2 AND list_dt LIKE "%:dathan:%"

As you can see I'm able to simulate the desired behavior but I can't use an index, I don't want to use a FULLTEXT_INDEX due to overhead and other issues that out of scope for this post. Nor do I want to manage UDF's or Stored procedures. The last two are not desirable yet can also simulate the behavior I am looking for.

Some primitives from REDIS or other noSQL solutions would be awesome additions for SQL as a hole IMHO.

My two cents.

Also in 5.7 maybe the JSON Column Type might be of some use.

Golang websockets (wss) and "OOP"

2015-09-22T17:38:00.002-07:00

Golang is awesome. My 1st Language back in 1994 was C. Then the following year my Computer Science Dept. switched from C/Pascal to C++. I even like C++ but I like C more mainly because of nostalgia.

Enter Go. The Syntax for me is a mix between JSON, and C. I love it. I've created 3 new servers all doing a ton of TPS. What I would like to share with you is some GO code to that handels websockets

If you are building a server using websockets, over secure TCP your browser behaves slightly differently than a client side application using a websocket library. Specifically when working with wss (secure websockets) across domains.

Its up to the client to respect Origin, so a client implementing a websocket doesn't have to set the Origin Header, but your browser does. This is done on purpose and its a good thing. To get websockets to work over secure sockets, let's make our assumptions consistent and do not report Origin errors with On the fly overriding methods. The power of Go.

// going to override the handshakeserver := websocket.Server{
   Handshake:func(config *websocket.Config, req *http.Request) error {
      return nil;
   },
   Handler:websocket.Handler(nsp.handle),
}

above says override the method in the libararies (golang.org/x/net/websocket) with the supplied local function and return nil for error - which means all is good.

Anytime that ORIGIN is sent the server (non browser clients and even the browser doesn't have to do this) ignore the origin handshake

http.Handle(nsp.path, websocket.Handler(server.Handler));

Next we handle the websocket with the supplied handler in the server called nsp.handle. nsp.handle is a string of a function name that takes in a websocket connection. nsp.path means for a given http connected path execute the handler.

This is awesome. Everything works, but what is cooler is how GOLANG handles OOP. The term used in GOLANG is embedding, and changing the type or executed method (method overriding), thats called Shadowing

Here is an example

package main
import "datarepo"
type DataLayer struct {
   datarepo.DataRepoAccess}

////https://github.com/luciotato/golang-notes/blob/master/OOP.md#golang-embedding-is-akin-to-multiple-inheritance-with-non-virtual-methods//func NewDataLayer(subject string, class string ) DataLayer {

   ret := DataLayer{ datarepo.DataRepoAccess{Subject: subject, Classof: class}}
   ret.New();
   return ret;

}

//// wrapper method to add in an counter//func(dl *DataLayer) Execute() ([]byte, error){ // shadowed   Reporter.increment("api_layer_cmd", 1)
   var base = dl.DataRepoAccess;
   return base.Execute()
}

DataLayer is a Wrapper Design Pattern Around datarepo.DataRepoAccess a structure I wrote that handles talking to the backend. datarepo.DataRepoAccess has a method called Execute. In the example above Execute is "Shadowed" or overridden. This new method counts the number of times the base class is called.

These months of coding go has been so much fun. I love learning new things but also getting my work done on time. Go enables me to do both. The analogy that I can compare learning go to is like learning to SnowBoard. In the beginning it's like getting your ass smacked with a cold wet shovel but once you get it you got it.

San Francisco mySQL Meetup August 26 2015

2015-08-05T10:51:00.001-07:00

Shots Architecture and how we handle extreme load spikes

I invite you to come out and join me in a talk about the above heading. I will describe many things and walk through the cases of what technology is used, where, why and how. The event information is located here. I'll also touch on, how cost is reduced, how we handle celebrity's load when they promote and what's next to make the system even more automatic and solid.

Thanks for sfmysql.org for all the work they do and for allowing me to give a talk.

Reporting Across Shards

2015-06-01T19:19:00.002-07:00

If you have chosen to split your data across boxes, and architected your app to not query across boxes there is still a case where you will need to. Data mining, reports and data health checks require hitting all servers at some point. The case I am going over is sessions and figuring out the Session Length without taking averages of averages which is wrong.

Let's assume you have a session table of the following

mysql> describe sessions;
+----------+---------------------+------+-----+---------+-------+
| Field    | Type                | Null | Key | Default | Extra |
+----------+---------------------+------+-----+---------+-------+
| user_id  | bigint(20) unsigned | NO   | PRI | 0       |       |
| added_ms | bigint(20) unsigned | NO   | PRI | 0       |       |
| appVer   | varchar(8)          | YES  |     | NULL    |       |
| device   | bigint(20) unsigned | YES  | MUL | NULL    |       |
| start    | int(10) unsigned    | NO   | MUL | NULL    |       |
| stop     | int(10) unsigned    | NO   |     | NULL    |       |
+----------+---------------------+------+-----+---------+-------+



The data is federated (distributed) by user_id. This table exists across 1000s of servers. How do you get the average session length for the month of May?





The question already scopes the process to hit every single server
Second we can't just take AVG((stop-start)) and then sum and divide that by the number of shards
We can't pull all the data in memory
We don't want to have to pull the data and upload it to BigQuery or Amazon RedShift
We want a daily report at some point







SELECT SUM((stop-start)) as sess_diff, count(*) as sess_sample FROM sessions WHERE start BETWEEN $start AND $stop AND stop>start




The above SQL statement says for the connection to a single server give me the sum of the session delta and count the corresponding rows in the set. In this case the SUM of SUMs (sum of session_delta) is the numerator and the sum of sess_sample is the denominator.




Now do this across all servers and finally write some client code to take a few rows < 1000 to report the number.








$total = 0;
$sessions_diff = 0;

foreach ($rows as $shard_id => $result) {

    $sessions_diff = \bcadd($sessions_diff, $result[0]['sess_diff']);
    $total = \bcadd($total, $result[0]['sess_sample']);
}

Now the session_avg = sessions_diff/total

Tada a query that can take hours if done on a traditional mining server is done in ms.

Federating THE friends table in a Sharded mySQL environment without downtime or users noticing

2015-04-01T17:09:00.001-07:00

A friends table is the cornerstone of social applications. Its purpose is to define relationships and help answer the question what are my friends doing.

Here is an example friend’s table:


 CREATE TABLE `friends` (

  `user_id` bigint(20) unsigned NOT NULL,

  `friend_id` bigint(20) unsigned NOT NULL,

  `auto_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (`user_id`,`friend_id`),

  KEY `user_id-auto_ts` (`user_id`,`auto_ts`),

  KEY `friend_id-auto_ts` (`friend_id`,`auto_ts`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

With the table above we can get a list of user_ids a user follows (following), or a list of people who follow said user (followers), or get a list of mutual follows. This is a very simple table structure yet very powerful.

The problem is this table doesn't scale on a single server, when you have millions of users, each user has many friends, all users are semi to deeply connected the table becomes a problem. Mix this with a huge request rate, with lots of concurrency a single server just doesn't scale.

One can replicate the friends table but what starts to cause lag is when many users start adding or removing friends at once. So, how can we distribute this table across many servers holding a small % of the friend graph?

Let's look at the friends table. It defines whom a user follows and who follows the user ordered by insertion time.

Let's create two tables:


CREATE TABLE `following` (

  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',

  `friend_id` bigint(20) unsigned NOT NULL DEFAULT '0',

  `mutual` tinyint(3) unsigned NOT NULL DEFAULT '0' COMMENT 'Flag to denote mutual connections',

  `auto_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (`user_id`,`friend_id`),

  KEY `user_id-auto_ts` (`user_id`,`auto_ts`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8


CREATE TABLE `followers` (

  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',

  `friend_id` bigint(20) unsigned NOT NULL DEFAULT '0',

  `mutual` tinyint(3) unsigned NOT NULL DEFAULT '0' COMMENT 'Flag to denote mutual connections',

  `auto_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (`user_id`,`friend_id`),

  KEY `friend_id-auto_ts` (`friend_id`,`auto_ts`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

The 'following' table defines whom a said user follows. The table is federated by the user_id so this table exists on the user_id's shard.

The 'followers' table fines that is following the said user. On every follow instead of writing one row, we now write two rows. One write on the following user's shard. One write on the followed users shard. Thus the followers table is federated by friend_id.

This can be best described by an example on reads:

How many people am I user_id 3306 following?

Connect to my Shard-x, execute the query


SELECT COUNT(*) FROM following WHERE user_id = 3306

How many people are following me (user_id 3306)

Connect to my Shard-x, execute the following query


SELECT COUNT(*) FROM followers WHERE user_id = 3306

Now let's look at a write, of me (user_id:3306) following friend_id:11211

3306 is on Shard-x
11211 is on Shard-y

So, 1st we write to the fact that 3306 is following 11211. We connect to Shard-x and execute the transaction


BEGIN

INSERT INTO following (user_id, friend_id, mutual, auto_ts) VALUES(3306, 11211, 0, NOW());

// DO NOT COMMIT YET





Now connect to Shard-y to write the followers row. If the connection fails rollback the transaction on 3306's Shard-x, otherwise



BEGIN

INSERT INTO followers (user_id, friend_id, mutual, auto_ts) VALUES(3306, 11211, 0, NOW());

if affected rows == 1 (no error)

COMMIT on Shard-x

COMMIT on Shard-y

Now we can answer the main questions.

But what about something like. Give me my friends photos sorted by last upload time 10 at a time?

Well here is the magic sauce. We are going to do a FANOUT reads and hit all the shards, which my friends are on. For my environment this is much better than a FANOUT of writes, since we like to customize in real-time the feed as well as duplicating the data 10000s of times becomes very expensive quickly as servers start turning cold. We can go into this topic a bit more in another post.

Now I execute the query across from friends shards


SELECT p.id FROM photos p JOIN followers f ON(f.friend_id=p.user_id) WHERE f.user_id = 3306 ORDER BY p.id DESC LIMIT 10;

If I have a 1000 friends and 100 shards, each friend has 10 photos I am going to get back 1000 rows.

But the Order is not what I am going to display because I want to display the latest 10 photos. Thus I will need to sort in memory on the application server and take a slice of the results.

But what if I want the 2nd page?


SELECT p.id FROM photos p JOIN followers f ON (f.friend_id=p.user_id) WHERE f.user_id = 3306 p.id < [LAST_ID_FROM_FIRST_PAGE] ORDER BY p.id DESC LIMIT 10

In the application we pass the last_id from the 1st page and execute the same FANOUT on reads again do the same logic and return the photos.

Your questions might be, but isn't this slow because people with large networks will have to hit every shard each time and you have to loop - execute - read on each connection?

This can be mitigated with memory, pipelining and parallel SQL execution.

If you're social graph is like twitter where all active users follows 100K users and the feed doesn't change dynamically writing the data to each shard may be for you. But, again this is out of scope for this post.

What about answering the question mutual connections?

On ever write of a friend relationship, do a select to see if the followed person follows the follower. Then mark the row on both shards as mutual.

For all my personal cases, this distributed friends table solves all my needs. Lots of friend writes from importing friends from say an address book or email or other social network friend graph and a large concurrency is not going to affect me SINCE the table has been removed from a Single Point and is now distributed across many servers.

Reads are fast because only a % of data is on each shard, 90% of the queries hit only that shard for a given user.

Feed type queries are fast because the SQL is executed in parallel if we have to go to the SQL Layer. Most data is cached, reducing the need to FANOUT on reads.

Finally federating without downtime or users notices requires a backfill script and writes to the old friends table as well as writes to the new friend tables. Once this is done, fix all the queries to use the new format. Then sit back and feel good that good work was done :)

Long time since an update but great stuff coming along

2015-03-12T17:04:00.002-07:00

So, its been a long time since I contributed anything to my blog. That will end very soon. Things coming up is writing about the architecture of Shots, Shard optimizations, Data Organization and Grouping, Java, Golang and some cool other stuff. Also how to handle Justin Biebers traffic, which is INSANE.

In the meantime if you live in the San Francisco California Bay Area, you want to work with the coolest founders on the planet, make a big difference in peoples lives, know mySQL / redis / memcache / Some C style language or want to learn contact me. I have a great job for you!

Manually Switch Slaves to new Masters in mySQL 5.6 (XTRADB 5.6)

2014-07-10T14:41:00.001-07:00

I'm really excited about Fabric which was recently announced. Everything it does has been a variety of scripts for me or manual tasks, but before I can integrate Fabric into my system I must know more about it. When dealing with live-data and moving servers around I still do things manually just because it makes me feel better to know that if data is lost, I was the cause for doing something dumb. Basically I need to know everything about Fabric including line by line execution until I will deploy it.

Here are my steps for switching and replacing a Shard Slave.

Imagine having a setup in the following Config.

Shard Server 10.0.30.123 - this is the master endpoint

The Global Shard which holds Friend Info to join against is

10.0.1.1

10.0.30.123 --- replicates from ---> 10.0.1.1

Now the Shard Server

10.0.30.123 has 3 slaves, thus 10.0.30.123 is set up to log-slave-updates

The 3 slaves are 10.0.18.78, 10.0.22.76, 10.0.22.77 and I want to make 10.0.22.76 the new master for the said Shard with 10.0.22.77 as its slave. So, what I have is

3 slaves --- replicates from ---> 10.0.30.123 --- replicates from ---> 10.0.1.1

what I will end up with is

10.0.22.77 -- replicates from ---> 10.0.22.76 ---> 10.0.1.1

I am getting rid of 10.0.30.123 and 10.0.18.78

Here are the steps.

Tell 10.0.22.77 and 10.0.22.76 to SLAVE UNTIL the next binary log in 10.0.123.1

ssh to each box
STOP SLAVE (using mysql 5.6) on 10.0.22.7[6-7]
SHOW SLAVE STATUS\G -- get Master_Log_File : master-bin.000612
START SLAVE UNTIL MASTER_LOG_FILE='master-bin.000613', MASTER_LOG_POS=4

Now what I did here was tell the slaves to replicate until the next bin log is reached

mysql> START SLAVE UNTIL MASTER_LOG_FILE='master-bin.000613', MASTER_LOG_POS=4;

Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> SHOW WARNINGS;

+-------+------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| Level | Code | Message |

| Note | 1278 | It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mysqld restart |

| Note | 1753 | UNTIL condtion is not supported in multi-threaded slave mode. Slave is started in the sequential execution mode. |

2 rows in set (0.00 sec) // notice the minor bug in the spelling :)

I also get a warning that says my multiple SQL threads are now a single one which is fine.

My next step is to ssh to 10.0.30.123

FLUSH LOGS - this tells 10.0.30.123 to rotate all log files including mysql-bin.000613

Now on the slaves I wait until they stop

Once both stop, on 10.0.22.76 I issue RESET MASTER // I don't care about what was replicated at this point and saved already in the binlogs - I've already verified that they are in-sync with CHECKSUM TABLE

On 10.0.22.77 I issue the command

STOP SLAVE; CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000001', MASTER_LOG_POS=4, MASTER_HOST='10.0.22.76'; START SLAVE;

if you get an error

Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs; these UUIDs must be different for replication to work.

stop mysql, remove auto.cnf in your $DATADIR (/var/lib/mysql)

On 10.0.22.76 I issue

START SLAVE

Now I wait until the SLAVE catches up to the MASTER 10.0.30.123 (remember this works because of log-slave-update)

Next in my dbconfig.php file I change all references to 10.0.30.123 to 10.0.22.76

Verify everything is in sync (USE CHECKSUM TABLE ACROSS TABLES/SERVERS ) and push out the new config

After the push Make sure to restart all daemons or queue workers, they may cache the database config

Now do this all over again to make

10.0.22.76 replicate from 10.0.1.1

In conclusion, this is just to manual and screams for automation. Soon it will be with Fabric which manages this process once I get around to rolling that out.

CTO of Shots on Core Technology, Culture and Working on the greatest App in the World

2014-06-03T13:26:00.001-07:00

MySQL has opened a lot of avenues and opportunities for me. I am the CTO of Shots and I got here because I was a mySQL DBA, who can code in a variety of languages, understand data access, layout and design fast backends that scales to well over 100 million users, manage a team, give back to the community and prove myself through constant good work. Plus I've made every single mistake, so I know what not to do.

At Shots we of course use Percona XTRA DB 5.6, with Memcache, Redis, ElasticSearch, HAProxy, FluentD with Logstash plugins, Ruby, PHP 5.4, Go, Java, Erlang and AWS which are managed via a custom CHEF build. We use chef server like chef solo :)

In four months we grabbed over 1 million ACTIVE users all on IOS, which are mainly from US, UK, Australia, Canada and Brazil. We are able to handle Justin Bieber's traffic that is insane. Currently we have no downtime (yet, always plan for downtime). We Moved DC to AWS us-west-1, and S3 from S3-east crappy to S3 west. We grow organically, and are at the cusp of hitting our hockey stick growth, all on 12 Servers currently :)

There are 4 of us. Everything described here is what I handle, yet I say "we" throughout the description of this post and this is a good segway for our culture. As a team we build for our consumers, our Shotties, a positive, bully free, app that works across all platforms to keep humans interacting with humans and not all the other cruft you find on social networks. Team and Community Focus is our culture, with the confidence to build the best App in the world.

It's a photo status update or better known as a selfie app, which everyone from Shaq, King Bach, Justin Bieber to Floyd Money Mayweather and many others are using. Its cool to have these folks but we are not building a platform for just them we are building a platform for you. For the Teens who are different and like Cosplay, for the teen that loves Bieber, for the person who wants to remember the moment, for you.

I invite you all to use it. I invite you or for you to ask your friends if they would like to join us because now I am hiring an IOS and Andriod Dev. If you or they want to work for a startup, be apart of something cool with the purpose of changing the world and the way we interact with one another online, join us. The requirements are live in the Bay Area, can code and want to make a change :)

MariaDB 10.0.4, BeanStalkD, Geographic Replication, Event Tracker for stats gathering at 60K stats a second

2013-10-28T14:33:00.000-07:00

Every company needs to see stats to understand how the application is performing, and how users are using the application(s). Typically a stat for most basic questions and even some advance questions can be summarized as "What is said event over time?". We call this EventTracker.

To add to the complexity of generating stats, how do you get stat events from a DataCenter (DC) in Singapore, a DC in Western Europe, a DC in Oregon to a database for querying in West Virginia - near real-time? I used multisource replication, and the BLACKHOLE storage-engine to do so with MariaDB.

Above is an image that shows a Webserver in some part of the world sends Events for tracking various interrupts to a BeanstalkD queue at time T in the same region. Each Region has a set of Python workers that grab events from BeanStalkD and writes the event to a local DB. Then the TSDB Database, a MariaDB 10.0.4 instance, replicates from each BlackHole StorageEngine BeanStalkD Worker server.

The obvious question might be why not use OpenTSDB? The TSDB daemon couldn't handle the onslaught of stats/second. The current HBase TSDB structure is much larger compared to a compressed INNODB row for the same stat. Additionally a region may loose connectivity to another region for some time so I would need to queue in some form or another events until the network was available again. Thus the need for a home grown solution. Now back to my solution.

The Structure for the event has the following DDL.

Currently we are using 32 shards defined by each bigdata_# database. This allows us to scale per database and our capacity plan is not DISK IO but based on diskspace.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| bigdata_0          |
| bigdata_1          |
| bigdata_10         |
| bigdata_11         |
| bigdata_12         |
| bigdata_13         |
| bigdata_14         |
| bigdata_15         |
| bigdata_16         |
| bigdata_17         |
| bigdata_18         |
| bigdata_19         |
| bigdata_2          |
| bigdata_20         |
| bigdata_21         |
| bigdata_22         |
| bigdata_23         |
| bigdata_24         |
| bigdata_25         |
| bigdata_26         |
| bigdata_27         |
| bigdata_28         |
| bigdata_29         |
| bigdata_3          |
| bigdata_30         |
| bigdata_31         |
| bigdata_4          |
| bigdata_5          |
| bigdata_6          |
| bigdata_7          |
| bigdata_8          |
| bigdata_9          |
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
35 rows in set (0.16 sec)

The database bigdata_0 is the only database that is slightly different than the rest. It has a table defined as EventTags that is not in the rest of the databases. EventTags is the map of eventId to tagName where eventId is just a numerical representation of a part of the md5 of the tagName. Each numerical representation falls into an address space that denotes the range of which database a tag should belong to. We use the EventTags table for the front-end to search for a stat to plot on a graph.

 

CREATE TABLE `EventTags` (
  `eventId` bigint(20) unsigned NOT NULL DEFAULT '0',
  `tagName` varchar(255) NOT NULL DEFAULT '',
  `popularity` bigint(20) unsigned NOT NULL DEFAULT '0',
  `modifiedDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`eventId`,`tagName`(25)),
  KEY `eventTag` (`tagName`(25))
) ENGINE=BLACKHOLE DEFAULT CHARSET=utf8

EventDay contains the actual value of the stat combined for the last 1 minute (currently). Our granularity allows down to a second but we found seeing events for the last minute is fine. The SQL produced for events are the following.

CREATE TABLE `EventDay` (
  `eventId` bigint(20) unsigned NOT NULL,
  `createDate` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'min Blocks',
  `count` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`eventId`,`createDate`)
) ENGINE=BLACKHOLE DEFAULT CHARSET=utf8

MariaDB's multi-source replication then downloads the statement based binary logs from each of the workers in some part of the world and applies the SQL to a combined database that represents all regions. Note that the central database is of the same Structure BUT now the engine is Compressed INNODB with KEY BLOCK SIZE set to 8.

The front-end sits on top of the central database and we record everything from a single item being sold to load on our auto-scaling web-farm. Which allows us to do some interesting plots like Items sold as a function of Load over time. I(L(t))

Currently with this method we are producing 60K events a second that translates to a few thousand database updates a second across 4 replication threads (8 threads total). Keeping all data up to date from within the last minute near realtime.

Speaking at RAMP: Scale Patterns and handling exponential growth without downtime

2013-07-07T13:38:00.002-07:00

I will be in Budapest talking about Scale and Rapid Growth. I will start off with Flickr's Five minute conversation to take a direction on how to scale the backend to getting 90 million users in 3 weeks after going Viral.

http://rampconf.com/main.html#schedule

RAMP will also have live streaming broadcasted at TNW, HWSW and on USTREAM.

First Week.5 in China: Part Two, Refactor PHP get 10% more capacity with one change

2013-06-24T04:44:00.000-07:00

The PHP code that I've experienced in China so far is pretty good. I have been in some environments where the Code is horrendous-where variables are set in one file yet used in another file via a require_once. If that magic variable is not set everything would break with side-effect galore. This is not the case here for the China Team. This team is really good not to imply the other-one wasn't just praising the current one.

The SQL, like many other companies I have been at requires some more extra effort, but the hunger to learn and improve is throughout the culture of the team here. Really that is the first step to improve a system, the willingness by developers and management in getting things done and fixed-fast.

Entering in the environment, first I read all the code. Then created a development environment to play with the code. Next I profiled how the database is being interacted with, and in conjunction with the cache. All looked ok, but with some back of the envelope calculations the server farm is to big for the amount of traffic. Traffic is huge don't get me wrong (5M+ DAU)! But the farm is too big. Digging some more I found that the code spins to search for items on a map by loading all map items, and in a for-loop go through each item until that 1 item is found, then return. This is done two - four times for every api request, especially to trigger an achievement if the item is found on your map. More on the fix later.

Before making any changes, I wanted to get feedback of what the biggest issue was that seem to cause bugs or slow down development. The consensus was the DB Layer was mixed into the Model Layer causing fear of changing said models because there was fear that they would break the DB Layer. It was not quite clear how the code communicated with the DB, thus work was done by the team to use more of the same existing functionality to fulfill feature requests which is sub-optimal if the root functionality was slow or expensive to use in the 1st place at scale.

Thus the 1st recommendation was to separate out the DB Layer in a way where the data being requested is accessed through Data Access Objects (DAO). This concept encapsulates DB logic and for the most part requires only three methods: add, get, delete. Some more complex objects calling DAO had specific SQL to make getting data faster but for the most part three methods per table was all that was needed. Following the new Directory Structure backed by PHP Namespaces all SQL was easy to find and isolated away from the model.

The second recommendation was to remove a bunch of in PHP caches of data, because this was the cause of a vast amount of copies chewing up a ton of memory per request as well as chewing up CPU to build the caches per request. If the cache hit rate is not good don't cache-added complexity sucks to maintain-and can actually slow things down if not needed.

The third recommendation was to make each Model a single instance per distinct entity (singleton-map) throughout the request which reduced the overall amount of database queries by coupling the model creation to database fetches. The database queries are reduced because instead of pulling the same data for object creation in various parts of the code, the single object was referenced.

So here is the new structure for models/database access/utils

v2/classes/DB/ -- DB connection logic

v2/classes/DAO -- DB Access

v2/classes/Models -- New Models

v2/classes/Util -- common Utility classes

v2/init.php -- everything is setup from this structure

With this new structure, separation of responsibility has been created in the code. More people can work on the same feature. One person can optimize the SQL, while another plugs in the model and yet another handles the access logic (controller). Or a single person can do it all. Most importantly the team loves the new setup.

In my 1st week and 1/2)with the new model format added to the existing code base via editing 242 files for a single model's usage (the largest and one of the most important models that controls the MAP locations of the game) the result has been great, a 10% drop in the number of servers and no user complaints with still more room for improvement. The biggest change was due to removing the spin through all the map locations to find a single item. The fix was changing a O(n) method in PHP getting hit hard to a O(1).

60 more models to go.

The good note about the unoptimized code is its forced the dev ops side of things to mature quickly and the tools that they built are really robust. To deal with features being pushed out that may not be mature enough for the request load the team built this cool dashboard with Jenkins automation, home grown software, realtime server metrics and rules to launch new instances and shrink them automatically throughout the day. It works flawlessly, for the front ends that is. It's pretty cool. Its so good and works so well I am hoping one day that it could be an OpenSource Project on its own.

First Week in China: Build a new Dev Environment

2013-06-18T19:10:00.001-07:00

I see my role as enabling others. When I was a pure awesome DBA in the early 2000s I enabled developers and customers of a companies product by making mySQL fault-tolerant and fast. As I moved up the stack as an Architect while still holding onto my roots as a DBA-I kept my DBA discipline by enabling my team and company through all the knowledge I garnered.

The first thing I identified in China that can really help my team-members is making a new development environment. The reason, the production and dev environments are wildly different. Dev is on Windows while production runs various flavors of Linux's 2.6 Kernel-mostly Centos-6. Additionally when the code is ready to be push to what I like to call pre-integration servers-meaning the code is not checked in but copied to a test server then checked in if the tests past. As a result developers spend time organizing which test server to use and this server can only be used while in the office.

Generally as a developer you should develop in something simular to your production environment, and the integration server should serve as QA of the product and not as the post development process that by-passes all unit tests (which did not exist). Also a lot of effort was put into making this Windows to Linux environment to work-just good enough-which really is not. Since PHP behaves slightly differently under Windows, I found that time was being spent on issues that could possibly not show up on Linux's php version. Thus these issues provided enough justification to build an integrated environment, where the end developer can work from home, or from where ever even if there is no direct network connection to the outside world.

The Setup

Forcing a developer to change their OS of choice, or IDE or what have you is not going to fly in any country-its just too disruptive. Thus I chose to build the environment on virtualbox, a free VM that works on MAC and Windows, the two primary Dev environments. I pre-built the VM and uploaded it to the local fileserver. Now all the team has to do is download the VM.

Here is what is installed on the VM. (These steps follow after installing Centos-6-minimal)

First, I set up a shared directory from the HOST (Mac) machine to the GUEST machine (VM), which contains the code to run the site. This allows the user to use their favorite native IDE app or vim.

Next, I set up a host virtual network, so even if the HOST does not have a connection to the net, it can always talk to the VM via ssh, httpd or what have you. I also setup another network interface for the VM to talk to the outside world via the NAT setting so packages can be installed directly on it via yum.

Then I configured the yum repos for Centos-6 epel for core linux utils, 10gen for mongoDB, percona for XtraDB by modifying /etc/yum.repos.d and adding the following repos to my list

Percona.repo

epel-testing.repo

epel.repo

remi.repo

CentOS-Vault.repo

CentOS-Media.repo

CentOS-Debuginfo.repo

CentOS-Base.repo

10gen.repo

Additionally I installed Percona, MongoDB, php, php-cli, php-frm, nginx, apache, vim-enhanced, etc. via Yum on the VM.

Finally I wrote documentation for the whole process and tested on a few people who have good Spoken English skills. With their feedback the documentation was improved and sent to the rest of the team, who have pretty good written english skills.

Now all the dev team members have to do is download the vm, configured the shared code directory and tada the entire dev environment in a box!

The next step is to resolve Schema Changes, and use Chef to update configurations and packages as if the VM was a real server-this is currently an manual process.

Next Post: Refactor PHP Models and add Unit Tests

In China and Spreading mySQL/MariaDB/XtraDB Ganglia, GearmanD, Memcache, MongoDB, HAProxy, Nginx, PHP, Python

2013-06-12T19:48:00.000-07:00

I am currently in Beijing for a month as the VP of Technology for Fun+, a US/China based gaming company, spreading the joys of open-source I have an entire team to do benchmarks, study INNODB flushing, build new technologies, which I hope to open-source I will also post the results here. Our Stack is mostly on AWS with the following.

HA Proxy Load Balances the Web Tier
Web-Tier runs nginX and php-frm
Data is stored in a new Sharded mySQL layer, Gift platform is on MongoDB
Memcache is used to cache frequently accessed items to give state to our stateless Web-tier and reduce DB load, although we can run without it.

What I am focusing on is

Code-Style
When to cache and not to Cache
How to get the most out of mySQL and MongoDB especially on Index Design
Tools for DevOps by DevOps
Reducing cost

I hope to have a lot of information to share in the next couple of weeks.

How to pick indexes is the same for MongoDB as mySQL

2013-05-13T11:13:00.000-07:00

I recently went to MongoDB Days, a conference about everything MongoDB in SF. Starting my career as a Systems Programmer then Web Developer, MySQL DB[Admin|Architect], to Software|System Architecture I like to keep an open mind about new technology and trends. When you work with a lot of different languages, and technology you find out that it’s basically the same Science from about 40 years ago.

An index in MongoDB is like an index in mySQL since a Btree is a Btree regardless of what application uses it. Just like with mySQL the best performance improvement for an application using MongoDB as a datastore is adding the correct indexes.

To create an index in MongoDB:

db.<tableName>.ensureIndex({ col#1:1, col#2:-1, col#3:1 }); // note 1 means ASC -1 means DESC

MongoDB follows the same left-most-prefix rule meaning

col#1, col#2, col#3 is an index
col#1, col#2 is an index
col#1 is an index

col#2, col#3 IS NOT AN INDEX

So, just like with mySQL for ONE compound index you get a total of THREE indexes if you follow the left-most-prefix rule, the columns from left to right (in order) in a compound index is an index.

MongoDB also gets a performance boost by using Covering indexes just like mySQL. What is a Covering index? Instead of reading from the datapage (or document store for MongoDB) which exists on disk your reading the data from the index which should be in memory for the most part. A common practice is to follow the left-most-prefix pattern, then add the columns which you are returning at the end of the compound index. For instance

SELECT photoId from Photos WHERE userId=? AND dateCreate=? AND privacy=?

The index in mySQL I would make is

ALTER TABLE Photos ADD INDEX `userId-dateCreate-privacy-photoId` (userId,dateCreate,privacy,photoId)

Thus following the left-most-prefix of a compound index I have an index on

userId, dateCreate, privacy, photoId
userId, dateCreate, privacy
userId, dateCreate
userId

and a Covering Index satisfied by the query above.

For mongoDB its the same

db.photos.ensureIndex({ userId: 1, dateCreate: 1, privacy: 1, photoId: 1});

So, in conclusion, understand the Computer Science of a Btree, Hash, LinkedList and you will understand how indexes work across technology and find that essentially it's the same. More info on indexes for mySQL can be found here.

Also note:

Explain in mongoDB is your friend just like Explain in mySQL