Saturday, January 27, 2018

Spotify Top 200 in mySQL

I do a lot of data analysis lately, and I try to find answers to questions through data for my companies pressing questions. Let's look at the past year of 2017 and answer questions for people who like music.

artist is the artist name
track is the artist's track name
list_date is which chart date the artist show up on the top200
streams is the number of plays following spotify specific rules

Let's look at the data set

select count(*) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
| count(*) |
|    74142 |
1 row in set (0.04 sec)

How many artists made it in the top200 for the United States?

mysql> select count(DISTINCT(artist)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
| count(DISTINCT(artist)) |
|                     527 |
1 row in set (0.09 sec)

Wow, it's really hard to be a musician. Only 527 broke the top200.

How many tracks in 2017 broke the top200?

 select count(DISTINCT(track)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
| count(DISTINCT(track)) |
|                   1682 |

For the entire year, 1682 songs defined the united states listing habits for the most part.

Who showed up the most in the top200 for 2017?

mysql> select artist,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
| artist           | CNT  |
| Drake            | 3204 |
| Lil Uzi Vert     | 1891 |
| Kendrick Lamar   | 1874 |
| Post Malone      | 1776 |
| Ed Sheeran       | 1581 |
| The Weeknd       | 1566 |
| Migos            | 1550 |
| Future           | 1536 |
| The Chainsmokers | 1503 |
| Kodak Black      | 1318 |
10 rows in set (0.16 sec)

Drake killed it, but Lil Uzi Vert is the star of the year, IMHO. Drake has a pedigree while Lil Uzi just started running.

Also from these artists I can tell HIP HOP dominated us charts; Let's verify this assumption.

mysql> select artist,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
| artist           | CNT        |
| Drake            | 1253877919 |
| Kendrick Lamar   | 1161624639 |
| Post Malone      |  954546910 |
| Lil Uzi Vert     |  818889040 |
| Ed Sheeran       |  714523363 |
| Migos            |  682008192 |
| Future           |  574005011 |
| The Chainsmokers |  557708920 |
| 21 Savage        |  472043174 |
| Khalid           |  463878924 |
10 rows in set (0.48 sec)

Yup hip hop dominated the top 10 steams.

What about tracks? What are the top 10 tracks by streams?

 select track,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
| track             | CNT       |
| HUMBLE.           | 340136186 |
| XO TOUR Llif3     | 314758565 |
| Congratulations   | 283551832 |
| Shape of You      | 280898054 |
| Unforgettable     | 261753940 |
| Mask Off          | 242524530 |
| Despacito - Remix | 241370570 |
| rockstar          | 225517132 |
| Location          | 224879215 |
| 1-800-273-8255    | 219689749 |
10 rows in set (0.43 sec)

Which tracks and artists had the most time in the top200?

 select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;
| artist           | track                               | CNT |
| D.R.A.M.         | Broccoli (feat. Lil Yachty)         | 485 |
| French Montana   | Unforgettable                       | 417 |
| PnB Rock         | Selfish                             | 394 |
| Travis Scott     | goosebumps                          | 365 |
| Post Malone      | Go Flex                             | 365 |
| Childish Gambino | Redbone                             | 365 |
| Post Malone      | Congratulations                     | 365 |
| Post Malone      | White Iverson                       | 365 |
| Migos            | Bad and Boujee (feat. Lil Uzi Vert) | 364 |
| Bruno Mars       | That's What I Like                  | 364 |
10 rows in set (0.20 sec)

Also from this data I can tell that Post Malone had a fantastic year!

So, more questions can be answered, like who held the number 1 position on the top200 the most?

select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and pos=1 and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;
| artist         | track                               | CNT |
| Post Malone    | rockstar                            | 105 |
| Kendrick Lamar | HUMBLE.                             |  67 |
| Ed Sheeran     | Shape of You                        |  48 |
| Luis Fonsi     | Despacito - Remix                   |  47 |
| Migos          | Bad and Boujee (feat. Lil Uzi Vert) |  29 |
| 21 Savage      | Bank Account                        |  20 |
| Drake          | Passionfruit                        |  12 |
| Logic          | 1-800-273-8255                      |  10 |
| Taylor Swift   | Look What You Made Me Do            |  10 |
| French Montana | Unforgettable                       |   7 |
10 rows in set (0.26 sec)

Wow can see hear that Post Malone is the star!

In summary, getting public data sources and doing simple queries can give a clearer insight into data to answer some pressing questions one may have.

With the schema above what questions would you answer?

Friday, January 19, 2018

Deploying Go Applications in Docker Containers using a Scratch Docker File

Programming in golang is fantastic. I find it fun, expressive and simple to build concurrent programs. Deploying a golang app from laptop to production is as hard now as when it was back when I was building Monolithic services. A great way to deploy nowadays is to deploy microservices in containers. Containers keep the environment between laptop and AWS Linux instance in sync since in essence the environment is deployed and not just the code or binary.

Containerization of the environment is not ideal although. Sometimes you can ship containers of 1GB in size or more. Deploying that across the LAN is ok, over the WAN .. it is debatable. So, to deal with this problem I work with scratch Dockerfiles when deploying applications.

Starting from scratch Dockerfiles, I know that there is no real environment overhead since the environment is the most basic it can be. Additionally, I do not have to worry about the golang environment in my container because we are not going to ship "golang and all its packages", we are going to ship the binary itself. This is best described as an example.

The Dockerfile, is like a Makefile but for your environment. Each line describes what the Dockerfile does. Prior to executing the docker file, we will need to set up the environment.

GOOS=linux go build .                                                                                           

This line will build the go program as a Linux binary.

docker build -t dathanvp/goprogram:latest .

This line says; execute the docker file and tag the image as dathanvp/goprogram.

docker run -p 8282:8282 -v /Users/dathan/gocode/src/ dathanvp/goprogram:latest

Now, this is the magic. Docker will open port 8282 and map it to port 8282 in the container. A volume is attached from my laptop to the container's /mnt directory with read and write privileges. (When executing my container in production only this line changes.) This volume is to keep the logs persistent. Containers reset state, thus losing anything generated and the reason for my volume. Finally docker run is going to run my image dathanvp/goprogram

I deploy my container's  to AWS by executing

docker push dathanvp/goprogram

This pushes my go program from my laptop to where my aws instances can then pull from, enabling running my programs in production without having to set up the environment on aws (other than docker of course).

Finally, why do it this way? I want my program to run on my laptop and on my AWS ubuntu servers without having to keep golang development environments in sync. Additionally, I want my containers to be really small so I don't have to ship hundreds of megs around to start the application, which itself is about 13MB. Uploading from comcast sucks. So, in conclusion, this is the best way I've found so far :)

Please let me know how you ship go applications and why.

Monday, December 11, 2017

Designing a RDBMS SQL Table

Building tables initially should not really require a lot of thought. What? I'm suggesting that when designing a table think of the Table as a spreadsheet. Yes.

For instance, let's create a table that combines all social scores of a users' media in a single table. We will call this table platform_resources.

What do we need to record the social score total of a single person?
  • Who is this person? How I know this persona. 
  • What is the platform? Which Social Platform does this reference refer to
  • What is the platform identifier? What is the social platform identifier
  • What is a common social score for each user? View, Likes, Comments

So the table above answers my questions. For each piece of media that and interna_name owns, I am able to collect a summary of basic stats. By no means is this optimized. The row size is roughly

21+51+51+4+4+4+4+4+256+4 = 403 bytes not taking into account the primary key which is very large and takes a small byte overhead due to the exceeding an internal limit.

We are not optimizing yet, we are just answering questions.

The Primary Key was picked to be platform_id, platform, internal_name. Following the Left Most prefix rule for composite indexes, we have roughly 3 indexes in 1 index. The original primary key, platform_id & platform, then finally platform_id. The primary key was picked to be this because for a platform the platform_id is unique and the person who owns this platform_id should be represented. Additionally, since we are using INNODB the table is sorted by the primary key.

No optimizations just a basic table get's the job done. Now how would you optimize this table?
First, you should ask what are you optimizing the table for? Disk size? Memory fit? Because its ugly and it bothers me?

Let's estimate how this table will grow. This table is a MxN problem where for each internal_name they will have N resources per platform. The bounds of the growth are around 1000 items per year per platform. M is less than 20K so, It's really not worth it to optimize for any other reason just to do it because. So don't.

If I had to optimize because the MxN problem turned into a huge overhead.
First, I would reduce the row size of the table by making lookup tables for internal_name, platform, platform_id which keeps the primary key smaller - probably in 64 bits.

Next, distribute the table by either date_taken range since queries will be more interested in the latest data, or we can distribute the table by internal_name; this is another post.

Finally, sometimes you just need a table and you just want to query it like give me the total sum of views for all Instagram videos by a creator. The post is to think about optimizations when you need to think about optimizations and not beforehand. If your needs changes; change the schema to focus on the optimization you are going for. :)

Wednesday, December 06, 2017

Back to Sharing stuff I learned

I have not been regular in blog posts as I've just been focused on everything. I got lazy. Well, that is over.

At Shots Studios, a teen social network consisting of nearly 2M lines of code is no more. Shots is now a one-stop shop for select Creators. We are a Production Studio, Ad/Talent Agency, Talent Management Media company focused on creating timeless content. A 21st-century answer to getting great content from great creators in front of their audience.

Your internal monologue after reading this is how does this have anything to do with MySQL, HA, Scale, Coding; if not, this is still a good segway to explain how.

Shots the App, did really well yet not well enough to compete with Snapchat and Instagram. We did gain a lot of insight, mainly in what is called Influencers. A large percentage of time in growing the Shots platform was handling their cases of spikey scale. When Influencers posted they would promote their Selfie on other platforms sending waves of teens all at once to their data. Honestly, this was an amazing challenge to scale on a tight budget. Cold to Performant in millisecond time, with a 600% increase in load/concurrency suddenly. The short answer to scale this was to keep data in memory - From this, we understood that influencers reach and ability to move users is more effective than Display Ads. Period.

We did a huge analysis about our user base, and from that analysis, we made the decision to keep all "Influencers" in memory, and people who were the sticky users-the percentage of DAU that comes back with frequency. Next, to make sure that we did not saturate a network interface by keeping their data in memory on a single box, we replicated this subset of users among redundant pairs. Finally, we had to keep higher than normal frontends in reserve to handle the sudden burst without the startup delta of dynamic scaling pools.

Now we use a subset of the tech developed to mine, analyze, data about Creators. Creators, were influencers but now create, perform, direct, edit content thus they are called Creators. For instance, we use a custom performant event tracking system to monitor the social engagement of all creators. If you heard of a site called socialblade, I basically duplicated it at a much higher precision then their data.

With this we are able to tell which of a creator's content strikes a chord with users then we produce more of that performant content. For instance, With this insight, analysis, data collection and maximizing the reach channels on platforms like YouTube, Instagram with a shoestring budget we are making data-rich informed decisions.

Tuesday, May 23, 2017

Golang (Go) and BoltDB

I've been using Go for some time now (3 years) and I am constantly impressed with the language's ease of use. I originally started my career in C-Unix System Programming, then Java, then PHP and now I am rather language agnostic. Out of all the languages I know, go is the most fun and there is a strong community behind it.

BoltDB is yet another NoSQL Key-Value store, designed to be embedded and I happened across it for a small use case. I use GO to crawl sites and parse HTML DOM in a very concurrent manner to gather data for analysis from a variety of remote web sources. BoltDB is used to keep state as I transfer from my local mac book to a remote server and it is very easy to use. Basically, I needed a portable embedded database that is fast and resilient without setting up MySQL and keeping the schema in sync between dev and production. This is not user facing just a set of go packages that help me keep state so I can know where to pick up from in case of some sort of error, like I turn off my laptop or some random panic.

Let's look at BoltDB usage. Below is my struct, everything is a string because I am not formatting or typing things yet.

type TableRow struct {       

       Title string       
       Time string       
       Anchor string      
       Price string       
       Notified string // could make this a Time Struct but let's be simple

Next, I create my.db if it doesn't exist. The function check looks to see if there are errors and panics.  The line defer db.Close() will close the db at the end of the function which these calls are made from. The function addRecord will create a bucket called parser_bucket which is a const and add the key byte with value triggering a bucket creation if this is the first run. It is something fast to make a point and yes there are more efficient ways to do this.

db, err := bolt.Open("my.db", 0644, &bolt.Options{Timeout: 10 * time.Second})
defer db.Close()
addRecord(db, []byte("start"), "starting") // create bucket when it doesn't exist

The function addRecord takes 3 arguments; db - the boltdb struct, key a byte array and a value which can be anything, in our case, TableRow the struct above. The function is lower case so it is not "public".  The interface v is marshaled into a byte array and stored in boltdb after it checks that the bucket is created. Finally, the addRecord function returns an error if an error occurred.

func addRecord(db *bolt.DB, key []byte, v interface{}) error {
       value, err := json.Marshal(v)
       return db.Update(func(tx *bolt.Tx) error {              
                  bkt, err := tx.CreateBucketIfNotExists([]byte(bucket))
                  if err != nil {                     
                     return err              
                  fmt.Printf("Adding KEY %s\n", key)              
                  return bkt.Put(key, value)       

To get a TableRow out of the database a read transaction is performed in BoltDB. This method is capitalized so it is a package public method. GetRecord returns a table row or panics if an error occurred.

func GetRecord(db *bolt.DB, key string) *TableRow {
       row := TableRow{}       err := db.View(func(tx *bolt.Tx) (error) {
              bkt := tx.Bucket([]byte(bucket))              
              if bkt == nil {                     
                 return fmt.Errorf("Bucket %q not found!\n", bucket)

              val := bkt.Get([]byte(key))
              if len(val) == 0 {                     
                 fmt.Printf("key %s does not exist\n", key)                     
                 return nil              
              err := json.Unmarshal(val, &row)
              return err
       return &row

Calling this function returns a TableRow reference. There are no real pointers in go but I conceptualize this internally as a pointer.

This is it. This is all there really is to BoltDB. Read Transactions, Write Transactions that are concurrency-safe. You can even run the Unix command strings on the database file so see if you stored the data correctly as a sanity check and you should see json from the output (if that is your serializer).

In conclusion, BoltDB is fast, so far safe and does exactly what I need. Store State, without expecting an external DB. Embedded databases are awesome and go is awesome. Give it a try.

Tuesday, November 22, 2016

INNODB Tablespace Copy in Go Lang

I just uploaded a quick tool that I think you will find useful if you need to consolidate, expand innodb databases if tablespaces are in use.

This golang application will copy an entire innodb database from one server to another server via scp.
innodb-tablespace-copy follows the algorithm described here. This golang application copies 4 tables in parallel after setting up the remote environment. Then in parallel import the tablespaces. I've only used this application on Percona XTRADB 5.6 but it should work for all flavors of innodb that are out there.

Note to recover from interruption, this is done manually either by discarding the tablespace or by dropping the remote database.

Feel free to add to it and make it better :)

Wednesday, August 10, 2016

Tech Stack at Shots Quick Post

The Shots APP we use the following technology to serve many millions of Photos, Videos and Cached Links.


RedHat Enterprise 6 on the Front ends and DBs. Amazon Linux (Centos) on Elastic Search and Go servers
Apache 2+
Percona 5.6 XTRADB with some minor custom stuff (sharded)

we have a little bit of Python, JAVA and a lot of GO!

One of the current features which has wildly been successful is sharing links on mobile, which is very hard. Mobile is not built for links but fortunately Instagram and YouTube are. To make this work; we have the client read from the clipboard. The client makes a call home where the link is sent to a distributed worker system which fetches the content of the HTML page, finds the media, manipulates the media and then distributes the media on our CDN. Links only last for a few days.

This is like a poor man's AMP and only took us a few days to write. We even retranscode videos to make sure the format fits our timeline and doesn't hog up to much bandwidth.

MySQL keeps state so the same link is not rebuilt and everything is fronted with Redis - since redis supports pipelined commands - which is great for a feed our size. The next feed version will be a go-tao-like system.

All in all for 3 days of work, the system works great and scales linearly. It is near real time. Give it a try. Link a Instagram or Youtube url and you will see for yourself.

Some things that I'd like to do in the future is use QUIC on a websocket layer. To have a non-blocking messaging system which is blazing fast and works on spotty networks and integrate ROCKSDB.

But that's another post.

Monday, September 28, 2015

Wish there is another String DataType called LIST but there is not

I believe the future of SQL is to take a lot of primitives that are Computer Science fundamentals and add them as datatypes to expand on the allowed columns today. The idea is of the ilk of a merging of noSQL and SQL  for solving problems to make it easier for a new person to develop.

For instance, what would be awesome is a LIST type, where the list contains a distinct number of string items mapped to a bit, much like SET yet you don't need to predefine all the items in a set.

Here is a good example as how I would use a list type:

Imagine you need permissions on a per row basis. Some rows are public, some are private, some are viewable by a small set of people. (Less than 64).

Let's take the example of Finding all rows that are public or are viewable by only me.

When creating a row

INSERT INTO resource_permissions (resource_id, perm_bit, list_dt) VALUES(1, 2, "dathan, sam, fred")

perm_bit is 0 private, 1 = public, 2 public to a list of people

When selecting rows that I "dathan" can see

SELECT resource_id FROM resource_permissions WHERE perm_bit = 1 UNION SELECT resource_id FROM resource_permissiongs WHERE perm_bit = 2 AND FIND_IN_LIST(list_dt, "dathan");

What the above statement says is give me all the public resource_ids and resource_ids that I "dathan" can see.

Right now I can't do this, I have to use a medium_blob and a LIKE

SELECT resource_id FROM resource_permissions WHERE perm_bit = 1 UNION SELECT resource_id FROM resource_permissions WHERE perm_bit = 2 AND list_dt LIKE "%:dathan:%"

As you can see I'm able to simulate the desired behavior but I can't use an index, I don't want to use a FULLTEXT_INDEX due to overhead and other issues that out of scope for this post. Nor do I want to manage UDF's or Stored procedures. The last two are not desirable yet can also simulate the behavior I am looking for.

Some primitives from REDIS or other noSQL solutions would be awesome additions for SQL as a hole IMHO.

My two cents.

Also in 5.7 maybe the JSON Column Type might be of some use.