Saturday, January 27, 2018

Spotify Top 200 in mySQL

I do a lot of data analysis lately, and I try to find answers to questions through data for my companies pressing questions. Let's look at the past year of 2017 and answer questions for people who like music.



artist is the artist name
track is the artist's track name
list_date is which chart date the artist show up on the top200
streams is the number of plays following spotify specific rules

Let's look at the data set

select count(*) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
+----------+
| count(*) |
+----------+
|    74142 |
+----------+
1 row in set (0.04 sec)


How many artists made it in the top200 for the United States?

mysql> select count(DISTINCT(artist)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
+-------------------------+
| count(DISTINCT(artist)) |
+-------------------------+
|                     527 |
+-------------------------+
1 row in set (0.09 sec)

Wow, it's really hard to be a musician. Only 527 broke the top200.

How many tracks in 2017 broke the top200?

 select count(DISTINCT(track)) from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01';
+------------------------+
| count(DISTINCT(track)) |
+------------------------+
|                   1682 |
+------------------------+

For the entire year, 1682 songs defined the united states listing habits for the most part.


Who showed up the most in the top200 for 2017?

mysql> select artist,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
+------------------+------+
| artist           | CNT  |
+------------------+------+
| Drake            | 3204 |
| Lil Uzi Vert     | 1891 |
| Kendrick Lamar   | 1874 |
| Post Malone      | 1776 |
| Ed Sheeran       | 1581 |
| The Weeknd       | 1566 |
| Migos            | 1550 |
| Future           | 1536 |
| The Chainsmokers | 1503 |
| Kodak Black      | 1318 |
+------------------+------+
10 rows in set (0.16 sec)

Drake killed it, but Lil Uzi Vert is the star of the year, IMHO. Drake has a pedigree while Lil Uzi just started running.

Also from these artists I can tell HIP HOP dominated us charts; Let's verify this assumption.

mysql> select artist,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
+------------------+------------+
| artist           | CNT        |
+------------------+------------+
| Drake            | 1253877919 |
| Kendrick Lamar   | 1161624639 |
| Post Malone      |  954546910 |
| Lil Uzi Vert     |  818889040 |
| Ed Sheeran       |  714523363 |
| Migos            |  682008192 |
| Future           |  574005011 |
| The Chainsmokers |  557708920 |
| 21 Savage        |  472043174 |
| Khalid           |  463878924 |
+------------------+------------+
10 rows in set (0.48 sec)


Yup hip hop dominated the top 10 steams.

What about tracks? What are the top 10 tracks by streams?

 select track,SUM(streams) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 1 order by 2 DESC LIMIT 10;
+-------------------+-----------+
| track             | CNT       |
+-------------------+-----------+
| HUMBLE.           | 340136186 |
| XO TOUR Llif3     | 314758565 |
| Congratulations   | 283551832 |
| Shape of You      | 280898054 |
| Unforgettable     | 261753940 |
| Mask Off          | 242524530 |
| Despacito - Remix | 241370570 |
| rockstar          | 225517132 |
| Location          | 224879215 |
| 1-800-273-8255    | 219689749 |
+-------------------+-----------+
10 rows in set (0.43 sec)

Which tracks and artists had the most time in the top200?

 select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;
+------------------+-------------------------------------+-----+
| artist           | track                               | CNT |
+------------------+-------------------------------------+-----+
| D.R.A.M.         | Broccoli (feat. Lil Yachty)         | 485 |
| French Montana   | Unforgettable                       | 417 |
| PnB Rock         | Selfish                             | 394 |
| Travis Scott     | goosebumps                          | 365 |
| Post Malone      | Go Flex                             | 365 |
| Childish Gambino | Redbone                             | 365 |
| Post Malone      | Congratulations                     | 365 |
| Post Malone      | White Iverson                       | 365 |
| Migos            | Bad and Boujee (feat. Lil Uzi Vert) | 364 |
| Bruno Mars       | That's What I Like                  | 364 |
+------------------+-------------------------------------+-----+
10 rows in set (0.20 sec)

Also from this data I can tell that Post Malone had a fantastic year!


So, more questions can be answered, like who held the number 1 position on the top200 the most?

select artist,track,count(*) AS CNT from spotify.top200 WHERE country='us' and pos=1 and list_date >= '2017-01-01' and list_date < '2018-01-01' group by 2 order by 3 DESC LIMIT 10;
+----------------+-------------------------------------+-----+
| artist         | track                               | CNT |
+----------------+-------------------------------------+-----+
| Post Malone    | rockstar                            | 105 |
| Kendrick Lamar | HUMBLE.                             |  67 |
| Ed Sheeran     | Shape of You                        |  48 |
| Luis Fonsi     | Despacito - Remix                   |  47 |
| Migos          | Bad and Boujee (feat. Lil Uzi Vert) |  29 |
| 21 Savage      | Bank Account                        |  20 |
| Drake          | Passionfruit                        |  12 |
| Logic          | 1-800-273-8255                      |  10 |
| Taylor Swift   | Look What You Made Me Do            |  10 |
| French Montana | Unforgettable                       |   7 |
+----------------+-------------------------------------+-----+
10 rows in set (0.26 sec)


Wow can see hear that Post Malone is the star!

In summary, getting public data sources and doing simple queries can give a clearer insight into data to answer some pressing questions one may have.

With the schema above what questions would you answer?

Friday, January 19, 2018

Deploying Go Applications in Docker Containers using a Scratch Docker File

Programming in golang is fantastic. I find it fun, expressive and simple to build concurrent programs. Deploying a golang app from laptop to production is as hard now as when it was back when I was building Monolithic services. A great way to deploy nowadays is to deploy microservices in containers. Containers keep the environment between laptop and AWS Linux instance in sync since in essence the environment is deployed and not just the code or binary.

Containerization of the environment is not ideal although. Sometimes you can ship containers of 1GB in size or more. Deploying that across the LAN is ok, over the WAN .. it is debatable. So, to deal with this problem I work with scratch Dockerfiles when deploying applications.

Starting from scratch Dockerfiles, I know that there is no real environment overhead since the environment is the most basic it can be. Additionally, I do not have to worry about the golang environment in my container because we are not going to ship "golang and all its packages", we are going to ship the binary itself. This is best described as an example.



The Dockerfile, is like a Makefile but for your environment. Each line describes what the Dockerfile does. Prior to executing the docker file, we will need to set up the environment.

GOOS=linux go build .                                                                                           

This line will build the go program as a Linux binary.

docker build -t dathanvp/goprogram:latest .

This line says; execute the docker file and tag the image as dathanvp/goprogram.

docker run -p 8282:8282 -v /Users/dathan/gocode/src/github.com/dathanvp/goprogram/logs:/mnt:rw dathanvp/goprogram:latest

Now, this is the magic. Docker will open port 8282 and map it to port 8282 in the container. A volume is attached from my laptop to the container's /mnt directory with read and write privileges. (When executing my container in production only this line changes.) This volume is to keep the logs persistent. Containers reset state, thus losing anything generated and the reason for my volume. Finally docker run is going to run my image dathanvp/goprogram

I deploy my container's  to AWS by executing

docker push dathanvp/goprogram

This pushes my go program from my laptop to cloud.docker.com where my aws instances can then pull from, enabling running my programs in production without having to set up the environment on aws (other than docker of course).

Finally, why do it this way? I want my program to run on my laptop and on my AWS ubuntu servers without having to keep golang development environments in sync. Additionally, I want my containers to be really small so I don't have to ship hundreds of megs around to start the application, which itself is about 13MB. Uploading from comcast sucks. So, in conclusion, this is the best way I've found so far :)

Please let me know how you ship go applications and why.


Monday, December 11, 2017

Designing a RDBMS SQL Table


Building tables initially should not really require a lot of thought. What? I'm suggesting that when designing a table think of the Table as a spreadsheet. Yes.


For instance, let's create a table that combines all social scores of a users' media in a single table. We will call this table platform_resources.

What do we need to record the social score total of a single person?
  • Who is this person? How I know this persona. 
  • What is the platform? Which Social Platform does this reference refer to
  • What is the platform identifier? What is the social platform identifier
  • What is a common social score for each user? View, Likes, Comments


So the table above answers my questions. For each piece of media that and interna_name owns, I am able to collect a summary of basic stats. By no means is this optimized. The row size is roughly

21+51+51+4+4+4+4+4+256+4 = 403 bytes not taking into account the primary key which is very large and takes a small byte overhead due to the exceeding an internal limit.

We are not optimizing yet, we are just answering questions.


The Primary Key was picked to be platform_id, platform, internal_name. Following the Left Most prefix rule for composite indexes, we have roughly 3 indexes in 1 index. The original primary key, platform_id & platform, then finally platform_id. The primary key was picked to be this because for a platform the platform_id is unique and the person who owns this platform_id should be represented. Additionally, since we are using INNODB the table is sorted by the primary key.

No optimizations just a basic table get's the job done. Now how would you optimize this table?
First, you should ask what are you optimizing the table for? Disk size? Memory fit? Because its ugly and it bothers me?

Let's estimate how this table will grow. This table is a MxN problem where for each internal_name they will have N resources per platform. The bounds of the growth are around 1000 items per year per platform. M is less than 20K so, It's really not worth it to optimize for any other reason just to do it because. So don't.

If I had to optimize because the MxN problem turned into a huge overhead.
First, I would reduce the row size of the table by making lookup tables for internal_name, platform, platform_id which keeps the primary key smaller - probably in 64 bits.

Next, distribute the table by either date_taken range since queries will be more interested in the latest data, or we can distribute the table by internal_name; this is another post.

Finally, sometimes you just need a table and you just want to query it like give me the total sum of views for all Instagram videos by a creator. The post is to think about optimizations when you need to think about optimizations and not beforehand. If your needs changes; change the schema to focus on the optimization you are going for. :)



Wednesday, December 06, 2017

Back to Sharing stuff I learned


I have not been regular in blog posts as I've just been focused on everything. I got lazy. Well, that is over.

At Shots Studios, a teen social network consisting of nearly 2M lines of code is no more. Shots is now a one-stop shop for select Creators. We are a Production Studio, Ad/Talent Agency, Talent Management Media company focused on creating timeless content. A 21st-century answer to getting great content from great creators in front of their audience.

Your internal monologue after reading this is how does this have anything to do with MySQL, HA, Scale, Coding; if not, this is still a good segway to explain how.

Shots the App, did really well yet not well enough to compete with Snapchat and Instagram. We did gain a lot of insight, mainly in what is called Influencers. A large percentage of time in growing the Shots platform was handling their cases of spikey scale. When Influencers posted they would promote their Selfie on other platforms sending waves of teens all at once to their data. Honestly, this was an amazing challenge to scale on a tight budget. Cold to Performant in millisecond time, with a 600% increase in load/concurrency suddenly. The short answer to scale this was to keep data in memory - From this, we understood that influencers reach and ability to move users is more effective than Display Ads. Period.

We did a huge analysis about our user base, and from that analysis, we made the decision to keep all "Influencers" in memory, and people who were the sticky users-the percentage of DAU that comes back with frequency. Next, to make sure that we did not saturate a network interface by keeping their data in memory on a single box, we replicated this subset of users among redundant pairs. Finally, we had to keep higher than normal frontends in reserve to handle the sudden burst without the startup delta of dynamic scaling pools.

Now we use a subset of the tech developed to mine, analyze, data about Creators. Creators, were influencers but now create, perform, direct, edit content thus they are called Creators. For instance, we use a custom performant event tracking system to monitor the social engagement of all creators. If you heard of a site called socialblade, I basically duplicated it at a much higher precision then their data.

With this we are able to tell which of a creator's content strikes a chord with users then we produce more of that performant content. For instance, https://shots.com/superheroes. With this insight, analysis, data collection and maximizing the reach channels on platforms like YouTube, Instagram with a shoestring budget we are making data-rich informed decisions.

Tuesday, May 23, 2017

Golang (Go) and BoltDB

I've been using Go for some time now (3 years) and I am constantly impressed with the language's ease of use. I originally started my career in C-Unix System Programming, then Java, then PHP and now I am rather language agnostic. Out of all the languages I know, go is the most fun and there is a strong community behind it.


BoltDB is yet another NoSQL Key-Value store, designed to be embedded and I happened across it for a small use case. I use GO to crawl sites and parse HTML DOM in a very concurrent manner to gather data for analysis from a variety of remote web sources. BoltDB is used to keep state as I transfer from my local mac book to a remote server and it is very easy to use. Basically, I needed a portable embedded database that is fast and resilient without setting up MySQL and keeping the schema in sync between dev and production. This is not user facing just a set of go packages that help me keep state so I can know where to pick up from in case of some sort of error, like I turn off my laptop or some random panic.


Let's look at BoltDB usage. Below is my struct, everything is a string because I am not formatting or typing things yet.


type TableRow struct {       

       Title string       
       Time string       
       Anchor string      
       Price string       
       Notified string // could make this a Time Struct but let's be simple
}


Next, I create my.db if it doesn't exist. The function check looks to see if there are errors and panics.  The line defer db.Close() will close the db at the end of the function which these calls are made from. The function addRecord will create a bucket called parser_bucket which is a const and add the key byte with value triggering a bucket creation if this is the first run. It is something fast to make a point and yes there are more efficient ways to do this.

db, err := bolt.Open("my.db", 0644, &bolt.Options{Timeout: 10 * time.Second})
check(err)
defer db.Close()
addRecord(db, []byte("start"), "starting") // create bucket when it doesn't exist


The function addRecord takes 3 arguments; db - the boltdb struct, key a byte array and a value which can be anything, in our case, TableRow the struct above. The function is lower case so it is not "public".  The interface v is marshaled into a byte array and stored in boltdb after it checks that the bucket is created. Finally, the addRecord function returns an error if an error occurred.

func addRecord(db *bolt.DB, key []byte, v interface{}) error {
       value, err := json.Marshal(v)
       check(err)
       return db.Update(func(tx *bolt.Tx) error {              
                  bkt, err := tx.CreateBucketIfNotExists([]byte(bucket))
              
                  if err != nil {                     
                     return err              
                  } 
             
                  fmt.Printf("Adding KEY %s\n", key)              
                  return bkt.Put(key, value)       
       })
}


To get a TableRow out of the database a read transaction is performed in BoltDB. This method is capitalized so it is a package public method. GetRecord returns a table row or panics if an error occurred.

func GetRecord(db *bolt.DB, key string) *TableRow {
       row := TableRow{}       err := db.View(func(tx *bolt.Tx) (error) {
              bkt := tx.Bucket([]byte(bucket))              
              if bkt == nil {                     
                 return fmt.Errorf("Bucket %q not found!\n", bucket)
              }

              val := bkt.Get([]byte(key))
              if len(val) == 0 {                     
                 fmt.Printf("key %s does not exist\n", key)                     
                 return nil              
              }
              err := json.Unmarshal(val, &row)
              return err
       })
       check(err)
       return &row
}





Calling this function returns a TableRow reference. There are no real pointers in go but I conceptualize this internally as a pointer.

This is it. This is all there really is to BoltDB. Read Transactions, Write Transactions that are concurrency-safe. You can even run the Unix command strings on the database file so see if you stored the data correctly as a sanity check and you should see json from the output (if that is your serializer).

In conclusion, BoltDB is fast, so far safe and does exactly what I need. Store State, without expecting an external DB. Embedded databases are awesome and go is awesome. Give it a try.







Tuesday, November 22, 2016

INNODB Tablespace Copy in Go Lang

I just uploaded a quick tool that I think you will find useful if you need to consolidate, expand innodb databases if tablespaces are in use.

This golang application will copy an entire innodb database from one server to another server via scp.
innodb-tablespace-copy follows the algorithm described here. This golang application copies 4 tables in parallel after setting up the remote environment. Then in parallel import the tablespaces. I've only used this application on Percona XTRADB 5.6 but it should work for all flavors of innodb that are out there.

Note to recover from interruption, this is done manually either by discarding the tablespace or by dropping the remote database.

Feel free to add to it and make it better :)