Monday, June 24, 2013

First Week.5 in China: Part Two, Refactor PHP get 10% more capacity with one change

The PHP code that I've experienced in China so far is pretty good. I have been in some environments where the Code is horrendous-where variables are set in one file yet used in another file via a require_once.  If that magic variable is not set everything would break with side-effect galore. This is not the case here for the China Team. This team is really good not to imply the other-one wasn't just praising the current one.

The SQL, like many other companies I have been at requires some more extra effort, but the hunger to learn and improve is throughout the culture of the team here. Really that is the first step to improve a system, the willingness by developers and management in getting things done and fixed-fast.

Entering in the environment, first I read all the code. Then created a development environment to play with the code. Next I profiled how the database is being interacted with, and in conjunction with the cache. All looked ok, but with some back of the envelope calculations the server farm is to big for the amount of traffic. Traffic is huge don't get me wrong (5M+ DAU)! But the farm is too big. Digging some more I found that the code spins to search for items on a map by loading all map items, and in a for-loop go through each item until that 1 item is found, then return. This is done two - four times for every api request, especially to trigger an achievement if the item is found on your map. More on the fix later.

Before making any changes, I wanted to get feedback of what the biggest issue was that seem to cause bugs or slow down development. The consensus was the DB Layer was mixed into the Model Layer causing fear of changing said models because there was fear that they would break the DB Layer. It was not quite clear how the code communicated with the DB, thus work was done by the team to use more of the same existing functionality to fulfill feature requests which is sub-optimal if the root functionality was slow or expensive to use in the 1st place at scale.

Thus the 1st recommendation was to separate out the DB Layer in a way where the data being requested is accessed through Data Access Objects (DAO). This concept encapsulates DB logic and for the most part requires only three methods: add, get, delete. Some more complex objects calling DAO had specific SQL to make getting data faster but for the most part three methods per table was all that was needed. Following the new Directory Structure backed by PHP Namespaces all SQL was easy to find and isolated away from the model.

The second recommendation was to remove a bunch of in PHP caches of data, because this was the cause of a vast amount of copies chewing up a ton of memory per request as well as chewing up CPU to build the caches per request. If the cache hit rate is not good don't cache-added complexity sucks to maintain-and can actually slow things down if not needed.

The third recommendation was to make each Model a single instance per distinct entity (singleton-map) throughout the request which reduced the overall amount of database queries by coupling the model creation to database fetches. The database queries are reduced because instead of pulling the same data for object creation in various parts of the code, the single object was referenced.

So here is the new structure for models/database access/utils

v2
v2/classes/DB/   -- DB connection logic
v2/classes/DAO  -- DB Access 
v2/classes/Models  -- New Models
v2/classes/Util  -- common Utility classes
v2/init.php -- everything is setup from this structure

With this new structure, separation of responsibility has been created in the code. More people can work on the same feature. One person can optimize the SQL, while another plugs in the model and yet another handles the access logic (controller). Or a single person can do it all. Most importantly the team loves the new setup.

In my 1st week and 1/2)with the new model format added to the existing code base via editing 242 files for a single model's usage (the largest and one of the most important models that controls the MAP locations of the game) the result has been great, a 10% drop in the number of servers and no user complaints with still more room for improvement. The biggest change was due to removing the spin through all the map locations to find a single item. The fix was changing a O(n) method in PHP getting hit hard to a O(1).

60 more models to go.

The good note about the unoptimized code is its forced the dev ops side of things to mature quickly and the tools that they built are really robust. To deal with features being pushed out that may not be mature enough for the request load the team built this cool dashboard with Jenkins automation, home grown software, realtime server metrics and rules to launch new instances and shrink them automatically throughout the day. It works flawlessly, for the front ends that is. It's pretty cool. Its so good and works so well I am hoping one day that it could be an OpenSource Project on its own.

No comments: