9+ years to 2 days: How we supercharged a multi-year migration process

Posted: April 28th, 2014 | Author: | Filed under: Rails | No Comments »

We at Lumosity love to improve. Personal and professional development are as important as the product we build each and every day. With that expectation for improvement comes radical new paradigm shifts on core systems. This time it was on our Brain Performance Index (BPI) system. BPI is a scale that allows users to see how well they are doing in the five core cognitive areas Lumosity is designed to help train. Every game on Lumosity falls into one of those cognitive areas and each game ends with a score. That score is then used to calculate a BPI for that game, which is then fed into that game’s area BPI, which then feeds into the user’s overall BPI.

Late last year our science team began devising a new system for calculating BPI that was more responsive to each game play a user completed and could scale out to more games across our multiple platforms (web, iOS and soon Android). This new system was named the “Lumosity Performance Index” (LPI) and with it would come a new set of calculators that could transform a game play’s score to an LPI and also update a variety of other stats for that user, including the game’s area LPI and the user’s overall LPI.

Once the new system and calculators were built, we needed to build a way to migrate or backpopulate existing game plays’ scores to LPI. At the time of this writing, we have over 60 million registered users who have played more than 1.6 billion games, and it’s growing quickly each day

The migrator script, version 1

Because LPI at any given moment in time is calculated as the result of all previous game plays up to that moment, migrating users entailed replaying the game play history of every registered user in order.

We used our Scripterator gem for this task, and came up with something like this:

require 'timecop'

Scripterator.run "Backpopulate LPI data for users" do
  for_each_user do |user|
    game_plays = GamePlays.for_user(user).order('created_at ASC')

    game_plays.each do |game_play|
      Timecop.travel(game_play.created_at) do
        lpi_game_play = calculate_lpi(game_play)
        lpi_game_play.save!
        update_area_lpi_for_game_play(game_play)
        update_overall_lpi_for_game_play(game_play)
        update_game_play_stats(game_play)
      end
    end
    user.grant_role(:lpi)
    true
  end
end

This was a pretty simple script that did all that we needed it to. So we began to run it on users with varying numbers of game plays and ended up with an average processing time of 0.2 seconds per game play. It didn’t seem so bad until we realized that would mean that, unparallelized, this script would take 9.3 years to complete! And with the incredible amount of new game plays we get each day, we’d never catch up. So we thought, “Hey, let’s parallelize it across multiple workers!” Even then, across 100 workers, it would take over a month to complete — far too slow.

We took a look at all the logs that were output from running this migrator script on a single user, and saw that for about 400 game plays, we were making over 50,000 network calls (MySQL, Memcache, Redis)! That was unsustainable, and probably a big part of where our slow down was coming from.

The migrator script, version 2

The first thing we needed to eliminate was all those network calls, and that meant putting more shared data into RAM. What we came up with were multiple ‘RAM stores’ that would replace ActiveRecord calls during the processing of each game play for a user. The goal was to reduce the network queries per game play to 0, and then do the saving/updating after we were done with all the user’s data and wanted to move onto the next user to be migrated.

An example RAM store for our games table and one to store each new LPI game play for a user:

# ram_stores.rb

class GameStore 
  def self.seed @games ||= Game.all.to_a @games_hash = {} 
  end

  def self.games 
    @games 
  end

  def self.find(id) 
    @games_hash[id] ||= games.find { |g| g.id == id } 
  end

  def self.find_by_bac_ids(area_ids) 
    @games.select { |g| area_ids.include?(g.brain_area_id) } 
  end 
end

class LpiGamePlayStore
  def self.reset 
    @lpi_game_plays = [] 
  end

  def self.lpi_game_plays 
    @lpi_game_plays ||= [] 
  end

  def self.add(game_play) 
    game_play.created_at = Time.now 
    lpi_game_plays << game_play 
  end 
end

We had to build eight stores in all to cover all the models that used to call out to ActiveRecord to get or store data. But the stores by themselves were not enough: we needed to use them. Instead of building new models and calculators, we instead just opened up our models and redefined a few methods here or there.

# lpi_overrides.rb
class LpiGamePlay < ActiveRecord::Base
  def update_play_count
    count          = GameStatStore.find(game_id).try(:play_count) || 0
    count          += 1
    GameStatStore.set_for_game(game_id, count)
  end

  def recalculate_game_lpi(user)
    calc = LpiForGameCalculator.new(user, lpi_game_play: self, score: score)
    self.lpi = set_lpi_nil ? nil : calc.calculate
  end
end

class LpiForGameCalculator < GameCalculatorBase
  def initialize(user, attrs)
    super(user)
    @lpi_game_play   = attrs[:lpi_game_play]
    @game_id         = @lpi_game_play.try(:game_id) || attrs[:game_id]
    @game            = GameStore.find(@game_id)
    @score           = attrs[:score]
  end

  def calculate
    return nil unless lpi_game_play.present? && game_has_percentiles?
    return lpi_game_play.lpi if lpi_game_play.lpi.present?

    past_lpi_data  = past_lpi_lookup
    last_lpi       = past_lpi_data[:last]
    new_result_lpi = lpi_for_game_score

    new_lpi = if past_lpi_data[:count] < 3 || last_lpi == 0
      [last_lpi, new_result_lpi].max
    else
      new_lpi_for(last_lpi, new_result_lpi)
    end.to_i

    store_game_lpi(new_lpi)
    new_lpi
  end

  protected
  def past_lpi_lookup
    last_lpi = GameLpiStore.for_game(game.game_id).try(:lpi) || 0
    count    = GameStatStore.for_game_id(game.id).try(:play_count) || 0
    { count: count, last: last_lpi }
  end

  def fetch_percentiles_table
    GameLpiPercentile.get_table_for(game_id)
  end

  def store_game_lpi(new_lpi)
    GameLpiStore.set_for_game(game.id, new_lpi)
  end
end

We ended up opening up 11 of our classes to redefine about 20 methods so that they used the RAM stores, and not ActiveRecord. Our migrator script was responsible for requiring both the ram_stores.rb and the lpi_overrides.rb.

The updated migration script looked a bit like this:

require 'timecop'
require 'ram_stores'
require 'lpi_overrides'

PercentileStore.seed
GameStore.seed
BrainAttributeCategoryStore.seed

Scripterator.run "Backpopulate LPI data for users" do
  for_each_user do |user|
    GameStatStore.reset
    LpiGamePlayStore.reset
    GameLpiStore.reset(user)
    DailyLpiStore.reset(user)
    UserStore.set_user(user)

    game_plays = GamePlay.where(user_id: user.id).order('created_at ASC')

    game_plays.each do |gr|
      Timecop.travel(gr.created_at) do
        lpi_game_play = calculate_lpi(game_play)
        LpiGamePlayStore.add(lpi_game_play)

        # All updated to store to a RAM store
        update_area_lpi_for_game_play(game_play) 
        update_overall_lpi_for_game_play(game_play)
        update_game_play_stats(game_play)        
      end
    end

    # Store to DB with bulk-inserts
    LpiGamePlay.import!(LpiGamePlayStore.lpi_game_plays)
    GameStat.import!(GameStatStore.stats)
    GameLpi.import!(GameLpiStore.lpis)
    OverallLpi.import!(OverallLpiStore.lpis)
    AreaLpi.import!(AreaLpiStore.lpis)

    true
  end
end

Results

With the replacing of ActiveRecord, Memcache, and Redis calls to use these RAM stores, our per-game-play processing time went from 0.2s down to as low as 0.007s! Taking the total time from 9.3 years (unparallelized) to about 4 months (~128 days, unparallelized) or about 2 days if we parallelized it across 100 workers. Success!


On Families, Set Covers, and Greed

Posted: March 12th, 2014 | Author: | Filed under: Rails | No Comments »

I recently had the pleasure of working on some exciting social features for Lumosity’s Family Plans. What better way to encourage people to train than to leverage the motivating community of their family and friends? One exciting component of the “Social Family Plan” feature is the weekly challenge. Members work together to collectively accomplish a single goal; this is something like “everyone completes three workouts” or “each person completes a workout in their least-proficient training category”.

Let’s take a look at a small, hypothetical feature which lends itself to interesting implementations, adapted from an old interview problem we used to assign to potential developers.

The Problem

Given a set of members on a family plan and all of their workouts from the past week (including the score and categories trained), let’s find the person least proficient in a given set of categories.

To be more precise, given the following input of a set of workouts and categories:

> finder workouts.csv attention speed

Where workouts.csv looks like:

Member name Workout Score Categories Trained in Workout
Groucho 15800 attention
Groucho 23580 speed, attention
Harpo 8200 memory
Harpo 17750 attention, speed, flexibility

Which returns:

> Harpo, 17750

Approach

One strategy we can use here is to use a greedy algorithm to quickly find each family member’s lowest total score in a given set of categories. You can find a Ruby implementation of this strategy in the related Github repository, with the covered examples.

If we take a closer look at this problem, we can identify it as the weighted set cover problem:

Given a set U of n elements, a collection S1, S2, … , Sm of subsets of U, with weights wi,
find a collection C of these sets Si whose union is equal to U and such that the sum of its weights is minimized.

We can use a solution to the weighted set cover problem to determine the lowest-scoring set of workouts that contain all of the given categories per person. Once we have each person’s lowest score, all we need to do is choose the minimum.

Let’s use a greedy algorithm for finding one member’s lowest-scoring set:

While there are still requested categories:
  workout = get the lowest scoring workout
  add workout to solution set
  remove categories from requested categories
Return the sum of the solution set's workout scores

To get the lowest scoring workout, we do the following (this is the greedy part):

For each workout:
  num_shared = # of categories shared between current workout's categories
    and requested categories (set intersection)
  score_weight = workout's score / num_shared
  if score_weight is the lowest workout score so far, set the workout as the best choice
Return the best choice

Finally, pick the person with the lowest score.

Drawbacks

Why wouldn’t you want to do this?

One significant reason is that correctness is not guaranteed. Remember how I kept stressing that this implementation uses a greedy algorithm? There are some problems for which a greedy algorithm will not always yield the optimal result. This problem is one of them.

For example, given a user with the following workouts:

Member name Workout Score Categories Trained in Workout
Zeppo 2100 memory, problem_solving
Zeppo 2000 memory, flexibility
Zeppo 700 problem_solving
Zeppo 500 flexibility

If you request:

> finder drawbacks.csv memory flexibility problem_solving

The result is:

> Zeppo, 3200

The optimal solution is:

> Zeppo, 2600

So what’s happening? The greedy algorithm’s nature is to choose what’s best at that moment. Let’s walk through:

  1. The first choice is 500: flexibility: it offers the minimum ratio of price / categories included.
  2. Of the remaining options, it chooses 700: problem_solving, which again offers the best ratio.
  3. Finally, it selects 2000: memory, flexibility as it covers memory for the least ratio.

The problem comes after step one; it should select 2100: memory, problem_solving, but doesn’t because the strategy is to “choose the most fulfilling option first”.

Benefits

If it’s not always right, why would this solution be even remotely reasonable?

It has a reasonable time complexity. If you do it right, the greedy algorithm has a time complexity of O(n log m) where n is the number of requested categories and m is the number of workouts. Other solutions can take a long time. One “succinct” solution involves computing every combination of workout sets, finding ones that have all the requested categories, and then choosing the one with the lowest score. The time complexity of this towers over that of the greedy algorithm.

You might say to yourself, “Well right is right! Who cares what the running time will be?” What about if member had, say, thousands (maybe even millions) of gameplays? And we have to calculate this information every time the home page is loaded. Still no big deal? Yeah, exactly. Sometimes imprecision is a reasonable tradeoff in the name of speed.

In Conclusion

Greedy algorithms can be a great solution to your problem, but remember to be careful and thoughtful about their implications. There are a myriad of intelligent solutions to this problem, with varying efficiencies. Have a good one? Share it with us!

Again, a Ruby implementation can be found in the related repository. Feel free to make a pull request with your own solution if you feel so inclined.


Post Mortem for Purchase Page Bug after Rails Upgrade

Posted: February 26th, 2014 | Author: | Filed under: Rails | No Comments »

Lumosity’s purchase pages rely on I18n to translate the various pieces across the numerous currencies and locales. This is ideal because we can keep all of this in centralized YAML files and keep it consistent as we add new languages and currencies.

Our pages are written using HAML and we use Rails partials to display the pricing for each subscription type across our various subscription options. We use ActiveSupport’s #number_to_currency method to help make the translations between currencies for each locale. This has worked well for us for some time, until we updated to Rails 3.2.17, which altered our purchase page to look like this:

Gross, what went wrong? Well, we found out that

#escape_unsafe_options

is now called from inside of

#number_to_currency

. This change happened due to a XSS security issues per this email from Aaron Patterson. From the post:

“One of the parameters to the helper (unit) is not escaped correctly. Application which pass user controlled data as the unit parameter are vulnerable to an XSS attack.”

So their fix was to make sure that the values were escaped, and this caused issues on our end as seen above, because now our own internal formatter’s HTML code was being escaped, and output HTML as plaintext to the page.

Since we are not using

#number_to_currency

to receive user data, we were comfortable converting the escaped HTML back to unescaped HTML as so:

CGI.unescapeHTML(String.new(number_to_currency(amount, params)))

We don’t believe this fix is ideal for a few reasons. Having to work around Rails core is usually hairy, and then having to unescape escaped HTML just seems like it will come to bite us in the future. We are exploring other alternatives to how we generate this HTML going forward.

We are open to better solutions and hope this helps you with debugging.


Announcing Campystrano

Posted: April 25th, 2013 | Author: | Filed under: Rails | No Comments »

Campystrano: Campfire integration for your Capistrano deploy tasks

Communicating with your peers is an important part of engineering. We make a lot of posts to Campfire: questions, jokes, CI build results, git commits, pull request notifications, New Relic performance alerts, robot voice control for company announcements, local milkshake suggestions, and… deploy announcements. To make that last one a bit easier, we built a gem integrating Campfire into our Capistrano deploy tasks. Campystrano adds two deploy hooks. The first hook (triggered before the start of the Capistrano :deploy task) announces the start of the deploy including the deployer’s name, deployment branch, and the Rails environment. It might look something like this:

chris deploying master to MyApp production

The second is a post-deploy hook that announces a successful deploy.

Deploy to MyApp production finished successully

We’re in the middle of breaking apart our main Rails app into smaller, more manageable apps. While there are a great many benefits to this, it does create some overhead for our deploy pipeline. Instead of deploying to just one app, we’re now deploying to four and counting. Campystrano has helped us quickly add consistent deploy notifications to all of our apps.

Usage

Add the following to your Gemfile:

gem 'campystrano'

In your config/deploy.rb file, add the following:

require 'capistrano/campystrano'
set :campfire_settings do
  {
    subdomain: mysubdomain,
    room: myroom,
    token: ENV['CAMPFIRE_TOKEN']
  }
end

And deploy away!


Stitching Together Seamless Migrations

Posted: March 28th, 2013 | Author: | Filed under: Rails | No Comments »

Background

In our last post, we talked about how we ensured the integrity of our data as we transitioned to a new version of our payment system. We touched upon the fact that we wanted to make the transition seamless for our end users, but didn’t go into detail about how we did that.

While we ported over much of the functionality of our old system to the new one, there were still many things we wanted to deprecate but couldn’t for various reasons (e.g. it was used by a 3rd party service, API, or we just didn’t have time to fully sunset it). We still needed to support these things in the interim, but really didn’t want to litter our shiny new code with legacy behavior that didn’t really fit into the new architecture. Enter shims.

Shim shim-in-ey

A concept we found useful in helping us achieve this were shims: small libraries we could use to bridge the gap between the systems during the interim period.

For example, one of the models we obviated was the account model. This was a basic association on a user that looked like:

class User < ActiveRecord::Base
  has_one :account
end

And our code was peppered with calls like:

if user.account.active_until_date > Date.today
  # do something
end

In the new system, we might have ported the active_until_date method to another model, deprecated it, or needed to reproduce it with more complex logic. So how could we handle this transparently so that the right methods would be called once a user was transitioned from the old system to the new system? We ended up creating a mixin that looked something like:

module AccountShim
  # Transparently compatible methods

  delegate :active_until_date, :to => :account_owner

  # Deprecated or more complex methods

  def do_backwards_compatible_thing
    if account_migrated?
      # something long and scary
    else
      account_owner.do_backwards_compatible_thing
    end
  end

  def do_old_thing
    account_migrated? ? nil : account_owner.do_old_thing
  end

  private

  def account_owner
    @account_owner ||= account_migrated? ? self.customer : self.account
  end

  def account_migrated?
    !!account_migrated
  end
end

Once mixed in to our user class, it would delegate methods to the appropriate “account owner” — that is, the model that knew how to correctly respond to the method given the user’s current state. In our example, this was previously the Account class and subsequently the Customer class. The switch to the new association only happened once the user had been migrated (this was a flag that would be flipped after a migration had successfully been run on a user).

Our user class ended up looking something like:

class User < ActiveRecord::Base
  include AccountShim
end

And the method calls were all replaced with:

user.active_until_date

Once the migration was 100% complete and the unneeded methods were fully deprecated or reproduced elsewhere, we went ahead and moved the delegations to the user and deleted this file — no external dependencies, no fuss, no muss.

Delegate your troubles away

Note that as we mentioned there were a lot of places where we had code that was written like:

user.account.active?

In these cases, we shouldn’t have exposed the fact that the account model was how the user was getting this information. This is an implementation detail that should have been abstracted away — after all, the caller only cares about the fact that the user was active, not who the owner of that information is.

If we had originally done something like:

class User < ActiveRecord::Base
  delegate :active?, :to => :account
end

Then we would have saved ourselves a lot of time picking through our codebase, finding these references, replacing them with the shim’d methods, and fixing tests. Lesson learned!


Data Sanity? Oh, the Humanity

Posted: February 4th, 2013 | Author: | Filed under: Rails | No Comments »

More models, more problems

As of mid-2012, we had been accepting payments on lumosity.com for almost five years, all of them through a rather creaky, nasty, brittle pile of code that only a few of our engineers were brave enough to touch. We wanted to build a more flexible payment system that would allow us to implement all kinds of functionality we could never have before. In our design and planning meetings, we realized having everything we wanted would require new code, new models, and new schemas to store the underlying data. No problems here — we quickly built a system that could do everything we wanted.

Sounds great, right? Deploy away!

Of course, there was one small roadblock: we have millions of users already on the current system. We needed to seamlessly transition them between the two systems without anyone noticing any change had happened. Total transparency for the end user was paramount. This proved to be a tough problem given the long, complex account histories that many users had.

One of the strategies we ended up relying on to pull this off was running sanity checks on the data. That is, the expectation was that a snapshot of the data before and after migration would produce the same answers to the same questions.

Case study

Pretend you’re creating a schema to store a person’s medical record. There are many ways you can record a patient’s visits to their doctor. You might choose to store each visit as a separate entry in a visits table, with each diagnosis for that visit stored in a visit_diagnoses table. In this case, the visit is your central model.

Or, you may choose take a longer view of their care and record each treatment given for a specific diagnosis in a diagnosis_treatments table. In this case, the treatments for a single diagnosis are more important than a single visit. No matter which way you choose, both models should be able to give you the same answers to questions like:

  • Was Jane treated by Dr. Simpson on July 11th?
  • Has John ever been diagnosed with measles?
  • How many vaccinations was Stacy given in the past 5 years?

Your model undoubtedly already asks these questions. And if you’re migrating from one to the other, you are probably porting all those “questions” (in the form of methods) to your new system.

This means that by the time you’ve written your new models and are ready to migrate the data you have everything you need to check your migrated data’s correctness for free!

What we did

By checking the values of your model before and after the migration, you can have increased confidence in the data that you’ve migrated. We chose to do this using a SanityCheck class, which looks something like this:

class SanityCheck
  attr_reader :diff, :record

  # The list of methods we’re going to compare -- obviously, we can add anything we want here, not just methods that are on the user class.
  Methods = [:was_treated_on?, :was_vaccinated_on?, :has_active_treatment?, :current_prescriptions] # etc.

  def initialize(record)
    @record = record
    @before_values = SanityCheck.values(record)
  end

  def self.values(record)
    Methods.map { |m| [m, record.send(m)] }.to_h
  end

  def check
    @after_values = SanityCheck.values(record)
    @diff = diff(@before_values, @after_values)
  end

  def diff(a, b)
    a.dup
      .delete_if { |k, v| b[k] == v }
      .merge!(b.dup.delete_if { |k, v| a.has_key?(k) })
  end
end

To make use of this class, we instantiated it before we migrated the data and populated the before values. Then we migrated the data and checked the after values. If there were any differences in the two, we logged an error and rolled back the transaction pending further review:

User.each do |user|
  ActiveRecord::Base.transaction do
    sanity_check = SanityCheck.new(user)
    user.migrate!
    sanity_check.check

    if sanity_check.diff.any?
      # sanity check failed -- log an error, rollback the transaction, etc.
      log "Oh no! Something went wrong: #{sanity_check.diff}"
      raise ActiveRecord::Rollback
    else
      # woo hoo, success! Let’s indicate that this user was migrated
      user.update_attributes(:was_migrated => true)
    end
  end
end

What about unit tests?

The important thing to note is that what we were doing with this pattern wasn’t testing the code (we already did that) — it was testing the data. And since we were doing it for every record in the system, using live, production values, it was the best source of data available.

It also helped us uncover gaps in our understanding of the model we were trying to migrate. In any long running system, there’s bound to be an abundance of accumulated knowledge living inside your codebase and nowhere else (but on the flip side, there’s a lot of obsolete functionality that you’ll never miss). Running these sanity checks helped us uncover these assumptions long before the inevitable “Hey, I remember in the old system we used to be able to…” emails started coming in.

Rewrites aren’t easy, especially when trying to migrate an accumulated history throughout time and faithfully capture the state at each of those times — all while handling every edge case, bug, hack, and workaround that seemed like a great idea years ago. Sanity checking proved to be a great tool in helping us run our migrations without a hitch.


Sinatra + ActiveRecord

Posted: March 29th, 2012 | Author: | Filed under: Rails | No Comments »

For our first hackathon, a few of us decided to build a “crushing it” meter – basically a simple web service that could grab business data and apply some marketing knowledge to output a single metric to a physical meter (powered by arduino).

Sinatra seemed like an obvious choice for the simple web app. Turned out getting sinatra to play nicely with ActiveRecord was a little trickier than originally thought and required some digging around. I’ve now put up an example app to hopefully help others out.


On Delivery

Posted: February 4th, 2012 | Author: | Filed under: Operations | 3 Comments »

Last fall, we took a look at our deploy process and how other companies were approaching the same problem.

The goals of a successful deploy process entail (graciously borrowing from the Continuous Delivery and web operations books):

  • reducing cycle time (time to make a change on production)
  • fewer bugs
  • fewer complex bugs
  • lessening MTTD (mean time to detection)
  • optimizing human resources (more automation)
  • deploys should be boring, not an exciting event (from the engineering stand point)
  • make them repeatable, reliable, predictable

Here is my break down of different types of deploy strategies.

1. The Monolith

Quarterly/Yearly+ updates. This is how updates to operating systems work. Consumers don’t want lots of regular updates. There is a giant test matrix and a long test cycle. Finding a regression is PAINFUL. I’ve been there. It sucks.

Application type: OS, “enterprise”

Companies: Apple, Sun(Solaris)/Oracle, IBM

Worst case scenario: we never update the site again before lumos goes under

Best case scenario: who cares, we’re not doing this


2. Jesus On The Dashboard

Push to production “whenever it feels right”. Don’t QA before hand. Don’t dark deploy it. Don’t use feature flags. Users see changes as soon as they are deployed. Cross your fingers and hope it works. If it doesn’t, go into panic fix mode.

Application type: cheap, throw away consumer app

Companies: some facebook apps

Worst case scenario: you break the site regularly, have to hack your sh*t to fix fire alarms, your hair turns grey a lot earlier than it should

Best case scenario: effectively zero cycle time


3. Old Married Couple Controlled Chaos

Check-in and deploy in the same day from trunk.

Iterations are one week and start on, say, Tuesday. Check-in to trunk whenever you want. On day of deploy (Monday), do last minute manual QA and last minute bug fixes/feature creeps. No official process on when you should stop checking-in to trunk on deploy day, but use sensible judgement. Too many last minute bugs/feature creeps can push deploy out another day; however, you’re so used to working with each other that you can usually manage to get the deploy out.

Application type: consumer website with few developers / one feature set per iteration

Companies: early stage start ups

Worst case scenario: one week cycle time, production deploys get delayed

Best case scenario: ~1hr cycle time


4. Grey Beard Continuous Delivery

Code sits in trunk for up to one week, then staging for N days, then into production. We use N = 3.

Iterations are one week and start on Monday. Check-in to trunk if dev CI is green. Staging branch is cut from trunk once a week, every Monday. Only cut staging from trunk if dev CI is green. QA team works on staging branch for up to N days. Only bug fixes can be checked-in to staging branch after cut. Staging has its own CI. Deploys to production from staging branch go out every Wednesday. Production deploy doesn’t happen until QA team says a-ok and staging CI is green. Ensures the code about to go to production is sufficiently QA’d. Limits possibility of feature creep / fresh checkins delaying push to production.

Application type: consumer website with medium to large number of developers / multiple feature sets per iteration

Companies: medium size start ups

Worst case scenario: one week + N days cycle time

Best case scenario: N days cycle time


5. Mountain Dew Extreme Continuous Deployment

Anyone can deploy at any time. Need code review before checking in. Automated test suite prevents bad check-ins. New feature sets are dark deployed. Manual QA is done on production. Metric monitoring alarms when something went wrong (we currently have something like this with splunk).

Application type: consumer website with medium to large number of developers / multiple feature sets going on at same time

Companies: flickr, etsy

Worst case scenario: you break the site but have enough prevention that it shouldn’t matter (ie: rolled back or not immediately user facing)

Best case scenario: however long it takes to get a code review + length of automated test suite (on the order of a few minutes/hours)


Should note that all of these approaches have been used by *successful* businesses. However, depending on the business/application, some of these will cause the engineers to go crazy, hate their jobs, and start producing lower quality work. Then beat their pets at home (or bikes if they do not have a pet).

Last fall, we moved out of the “Old Married Couple Controlled Chaos” and started using the “Grey Beard Continuous Delivery” deploy process. It was simple, straight forward, and something we could move to with little extra work. Its working well for us and has set a nice cadence for the rest of the company. We are looking into building the tools and experience necessary to get us closer to “Mountain Dew Extreme Continuous Deployment”.

One key tidbit i got from my old ski-lease buddy Paul of flickr fame was that engineers have to have the proper mentality in order to get to true continuous deployment. If you don’t have the mindset that every check-in matters and breaking things is not ok, then process can’t save you. This means you have to be smart about who you hire and bringing new hires up to speed.


Take it easy, trashman! How switching to REE 1.8.7 made lumosity.com way faster

Posted: October 12th, 2011 | Author: | Filed under: Operations | 2 Comments »

If you’re running a Rails application in production and you’re still running plain old vanilla ruby 1.8.7 (the MRI or “Matz” version), our recent experience shows that you might have a lot to gain for relatively little effort by making the switch to Ruby Enterprise Edition 1.8.7. After rolling out REE and tuning its garbage collection parameters, we saw a 35% improvement in performance — average response time dropped from 200ms to around 125ms.

Since we’re using the amazing RVM in production, switching ruby versions was relatively easy. I wrote a simple Chef cookbook to have RVM install a desired ruby version on an app server, set it as the default, and reinstall essential gems like chef and bundler. The upgrade process then consisted of running chef-client on each server, re-bundling our app’s gems with bundle install, and bouncing the mongrels. (Yes, we’re still using mongrel… for now :)

Right out of the box, REE provided a noticeable improvement in performance, especially in memcache call time:

Note the gradual increase in the brown “GC Execution” layer of the New Relic RPM graph as each app server comes online with REE 1.8.7, which (unlike MRI ruby) emits the stats that New Relic uses to track GC performance. I was somewhat shocked to see that we were spending approximately half of all our Ruby time doing garbage collection, though in retrospect maybe I shouldn’t have been — GC performance tuning is one of the reasons that REE was developed in the first place.

So far so good — we’ve shaved off about 20-30ms on average and we have much better visibility into what Ruby is doing with its time. The real win at this point would seem to be reducing time spent in garbage collection. Time to take Twitter’s advice about tuning GC performance!

This was also a simple change, thanks to monit and chef cookbooks. I added an attribute to our mongrel_rails cookbook to specify the GC parameters to passed into the environment of each mongrel when started by monit — something like this:

check process mongrel_5000
  with pidfile /var/run/mongrel/mongrel.5000.pid every 2 cycles
  start program = "/bin/su - lumoslabs -c 'RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000 /data/lumoslabs/current/bin/rackup -s mongrel -o xx.yy.zz.aa -p 5000 -E production -D -P /var/run/mongrel/mongrel.5000.pid /data/lumoslabs/current/config.ru'"
  stop program = "/bin/su - lumoslabs -c '/data/lumoslabs/current/script/stop_racked_mongrel.sh /var/run/mongrel/mongrel.5000.pid'"
  if totalmem is greater than 600 MB for 2 cycles then restart      # eating up memory?

The actual values for these parameters were taken directly from Twitter’s suggestions. Once I rolled out the monit changes via chef and bounced all the mongrels, I saw good things happen:

This is exactly what an ops engineer wants to see! A nearly immediate and big improvement in performance, and the only code changes required were on the infrastructure side of things. Though I haven’t tested out all of these parameters in isolation (who has the time??!) their aggregate effect is clear. We allocate more memory initially (RUBY_HEAP_MIN_SLOTS increases to 500k from 1k), grab more when we need to add it (RUBY_HEAP_SLOTS_INCREMENT increases to 250k from 10k), and make sure we don’t trigger GC before we actually need it (RUBY_GC_MALLOC_LIMIT increases to 50M from 1M.)

One word of caution – these settings did have the effect of increasing the memory footprint of each mongrel from about 480MB to about 600MB. As a result, subsequently we needed to reduce the number of workers on each server from 8 to 6 to avoid swapping, which after a few days had undone all of the performance improvements made by switching to REE. Having fewer workers is neatly balanced out by the improvement in response time, though, and ultimately we’re seeing lower CPU usage and higher throughput on each app server.

And the magic just doesn’t end. Based on some very solid advice at smartic.us, I added some GC tuning parameters to the environment on our CI box. I was really happy to see that it cut our build time (which had bloated up to become painfully long) approximately in half. Definitely try this out if your build is taking longer than you’d like!

Up next — taking the plunge and upgrading to Ruby 1.9.2. More on that in a month or two!


Sleep easy when switching MySQL servers: benchmarking with sysbench and mk-query-digest

Posted: September 13th, 2011 | Author: | Filed under: Operations | No Comments »

Our MySQL database servers are the heart and soul of lumosity.com‘s Rails stack. Memcached helps a ton with serving up data that’s accessed very frequently, but in the end the database is the single most critical (and most complex) piece of our infrastructure.

This summer we decided that it was time to go through the semi-painful process of upgrading our trusty primary database server — which has been serving a thousand queries per second for the last 18 months without batting an eyelash — to a bigger, badder, more modern machine with larger and faster disks, more RAM, and faster processors. Like most engineering teams, we strongly prefer to add capacity before we encounter performance issues, so it was the right time to switch. We needed to prepare the site for our next 6+ months of growth.

Once the new machine was prepped and ready to go, we had an important question to answer:

“This machine certainly seems fast, but can we be sure that it can handle our production load? What if there’s a subtle issue with the RAID controller, or maybe an unexpected I/O issue with our MySQL configuration?  What if we blow up the site???!!!”

We weren’t comfortable simply crossing our fingers and making the switch.  With money and subscriber satisfaction on the line, we needed to be sure that it would be a smooth transition.

Sysbench – OLTP Workload

Baron Schwartz et al’s indispensable High Performance MySQL, 2nd Edition (hereafter referred to as HPM) has a fairly basic but good section on benchmarking MySQL. sysbench is one of the standard Linux benchmarking tools covered in his survey. I wanted a simple tool that could (1) find the upper bounds of I/O and transaction processing performance on the new database server and (2) allow us to compare these boundaries against the performance of our current database servers.

After installing sysbench using yum, I followed Baron’s lead in HPM and fired off a fairly sizable OLTP workload on each of the machines. This would provide a reasonable approximation of the I/O generated by a database server handling many concurrent requests in a Rails stack.  The goal of this test was to determine the expected throughput (transactions per second) and per-request processing time that both generations of server could handle at their peak.  My hypothesis was that db-new, all souped up with 2011 hardware, would smoke db-old and its 2009 hardware.

I executed the following commands (straight out of HPM) on db-new and db-old to push 60 seconds worth of requests through a 1M row table in 8 concurrent threads:

# sysbench --test=oltp --oltp-table-size=1000000 \
    --mysql-db=test --mysql-user=root prepare
# sysbench --test=oltp --db-driver=mysql --oltp-table-size=1000000 \
    --mysql-socket=/tmp/mysql.sock --mysql-db=test --mysql-user=root \
    --mysql-password=xxxxxxx --max-time=60 --oltp-read-only=on \
    --max-requests=0 --num-threads=8 run

The results from db-new:

OLTP test statistics:
queries performed:
read: 3925586
write: 0
other: 560798
total: 4486384
transactions: 280399 (4673.18 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 3925586 (65424.48 per sec.)
other operations: 560798 (9346.35 per sec.)

Test execution summary:
total time: 60.0018s
total number of events: 280399
total time taken by event execution: 478.7724
per-request statistics:
min: 1.45ms
avg: 1.71ms
max: 6.62ms
approx. 95 percentile: 1.84ms

Threads fairness:
events (avg/stddev): 35049.8750/715.27
execution time (avg/stddev): 59.8465/0.01

And from db-old:

OLTP test statistics:
queries performed:
read: 2671536
write: 0
other: 381648
total: 3053184
transactions: 190824 (3180.28 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 2671536 (44523.87 per sec.)
other operations: 381648 (6360.55 per sec.)

Test execution summary:
total time: 60.0023s
total number of events: 190824
total time taken by event execution: 478.7136
per-request statistics:
min: 2.14ms
avg: 2.51ms
max: 81.88ms
approx. 95 percentile: 2.71ms

Threads fairness:
events (avg/stddev): 23853.0000/177.37
execution time (avg/stddev): 59.8392/0.00

This was good news, but not unexpected — the 95th percentile request time is approximately 33% faster on db-new (1.84ms vs 2.71ms), and the throughput is about 150% higher (4673 transaction/sec vs 3180 transactions/sec.) Given the increase in CPU and I/O horsepower on db-new, I would have been disappointed with anything else!

Before I could jump headfirst into the void of db-new, I had to see it run our real production workload. I had to make sure that it didn’t blow up because of some subtle change to MySQL configuration, or the new RAID controller, or the RAM, or one of an infinite number of other things. This was a head-scratcher at first; I thought maybe we could sniff the mysql traffic from db-old with tcpdump and somehow “replay” it on db-new, but I hadn’t a clue how to decode the mysql protocol. It sounded like a lot of work and we were working against the clock, disk filling up little by little on db-old every day. But soon enough, Google led me to…

mk-query-digest (the greatest MySQL tool in the history of the universe)

I had already discovered maatkit via the mentions of mk-table-sync in HPM, but had never bumped into the innocuously-named mk-query-digest until I found this 37signal post on warming the passive failover in a master-master replication pair. mk-query-digest should be in every MySQL admin’s toolkit: it does pretty much everything, including stuff you had no idea you needed to do. Let it teach you.

I already had db-new set up as a replication slave of db-old, so I knew their data was in sync at any given time (give or take a few seconds.) Moreover, I knew that UPDATEs and INSERTs were working as expected on db-new, since db-new was constantly replaying those queries by reading db-old’s binary logs via replication. But I had no proof that db-new would be able to keep up with the sizable throughput of SELECTs on db-old, about 1000-1200 per second these days. As outlined in the 37signals post I linked above, the quick-and-dirty way to do this was to run tcpdump on db-old to capture mysql traffic for some period of time, say 5 minutes, and then to use mk-query-digest to replay the SELECT queries on db-new.

First, I ran tcpdump on db-old:

[db-old] # time tcpdump -s 65535 -x -nn -q -tttt -i any port 3306 > db-old.tcp.txt
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
1050296 packets captured
1063883 packets received by filter
13498 packets dropped by kernel

real 4m34.464s
user 1m40.576s
sys 0m16.530s

Then, I carefully shipped the resulting 1.8GB file over to a third server, call it db-other, which would be used to execute the queries on db-new. Why a third server? I didn’t want the overhead of the mk-query-digest perl script itself polluting the results on db-new. Also, note the use of the '-l 500' argument to scp, which rate-limits the file copy to 500kB/sec. Since db-old was a live production database server, I had to take care not to hog all its outbound bandwith with the file copy, which would starve our Rails app servers of data and crash the site!

[db-old] # scp -l 500 db-old.tcp.txt lumoslabs@db-other:~

Now, on db-other, I ran mk-query-digest to run the 5 minutes worth of queries on db-new. I’m using the --filter argument to pass in a perl expression that will only execute queries that start with the string ‘SELECT’ (case insensitive.)

[db-other] # mk-query-digest --type tcpdump \
    --filter '($event->{arg} =~ m/^SELECT/i) \
    --execute h=db-new,u=user,p=xxxxxx db-old.tcp.txt

On db-new, I fired up mk-query-digest to watch the SELECTs roll in on the mysql interface:

[db-new] tcpdump -s 65535 -x -nn -q -tttt \
    -i any port 3306 | mk-query-digest --print --type tcpdump

# Time: 110913 15:48:15.650359
# Client: 10.32.95.138:48155
# Thread_id: 4294967311
# Query_time: 0.000059 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SELECT `asset_versions`.* FROM `asset_versions` WHERE (`asset_versions`.asset_id = 27668);
# Time: 110913 15:48:15.652158
# Client: 10.26.2.134:53512
# Thread_id: 4294967346
# Query_time: 0.000056 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SELECT COUNT(*) FROM `roles` INNER JOIN `roles_users` ON `roles`.id = `roles_users`.role_id 
  WHERE `roles`.`name` = 'admin' AND ((`roles_users`.user_id = NULL));

... etc ...

I watched top, iostat, and mytop to make sure nothing was blowing up. The server load stayed nice and moderate, peaking at 0.66 as the iowait percentage spiked during the initial paging into the InnoDB buffer. It eventually settled into a comfortable 0.33, with iostat showing only 0.1% iowait time. Basically, things looked great from the perspective of system metrics. I relaxed even more!

The output of mk-query-digest gave even more reason to be hopeful that db-new was ship-shape:

# 339.9s user time, 7.1s system time, 112.13M rss, 233.60M vsz
# Current date: Mon Sep 5 14:57:11 2011
# Hostname: db-other.sl.lumoslabs.com
# Files: db-old.tcp.txt
# Overall: 365.64k total, 1.89k unique, 1.33k QPS, 18.37x concurrency ____
# Time range: 2011-09-05 13:18:10.837646 to 13:22:45.293365
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 5042s 0 38s 14ms 839us 347ms 194us
# Exec orig ti 204s 0 2s 557us 596us 11ms 131us
# Rows affecte 21.92k 0 19 0.06 0.99 0.24 0
# Query size 42.85M 5 3.61k 122.87 246.02 148.13 102.22
# Exec diff ti 273s 0 38s 2ms 626us 118ms 108us
# Warning coun 11.81k 0 11.30k 0.03 0 18.31 0
# Boolean:
# No index use 7% yes, 92% no

We can see that we reach 1.33k queries per second, which is in line with the expected load of our production traffic. We can also compare the 95th percentile execution time of the queries on the new server — “Exec time” of 839 microseconds — with that on the old server — “Exec orig time” of 596 microseconds. Given the fact that we were running these queries against an entirely “cold” server, i.e. with absolutely no data in the InnoDB buffer, we would expect this performance hit. Nearly every piece of data requested by the SELECTs during this 5 minutes had to be pulled from disk, whereas db-old had the great advantage of having many GB of RAM all warmed up with the most-frequently requested items.

So, this particular test shows us that the database server wasn’t crushed by the load, but not that its performance was comparable to the original machine. To test that we’d need to mirror our production workload for a considerable period of time — an exercise that is beyond the scope of this post. (But we did it, of course!)

Conclusion

This database switchover went smooth, as smooth as Sade. We’re running against db-new now and have started seeing the expected performance boost as its InnoDB buffer pages in the optimal working set for Lumosity’s data. It’ll be very useful to have these tools in our box in the near future when we tackle our next database project: upgrading to MySQL 5.5. Without benchmarking, we’d just be switching and praying!