9+ years to 2 days: How we supercharged a multi-year migration process

We at Lumosity love to improve. Personal and professional development are as important as the product we build each and every day. With that expectation for improvement comes radical new paradigm shifts on core systems. This time it was on our Brain Performance Index (BPI) system. BPI is a scale that allows users to see how well they are doing in the five core cognitive areas Lumosity is designed to help train. Every game on Lumosity falls into one of those cognitive areas and each game ends with a score. That score is then used to calculate a BPI for that game, which is then fed into that game’s area BPI, which then feeds into the user’s overall BPI.

Late last year our science team began devising a new system for calculating BPI that was more responsive to each game play a user completed and could scale out to more games across our multiple platforms (web, iOS and soon Android). This new system was named the “Lumosity Performance Index” (LPI) and with it would come a new set of calculators that could transform a game play’s score to an LPI and also update a variety of other stats for that user, including the game’s area LPI and the user’s overall LPI.

Once the new system and calculators were built, we needed to build a way to migrate or backpopulate existing game plays’ scores to LPI. At the time of this writing, we have over 60 million registered users who have played more than 1.6 billion games, and it’s growing quickly each day

The migrator script, version 1

Because LPI at any given moment in time is calculated as the result of all previous game plays up to that moment, migrating users entailed replaying the game play history of every registered user in order.

We used our Scripterator gem for this task, and came up with something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
require 'timecop'

Scripterator.run "Backpopulate LPI data for users" do
  for_each_user do |user|
    game_plays = GamePlays.for_user(user).order('created_at ASC')

    game_plays.each do |game_play|
      Timecop.travel(game_play.created_at) do
        lpi_game_play = calculate_lpi(game_play)
        lpi_game_play.save!
        update_area_lpi_for_game_play(game_play)
        update_overall_lpi_for_game_play(game_play)
        update_game_play_stats(game_play)
      end
    end
    user.grant_role(:lpi)
    true
  end
end

This was a pretty simple script that did all that we needed it to. So we began to run it on users with varying numbers of game plays and ended up with an average processing time of 0.2 seconds per game play. It didn’t seem so bad until we realized that would mean that, unparallelized, this script would take 9.3 years to complete! And with the incredible amount of new game plays we get each day, we’d never catch up. So we thought, “Hey, let’s parallelize it across multiple workers!” Even then, across 100 workers, it would take over a month to complete – far too slow.

We took a look at all the logs that were output from running this migrator script on a single user, and saw that for about 400 game plays, we were making over 50,000 network calls (MySQL, Memcache, Redis)! That was unsustainable, and probably a big part of where our slow down was coming from.

The migrator script, version 2

The first thing we needed to eliminate was all those network calls, and that meant putting more shared data into RAM. What we came up with were multiple ‘RAM stores’ that would replace ActiveRecord calls during the processing of each game play for a user. The goal was to reduce the network queries per game play to 0, and then do the saving/updating after we were done with all the user’s data and wanted to move onto the next user to be migrated.

An example RAM store for our games table and one to store each new LPI game play for a user:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# ram_stores.rb

class GameStore
  def self.seed @games ||= Game.all.to_a @games_hash = {}
  end

  def self.games
    @games
  end

  def self.find(id)
    @games_hash[id] ||= games.find { |g| g.id == id }
  end

  def self.find_by_bac_ids(area_ids)
    @games.select { |g| area_ids.include?(g.brain_area_id) }
  end
end

class LpiGamePlayStore
  def self.reset
    @lpi_game_plays = []
  end

  def self.lpi_game_plays
    @lpi_game_plays ||= []
  end

  def self.add(game_play)
    game_play.created_at = Time.now
    lpi_game_plays << game_play
  end
end

We had to build eight stores in all to cover all the models that used to call out to ActiveRecord to get or store data. But the stores by themselves were not enough: we needed to use them. Instead of building new models and calculators, we instead just opened up our models and redefined a few methods here or there.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# lpi_overrides.rb
class LpiGamePlay < ActiveRecord::Base
  def update_play_count
    count          = GameStatStore.find(game_id).try(:play_count) || 0
    count          += 1
    GameStatStore.set_for_game(game_id, count)
  end

  def recalculate_game_lpi(user)
    calc = LpiForGameCalculator.new(user, lpi_game_play: self, score: score)
    self.lpi = set_lpi_nil ? nil : calc.calculate
  end
end

class LpiForGameCalculator < GameCalculatorBase
  def initialize(user, attrs)
    super(user)
    @lpi_game_play   = attrs[:lpi_game_play]
    @game_id         = @lpi_game_play.try(:game_id) || attrs[:game_id]
    @game            = GameStore.find(@game_id)
    @score           = attrs[:score]
  end

  def calculate
    return nil unless lpi_game_play.present? && game_has_percentiles?
    return lpi_game_play.lpi if lpi_game_play.lpi.present?

    past_lpi_data  = past_lpi_lookup
    last_lpi       = past_lpi_data[:last]
    new_result_lpi = lpi_for_game_score

    new_lpi = if past_lpi_data[:count] < 3 || last_lpi == 0
      [last_lpi, new_result_lpi].max
    else
      new_lpi_for(last_lpi, new_result_lpi)
    end.to_i

    store_game_lpi(new_lpi)
    new_lpi
  end

  protected

  def past_lpi_lookup
    last_lpi = GameLpiStore.for_game(game.game_id).try(:lpi) || 0
    count    = GameStatStore.for_game_id(game.id).try(:play_count) || 0
    { count: count, last: last_lpi }
  end

  def fetch_percentiles_table
    GameLpiPercentile.get_table_for(game_id)
  end

  def store_game_lpi(new_lpi)
    GameLpiStore.set_for_game(game.id, new_lpi)
  end
end

We ended up opening up 11 of our classes to redefine about 20 methods so that they used the RAM stores, and not ActiveRecord. Our migrator script was responsible for requiring both the ram_stores.rb and the lpi_overrides.rb.

The updated migration script looked a bit like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
require 'timecop'
require 'ram_stores'
require 'lpi_overrides'

PercentileStore.seed
GameStore.seed
BrainAttributeCategoryStore.seed

Scripterator.run "Backpopulate LPI data for users" do
  for_each_user do |user|
    GameStatStore.reset
    LpiGamePlayStore.reset
    GameLpiStore.reset(user)
    DailyLpiStore.reset(user)
    UserStore.set_user(user)

    game_plays = GamePlay.where(user_id: user.id).order('created_at ASC')

    game_plays.each do |gr|
      Timecop.travel(gr.created_at) do
        lpi_game_play = calculate_lpi(game_play)
        LpiGamePlayStore.add(lpi_game_play)

        # All updated to store to a RAM store
        update_area_lpi_for_game_play(game_play)
        update_overall_lpi_for_game_play(game_play)
        update_game_play_stats(game_play)
      end
    end

    # Store to DB with bulk-inserts
    LpiGamePlay.import!(LpiGamePlayStore.lpi_game_plays)
    GameStat.import!(GameStatStore.stats)
    GameLpi.import!(GameLpiStore.lpis)
    OverallLpi.import!(OverallLpiStore.lpis)
    AreaLpi.import!(AreaLpiStore.lpis)

    true
  end
end

Results

With the replacing of ActiveRecord, Memcache, and Redis calls to use these RAM stores, our per-game-play processing time went from 0.2s down to as low as 0.007s! Taking the total time from 9.3 years (unparallelized) to about 4 months (~128 days, unparallelized) or about 2 days if we parallelized it across 100 workers. Success!