Announcing Campystrano

Posted: April 25th, 2013 | Author: | Filed under: Rails | No Comments »

Campystrano: Campfire integration for your Capistrano deploy tasks

Communicating with your peers is an important part of engineering. We make a lot of posts to Campfire: questions, jokes, CI build results, git commits, pull request notifications, New Relic performance alerts, robot voice control for company announcements, local milkshake suggestions, and… deploy announcements. To make that last one a bit easier, we built a gem integrating Campfire into our Capistrano deploy tasks. Campystrano adds two deploy hooks. The first hook (triggered before the start of the Capistrano :deploy task) announces the start of the deploy including the deployer’s name, deployment branch, and the Rails environment. It might look something like this:

chris deploying master to MyApp production

The second is a post-deploy hook that announces a successful deploy.

Deploy to MyApp production finished successully

We’re in the middle of breaking apart our main Rails app into smaller, more manageable apps. While there are a great many benefits to this, it does create some overhead for our deploy pipeline. Instead of deploying to just one app, we’re now deploying to four and counting. Campystrano has helped us quickly add consistent deploy notifications to all of our apps.

Usage

Add the following to your Gemfile:

gem 'campystrano'

In your config/deploy.rb file, add the following:

require 'capistrano/campystrano'
set :campfire_settings do
  {
    subdomain: mysubdomain,
    room: myroom,
    token: ENV['CAMPFIRE_TOKEN']
  }
end

And deploy away!


Stitching Together Seamless Migrations

Posted: March 28th, 2013 | Author: | Filed under: Rails | No Comments »

Background

In our last post, we talked about how we ensured the integrity of our data as we transitioned to a new version of our payment system. We touched upon the fact that we wanted to make the transition seamless for our end users, but didn’t go into detail about how we did that.

While we ported over much of the functionality of our old system to the new one, there were still many things we wanted to deprecate but couldn’t for various reasons (e.g. it was used by a 3rd party service, API, or we just didn’t have time to fully sunset it). We still needed to support these things in the interim, but really didn’t want to litter our shiny new code with legacy behavior that didn’t really fit into the new architecture. Enter shims.

Shim shim-in-ey

A concept we found useful in helping us achieve this were shims: small libraries we could use to bridge the gap between the systems during the interim period.

For example, one of the models we obviated was the account model. This was a basic association on a user that looked like:

class User < ActiveRecord::Base
  has_one :account
end

And our code was peppered with calls like:

if user.account.active_until_date > Date.today
  # do something
end

In the new system, we might have ported the active_until_date method to another model, deprecated it, or needed to reproduce it with more complex logic. So how could we handle this transparently so that the right methods would be called once a user was transitioned from the old system to the new system? We ended up creating a mixin that looked something like:

module AccountShim
  # Transparently compatible methods

  delegate :active_until_date, :to => :account_owner

  # Deprecated or more complex methods

  def do_backwards_compatible_thing
    if account_migrated?
      # something long and scary
    else
      account_owner.do_backwards_compatible_thing
    end
  end

  def do_old_thing
    account_migrated? ? nil : account_owner.do_old_thing
  end

  private

  def account_owner
    @account_owner ||= account_migrated? ? self.customer : self.account
  end

  def account_migrated?
    !!account_migrated
  end
end

Once mixed in to our user class, it would delegate methods to the appropriate “account owner” — that is, the model that knew how to correctly respond to the method given the user’s current state. In our example, this was previously the Account class and subsequently the Customer class. The switch to the new association only happened once the user had been migrated (this was a flag that would be flipped after a migration had successfully been run on a user).

Our user class ended up looking something like:

class User < ActiveRecord::Base
  include AccountShim
end

And the method calls were all replaced with:

user.active_until_date

Once the migration was 100% complete and the unneeded methods were fully deprecated or reproduced elsewhere, we went ahead and moved the delegations to the user and deleted this file — no external dependencies, no fuss, no muss.

Delegate your troubles away

Note that as we mentioned there were a lot of places where we had code that was written like:

user.account.active?

In these cases, we shouldn’t have exposed the fact that the account model was how the user was getting this information. This is an implementation detail that should have been abstracted away — after all, the caller only cares about the fact that the user was active, not who the owner of that information is.

If we had originally done something like:

class User < ActiveRecord::Base
  delegate :active?, :to => :account
end

Then we would have saved ourselves a lot of time picking through our codebase, finding these references, replacing them with the shim’d methods, and fixing tests. Lesson learned!


Data Sanity? Oh, the Humanity

Posted: February 4th, 2013 | Author: | Filed under: Rails | No Comments »

More models, more problems

As of mid-2012, we had been accepting payments on lumosity.com for almost five years, all of them through a rather creaky, nasty, brittle pile of code that only a few of our engineers were brave enough to touch. We wanted to build a more flexible payment system that would allow us to implement all kinds of functionality we could never have before. In our design and planning meetings, we realized having everything we wanted would require new code, new models, and new schemas to store the underlying data. No problems here — we quickly built a system that could do everything we wanted.

Sounds great, right? Deploy away!

Of course, there was one small roadblock: we have millions of users already on the current system. We needed to seamlessly transition them between the two systems without anyone noticing any change had happened. Total transparency for the end user was paramount. This proved to be a tough problem given the long, complex account histories that many users had.

One of the strategies we ended up relying on to pull this off was running sanity checks on the data. That is, the expectation was that a snapshot of the data before and after migration would produce the same answers to the same questions.

Case study

Pretend you’re creating a schema to store a person’s medical record. There are many ways you can record a patient’s visits to their doctor. You might choose to store each visit as a separate entry in a visits table, with each diagnosis for that visit stored in a visit_diagnoses table. In this case, the visit is your central model.

Or, you may choose take a longer view of their care and record each treatment given for a specific diagnosis in a diagnosis_treatments table. In this case, the treatments for a single diagnosis are more important than a single visit. No matter which way you choose, both models should be able to give you the same answers to questions like:

  • Was Jane treated by Dr. Simpson on July 11th?
  • Has John ever been diagnosed with measles?
  • How many vaccinations was Stacy given in the past 5 years?

Your model undoubtedly already asks these questions. And if you’re migrating from one to the other, you are probably porting all those “questions” (in the form of methods) to your new system.

This means that by the time you’ve written your new models and are ready to migrate the data you have everything you need to check your migrated data’s correctness for free!

What we did

By checking the values of your model before and after the migration, you can have increased confidence in the data that you’ve migrated. We chose to do this using a SanityCheck class, which looks something like this:

class SanityCheck
  attr_reader :diff, :record

  # The list of methods we’re going to compare -- obviously, we can add anything we want here, not just methods that are on the user class.
  Methods = [:was_treated_on?, :was_vaccinated_on?, :has_active_treatment?, :current_prescriptions] # etc.

  def initialize(record)
    @record = record
    @before_values = SanityCheck.values(record)
  end

  def self.values(record)
    Methods.map { |m| [m, record.send(m)] }.to_h
  end

  def check
    @after_values = SanityCheck.values(record)
    @diff = diff(@before_values, @after_values)
  end

  def diff(a, b)
    a.dup
      .delete_if { |k, v| b[k] == v }
      .merge!(b.dup.delete_if { |k, v| a.has_key?(k) })
  end
end

To make use of this class, we instantiated it before we migrated the data and populated the before values. Then we migrated the data and checked the after values. If there were any differences in the two, we logged an error and rolled back the transaction pending further review:

User.each do |user|
  ActiveRecord::Base.transaction do
    sanity_check = SanityCheck.new(user)
    user.migrate!
    sanity_check.check

    if sanity_check.diff.any?
      # sanity check failed -- log an error, rollback the transaction, etc.
      log "Oh no! Something went wrong: #{sanity_check.diff}"
      raise ActiveRecord::Rollback
    else
      # woo hoo, success! Let’s indicate that this user was migrated
      user.update_attributes(:was_migrated => true)
    end
  end
end

What about unit tests?

The important thing to note is that what we were doing with this pattern wasn’t testing the code (we already did that) — it was testing the data. And since we were doing it for every record in the system, using live, production values, it was the best source of data available.

It also helped us uncover gaps in our understanding of the model we were trying to migrate. In any long running system, there’s bound to be an abundance of accumulated knowledge living inside your codebase and nowhere else (but on the flip side, there’s a lot of obsolete functionality that you’ll never miss). Running these sanity checks helped us uncover these assumptions long before the inevitable “Hey, I remember in the old system we used to be able to…” emails started coming in.

Rewrites aren’t easy, especially when trying to migrate an accumulated history throughout time and faithfully capture the state at each of those times — all while handling every edge case, bug, hack, and workaround that seemed like a great idea years ago. Sanity checking proved to be a great tool in helping us run our migrations without a hitch.


Sinatra + ActiveRecord

Posted: March 29th, 2012 | Author: | Filed under: Rails | No Comments »

For our first hackathon, a few of us decided to build a “crushing it” meter – basically a simple web service that could grab business data and apply some marketing knowledge to output a single metric to a physical meter (powered by arduino).

Sinatra seemed like an obvious choice for the simple web app. Turned out getting sinatra to play nicely with ActiveRecord was a little trickier than originally thought and required some digging around. I’ve now put up an example app to hopefully help others out.


On Delivery

Posted: February 4th, 2012 | Author: | Filed under: Operations | 3 Comments »

Last fall, we took a look at our deploy process and how other companies were approaching the same problem.

The goals of a successful deploy process entail (graciously borrowing from the Continuous Delivery and web operations books):

  • reducing cycle time (time to make a change on production)
  • fewer bugs
  • fewer complex bugs
  • lessening MTTD (mean time to detection)
  • optimizing human resources (more automation)
  • deploys should be boring, not an exciting event (from the engineering stand point)
  • make them repeatable, reliable, predictable

Here is my break down of different types of deploy strategies.

1. The Monolith

Quarterly/Yearly+ updates. This is how updates to operating systems work. Consumers don’t want lots of regular updates. There is a giant test matrix and a long test cycle. Finding a regression is PAINFUL. I’ve been there. It sucks.

Application type: OS, “enterprise”

Companies: Apple, Sun(Solaris)/Oracle, IBM

Worst case scenario: we never update the site again before lumos goes under

Best case scenario: who cares, we’re not doing this


2. Jesus On The Dashboard

Push to production “whenever it feels right”. Don’t QA before hand. Don’t dark deploy it. Don’t use feature flags. Users see changes as soon as they are deployed. Cross your fingers and hope it works. If it doesn’t, go into panic fix mode.

Application type: cheap, throw away consumer app

Companies: some facebook apps

Worst case scenario: you break the site regularly, have to hack your sh*t to fix fire alarms, your hair turns grey a lot earlier than it should

Best case scenario: effectively zero cycle time


3. Old Married Couple Controlled Chaos

Check-in and deploy in the same day from trunk.

Iterations are one week and start on, say, Tuesday. Check-in to trunk whenever you want. On day of deploy (Monday), do last minute manual QA and last minute bug fixes/feature creeps. No official process on when you should stop checking-in to trunk on deploy day, but use sensible judgement. Too many last minute bugs/feature creeps can push deploy out another day; however, you’re so used to working with each other that you can usually manage to get the deploy out.

Application type: consumer website with few developers / one feature set per iteration

Companies: early stage start ups

Worst case scenario: one week cycle time, production deploys get delayed

Best case scenario: ~1hr cycle time


4. Grey Beard Continuous Delivery

Code sits in trunk for up to one week, then staging for N days, then into production. We use N = 3.

Iterations are one week and start on Monday. Check-in to trunk if dev CI is green. Staging branch is cut from trunk once a week, every Monday. Only cut staging from trunk if dev CI is green. QA team works on staging branch for up to N days. Only bug fixes can be checked-in to staging branch after cut. Staging has its own CI. Deploys to production from staging branch go out every Wednesday. Production deploy doesn’t happen until QA team says a-ok and staging CI is green. Ensures the code about to go to production is sufficiently QA’d. Limits possibility of feature creep / fresh checkins delaying push to production.

Application type: consumer website with medium to large number of developers / multiple feature sets per iteration

Companies: medium size start ups

Worst case scenario: one week + N days cycle time

Best case scenario: N days cycle time


5. Mountain Dew Extreme Continuous Deployment

Anyone can deploy at any time. Need code review before checking in. Automated test suite prevents bad check-ins. New feature sets are dark deployed. Manual QA is done on production. Metric monitoring alarms when something went wrong (we currently have something like this with splunk).

Application type: consumer website with medium to large number of developers / multiple feature sets going on at same time

Companies: flickr, etsy

Worst case scenario: you break the site but have enough prevention that it shouldn’t matter (ie: rolled back or not immediately user facing)

Best case scenario: however long it takes to get a code review + length of automated test suite (on the order of a few minutes/hours)


Should note that all of these approaches have been used by *successful* businesses. However, depending on the business/application, some of these will cause the engineers to go crazy, hate their jobs, and start producing lower quality work. Then beat their pets at home (or bikes if they do not have a pet).

Last fall, we moved out of the “Old Married Couple Controlled Chaos” and started using the “Grey Beard Continuous Delivery” deploy process. It was simple, straight forward, and something we could move to with little extra work. Its working well for us and has set a nice cadence for the rest of the company. We are looking into building the tools and experience necessary to get us closer to “Mountain Dew Extreme Continuous Deployment”.

One key tidbit i got from my old ski-lease buddy Paul of flickr fame was that engineers have to have the proper mentality in order to get to true continuous deployment. If you don’t have the mindset that every check-in matters and breaking things is not ok, then process can’t save you. This means you have to be smart about who you hire and bringing new hires up to speed.


Take it easy, trashman! How switching to REE 1.8.7 made lumosity.com way faster

Posted: October 12th, 2011 | Author: | Filed under: Operations | 2 Comments »

If you’re running a Rails application in production and you’re still running plain old vanilla ruby 1.8.7 (the MRI or “Matz” version), our recent experience shows that you might have a lot to gain for relatively little effort by making the switch to Ruby Enterprise Edition 1.8.7. After rolling out REE and tuning its garbage collection parameters, we saw a 35% improvement in performance — average response time dropped from 200ms to around 125ms.

Since we’re using the amazing RVM in production, switching ruby versions was relatively easy. I wrote a simple Chef cookbook to have RVM install a desired ruby version on an app server, set it as the default, and reinstall essential gems like chef and bundler. The upgrade process then consisted of running chef-client on each server, re-bundling our app’s gems with bundle install, and bouncing the mongrels. (Yes, we’re still using mongrel… for now :)

Right out of the box, REE provided a noticeable improvement in performance, especially in memcache call time:

Note the gradual increase in the brown “GC Execution” layer of the New Relic RPM graph as each app server comes online with REE 1.8.7, which (unlike MRI ruby) emits the stats that New Relic uses to track GC performance. I was somewhat shocked to see that we were spending approximately half of all our Ruby time doing garbage collection, though in retrospect maybe I shouldn’t have been — GC performance tuning is one of the reasons that REE was developed in the first place.

So far so good — we’ve shaved off about 20-30ms on average and we have much better visibility into what Ruby is doing with its time. The real win at this point would seem to be reducing time spent in garbage collection. Time to take Twitter’s advice about tuning GC performance!

This was also a simple change, thanks to monit and chef cookbooks. I added an attribute to our mongrel_rails cookbook to specify the GC parameters to passed into the environment of each mongrel when started by monit — something like this:

check process mongrel_5000
  with pidfile /var/run/mongrel/mongrel.5000.pid every 2 cycles
  start program = "/bin/su - lumoslabs -c 'RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000 /data/lumoslabs/current/bin/rackup -s mongrel -o xx.yy.zz.aa -p 5000 -E production -D -P /var/run/mongrel/mongrel.5000.pid /data/lumoslabs/current/config.ru'"
  stop program = "/bin/su - lumoslabs -c '/data/lumoslabs/current/script/stop_racked_mongrel.sh /var/run/mongrel/mongrel.5000.pid'"
  if totalmem is greater than 600 MB for 2 cycles then restart      # eating up memory?

The actual values for these parameters were taken directly from Twitter’s suggestions. Once I rolled out the monit changes via chef and bounced all the mongrels, I saw good things happen:

This is exactly what an ops engineer wants to see! A nearly immediate and big improvement in performance, and the only code changes required were on the infrastructure side of things. Though I haven’t tested out all of these parameters in isolation (who has the time??!) their aggregate effect is clear. We allocate more memory initially (RUBY_HEAP_MIN_SLOTS increases to 500k from 1k), grab more when we need to add it (RUBY_HEAP_SLOTS_INCREMENT increases to 250k from 10k), and make sure we don’t trigger GC before we actually need it (RUBY_GC_MALLOC_LIMIT increases to 50M from 1M.)

One word of caution – these settings did have the effect of increasing the memory footprint of each mongrel from about 480MB to about 600MB. As a result, subsequently we needed to reduce the number of workers on each server from 8 to 6 to avoid swapping, which after a few days had undone all of the performance improvements made by switching to REE. Having fewer workers is neatly balanced out by the improvement in response time, though, and ultimately we’re seeing lower CPU usage and higher throughput on each app server.

And the magic just doesn’t end. Based on some very solid advice at smartic.us, I added some GC tuning parameters to the environment on our CI box. I was really happy to see that it cut our build time (which had bloated up to become painfully long) approximately in half. Definitely try this out if your build is taking longer than you’d like!

Up next — taking the plunge and upgrading to Ruby 1.9.2. More on that in a month or two!


Sleep easy when switching MySQL servers: benchmarking with sysbench and mk-query-digest

Posted: September 13th, 2011 | Author: | Filed under: Operations | No Comments »

Our MySQL database servers are the heart and soul of lumosity.com‘s Rails stack. Memcached helps a ton with serving up data that’s accessed very frequently, but in the end the database is the single most critical (and most complex) piece of our infrastructure.

This summer we decided that it was time to go through the semi-painful process of upgrading our trusty primary database server — which has been serving a thousand queries per second for the last 18 months without batting an eyelash — to a bigger, badder, more modern machine with larger and faster disks, more RAM, and faster processors. Like most engineering teams, we strongly prefer to add capacity before we encounter performance issues, so it was the right time to switch. We needed to prepare the site for our next 6+ months of growth.

Once the new machine was prepped and ready to go, we had an important question to answer:

“This machine certainly seems fast, but can we be sure that it can handle our production load? What if there’s a subtle issue with the RAID controller, or maybe an unexpected I/O issue with our MySQL configuration?  What if we blow up the site???!!!”

We weren’t comfortable simply crossing our fingers and making the switch.  With money and subscriber satisfaction on the line, we needed to be sure that it would be a smooth transition.

Sysbench – OLTP Workload

Baron Schwartz et al’s indispensable High Performance MySQL, 2nd Edition (hereafter referred to as HPM) has a fairly basic but good section on benchmarking MySQL. sysbench is one of the standard Linux benchmarking tools covered in his survey. I wanted a simple tool that could (1) find the upper bounds of I/O and transaction processing performance on the new database server and (2) allow us to compare these boundaries against the performance of our current database servers.

After installing sysbench using yum, I followed Baron’s lead in HPM and fired off a fairly sizable OLTP workload on each of the machines. This would provide a reasonable approximation of the I/O generated by a database server handling many concurrent requests in a Rails stack.  The goal of this test was to determine the expected throughput (transactions per second) and per-request processing time that both generations of server could handle at their peak.  My hypothesis was that db-new, all souped up with 2011 hardware, would smoke db-old and its 2009 hardware.

I executed the following commands (straight out of HPM) on db-new and db-old to push 60 seconds worth of requests through a 1M row table in 8 concurrent threads:

# sysbench --test=oltp --oltp-table-size=1000000 \
    --mysql-db=test --mysql-user=root prepare
# sysbench --test=oltp --db-driver=mysql --oltp-table-size=1000000 \
    --mysql-socket=/tmp/mysql.sock --mysql-db=test --mysql-user=root \
    --mysql-password=xxxxxxx --max-time=60 --oltp-read-only=on \
    --max-requests=0 --num-threads=8 run

The results from db-new:

OLTP test statistics:
queries performed:
read: 3925586
write: 0
other: 560798
total: 4486384
transactions: 280399 (4673.18 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 3925586 (65424.48 per sec.)
other operations: 560798 (9346.35 per sec.)

Test execution summary:
total time: 60.0018s
total number of events: 280399
total time taken by event execution: 478.7724
per-request statistics:
min: 1.45ms
avg: 1.71ms
max: 6.62ms
approx. 95 percentile: 1.84ms

Threads fairness:
events (avg/stddev): 35049.8750/715.27
execution time (avg/stddev): 59.8465/0.01

And from db-old:

OLTP test statistics:
queries performed:
read: 2671536
write: 0
other: 381648
total: 3053184
transactions: 190824 (3180.28 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 2671536 (44523.87 per sec.)
other operations: 381648 (6360.55 per sec.)

Test execution summary:
total time: 60.0023s
total number of events: 190824
total time taken by event execution: 478.7136
per-request statistics:
min: 2.14ms
avg: 2.51ms
max: 81.88ms
approx. 95 percentile: 2.71ms

Threads fairness:
events (avg/stddev): 23853.0000/177.37
execution time (avg/stddev): 59.8392/0.00

This was good news, but not unexpected — the 95th percentile request time is approximately 33% faster on db-new (1.84ms vs 2.71ms), and the throughput is about 150% higher (4673 transaction/sec vs 3180 transactions/sec.) Given the increase in CPU and I/O horsepower on db-new, I would have been disappointed with anything else!

Before I could jump headfirst into the void of db-new, I had to see it run our real production workload. I had to make sure that it didn’t blow up because of some subtle change to MySQL configuration, or the new RAID controller, or the RAM, or one of an infinite number of other things. This was a head-scratcher at first; I thought maybe we could sniff the mysql traffic from db-old with tcpdump and somehow “replay” it on db-new, but I hadn’t a clue how to decode the mysql protocol. It sounded like a lot of work and we were working against the clock, disk filling up little by little on db-old every day. But soon enough, Google led me to…

mk-query-digest (the greatest MySQL tool in the history of the universe)

I had already discovered maatkit via the mentions of mk-table-sync in HPM, but had never bumped into the innocuously-named mk-query-digest until I found this 37signal post on warming the passive failover in a master-master replication pair. mk-query-digest should be in every MySQL admin’s toolkit: it does pretty much everything, including stuff you had no idea you needed to do. Let it teach you.

I already had db-new set up as a replication slave of db-old, so I knew their data was in sync at any given time (give or take a few seconds.) Moreover, I knew that UPDATEs and INSERTs were working as expected on db-new, since db-new was constantly replaying those queries by reading db-old’s binary logs via replication. But I had no proof that db-new would be able to keep up with the sizable throughput of SELECTs on db-old, about 1000-1200 per second these days. As outlined in the 37signals post I linked above, the quick-and-dirty way to do this was to run tcpdump on db-old to capture mysql traffic for some period of time, say 5 minutes, and then to use mk-query-digest to replay the SELECT queries on db-new.

First, I ran tcpdump on db-old:

[db-old] # time tcpdump -s 65535 -x -nn -q -tttt -i any port 3306 > db-old.tcp.txt
tcpdump: WARNING: Promiscuous mode not supported on the "any" device
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
1050296 packets captured
1063883 packets received by filter
13498 packets dropped by kernel

real 4m34.464s
user 1m40.576s
sys 0m16.530s

Then, I carefully shipped the resulting 1.8GB file over to a third server, call it db-other, which would be used to execute the queries on db-new. Why a third server? I didn’t want the overhead of the mk-query-digest perl script itself polluting the results on db-new. Also, note the use of the '-l 500' argument to scp, which rate-limits the file copy to 500kB/sec. Since db-old was a live production database server, I had to take care not to hog all its outbound bandwith with the file copy, which would starve our Rails app servers of data and crash the site!

[db-old] # scp -l 500 db-old.tcp.txt lumoslabs@db-other:~

Now, on db-other, I ran mk-query-digest to run the 5 minutes worth of queries on db-new. I’m using the --filter argument to pass in a perl expression that will only execute queries that start with the string ‘SELECT’ (case insensitive.)

[db-other] # mk-query-digest --type tcpdump \
    --filter '($event->{arg} =~ m/^SELECT/i) \
    --execute h=db-new,u=user,p=xxxxxx db-old.tcp.txt

On db-new, I fired up mk-query-digest to watch the SELECTs roll in on the mysql interface:

[db-new] tcpdump -s 65535 -x -nn -q -tttt \
    -i any port 3306 | mk-query-digest --print --type tcpdump

# Time: 110913 15:48:15.650359
# Client: 10.32.95.138:48155
# Thread_id: 4294967311
# Query_time: 0.000059 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SELECT `asset_versions`.* FROM `asset_versions` WHERE (`asset_versions`.asset_id = 27668);
# Time: 110913 15:48:15.652158
# Client: 10.26.2.134:53512
# Thread_id: 4294967346
# Query_time: 0.000056 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SELECT COUNT(*) FROM `roles` INNER JOIN `roles_users` ON `roles`.id = `roles_users`.role_id 
  WHERE `roles`.`name` = 'admin' AND ((`roles_users`.user_id = NULL));

... etc ...

I watched top, iostat, and mytop to make sure nothing was blowing up. The server load stayed nice and moderate, peaking at 0.66 as the iowait percentage spiked during the initial paging into the InnoDB buffer. It eventually settled into a comfortable 0.33, with iostat showing only 0.1% iowait time. Basically, things looked great from the perspective of system metrics. I relaxed even more!

The output of mk-query-digest gave even more reason to be hopeful that db-new was ship-shape:

# 339.9s user time, 7.1s system time, 112.13M rss, 233.60M vsz
# Current date: Mon Sep 5 14:57:11 2011
# Hostname: db-other.sl.lumoslabs.com
# Files: db-old.tcp.txt
# Overall: 365.64k total, 1.89k unique, 1.33k QPS, 18.37x concurrency ____
# Time range: 2011-09-05 13:18:10.837646 to 13:22:45.293365
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 5042s 0 38s 14ms 839us 347ms 194us
# Exec orig ti 204s 0 2s 557us 596us 11ms 131us
# Rows affecte 21.92k 0 19 0.06 0.99 0.24 0
# Query size 42.85M 5 3.61k 122.87 246.02 148.13 102.22
# Exec diff ti 273s 0 38s 2ms 626us 118ms 108us
# Warning coun 11.81k 0 11.30k 0.03 0 18.31 0
# Boolean:
# No index use 7% yes, 92% no

We can see that we reach 1.33k queries per second, which is in line with the expected load of our production traffic. We can also compare the 95th percentile execution time of the queries on the new server — “Exec time” of 839 microseconds — with that on the old server — “Exec orig time” of 596 microseconds. Given the fact that we were running these queries against an entirely “cold” server, i.e. with absolutely no data in the InnoDB buffer, we would expect this performance hit. Nearly every piece of data requested by the SELECTs during this 5 minutes had to be pulled from disk, whereas db-old had the great advantage of having many GB of RAM all warmed up with the most-frequently requested items.

So, this particular test shows us that the database server wasn’t crushed by the load, but not that its performance was comparable to the original machine. To test that we’d need to mirror our production workload for a considerable period of time — an exercise that is beyond the scope of this post. (But we did it, of course!)

Conclusion

This database switchover went smooth, as smooth as Sade. We’re running against db-new now and have started seeing the expected performance boost as its InnoDB buffer pages in the optimal working set for Lumosity’s data. It’ll be very useful to have these tools in our box in the near future when we tackle our next database project: upgrading to MySQL 5.5. Without benchmarking, we’d just be switching and praying!


code consistency / rails style guide

Posted: June 30th, 2011 | Author: | Filed under: Rails | 2 Comments »

After you have programmed for a while, you start to recognize good code from bad code. One of the things that makes code better is making it consistent.

Code consistency leads to code that is:
- easier to read
- easier to debug
- less buggy
- easier to code review
- easier to maintain

Ultimately, having agreed upon consistency in our code allows us to focus more on what we are building, and less on how to write it. This in turn, let’s us deliver better features, faster.

Here is our style guide for rails. This has come from lessons learned on building our own app, looking at lots of external code, plus checking out other peoples’ style guides.

Bonus points to anyone willing to write a rstyle tool (something like cstyle) to automate the task of ensuring code consistency.


Capistrano Task to Do a Rolling Restart + Uninterrupted Deploy

Posted: June 16th, 2011 | Author: | Filed under: Rails | No Comments »

Unnecessary downtime is bad. To allow for small, frequent deploys, we need to avoid downtime – no matter how short it is. In order to do this, we can’t restart all of the mongrels on all of our app servers at the same time. What we would like is a rolling restart – the ability to restart the mongrels, one app server at a time.

capistrano has the assumption that you want to perform a given task in parallel to all servers within a role. For a rolling restart, we would like to serialize the restart task. Here is way to perform a task serially (across your servers) within cap:

def serialize_task_for(task_name, servers)
  servers.each do |server|
    puts "    Performing #{task_name} for #{server} at #{Time.now}..."
    task(task_name, :hosts => server) do
      yield server
    end
    eval(task_name)
  end
end

Now that we have this, we can do a rolling restart to bounce our mongrels across our app servers:

namespace :deploy do
  desc "Rolling restart. Restart the mongrels, one app server at a time."
    task :rolling_restart, :roles => :rolling_restart do
      servers = find_servers(:roles => :app)
      serialize_task_for('rolling_restart_for_a_single_server', servers) do |server|
        run("sudo monit restart all -g lumoslabs")
        # wait longer than it normally takes a mongrel to startup
        sleep(70)
        teardown_connections_to(sessions.keys)
        done = false
        while(!done) do
          run("sudo monit summary | grep mongrel | awk '{print $3}' | grep running | wc -l") do |channel, stream, data|   
            done = data.split.first.to_i >= num_mongrel_instances
            sleep(10) if !done
          end
          teardown_connections_to(sessions.keys)
        end
      end
  end
end

And if we take this to its logical conclusion, we can now do an uninterrupted deploy to our website:

namespace :deploy do
  desc "Update the site without taking it offline. Does not run migrations."
  task :uninterrupted do
    transaction do
      update_code
      web.compile_stylesheets
      symlink
    end
    rolling_restart
  end
end

When you should do a rolling restart/uninterrupted deploy, of course, needs to be thought out and is app specific. How to do an uninterrupted deploy with migrations is a future topic.