EC2 monitoring: the case of stolen CPU">EC2 monitoring: the case of stolen CPU

Posted on 22. Jul, 2010 by admin in Blog

When the top com­mand dis­plays 40% CPU busy but Cloud­Watch says the server is maxed out at 100% — which side do you take? The answer is sim­ple (Cloud­Watch is cor­rect, top is not) but it raises a ques­tion about how to mea­sure per­for­mance of vir­tual machines if you can no longer take oper­at­ing sys­tem sta­tis­tics at face value. How do you define thresh­olds, raise alerts, and cre­ate man­age­ment reports if the under­ly­ing data appears to be misleading?

CPU Usage dis­played by top

CPU Usage reported by CloudWatch CPU Usage reported by Tivoli OS agent

If you’re an IBM cus­tomer with a pSeries frame these ques­tions aren’t entirely new to you. When IBM intro­duced shared pools and micro-partitioning back in 2004 it rad­i­cally changed how CPU usage is mon­i­tored in the AIX part of the world. In fact, since CPU capac­ity is allo­cated to a log­i­cal par­ti­tion dynam­i­cally, the tra­di­tional CPU break­down by system/user/wait i/o has become irrel­e­vant for capac­ity plan­ning. What mat­ters is CPU con­sump­tion in proces­sor units as well as the ratio of CPU units con­sumed to CPU units allo­cated. The ratio can be greater than 100% which is not a scalability-on-demand fea­ture that Ama­zon cus­tomers can enjoy as of this writing.

The XEN hyper­vi­sor pow­er­ing Ama­zon EC2 infra­struc­ture has made great progress of adding flex­i­bil­ity to resource allo­ca­tions, but it’s still years behind IBM POWER hyper­vi­sor in terms of gran­u­lar­ity. Nev­er­the­less, there are still some options left to cor­re­late OS and hyper­vi­sor met­rics for the ini­ti­ated observer and an aspir­ing cloud guru. For exam­ple, you may notice that the top out­put con­tains an addi­tional met­ric called stolen CPU (st for short).

CPU Stolen dis­played by top

The met­ric is exposed by the XEN hyper­vi­sor and in the above exam­ple it’s equal to 56.9%.  Stolen CPU means how many cycles were re-claimed by the hyper­vi­sor because the vir­tual machine has reached the max­i­mum allo­cated num­ber of proces­sor units of the under­ly­ing proces­sor core. In the exam­ple above, the m1.small EC2 instance was allo­cated 0.4 proces­sor units and so 40% CPU busy means the per­cent­age usage of the under­ly­ing core. How­ever because 40% is the max­i­mum CPU share that can be allo­cated to this VM, the effec­tive CPU usage is 40%/40% = 100%. Which is the num­ber dis­played by CloudWatch.

Another option that can used to retro­fit the exist­ing agent– or SNMP– based mon­i­tor­ing tools, that don’t inte­grate with Cloud­Watch, is to use the CPU idle met­ric. All you need to do is to re-write rules to mea­sure CPU idle instead of CPU busy. E.g. if you have a >75% thresh­old defined for CPU busy, cre­ate a <25% rule for CPU idle. If CPU idle is 0, then your server is CPU bound.

CPU Idle dis­played by top

If you’re won­der­ing where does 40% comes from, the math is pretty sim­ple. The m1.small linux sys­tem is enti­tled to 1 EC2 com­pute unit which pro­vides the equiv­a­lent CPU capac­ity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon proces­sor. Since the VM runs on a machine with 2.6 GHz clock speed, it’s enti­tled to 38.4% — 46.2% proces­sor share on this par­tic­u­lar XEN node. You can run cat /proc/cpuinfo com­mand to find out CPU archi­tec­ture behind your EC2 instances.

Find­ing out CPU clock speed on Linux EC2 instance

By the way, there is an ongo­ing indus­try dis­cus­sion about the ‘stolen cpu’ or ‘steal time’ term. Obvi­ously, the word itself car­ries a con­no­ta­tion that might make some AWS cus­tomers won­der if their fully-paid CPU time was some­how stolen by rogue EC2 instances run­ning on the same phys­i­cal node. Rest assured, the rules of the game are fair. The best way to describe stolen CPU time to your peers is to think of it as shared CPU time belong­ing to other AWS customers.