« April 2007 | Main | September 2007 »

June 2007

June 29, 2007

Hold the Fudge Please

For anyone getting serious about virtualization, one of the first questions that comes up is how to account for the overhead introduced by the virtualization technology being used. Any time you place a layer of abstraction between resource demand and supply there is a high likelihood that some overhead will result.

When we look at common server virtualization technologies there is significant variability in the level of overhead.  Above-kernel technologies such as Solaris Zones are based on a containment model, which is fairly efficient since the device drivers and scheduler activity are controlled, not virtualized. On the other hand, below kernel technologies such as VMware tend to replace what would have been a fairly direct link to the hardware with virtual device drivers, thus generating CPU activity whenever I/O occurs. Paravirtualization technologies (like Xen) and partitioning technologies (like LPARs) have different characteristics yet again, with LPARs having some very sophisticated I/O virtualization capabilities

Of course there is currently a lot of focus on VMware, and how to determine how high you can safely pile the workloads.  Because it is easy to do, it is a common practice to use simple fudge factors to account for overhead when stacking workloads. Overhead numbers in the 15% range are common, and fudge factors upwards of 30% are not unheard of in some situations. Of course, this is like surgery with a sledgehammer, and is no replacement for a proper model of overhead.

To properly analyze this, let’s consider a workload pattern for a physical server:

Cpu_2

This is a server that uses spends most of the day between 0 and 25% utilization and peaks up as high as 50% late in the day. (This is a quartile-based view, which I won’t get into too much here.) To determine what this will look like in a VMware environment, the next step is to look at the I/O rates on this server:

Disk_io

Net_io

From these curves we can see that this server is not I/O intensive for most of the day but has some reasonably sustained I/O activity after 7pm. In VMware, this I/O will pass through virtual drivers, which effectively pass the operations on to the underlying device drivers, but generate CPU while doing so. We can therefore use these curves to estimate the true overhead on the CPU, and attribute this overhead to the corresponding times of day.

The result of using this approach is a virtual CPU curve that looks like this:

Virt_cpu

When compared against the original curve it is clear to see that the overhead is skewed heavily toward the periods of significant I/O activity, and that periods of low I/O do not contribute disproportionately to the overall utilization levels.  (This curve accounts for overhead caused by network I/O, disk I/O and general scheduler overhead, and is exaggerated to help illustrate the effect.) 

When applied in practice this helps bring an additional level of accuracy to virtualization analysis and helps prevent layering fudge factors on top of fudge factors, which tends to lead to sticky situations that can undermine your attempts to trim down your data center.