VDI and IOPS
A few months ago I was asked to join a project which was investigating a VDI implementation for our company. I was happy to join, because I know VDI solutions can be array killers. So it’s best to join such a project in it’s earliest stage, to provide input and see where pitfalls might be. The chosen VDI solution was Citrix Xenserver, and apparently it wasn’t that early to join the project. Most parameters were already set. It had already entered the technical phase, where we were asked what it would cost and what we needed to build a working environment. In that “early” stage it got very clear to me that we were talking about virtualizing only developer desktops. We talked a bit about it, put down some remarks about which info was missing for doing a good design.
Fast forward to two weeks ago. We got a simple question.
“Can you or can you not handle the IOPS if we virtualize all developer workstations, and if not, how much money do we need to put in so we can handle them. Here are the specs you can use for your design, please tell us fast.”
The given specs;
- 125 developers, all running two workstations. Effectively we’re virtualizing 250 workstations
- All static images, no linked clones
- Peak IOPS “calculated” by architects: 30
- Those peak IOPS were only to be seen during a so called bootstorm
- Provisioned size per workstation: 1x 30GB, so we’re talking around 7.5TB
- Current infrastructure must never be impacted by the VDI implementation
We leaned back in our chairs, and told them 30 iops was way too low for a developers. Furthermore, the given provisioned size is low, but we were told it was handled by virtualizing applications.
Because the answer was needed that fast, we decided to up the number of IOPS to 100 peak. And double the provisioned size. We’re running SVC, so to keep cost low and make sure our current backend controllers wouldn’t see the IOPS of the VDI solution, we chose a DS5300 containing 146GB FC disks, purely for running VDI images. Remember, no real design here, just pulling up numbers, and matching them to a controller. We told management this was our initial design, but we wanted to look into it more.
I talked to @rootwyrm about this, and he confirmed the given specs were low. Too low. After talking some more, I decided to go back to the project, and get some names of developers, so we could monitor them for a few days, and after that could analyse real data. What we absolutely did not want, was design with some specs given to the project by the vendor of the virtualization solution. We wanted real life data from our own developers. So after a few days of logging 5 random developers, we got interesting numbers. Very interesting numbers.
Turns out 3 out of 5 developers have around 30-50 IOPS sustained. All day long. And that’s when they’re not behind their desk, but are in meetings. When they start working, IOPS shoot up, to be around 100 sustained. Serious work, compiling or debugging; 200. I’ve also seen 300-400 sustained for hours. Largest peak measured was 600. There goes the 30 IOPS we were given to work with.
Here’s a graphed example, the sample shown here is a single developer, during a 4 hour timespan.
Same story for the given capacity specs. All developers are at least running a partition of 60GB. And they need it. It’s kinda hard to virtualize and stream local Oracle installations and Tomcat servers.
I don’t even want to know how many people got in big problems due to just calculating with the specs the virtualization vendor gave them, and not measuring themselves. It’s going to be a big problem if you invested in running this for 250 users, the POC and all tests succeed, and when rolling out in production starts you come to the conclusion your arrays can’t keep up and you’ve paved a road to disaster. Then go and tell upper management you need a few bucks more. Not funny.
We’re back to drawing board. Logging more users for analysing next week.
Hope you guys out there do the same as we did, never assume. Because it will make an ass out of u and me.