Archive

Posts Tagged ‘SAN’

VDI and IOPS

May 5th, 2011 2 comments

A few months ago I was asked to join a project which was investigating a VDI implementation for our company. I was happy to join, because I know VDI solutions can be array killers. So it’s best to join such a project in it’s earliest stage, to provide input and see where pitfalls might be. The chosen VDI solution was Citrix Xenserver, and apparently it wasn’t that early to join the project. Most parameters were already set. It had already entered the technical phase, where we were asked what it would cost and what we needed to build a working environment. In that “early” stage it got very clear to me that we were talking about virtualizing only developer desktops. We talked a bit about it, put down some remarks about which info was missing for doing a good design.

Fast forward to two weeks ago. We got a simple question. :)

“Can you or can you not handle the IOPS if we virtualize all developer workstations, and if not, how much money do we need to put in so we can handle them. Here are the specs you can use for your design, please tell us fast.”

The given specs;
- 125 developers, all running two workstations. Effectively we’re virtualizing 250 workstations
- All static images, no linked clones
- Peak IOPS “calculated” by architects: 30
- Those peak IOPS were only to be seen during a so called bootstorm
- Provisioned size per workstation: 1x 30GB, so we’re talking around 7.5TB
- Current infrastructure must never be impacted by the VDI implementation

We leaned back in our chairs, and told them 30 iops was way too low for a developers. Furthermore, the given provisioned size is low, but we were told it was handled by virtualizing applications.

Because the answer was needed that fast, we decided to up the number of IOPS to 100 peak. And double the provisioned size. We’re running SVC, so to keep cost low and make sure our current backend controllers wouldn’t see the IOPS of the VDI solution, we chose a DS5300 containing 146GB FC disks, purely for running VDI images. Remember, no real design here, just pulling up numbers, and matching them to a controller. We told management this was our initial design, but we wanted to look into it more.

I talked to @rootwyrm about this, and he confirmed the given specs were low. Too low. After talking some more, I decided to go back to the project, and get some names of developers, so we could monitor them for a few days, and after that could analyse real data. What we absolutely did not want, was design with some specs given to the project by the vendor of the virtualization solution. We wanted real life data from our own developers. So after a few days of logging 5 random developers, we got interesting numbers. Very interesting numbers.

Turns out 3 out of 5 developers have around 30-50 IOPS sustained. All day long. And that’s when they’re not behind their desk, but are in meetings. When they start working, IOPS shoot up, to be around 100 sustained. Serious work, compiling or debugging; 200. I’ve also seen 300-400 sustained for hours. Largest peak measured was 600. There goes the 30 IOPS we were given to work with. :)

Here’s a graphed example, the sample shown here is a single developer, during a 4 hour timespan.

Disk Transfers/Sec for a developer workstation

VDI IOPS

Same story for the given capacity specs. All developers are at least running a partition of 60GB. And they need it. It’s kinda hard to virtualize and stream local Oracle installations and Tomcat servers. ;)

I don’t even want to know how many people got in big problems due to just calculating with the specs the virtualization vendor gave them, and not measuring themselves. It’s going to be a big problem if you invested in running this for 250 users, the POC and all tests succeed, and when rolling out in production starts you come to the conclusion your arrays can’t keep up and you’ve paved a road to disaster. Then go and tell upper management you need a few bucks more. Not funny.

We’re back to drawing board. Logging more users for analysing next week.
Hope you guys out there do the same as we did, never assume. Because it will make an ass out of u and me. :)

 

Categories: SAN Tags: , , ,

Brocade DCX webtools authentication problem

May 31st, 2010 9 comments

Recently we had some problems with DCFM suddenly marking all virtual switches on all of our DCX directors with a nice tag of “Product status unknown”. Solving it was not hard to do in the end, but it took some time going through support and all. In this post I will explain how the problem looked like, and what the solution was.
On day one all virtual switches defined on one of our DCX’s, and the chassis, were marked with this unkown status. Since we use command line for most of the time, were busy with other things at that time and it didn’t disrupt operation, we didn’t directly look in to the problem. The next day another director did the same, and the day after that another two did it too. DCFM was constantly spitting out messages that it had security login violations. The day after the first one revealed this problem, we started looking into it, and found this errors in the logs:
datestamp, [FW-1342], 23858, SLOT 6 | FID id, WARNING, vFabricName, Sec Login Violation, is above high boundary(High=2, Low=1). Current value is 6 Violation(s)/minute.
datestamp, [SEC-1193], 23859, SLOT 6 | FID id, INFO, vFabricName, Security violation: Login failure attempt via HTTP. IP Addr: violating ipaddress
datestamp, [SEC-1193], 23860, SLOT 6 | FID id, INFO, vFabricName, Security violation: Login failure attempt via HTTP. IP Addr: violating ipaddress
datestamp, [SEC-1193], 23861, SLOT 6 | FID id, INFO, vFabricName, Security violation: Login failure attempt via HTTP. IP Addr: violating ipaddress
datestamp, [SEC-1193], 23862, SLOT 6 | FID id, INFO, vFabricName, Security violation: Login failure attempt via HTTP. IP Addr: violating ipaddress

The violating ipaddress was the address of the DCFM server. So I fired up a browser from my workstation and connected to webtools. Webtools showed up fine with the authentication screen:

So we use the user which is defined for DCFM, and it starts authenticating:

And after a second or two we get an invalid password error:

This happens starting Webtools from DCFM server, from workstations, from every vlan, and with every user we tried. We’re not using RADIUS authentication, this is normal local switch authentication. All of the tried users are working if you use them logging in through SSH. My guess at that time was that the authentication between the http server on the directors and the local switch database was broken, due to a bug. Contacted a Brocade engineer directly, he made some calls but no one had ever seen this strange behaviour. Logged a call with IBM support (directors are OEM’d by IBM) and then the hassle of logs and dumps sending and answering your standard L1 questions all came by. They purely focussed on DCFM losing it’s password in the discovery setup screen. To me it was obvious why it was gone there, DCFM was told it was an invalid user, so it clears the field. IBM L1 support however was persisting this was where the problem was. After persuading them to dispatch the call to Brocade, things sped up a bit.

First bullet on the action plan was to upgrade our Java plugin. For FOS 6.3.0 the plugin should be at least at 1.6.0.13 or later. Of course that didn’t work because it already was at 1.6.0.13. After telling them the complete story, apparently things got dropped in the conversation between IBM and Brocade, they came up with a HA failover. Effectively rebooting the CTP’s.

So we did:

SwitchName:vFabricName:username> hashow
Local CP (Slot 6, CP0): Active, Warm Recovered
Remote CP (Slot 7, CP1): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized
SwitchName:vFabricName:username> hafailover
Local CP (Slot 6, CP0): Active, Warm Recovered
Remote CP (Slot 7, CP1): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized
Warning: This command is being run on a redundant control processor(CP)
system, and this operation will cause the active CP to reset.
Therefore all existing telnet sessions are required to be restarted.

Are you sure you want to fail over to the standby CP [y/n]? y
Forcing Failover ...

SwitchName:vFabricName:username> hashow
Local CP (Slot 7, CP1): Active, Warm Recovered
Remote CP (Slot 6, CP0): Non-Redundant
SwitchName:vFabricName:username> hashow
Local CP (Slot 7, CP1): Active, Warm Recovered
Remote CP (Slot 6, CP0): Standby, Healthy
HA enabled, Heartbeat Up, HA State not in sync
SwitchName:vFabricName:username> hashow
Local CP (Slot 7, CP1): Active, Warm Recovered
Remote CP (Slot 6, CP0): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized
SwitchName:vFabricName:username>

After this we tried Webtools, and it worked. DCFM picked it up immediately, without any changes it discovered the failed over DCX. Interesting to see however was the fact in discovery setup the password was still blanked for this DCX, although it was re-discovered automatically. Just filled the password in there again, and it accepted it. Field is now filled.

Problem solved. Although there’s no answer from Brocade yet explaining why this happened. Expecting a note in upcoming releasenotes somewhere :)

FIY:
This happened on:
Brocade DCX running FOS code 6.3.0b
DCFM 10.3.3 build 11

Categories: SAN Tags: , , ,