Episode 49: Automation Infrastructure with Jim VanderMey and Allen Derusha – Part 2

Episode 49: Automation Infrastructure with Jim VanderMey and Allen Derusha - Part 2

In this episode:

In the conclusion of our two-part episode on automation infrastructure, CIO Jim VanderMey and Sr. Engineer Allen Derusha talk about interoperability between systems. The two chat about RedFish, which allows for a unified approach. Learn more about the advantages of automation. If you missed part one, here is a link to catch up. Enjoy this episode!


This podcast content was created prior to our rebrand and may contain references to our previous name (OST) and brand elements. Although our brand has changed, the information shared continues to be relevant and valuable.


Episode Transcript

Kiran: Welcome to today’s episode of 10,000 feet the OST podcast, we are back with part two of our two-part conversation on automation infrastructure, and we’re happy to have OST CIO Jim Vandermay back as well as senior engineer, Allen Derusha. Enjoy the conclusion to this two-part series.

Jim: So the center of our industry from a, I’ll say the intellectual heft, has shifted from the traditional hardware manufacturers to the hyperscale public cloud providers. And so the thinking around server administration “infrastructure is code”, which is now from the cloud is now being brought into the data center. And there’s an old group called the desktop management taskforce, DMTF they started off by doing, how are we going to control desktops? And now you introduced me to..

View Full Transcript

Allen: Redfish yeah. So we’ve got these, you know, we have this tooling, right. And I’m going to put some names out here. Right? So Ansible and Terraform, I think are the ones that–

Jim: I love terraform.

Allen: The two that I run into most. There are others, right?

Jim: And those are the two that we’ve adopted here at OST.

Allen: Different layers of the stack. You know? So you might have Shafir, cobbler, or salt, or whatever, you know, there’s a bunch of these out there. And I don’t mean by exclusion to say that those are not valid solutions, but my world has largely focused on Ansible and Terraform. So I’m gonna use those as our examples today. And I also think both Dell and HP have focused on those two tools as well.

So you’re going to find, you know, the support from the major x86 vendors are also focused on that stack. So we’ve got this problem and HP has not a lot of incentive to solve it for Dell and Dell doesn’t have a lot of incentive to solve it for HP. However, there is this common ground, we’re Dell and HP and Lenovo, and a bunch of other people that make desktop systems had to implement some technology to allow those systems to be managed.

And for whatever reason, they decided to actually make it interoperable. And so this thing is happening also on the server end. Like, so we don’t have this, we’ve got these, you know, these HP specific tools that do HP, we’ve got these Dell specific tools that do Dell things. All of those systems have a notion of what’s called an out of band management interface, OOB. In HP we call it iLO, in Dell we call it IDREC, IBM and now Lenovo have IMM, Cisco UCS. I don’t remember the name of that one, but they’re all the same thing. Right. You know, it’s a way to attach to that system and perform some management activities that around the configuration of that platform that do not require an operating system to be running on that system.

So if I plug in a modern Dell server, there’ll be at least two ethernet ports on the back. One of them’s going to be the ethernet port we all think of. So if I installed the IP address of windows or Linux, whatever OS is running on there, and the other one is a separate processor with separate operating system running on it. It’s got its own Ram. It’s got its own network port, and I can log into that interface. Even if the server is powered off, the power has to be plugged into it. Right. But if there’s no fan spinning and the light is Amber on the front indicating power off. That out of band management interface is still powered on and functioning and responding.

And I can do things into that. Initially, these have been around for a very long time. This concept has been around, you know, back in the compact days we had iLOs and I would log into that interface in a web browser, again, click my way through it again. Right? It was all very interactive. Click, do things once to the system, but you have this thing out there.

It’s embedded inside of the system. It has some communications to the system. It knows things about that system. And through that web interface, it can make changes to that system. What do we mean by changes to a server? Maybe it’d be useful to discuss that. So I’ve got a bunch of things I’m worried about.

All of the things– if you’ve dealt with desktop PCs at all and you boot up and you hit delete to go into Bios or F 12 or whatever that key is, and those kinds of settings what device do I want to boot from? Maybe I need to install firmware on that system, which Knixs should be active. How do I want my boot process to run? Do I want to run legacy Bios, boot process or new fangled UFE boot process, or some hybrid of both? Do I want to enable boot from lan? Do I want to enable these devices or not? And how about what power performance profile do I want to apply to my processor? All of this stuff, these things we’ve been dealing with for a long time, by logging into a system, hitting delete, changing those settings, that sort of thing.

So our configurable systems from whatever manufacturer have these notions, they have configuration items that exist on them in the cloud. I don’t have to worry about it ever because I don’t touch the hardware ever. It’s somebody else’s problem. In my data center, I have to worry about it. My vendors give me solutions that aren’t cheap necessarily, but that will deal with those configurations.

But they also will only do it for themselves. And so I’ll buy an expensive solution from Dell to do my Dell stuff. I buy an expensive solution from Cisco to do my Cisco stuff, et cetera. And none of those things align with each other. And so if I do the work in Cisco, I can’t really take that, you know, my bias configuration on my boot order and how I want my firmware and everything. It doesn’t apply to that Dell. So I start over from nothing on the Dell tool. Okay, great.

Boy, it would be better if I had some common way to handle this. And so enter Redfish. So red fish is a agreed upon standard by this desktop management technology forum, DMTF that allows for a unified approach to sending rest commands to the outer band management interfaces on systems which have a Redfish compliant on a band management bias, almost everything you buy today from an enterprise server vendor is going to be compatible with this. I don’t know that Dell makes a thing that you can put in a data center. That’s called powerage. I don’t know, that HP makes a thing that’s called pro-line that doesn’t have this built into it.

Maybe on the very low end, like the ultra stripped down low cost servers, but anything our customers will buy has this capability. And so now we have a standard that I can using HTP calls from my own Python code or whatever it is that I’m running, that I can authenticate to this management infrastructure that’s inside of one of these boxes. I can collect information about one of these boxes. I can use the same command to do that on an HP, on a Dell, on a Cisco, on a Lenovo. And then I can send more HTP commands to make changes to that environment. I want change a boot order. I want to now have it boot from network. I want to disable insecure boot processes, you know, whatever those things might be.

Those settings aren’t all standardized. And so I may still have a situation where if I want to change the boot order, that I have a different command that I use for Dell than I would for HP. So there are vendor specific extensions to Redfish, but I will say to the vendors great credit they’re well-documented and they’re discoverable, meaning I can send a rest command that says, show me what commands you support and it will dump that information back out and I can then make decisions based on what it supports for my next step.

Jim: So we’ve obviously gotten into a lot of detail, which I think that you just made a comment in passing that is quite important because you mentioned eliminating insecure boot options. Which is I think an important part of why organizations should be thinking this way because security risk and configuration risk and variation and configuration is where the new risk profiles in the modern security issues are coming from.

Allen: Absolutely. I think specter meltdown was a great early example of this. So one of the first– when spectrum was first announced, I don’t know if you remember, but it was a mess. We had SSH credentials just leaking out to the internet for anybody who asked for them. I mean, it was bad news and the initial mitigation was to go into your bias and disable certain functionalities, right?

Like I could turn off VT and some of these other functions that would not prevent it, but help mitigate. And the problem was is that no one had anybody to go away to tweak a bias setting across 10,000 systems. It just never mattered to them that much up until–

Jim: It did.

Allen: Up until it did that one day. Right. We’re Heartbleed came out. I’m sorry. I’m mixing up the SSH one. That was Heartbleed but spectrum meltdown came out and it was drastic. We have to react right now and there is no tooling to go make changes to hardware settings on my platform. So I think that was an early case, you know, a security incident response where we are in the nature of security incidents today are no longer purely software driven. There are attacks which rely on potentially unexpected behavior of the hardware. And so where these are cash attacks, whether these are a DDR Rowhammer attacks, right? There’s a lot of things that are just around the way that the Harbor physically behaves.

In the case of the memory attacks, it’s actually how it physically behaves, like how electrons move around. So in those situations, you can’t just patch the operating system and expect you to have immediate resolution on this. Like we’ve got to touch the hardware in some way. And so that’s where this tooling really becomes important.

Jim: And so the ability to do, to turn features off and on at the hardware layer to manage bios settings. To be able to, and here’s a side effect that I’ve seen in my own experiences that by doing this through an automation stack, you all also are creating through logging and through the exhaust of these processes is documenting that the process was done because you love writing documentation.

Allen: Like every other engineer out there right?.

So like, okay, what does this get us? So, we’ve done this process. All right, I’ve created some sort of Ansible or Terraform or whatever process that, you know, HP has good modules for Ansible. Dell has good modules for Ansible, so I can go into Ansible and I can say, here’s how I configure a system. And then I will, you know, there are still going to be things that are Dell specific or HP specific. They’re always trying to differentiate themselves from the competition. And so there’s going to be things that only apply to that platform. Okay. So I’ve got this abstraction layer that allows me to configure the system at a hardware level, in a way that will work across multiple systems, or at least give me a way to deal with multiple systems.

I can do that at low to no cost. I can do that at a cost, which is not impacted by the number of systems which I’m touching. And so I can do this a hundred thousand times at the same price. I can do it 10 times, which that’s a big deal for a lot of our customers. So then what, you know, what, why do I bother doing all of these things?

So, Jim, you just brought up a very good one. So, you know, a deployment, anyone who operates a hundred thousand x86 systems is very regularly replacing a lot of them. So if we divide a hundred thousand by maybe a five-year life cycle, say it’s a–

Jim: 20,000 servers a year.

Allen: 20,000 systems are going to be rolling in and out of that data center.

And that’s just, or not that data, those data centers, I spread across the world, right. That estate, but that’s a lot, right? That’s a huge estate and that’s I mean the churn happening all the time. Right. And there’s just, at that point, there is no way you are going to economically be able to click your way through that sort of setup.

And if you did, you know, your guarantees of that being right at the end of the day are all flown out the window. Okay. So we implement some of this automation. What’s the first thing that happens. One, we turn our life cycle between receiving that piece of expensive hardware and being able to make productive use of that expensive hardware from potentially months into minutes. If you execute well on this, so right away, I am going to get back. You know, if that server is a five-year lifestyle,

Jim: 60 months. If it’s a 60 month life cycle.

Allen: Yes a 60 months like cycle, I can maybe chop one or two additional months of functional use. Okay. So not a big deal, but you know, four or 5% extra life that you could get out of this by being able to turn it on quicker, I can do more with fewer people so I can get this whole operation.

I’ll have to have people that at the end of the day, someone physically is opening a box and stacking these things and plugging them in and getting sweaty. So, you know, yeah. So those guys and girls still exist. Right. And they’re going to do that thing and they’re going to plug that thing in. Right. And so this process then takes over.

I take all of the configuration, the logging into that system that, you know, typing in stuff, waiting for it to boot, which takes a while. Like that becomes a very manual step-by-step non paralyzed workflow.

Jim: So we move a serial workflow into a parallel workflow.

Allen: A parallel workflow because that person– like them unboxing servers is a serial process for that one individual. But I can parallelize it by putting more individuals on the job easily, and then I can get them all, you know, up and running or at least just powered on and then this next step takes over. Okay. So this next step has already saved me a bunch of engineer time. One, because it’s parallelizable too, because there’s no engineer involvded.

The output of that is that log you mentioned. Right? So once that process completes, I have two things. One, I have an expectation that it’s configured the way I desire. And two, I’ve got a record of that happening. And so if there’s whatever reason, any reason to believe that didn’t happen well, I’ve got a trace of what went wrong and I can go back and try it again or resolve that problem.

Jim: So right now you’ve just described about– we’ve optimized the capital life cycle, because we’ve just improved the amount of usable life of that asset in the infrastructure, because I have automationinput. It gets to the floor through automation. Secondly, I’ve created a parallel work structure, so I’ve optimized my labor.

And now you just described how you were optimizing the outcome because we’re creating a more predictable, reliable, repeatable outcome. And we’re also eliminating the– and in some respects, we’re optimizing the work of the engineering team because we’re not forcing them into highly repetitive processes of clicking, and documenting, documenting.

Allen: Yeah. I don’t care who you are. I’ve not met an engineer that loves doing a thing a hundred times. Right?

Jim: I have.

Allen: Okay. They might exist.

Jim: They might exist, but that becomes a reason for existence. And that’s a problem from a human standpoint is that we have to recognize that the skill sets are changing and the mindsets are changing.

So there’s a competency issue and an approach issue that we have to resolve.

Allen: So, so we have this outcome, right? So now I’ve got my HP systems and my Dell systems and my lenovos and Cisco’s and what have yous are all able to be recieved, stood up and available to my consumers quickly and repeatedly.

Okay. What then? So, you know, first advantage is I can allow myself to purchase from multiple vendors yet still realize the benefits of a streamline automation cycle, which historically wasn’t true. Right. So I bring these cloud-like applications into my environment and I can leverage them to create this capability, which allows me to go multi-vendor in terms of the initial implementation and deployment.

So then, okay. Trucks show up, systems get on boxed, things get shoved in racks, process runs, outcome happens. Okay. Now I’ve got a system up and running in my environment. What then? You know, so, so what’s going to happen. Well, we’ve got this day to day thing too. I have– we mentioned security incidents, so. Security incidents can come in a lot of different formats, right? It might be an OS thing. And I think that’s a whole separate conversation and it’s something that’s well-traveled waters, you know, patching systems is something everybody has to deal with, but then there’s these things where, okay, there, we need to push out a firmware update.

It might just be a functionality like we’re having servers blow up because of some bug in a nix firmware or something. Happens all the time. It might be a security incident. It might just be a routine process. I will say a lot of my customers sadly to this day, don’t have a very good message around how they update firmware after the installation date and the vendors ain’t helping us.

Like they will give us all the tools, but they’re going to– none of them are free, right? Like this is when you start looking at firmer updates, you start getting into, okay. $500 per system or more. Times a hundred thousand you’re talking huge spends just be able to get this functionality of day two. How do I deal with this sort of reality?

Once again enter our good friends at the DMTF and the Redfish specification, which not only allows me to do this day one configuration, okay. I need it to boot this way. And I want you to enable this nix, but disable this other nix. And when you boot from this nix, I want you to set this vLAN, like all those you know, settings, but I can also do firmware updates through this interface.

So part of the Redfish specifications in addition to saying, okay, show me what you have from a hardware inventory and what the Mac address is of your nix, or just, you know, details like that. I can also say, show me all your configurable software elements. And remember firmware is just software. Your nix is running software, your hard drives running software, all the rest of it.

So it’ll show me all the nixs, all the power supplies, the system boards, everything has got a firmware on it. Show me the version of it and then it also gives me ability to upload new versions of that firmware attempt to apply them at boot time or next to boot time or right now. And then give me an output as to how that process went.

Jim: So telling this particular story just reminds me that we tend to over simplify what a server is, what a compute note is. It’s actually a system of system. And it’s– it is network, memory, power management, compute, storage, and all of those components now have discreet software on them, which we call firmware. And you and I have both had situations where we’ve gotten shipments of devices from manufacturers and within the same shipment have had variation between the firmware on nix or power supply features or some type of variation, it might be at the hardware level or the software level. And so, because variation is the enemy of reliability. Being able to do this in an automated fashion, reduces the variation in the estate and drives to better outcomes and predictable ROI.

Because how many times do we have situations where we’ve tested something in development in the sandbox and it works just fine, then we put it in production and it fails. So I think that what we’re talking about is fundamentally changing how we make data centers reliable.

Allen: Yes. Yeah. Right. So it’s like this configuration alignment. I think you bring up a really important point here. Right? So we’re talking in terms of firmware update, or I’ve been talking in terms of firmer update, but I will say my, the customer I’m working with right now, Updates really, aren’t a problem. Many times it’s downgrades that we’re trying to accomplish and are difficult.

Jim: Just a second, downgrades?

Allen: Downgrades. Why on earth would you ever downgrade firmware? This is a crazy thing. And if you go, you know, HP has some great tools for automating firmware upgrades. You can just stick a USB drive into a server, turn it on, it will do all the things automatically. Except it will never downgrade its upgrade only Dell’s tools are the exact same way.

Jim: So it’s a one-way street.

Allen: It’s a one way street because that’s what we’re trying to do. We are trying to upgrade firmware. Well, not always, if I’m in an environment where application performance is extremely sensitive, where that platform is the result of billions of dollars of revenue in a given calendar year.

Jim: And nix is underperforming because of a firmware update.

Allen: Or we just don’t know, but there’s some difference between performance, like you said, between when we tested it in development and then rolled it into production and we’re seeing a Delta, it performed like this over here, perform like this over here. In that environment performance is what matters over all. Stability is what matters more than anything else. If there are security threats we need to guard from, we can do it at the perimeter. We have other ways of handling it, but it is way more important for us that all of the nodes in this one system are performing the exact same and operating in lock step in as much of that as possible than it is that all or some of them are running newer code new is not really what we’re targeting. Same is what we’re targeting.

Jim: So you introduce a new server into a cluster, into an environment.

Allen: Just shining out of the box. Yeah.

Jim: And you actually want to– and so then you want to downgrade the firmware so that it’s at the same release level as the existing servers in that particular stack for that particular application.

Allen: Exactly. Right. And then at some point, you know, we’ll go through test on newer firmwares. We’ll roll that in, you know, blue-green deployments, all that stuff. But until that process happens, I need to downgrade a lot of my systems. And if you go to the tooling from the major vendors, you’ll find that like, they will do downgrade, but you gotta do a bunch of stuff.

And you’ve got to do a confirmation, whatever, like all the automatic processing is upgrade only it’s one direction. So now if I can try– if I get myself out of what HP thinks I should be doing and what Dell thinks I should be doing, and instead put myself in control where I have my own Ansible or Terraform or whatever, right. My own platform. Man the gloves come off. I have a lot more things that are available to me and I can do stuff even when Dell says, oh, you don’t want to downgrade. Well, maybe that’s true for most of your customers, but we’ve got our own needs. Don’t tell us what to do. This is we have to do this right.

Now I’ve got a tooling that I don’t have to depend on the default options provided to me by the vendor. I can be declarative as to upgrades and downgrades. And then, and here’s the big deal I can go back and audit all of that stuff and make sure that my configuration is locked down across my environment, regardless of the vendor.

Jim: So we’re not just talking about upgrading from where we’re talking about managing firmware and people listening might think, well, I don’t have 10,000 servers, so that’s not an issue for me, but I ran into a situation at one of our clients where in a deployment of one, it made a difference because it was an FDA regulated device that was attached to a specific modality in a hospital environment. And the version of firmware made a difference.

Allen: Yeah. That’s what is they tested that’s, what’s approved and that’s, and you don’t change it.

Jim: And there was, because of it’s a hardware level interface there, the interaction patterns had to be very well understood. So in units of one, it has relevance and in units of thousands it has relevance. And that ability to say that we create predictable outcomes through automation and that income from managing firmware to managing configuration at the bios level, managing configuration of the physical server to managing deployments of the operating system, to rolling the application that the software defined data center is all of those pieces working together as a system. So I feel like we’ve been talking for a long time.

Allen: And so what have we accomplished? Right. So we, you know, we went back through, you know, all this great stuff that cloud brought to us from the capabilities perspective, and they did it all through an API. They didn’t give us, I mean, you can log into Azure and start clicking on things, but for the most part, the intention was always, you don’t care about the firmware or whatever else on the storage, you just getting storage.

And when you give us an API, we’ll give you storage back. The vendors caught on to this. And so then we got some of that capabilities. They gave us APIs and they gave us some management tools. Well they didn’t give it to us they sold us management tools. Right. But they still had all these gaps and they still had this very vendor specific stuff going on.

And then we went to the cloud guys and said, Hey, your tools look great. And then started adding, you know, local on premises. You know, your infrastructure that you own. Management capabilities to those existing cloud-scale tools. And then the outcome of it is that I am better able to manage my global or even my one data center in a way that more closely tracks the benefits that we’ve realized from the way that we manage cloud workloads.

And I think to me that’s the takeaway from this whole capability. Is that I get my on-premise investment to function in a way that takes lessons from the good parts that we’ve learned from cloud. And then rolling those back into the data center so that I can manage it in that in a similar way, with all the benefits that we saw in the cloud, but using my own hardware, you know, my own security, my own latency profiles and all the rest of it, you know.

I don’t know if there’s going to be a day where all data centers are cloud, maybe there will. Maybe they won’t, who knows. I suspect there’s probably going to be on premises, data centers for a while, but best if we learn from each other. And so, you know, I could not be more excited to see the talent, the time, the experience, and just the money that’s pouring into tools like, you know, Ansible being owned by redhat being owned by IBM now. Right? So we’ve got these billion dollar companies that are pouring resources into this technology. Same with, you know, groups like Terraform and the rest of it, right? Like these, you know, well-funded really well-established tools.

Jim: That grew up in the open source community.

Allen: That grew up in the open source community and that continue to be provided to the open source community. Right. I don’t have to pay a cent to Ansible if I– unless I, you know, want to own a tower, that sort of thing. But for just the basic of figure agent capability, we’ve got all that for free. Thanks cloud and red hat and the rest. Thanks opensource for giving us a path out of this vendor lock-in that HP and Dell and Cisco were never going to take us out of.

And like, they just have zero incentive to provide this sort of tooling to us. And so, you know, I think it’s as anyone who manages a data center today, pay attention to your friends in the cloud, they’ve got a lot of good tools and there’s a lot of that stuff that. If you look at Terraform, if you go to their website today, all the logos they show on that website, not one of them is an on-premise data center, infrastructure vendor.

You won’t see Cisco on that list. You do not see HP or Dell on that list. Or however, if you dig up the docs, you’ll see some reference to it. Right. But if you go to HP, they’ll show you all that stuff. So I will say just, you know, for those cloud vendors, they might even tell you that their tooling works with on-premise.

Talk to your server vendors, they know exactly what their systems are capable of doing. And I think, you know, we can do a much better job of marrying these environments.

Jim: And I think that the hypothesis that, that everything is moving to the cloud is not true because there’s always going to be operational technology and manufacturing sites.

There’s going to be life critical systems in healthcare environments. There’s going to be latency sensitive workloads, as long as latency is a competitive advantage for making money there’s going to be people that are going to try to optimize latency. So there’s always a place for the hyper tuned environments that there’ll be less of them in the future than what there are now, but there will always be a place for that.

And so– but taking the management practices of the cloud that is clearly defined by NIST and other sources and bringing that into the data center and then managing the data centers in a way that gives the business cloud-like capabilities is fundamentally about our approach to the data center, which is why the way we manage servers is perhaps more important than now the logos on the front of the servers. And that’s what we’re telling our clients. And I know this has been a fun and sometimes wandering conversation, but this is what I wanted our clients to hear about was how our approach to data center management has evolved.

And I think that your career here at OST has been a wonderful characterization of that as you’ve moved through so I appreciate you taking the time to have the conversation today.

Allen: Thanks, Jim, this has been a lot of fun.

Kiran: OST changing how the world connects together.

Episode 48: Automation Infrastructure with Jim VanderMey and Allen Derusha - Part 1

48: Automation Infrastructure with Jim VanderMey and Allen Derusha – Part 1

Listen Now