Past & Future Disaster Recovery for the Power Grid

Past & Future OT cyber Incident Response and Disaster Recovery

Welcome to the first episode of our new Energy Talks miniseries, Why Should We Talk About Incident Response? Join OMICRON cybersecurity experts Andreas Klien and Simon Rommer as they explore the critical roles of IT and OT in cyber incident response and disaster recovery alongside other experts from the power industry.

In this episode, Andreas and Simon discuss disaster recovery from an OT perspective, using a recent CrowdStrike incident as a case study. Simon also highlights key considerations for utilities when developing OT incident response and recovery processes and offers practical tips for those without established plans.

We want to hear from you! If you work at a utility, please share your experiences implementing OT response and recovery processes – what prompted you to start, and what challenges have you faced? If you would like to contribute, email us at info@omicroncybersecurity.com and if you want, we may feature your story in an upcoming episode.

Stay tuned for upcoming episodes in our miniseries Why Should We Talk About Incident Response?

Learn more about OMICRON’s approach to advanced cybersecurity for OT environments.

Listen to Our Podcast

Incident Response, Podcast, Simon Rommer

“You need to have a good asset inventory before you can do anything else. You can't start recovering if you don't know what you need to recover.”

Simon Rommer

OT Cybersecurity Expert, OMICRON

Here Are The Key Topics from This Episode

1. Offensive vs Defensive Security: Andreas Klien and Simon Rommer discuss the importance of incident response and recovery in the power industry, sparked by a recent CrowdStrike incident. Simon shares his background and preference for defensive security.

2. Improving OT Security and Recovery in Power Grids: Andreas and Simon discuss how utilities can enhance their OT security and recovery processes, referencing frameworks like NIST and SANS. They highlight the importance of recovery and the differences between handling non-malicious incidents and real cyber-attacks.

3. Preparation for OT Cybersecurity Recovery: In this part, they emphasize the importance of preparation for OT cybersecurity recovery, citing recent attacks like Black Energy 3 and Industroyer. They discuss the necessity of having good backups and asset inventories.

4. Key Steps for OT Security Incident Recovery: They discuss the essential steps for utilities to establish OT security incident recovery plans, emphasizing the importance of preparation and asset inventory. Simon highlights the need for configuration backups and references frameworks like NIST Cybersecurity Framework 2.0 for detailed recovery plans.

Scott: Hello everyone! This episode is the first in our new Energy Talks podcast miniseries titled, "Why Should You Talk about Incident Response?” It relates to an important aspect of cybersecurity in the power industry and continues the discussions we featured in our previous miniseries, “Cybersecurity in the Power Grid, a 360 Degree View.” Your host for this new miniseries is OMICRON cybersecurity expert Andreas Klien, who has been a guest and a host in some of our previous cybersecurity episodes. So, without further delay, I hand over the microphone to your host, Andreas Klien.

Andreas: Thank you, Scott. Welcome to our new Cybersecurity Podcast mini-series, where we speak about the incident response step according to SANS to explore the critical role of IT, OT and cybersecurity in the power industry. In this episode, our OT cybersecurity expert, Simon Rommer, talks about disaster recovery. How to get back on track after a cyber incident. Hi Simon.

Simon: Hi Andreas.

Andreas: The topics of how to protect against cyber-attacks and how to detect cyber-attacks are what everyone talks about. Also, for OT security, obviously. It's a very common topic. However, the parts response and recovery after cyber incidents are not as much covered as the other two, especially not for the power grid security sector. That is why we talk about incident response steps in this podcast mini-series. And triggered by the recent CrowdStrike incident, we start this time directly with the topic of recovery in this episode. We would like to hear from you, if you work at a utility, how did you implement response and recovery processes for OT? Would you like to share something with us so that other utilities can learn from this? Were you affected, for example, by the CrowdStrike incident in your OT domain also? We would like to learn from you and share some of it here if you allow it so that others can also learn from this. Please contact us on omicroncybersecurity.com and then submit your ideas for this podcast and your inputs that we could cover here. Thank you already. Simon, could you please tell us a little bit about your background in cybersecurity and in cybersecurity for the power industry?

Simon: So, my background is not quite straightforward. Initially, I studied electronical and mechanical engineering in school. So, we called it automation because we learned about automation systems in general, all the good stuff with HMIs and RTUs. After school, I went to study Electronic Engineering and then I noticed that I really like to play with computers. So I started hacking games and that's where I decided I want to work in cyber security. Combining those two fields of engineering and IT was a no -brainer for me since both of them are equally interesting to me and that's why and where I started. After working in some security operational centers and working as an OT security consultant, I found my home with OMICRON and that's where I'm doing my good work.

Andreas: And we're happy to have you here, Simon. So, you once said that you prefer working in incident response instead of offensive part of security. Why is that so?

Simon: Offensive security and hacking are always the cool kid on the block, if you want to say. But for me, defending comes more natural to me since I'm more of a defensive guy, so to say. I don't like to destroy stuff. I want to protect stuff. And also, I think there is so many good Pen Testers out there and not enough good Forensics and then defensive personnel. So, I decided to go on the defensive side and help protect and secure infrastructure and networks.

Andreas: Okay, so you say you're a defensive guy, but you take your bicycle to ride down a mountain with rocks and all jumps and stuff?

Simon: Yeah, but with enough protection. Preparation is the most important part, which is our first lecture of today.

Andreas: Okay, so that's how these two things fit together. Let's start with After Effect. So, this episode was triggered by the CrowdStrike incident and so this triggered us to talk more about cyber incident recovery processes in the power grid. Would you like to explain what triggered you to start this episode here?

Simon: As you said, the CrowdStrike incident is the thing that triggered my thought process on this. Especially the things that I heard from colleagues and former customers of mine that I'm still in contact with. CrowdStrike is a well-known, well-regarded EDR solution. It means it's deployed on the endpoints. Endpoints in this case is Windows PCs and Windows PCs have been and are part of OT infrastructure for quite some time now. If it's laptops that are coming with maintenance personnel or it's engineering PCs that are controlling directly some part of our infrastructure or if it's an HMI with a touchscreen, Windows PCs are quite everywhere in OT and that's why it also touched OT environments. So this incident was not an IT incident per say but it was an OT incident as well.

Andreas: Yeah, so just recently last week I was in a discussion with an utility and they told me that between five and six windows-based assets is the normal number per substation for that utility.

Simon: Yeah, and this is like small to medium substations whereas like three devices for example, so it can go up to more than that.

“Safety is the primary goal, safety for your engineers. And the secondary goal is availability.”

Andreas: When utilities are starting to improve their OT security, what could they do? Where could they look if they want to get an overview on response processes and recovery processes in OT for the power grid?

Simon: The response process is also part of the NIS2 regulation. And NIS2 references to the NIST Cybersecurity Framework (CSF) 2.0 or the SANS framework. These two frameworks are the most well -known frameworks for incident response, where recovery is just part of the incident response process. I thought that we start on recovery triggered by the CrowdStrike incident, since there wasn't a real attack, quote unquote, happening. You could say it was a supply chain attack similar to SolarWinds, but in this case it was just a programming error, so to say, and the attack vector was clear, so recovery was the only real interesting part in my opinion. I talked to some colleagues and we came to the conclusion that the response part is not talked enough about. That's why we wanted to have this talk about incident response, recovery, and so on and so forth. As I said in the beginning, Pen Testing and hacking is cool, but the more important topic is recovery and getting to status quo.

Andreas: You mentioned like CrowdStrike could be compared to a supply chain attack. So is this from a recovery perspective, are there differences between CrowdStrike, how it happened, which was not malicious intent and the real supply chain attack?

Simon : Yes, the difference is that we knew what was happening and we knew where it was coming from. So the analysis was really quick and don't have to clean up some kind of attack vectors at the supplier's point of view. So there is no vectors that you have to clean up because you know what was happening. Especially with the CrowdStrike incident, you knew what was causing it. So there was a config file that was pushed that triggered a coding mistake. And the supplier in this case also gave recommendations on how to clean up everything. So this was the part that was done for you and only the recovery part was the important or the interesting part because that varies from company to company.

Andreas: Okay, so from that perspective it's much nicer than a real cyber-attack because with a real cyber-attack you never know what else they did. And there could still be something else that you're missing out if when you're recovering from.

Simon: With a real cyber-attack, as you said, you must be more thorough. The problem is that it's always a race against time since the recovery phase can start when the investigation is over most of the time because recovery also means setting up systems from backups, for example, or rebuilding networks. And if you rebuild and recover backups, then you also destroy the indicators and destroy the leads that you have for the attack and all the log files get deleted and such. In this case, the whole analysis phase was skipped, and we didn't need to save indicators and so on. But with a real cyber-attack, as you said, you have to investigate, and you have to be able to understand the attack first and the steps that the attacker took before you can recover everything. Most of the times you can recover parts of the systems. But you never know what you're going to destroy with doing so.

Andreas: Okay, so I think that the CrowdStrike incident probably triggered a lot of people working in OT security or in PowerGrid OT that they now start thinking about their Windows machines in those power plants and substations and how they would recover after a cyber incident that affected these machines. Of course, Power Grid OT, are also a lot of devices which are not based on Windows. Is that totally different?

Simon: The difference is most of the utilities have hopefully a disaster recovery plan already in place that deals with broken devices or some kind of damage to the power grid itself. But for some or for most of the utilities, recovering from a cyber-attack is quite a new domain. So, the difference is, in general OT, for example, you have your ERP systems, have your CNC machinery that is controlled by devices like Windows PCs. So in general OT, you can't really produce without the Windows devices and then you're reliant on them. But for power utilities and for the energy sector in general, we are a niche within a niche, so we have a different set of constraints and different set of rules that apply here. For example, the power grid can work hopefully on its own without too much intervention from a Windows PC for a certain amount of time. So, these plans are already in place, but recovering from a cyber incident, quote unquote, is a subset or a superset, how you want to see it, of actions that you need to take. There is like the original recovery plan and then you have additional things to do. For example, you have to replay the backups from the backup system basically. So this is something that comes on top to the blackout scenarios that have been talked about in the past few years. So what is new for the engineering side of things is that they have also to deal with IT infrastructure like database servers. Why is this important? Because on the database servers there is also the configuration from the IEDs and the SCADA systems and all the good stuff. So, attackers like to destroy configuration files and backups as well. As we've seen in the Ukraine attack 2015 and 2016 for example, they destroyed the configuration files. So, if you are not able to restore configuration files from an IED that was attacked maliciously configured by an attacker, then you're sorry out of luck and then you need to start from scratch. And this is something that is not in most recovery plans because you expect your configurations to be on point and you just need to swap out some devices or repair landline or something.

Andreas: Yeah, we've seen that in these two cyber-attacks, Ukraine 2015 and 2016, that they specifically looked for configuration files for these protocol gateways for the RTUs there and they even looked for files with the ending of SCD file and ICD file so the IEC61850 project files and they looked for these files on the engineering PCs but also on the RTU to delete them so that it would make recovery more difficult. So, part of the recovery process must obviously be to clean up your configuration files and make sure you store them in a second and third location so that you're able to recover your RTU configurations and gateway configurations.

“Preparation is key. If you don’t know what you need to recover, then you can’t start recovering.”

Simon: This is something that we also mentioned in the beginning, that preparation is key. Especially when we look at the recent attacks where both with Black Energy 3 in 2015 and Industroyer or Crash Override 2016, there were major parts of the malicious frameworks was the destruction and the deletion of config files backups and so on and denying access to the devices. So, having good backups, having a good asset list that you know what you have to look is really important.

Andreas: Was it in the Ukraine 2015 incident that they even bricked protocol gateways, they uploaded defective firmware so that the protocol gateway itself was bricked and then the utility replaced them with spare units from stock but they didn't have so many spare units that they were able to replace all of them. And I don't know if they could be recovered somehow with a special serial port or JTAG port or so. But from what I read, these devices were completely bricked.

Simon: That's the thing for your disaster recovery plan. In a blackout scenario, you have limited amount of IEDs or devices that are likely to break at a certain time. So you have time to replace your replacements. But with an attack you can't really expect what to expect. So the infamous unknown unknowns, which is quite the favorite phrase because you can't know what you don't know. So, with a cyber attack you have to be prepared for so much more and for different things. And as I said, the first step in a recovery process is always the preparation. If you look at the NIST Cybersecurity Framework (CSF) 2.0, and the SANS framework, so they both have preparation as the first step. If you look at legal guidelines like the NIS, you also have asset inventory. Asset inventory is always the first step because you need to know what you have and then you need to be able to recover exactly this status quo.

Andreas: The preparation is key, is fine, but how do I prepare for somebody destroying all my protocol gateway?

Simon: This is a good question.

Andreas: I cannot keep 200 or 1000 devices in stock. It's also a bit of expensive.

Simon: That's where we disaster recovery plans also must keep in mind that for example, I've heard from customers, they can lose a substation for example. So, what does this mean? It means you must be prepared to shut down a substation completely. And this is also part of recovery plans to be able to shut down a whole substation, for example, in a controlled manner. Nothing more dangerous than a substation that cannot be controlled anymore. I would rather shut it down in a controlled manner than have the substation impact my whole grid.

Andreas: So that would mean somebody needs to drive there, going there and with manual control turn off all circuit breakers.

Simon: If that's possible, yes, that can be a scenario.

Andreas: But I think with most substations we can't afford that. We need to still be able to maintain manual control. I think in the Ukraine 2015 they were in a more fortunate situation than most utilities from the EU. Because they had a lot more staff who could go visit substations and manually operate them. I think this was their main or probably one of the biggest advantage there to regain control of their substations and of their grid. So how could utilities in the EU do this who have not as much staff to do manual local control?

Simon: The resource topic is always a big topic, especially with automation and so on. So, this is the reason why automation exists because resources are scarce and resources are also well -trained personnel. There is not a way around this because you need people that can control and use the devices. Where and that’s where you also must write down the personnel in their disaster recovery plans. For example, in an incident response handbook, there is like private numbers or out of bound communication to certain personnel. May it be some kind of key engineer or a team of key engineers or the legislator, for example, if you are part of critical infrastructure and you have to make a call to your legislator, there is also the numbers in there. You can't be certain that your internal communication will work in the case of a cyber incident. For example, if you have IP communication or IP phones, I'm sure they won't work. Active Directory is the first target. And there is Active Directory as well in OT environments. So be prepared that there is not enough personnel and be prepared that you can't reach the ones you need to reach.

Andreas: So, what is so special about recovery processes for power grid cybersecurity when you compare it with normal IT recovery?

Simon: This is the interesting part in my opinion. We already talked briefly about it. With IT systems, maybe you even have full backups, or you have a config backup that you can just recover from. So with Windows PCs that's really, really easy, I would say. For Linux devices, Linux servers, it's even more easy. You just have to replace the config files. For IEDs, if they don't work, they don't work. You can try to recover the firmware. As you mentioned before, sometimes it's not possible anymore. A Windows device, if you need a replacement device, you can go to any major retail store. They have like hundreds of laptops that you could use as makeshift devices. If you need a specific gateway or protection relay, there is not a protection relay down the store from where I live, so you have mileage, may vary. But this is also a key difference. So you have to make with the devices that you have in place. And recovering them is important since limited resources and that's also again why storing configuration files is also important in this case.

Andreas: Okay, so the main differences you mentioned are of course you need to replace devices and devices are difficult to replace. Even if it's a Windows based device, it can be difficult to replace because it's maybe an RTU running Windows and so on. And then second also you need the knowledge in place to do the recovery actions so skilled people to do this and so… Is it likely that IT incidents also cause problems on the OT side? Obviously with Windows hosts it's clear that it can also happen, but is there more possible also that can spread?

Simon: We've seen in the past, for example, another key difference between IT incidents and OT incidents is the impact. If you have an IT incident, the local IT is impacted, maybe the company and some suppliers. If you have an OT incident from critical infrastructure, for example, the general population will also be impacted in the worst case. The stakes are higher and that's why it's more severe.

Andreas: This also influences your recovery processes because if power supply is impacted, then your own recovery processes are also going to be impacted.

Simon: Exactly, the goals are also different. For example, in an IT incident, your information, hence the name information technology, is the biggest value, so to say. But in an OT incident, you have other goals. Safety for your engineers is the primary goal. And the secondary goal is availability. So availability of your service. And if the information about your configuration files is leaked, this is not a big issue most of the times. But if the service is interrupted and if safety is compromised, that's a real big issue in 99 % of the cases.

Andreas: Of course, yeah. These are are all differences which you need to consider in the recovery processes for Power Grid OT. But somehow related to the recovery processes which are anyway should be known for power engineers working in this area. So how do these two plans fit together? You've got recovery processes after a blackout or with after faults, which are anyway have been there for years and executed by power engineers. And then there are new recovery plans coming in from OT security. How are these two things related?

Simon: So the OT security recovery plans keep the blackout plan, so to say, in mind. It's on top basically. What we had in the past is failure of our infrastructure, for example, or from certain devices that were replaced, repaired, and so on really quote unquote easily. What we have now is that we have outside factors. So, there is IT - OT convergence has been going on for quite some time. And most of the stuff is digitalized. For example, IEC 61850 digital substation. If you have a cyber incident on a digital substation, it's harder to replace or harder to repair devices inside the substation because there is no device anymore. On the other hand, they have their pros with automation and so on. But the digitalization of our grid and our devices also need to be considered for the blackout plan. So, this is something that goes hand in hand and the blackout plans need to be considered for the cybersecurity. So, it's on top, I would say.

“Having good backups, having a good asset list that you know what you have to look is really important.”

Andreas: So, what is the most important thing that utilities need to consider for these recovery processes? What would you like to give them as an advice if they want to start establishing OT security incident recovery plans?

Simon: First things first is always preparation. If you have a good asset inventory, I'm sure all our listeners have heard this before. You need to have a good asset inventory before you can do anything else. Why is that? Because if you don't know what you need to recover, then you can't start recovering. So you're just shooting in the sky. So first things first, you need to have a good overview over your systems, over your network, asset inventory, network plans. Second, we also talked about the configuration backups, so you need to be able to restore in case of emergency. This also means to have offset backups and have backup good practices. Having your backups encrypted or your backups destroyed is a common tactic with malicious actors. Thinking outside the box or having your offset backups or like write or read only backups, depending on the direction. These are the first two things. And if you have this preparation in place, then you can start looking at something like the SANS framework, where recovery plans and incident response plans and so on are described. You can also hire consultancy. For example, OMICRON is providing such services where we can help the customers to set up something like this.

Andreas: Yeah, so that is your job when you're helping your clients to establish incident response plans. Let's imagine somebody comes to you and says, “Can you help me to get started?” How would you help them?

Simon: Normally, if somebody comes to you and ask you how to get started means they start from zero, right? So I would also go with asset inventory, network checks. For example, we have this solution called StationGuard. Our podcast listeners have heard of this before and StationGuard helps us also have a baseline on what devices are in the network. So with the help of StationGuard, we can draw a picture, so to say, or draw a full network diagram with all the communications and then we can start from the beginning. And another big part is policies that are also mandated, for example, by some regulations. It depends if you're under critical infrastructure regulations or if you're a smaller utility that doesn't really need but want to have something like this. The process is always the same but the volume, how much work goes into it differs from size to size.

Andreas: Of course. It's great to have somebody here who has been working with incident response for years. So how do you think did this world change over the past years?

Simon: If you look back to Stuxnet, for example, it was like brand new. Nobody's ever thought about having a malware attacking OT infrastructure. But since 2015 with the Ukraine attack where even the US government sent specialists to Ukraine to help investigate. A lot has changed since then. So it was almost 10 years ago. Nowadays nobody asks anymore. So everybody understands that Malware has the capability to send IEC 104 packets, to send 61850 MMS packages, for example. Maybe Modbus or DNP3 or whatever. So this is common knowledge now that malware has the capability, attackers have the capability and the knowledge, the understanding of our devices that we use, of the infrastructure, of the grid layout and so on and so forth. So the biggest thing that changed was the capabilities of the attackers and the understanding what the risks are with this. And I would also like to hear from our listeners, maybe the ones that have been in the industry for quite some time, how did you perceive the change? Maybe you have some differing experiences, maybe you share the same experience. I would invite you to let us know how you view this topic and if there is interest, we plan to do a mini-series within the mini-series on incident response steps according to SANS, where I also invite some of my former colleagues from other companies to have their point of view on the certain topics.

Andreas: Yeah, that will be very interesting. And I think it's state of the art to establish incident response processes. And one source of learning could be our podcast mini-series here. Thank you so far for listening to us today. And with that I would like to hand over back to you Scott. Thank you!

Scott: Thank you, Andreas, for hosting this and upcoming episodes of your new Energy Talks podcast miniseries, “Why Should You Talk about Incident Response?” We look forward to listening to many interesting perspectives and discussions in this important area of cybersecurity. And to our audience, a big thank you for listening to this and other episodes of Energy Talks! We always welcome your questions and feedback. Please send us an email to podcast@omicronenergy.com. OMICRON has several years of experience in power system testing, data management, and cybersecurity and offers the matching solution for your application. Please join us for the next episode of Energy Talks and stay tuned for feature episodes of our new miniseries, “Why Should You Talk about Incident Response?”, with Andreas Klien. Goodbye for now, everyone!

Listen to Our Podcast