Cloud Workloads: Rethinking Endpoint Security Approaches

Blog Author
Matt Hathaway

This may sound like common sense to developers, but securing the assets in your cloud requires you to recognize just how different cloud workloads are from user assets. While the high level strategy is nothing new, legacy solutions cannot simply be repurposed in your cloud due to some very straightforward barriers to each fundamental goal.


System Hardening: AMIs often decay, but for different reasons

No regulation is complete without some requirement for hardening the systems in your organization because there’s no debate over its value. The security community has its share of pundits who oversimplify the ease of system hardening (“just patch on time and don’t let users be stupid”), typically neglecting the reality that users often need to change their optimally-secure configuration to maximize their productivity. Pretending that end users can always operate without local administrator privileges or installing new software, no matter the role or remote location, is like pretending everyone fully understands Apple’s terms and conditions.


Taking this same challenge to cloud assets means contending with a whole new set of exception cases. You no longer have to think about users changing their own laptop to do their job faster, but you must accept that developers are going to need to modify a production system or tweak a configuration file. When your production environment experiences a disruption, your team is going to do whatever it takes to prevent an outage, even if that means modifying base images and rolling them out without qualification. And if you think you’re immune to this, check your pulse the next time someone asks if you’re using any unpatched versions of OpenSSH (or whatever the next major vulnerable utility is).


Having tools at your disposal for an ad hoc check of your cloud assets against the defined stack-of-binderspolicy isn’t sufficient. Systems that are being continuously modified throughout the day need to be continuously assessed. Legacy scanners and other tools will offer only periodic configuration assessment, so dozens of new configurations will deploy before you’re aware of a violation.


Monitoring: Cloud nodes are virtual and short-lived, so the first moments become more important

In addition to their ever-changing nature, cloud assets are created at a pace beyond that of any physical system [see related article: Infrastructure Management has Evolved]. With laptops and desktops initially provisioned by your help desk staff, you can somewhat control what security tools are initially installed for monitoring. Even when the team cuts corners in the process (more often than you can ever plan), letting some assets slip through, your asset management tools can push your EDR agent to restore balance to GPS_Mapthe universe. Having it remedied within the first week or two after provisioning means that, like when you lose your GPS signal in a tunnel, the gap in coverage doesn’t divert you from the overarching goal.


Here again, cloud assets are dramatically different: waiting for asset discovery to catch unmonitored assets within a week would mean never knowing what happens in your cloud. It’s an unacceptable waiting period when assets frequently spin up, do their job, and destroy themselves in a matter of hours. Baking a legacy EDR agent into your base machine images is going to make your team squirm, but you need system monitoring from the moment they come online. Having a minimal impact on system performance and working with your deployment tools are foundational requirements for any agent to run on production cloud assets. If you can’t reach that level of comfort in your monitoring agent, you’re never going to know what systems came online for mere minutes or what happened on them.


Detection: Anomalous behavior doesn’t mean the same thing in the cloud

Even when you have visibility of an asset from the moment it’s online, you need to understand “normal” behavior to detect malicious activity. Identifying compromised user assets is challenging enough that multiple categories of security products coexist with very different approaches. From email security to endpoint detection and response to deep packet inspection and egress traffic analysis, there are solutions for analyzing anything sent to, running on, or transmitted from your users’ systems. Each product in these categories has been fine-tuned to more effectively detect suspicious behavior when viewed through the lens of “would a normal user do this on their system?”


But this same lens cannot simply be refocused on a cloud environment to successfully detect compromised assets, where the vast majority of activity occurs without direct human action. Emails aren’t received. Browsers aren’t serving up malvertising. Very few systems should ever communicate with the open web, unless that’s their entire job. Unlike user assets, host of many cloud workloads will be clones of one another… until a new base image is gradually rolled out, and they look suspicious until it reaches a critical mass and they again become your clone army. You have nginx clones with constant streams of failed logins and new network connections and then you have VMs with Chef killing and restarting processes to maintain performance. All of this would look ominous on a user’s asset, but it’s standard behavior here.


Investigation: There’s always a trace -- except when the asset’s gone

Once you reach a point where you can detect a security incident in your cloud, the investigation process also contrasts with standard operating procedure. Getting access to an asset under investigation, either physically or remotely, is the easiest way to understand what’s happening on it right now and what malicious tools ran on it in the past. Given appropriate access, a digital forensics team can almost always find enough traces left behind to draw conclusions. This may not be the smoothest experience for organizations with external incident response services, but at least the incident can be fully scoped.


Scoping such an incident requires completely different investigative tools in an environment where every days sees virtual machines and containers destroyed. You can no longer assume that the compromised assets are accessible when your team discovers the incident. The same traces were likely left on the asset, but it’s completely gone, with a new one in its place. Unless you’ve been regularly grabbing in-depth snapshots of everything installed and running on the system, you’re limited to the logs you collected from the ephemeral asset before it disappeared, so while you may be able to determine the result of the malicious activity, the trace of malware or hacking tool died with the asset, and not in a “mostly dead” way like Westley from The Princess Bride. All dead.


The Uptycs team recognized the importance of this historical record of transient cloud assets and designed our platform to help you fully reconstruct the configuration and running state of each asset at any point in time. You can use these historical snapshots to show an auditor which systems were securely configured on a given date and you can look back at what operations were run on a system long since forgotten by your environment. If you want to monitor your cloud assets to verify appropriate hardening, track their activity, and piece together what happened weeks after the VM has been destroyed, you need a solution designed with ephemeral nature of cloud workloads in mind.


Want to see how it works? Sign up for a trial here.


check out the cloud security whitepaper by lee atchison