Monitor Workloads in AWS Cheaply: Netflix Cyber Security Team Use Case

Tags Events

7k microservices. 2.5 million EC2 instances and containers per day. Immutable (except not always) infrastructure. What is this? A security professional’s nightmare? Nope. It is Nabil Schear’s everyday reality as a security engineer at Netflix.

Watch the video below to hear him speak on the unique challenges of securing a massive and constantly shifting cloudscape, from the nightmare of Log4j to discovering TLS certificates. In this session Nabil Schear, Staff Security Engineer at Netflix, talks about how Netflix uses osquery. Netflix operates one of the largest AWS deployments in the world to power their streaming service, studio, and other business operations. This complex deployment spans thousands of microservice and data processing applications running on a mix of EC2 instances and containers running on the Titus platform.

Since 2019, Netflix has used osquery to understand their large environment, respond to security incidents, and unlock cost savings. Nabil explains how Netflix deployed Osquery while minimizing the burden of operating it on a large scale. He wraps up with some examples of the breadth of different challenges that they have been able to solve using osquery and how they are thinking about it in the future.

Check out the other sessions from Osquery@scale, an annual event hosted by Uptycs for the osquery community. This event was held in San Francisco at the Exploratorium in September, 2022. Join us at future events to learn how security leaders and practitioners from financial services, telco, SaaS, hi-tech, and other industries use osquery to manage security risks at scale.

Transcript:

Nabil: Good morning, everybody. So, my name is Nabil Schear. I've been at Netflix for a little under three years. And there, I'm a security engineer and I do work with our teams that build the platform on which the rest of Netflix operates, and help them do security. And more recently, I've been doing more risk reduction, access control assessment, and stuff like that.

"How can you deploy osquery to get some visibility where you didn't have it in your cloud environment, and literally do as little work as you possibly can to enable that?"

And so, what I wanna tell you about today is how can you deploy osquery to get some visibility where you didn't have it in your cloud environment, and literally do as little work as you possibly can to enable that? And the reason for that is that, you know, it wasn't really entirely my job to do this. One of the problems that we had at Netflix is that it wasn't really anybody's job to do this. And so, in the Netflix culture, we have this notion of what we call bias to action, which sort of sounds like do what you want.

So, I took that seriously and I decided to try and make this work for Netflix so we could start to answer some questions. And I wanna give you some insights into how we did this. And the reason, like I said, that it's on the cheap, is that my motivation for doing this was to try and make this work without really putting a lot of effort into it. And furthermore, to not pay attention to it operationally on a day-to-day basis and just have it do its thing. So, that's what you're gonna hear about today.

I'll give you some of the lessons learned that we had from that experience, some of the choices that we made and how we rolled our osquery infrastructure out. In some cases, those choices have come back to bite us. I'll tell you about that. In other cases, they actually turned out to work out pretty well. So, without further ado, let's get started.

How Does Netflix Work?

So, before I jump into all the osquery-ing, I need to tell you a little bit about Netflix because some people don't quite have the right idea. So, how does Netflix work? So, you go sit down after a long day of listening to people like me drone on and on and on about osquery and you're like, "Let me Netflix and chill." So, you sit down, you know, you fire up your TV or you got your iPad or whatever, and you can find a title you wanna watch and you click play.

What's gonna happen when you click play is your client device is gonna reach out to a bunch of different microservices that exist in our cloud environment hosted in AWS in one of a handful of regions. It's gonna do some various things to figure out where to route you. And then it's gonna send you to our content distribution network, which is called Open Connect, which is deployed all around the world. Really cool stuff. All that stuff about the content distribution network, I'm not gonna talk about it all.

What I'm talking about is our AWS cloud environment. So, once it gets you steered to your CDN Appliance, it's gonna start slamming bits of Stranger Things or whatever it is that you chose to watch on your device, and you're gonna be happy.

What a lot of people don't realize is that Netflix has evolved quite a bit over the 25 years that we've been in existence. You know, it used to be a DVD shipping company, then we turned into a streaming company, then we turned into one of the largest movie and TV production studios in the world. So, there's a whole bunch of services that also run in our AWS cloud environment that support our studio. We also have a giant pilot, big data processing that we use for personalization, for predictions, for you name it. There's a bunch of that going on.

And then finally, you know, we're a big business with, you know, tens of thousands of employees, hundreds of thousands of different partners. You know, think of all the people, the person who holds the boom mic, you know, on the set of Bridgerton. They have a user account somewhere in our systems. So, we have tons and tons of services that also support the business. And don't even get me started. We have games now. I'm gonna do a little advertising pitch here. They're actually quite cool. They're good. I like games. The Netflix games are good. Play some Knittens. It's addictive. And more are coming.

"We have about 7,000 different microservices that are deployed in our AWS environment. When we break this down by EC2 instances and the containers that run on our container platform, which is called Titus, there's around about 2.5 to 3 million different workloads that we have in our cloud every day."

And then we also, more recently, if you've been following the news, we're gonna start to be an advertising business. And that's, you know, a big shift for Netflix. And it's gonna result from my perspective, as somebody who cares about our infrastructure security, in just like a whole lot more microservices doing a whole lot more things that we need to keep track of. So, we have this problem, I'll tell you.

We have about 7,000 different microservices that are deployed in our AWS environment. When we break this down by EC2 instances and the containers that run on our container platform, which is called Titus, there's around about 2.5 to 3 million different workloads that we have in our cloud every day. And so, you might be thinking, well, you know, maybe that's cool. Netflix was early to this cloud thing. We really pioneered immutable infrastructure, right?

We deploy things using our deployment system called Spinnaker and things that exist within what we call the streaming path, namely this thing that gets Stranger Things to you. Those things deploy pretty regularly and pretty frequently, but it's not always as immutable as you might think. Some of those studio applications are getting modified in place, some of our databases can't be redeployed as frequently as we would like. And our container hosting platform does all manner of things from batch jobs to hosting microservices, to doing machine learning, you name it.

The Problem

And so, what we needed was a way to understand what was going on in our running cloud fleet. And we think that that's important for security. I'm a security person, so I care about security visibility. I need to know what's going on before I can secure it. And we also need this for fleet management needs. So, we had a bunch of different data sources that could help us with this, but none of them were really designed for security use cases. And furthermore, this is the one that's a little more hard to swallow, so nobody was really looking at it very much. Like we had some data sources, but they were incomplete and they weren't worth looking at.

"I'm a security person, so I care about security visibility. I need to know what's going on before I can secure it."

Constraints of the Netflix Environment

So, when I came to this problem, I was like, alright, let's come up with what the constraints are, how we actually get this done in our environment. And we basically came up with four criteria that we wanna accomplish with our instance monitoring approach. The first is actionability. And this is really near and dear to my heart. I do not like collecting piles of data that are just piles of data. And so, I would much rather collect things that I can use actionably to reduce risk.

"I do not like collecting piles of data that are piles of data. And so, I would much rather collect things that I can use actionably to reduce risk."

So, for example, collecting data that helps me to directly reduce risk, for example, by discovering vulnerable packages deployed in our systems. Or collecting data that can be used for very extremely low-noise detections that have high signal-to-noise ratio, right? Like, things that if they fire, definitely a bad thing is going on. And then the distant third goal of collecting all this data is for forensic purposes.

Don't Break Any Applications

Another big constraint that we have is performance, right? We have a bunch of different microservices. Like I said, each of those microservices is the baby of some team at Netflix, right? And so, if we come in there and take a bunch of their CPU away, or if we come in there and add a 5% to 10% CPU overhead to some of our larger microservices, that can literally cost millions of dollars. And that will make people upset.

So, we basically had the criteria or the goal here that the application owners must not notice, right? This must exist almost invisibly. So, we have to be very careful about what we collect. We have a lot of stuff, like I said, 2.5 million workloads—a large, very, very large AWS deployment. Even tiny costs add up in terms of data transfer, data storage, you name it. And then finally, as a security person, you know, when everyone needs to run these OS monitoring things, they always have to run as root because, of course, they have to run as root because they need to collect data that only root can have.

And then, you know, we wire that up to a control plane and now we have a botnet, right? So, what we wanted to do is to avoid actually introducing a whole lot more attack surface. We need really consistent, predictable behavior for our applications, or like I said, those 7,000 app owners whose babies we were messing with are not gonna be happy with us.

An Analysis of Existing Netflix Monitoring Tools

So, we took a look at what we had already. We had some capabilities to ingest some of the CIS logs from our environment that had kind of medium-to-low actionability. Performance scale and attack surface were good because there's no control plane, this is just statically baked in. It costs very little, it moves very little data.

We have a bunch of really cool performance management tools. You may have heard of flame graphs. We have a tool that actually deploys out and will do performance profiling and assessment against services. Really cool stuff. But the actionability for security purposes of these tools was low. And also, we actually can't run them across our whole fleet, or it would all keel over and die. We have to select single instances out of a scaling group and assess that one to try and figure out what's going on performance-wise.

We also have an unnamed third-party agent that we run in our environments that are under compliance. So, you know, we process hundreds of millions of credit cards, we do billing for a lot of things. We have SOC, we have PCI just like anybody else who does business today. And we have a third-party agent there. And to be totally honest, the fact that this agent was not really providing us a lot of value was mostly because of the way we used it. It was literally checking a box for us and not really about actionably reducing risk. And then the idea of deploying that across our entire environment, not just to the things that are under compliance, would have been potentially costly and difficult to scale.

Security Goals & Osquery

So, what we wanted to do is: let's see if we can make osquery do this for us. And so, these are our goals, and I'm gonna tell you about how we did it. The structure of the rest of the talk, I'm gonna tell you about what we did, how we structured it, some of the constraints, how we satisfied them, and then I'll walk through a few use cases. To be totally honest, they're a little bit weird and they're kind of what I came up with at the time. But please feel free to ask me more questions about the weird and bizarre things that we can answer using osquery data.

It's kind of like a throwdown now. People are like, "I wonder how much Ruby we have." And I'm like, "Hold on. Beep boop beep boop. We have six applications using Ruby." They're like, "Six? Really? Only six? We have 7,000 applications, nobody else is using Ruby?" And so, you know, there are things like that that we've been able to do that are hard to put a value on, but it helps us when we're doing security assessments to answer questions like, "Do we need to do vulnerability management for Ruby?"

Before we had this data, we literally would just get the smartest people who had been at Netflix the longest and be like, "How much Ruby do you think there is?" So, sorry, I digress. And this is state of the art for Netflix and for many environments too, of, you know, how do you manage all this complexity? So, we'll talk about some of the use cases we use it for and then I'll wrap up.

Canonical Cloud Monitoring Architecture

Okay. So, what does a canonical cloud instance or container monitoring architecture look like? So, I threw this little picture together. We’ve got some instance—some OS—that's running. We're gonna stick an agent on it. It's gonna produce data that we need to ETL, we're gonna throw that into some storage. We need to do some analysis on it, maybe streaming. If it has real-time needs, we need to do batch analysis to figure out what's going on. We need some user interfaces and we might need a control system. So, when I first drew this, I'm like, "Boy, that sounds expensive. That sounds like a bunch of services that I cannot run. I'm not allowed to run." I'm not allowed to run real engineering things at Netflix. They don't give me that privilege and that is a good thing. Do not do that.

Architectural Approach

So, what we did is we said, "Okay. Let's look at this and figure out how we can minimize this." And so, our architectural approach is really green, right? I am up here telling you how Netflix is saving the world through recycling. We wanna reduce, reuse, and recycle, right? How many existing services can we use to make this tick? How much can we make trade-offs that minimize the operational overhead and needing to maintain and operate all those things?

"We wanna reduce, reuse, and recycle, right? How many existing services can we use to make this tick?"

Design Principles

So, these are some of the design principles I had. Absolutely zero new services to maintain. None. So, that means no control plane, no new UI, no methods for managing all that batch analysis, and stuff like that. We wanted to deploy this, and I'll talk more about this in a moment as part of our base operating system image.

We also wanted to be very careful about how much data we collect and how frequently we do it. So, for the most part, we never collect any data faster than about once an hour. And in many cases, it's much longer than that. At the time that we were doing this, which was in 2019-2020 timeframe, we basically looked at the various features in osquery for events that were using the audit subsystem, and we were like, "That sounds scary. Anything that can lock up the kernel is gonna get us killed. So, no events, only scheduled queries.”

We also made the decision based upon reusing existing services to ship the data that we really wanted to try and avoid anything that was sensitive. So, for example, we said, "No command line arcs, no process environment variables, because the likelihood that we pick up a password or a credential or other sensitive information from our fleet of thousands of applications is not zero and we don't wanna mess with that."

"We use extensions a lot. I love extensions… Extensions make osquery really powerful."

As a result of having no new services, we had no dynamic query capability. We're gonna bake the configuration in and that's what it's gonna be. We use extensions a lot. I love extensions. I've heard that a lot throughout this talk. I'm gonna say it too. Extensions make osquery really powerful. And we're gonna have one big shared data pipeline that we use for osquery across all of the different use cases we wanna use it for.

Netflix Osquery Architecture

So, without further ado, this is what osquery looks like at Netflix. So, we have our EC2 instances. They run an osquery process that goes into a system called Keystone. Later on, feel free... Sorry, I forgot to mention this at the beginning of the talk. If you go to TinyURL /netflix-osquery, you can see a copy of this talk and in the notes section, there's a bunch of links to Netflix tech blogs and other things that describe some of these systems because I can't be like, "Well, hey, you use Keystone because Keystone is a service we built and you don't have one, and I'm sorry for that."

But underneath, it's basically super duper fancy cool Kafka. And it was already an existing service, so we were able to wire it up pretty easily. And what we do is we sync our data, a small amount of it into an Elasticsearch, and ELK Stack, basically. We also throw data into our big data warehouse where it gets stored in Hive. And then we have ETLs that run in Spark that take that and kind of enrich it and make it a little bit more useful. And then we basically rely upon the existing set of big data tools that Netflix already has.

So, again, I feel like I'm kind of cheating here being like, "Yeah. We did this with no effort," because Netflix has built all of these services that make this easy. But what I am gonna try and tell you is that, to the extent that you can reuse existing things, you can totally do that for osquery. These don't need to be purpose-built for it.

Last thing I'll mention is that we have a whole detection system called Snare. Again, there's a tech blog that describes how that thing works. I am not expert in that, but we do actually siphon osquery data out of Keystone into regular Kafka where it gets ingested into that detection system where things like CloudWatch and other data are going in where we can do detections.

Monitoring Containers on Titus

All right. So, that's how the basic architecture works. We also wanted to be able to do this in containers just as well as we could do this in VMs. So, we built out a system for doing this in our container platform. We have a container platform called Titus. It's kind of weird. I won't go into all of the details, but just imagine that you were trying to sort of do Kubernetes with Mesos and Docker about five years ago, and then you wanted that to look as close to EC2 as you possibly could.

"We use one osquery process that runs on the host machine. And we basically use it to peer into all of the containers. And the nice thing about this is that we don't actually have to have anything inside of the container in order to monitor it."

So, we have very big thick containers. They have their own network interfaces, they have security groups attached to them. They're very much like VMs, but we have a container platform that can orchestrate and operate them. But the thing you need to know is that each of those runs on top of a real host, an actual EC2 instance, which are like the container host. These are like the Kubelets in Kubernetes land. And so, what we do is we use one osquery process that runs on the host machine. And we basically use it to peer into all of the containers. And the nice thing about this is that we don't actually have to have anything inside of the container in order to monitor it because we use, basically, namespace trickery in the Linux kernel to allow us to do this.

So, osquery natively has some support for this. If you ever see a table that has this very weirdly named, and I don't fault anyone who named this because I can't come up with a better one, but PID with namespace. If you see that as one of the columns in a table and the schema, what that actually means is that you can hand it a process ID which is associated with the container namespace, and osquery will jump into that mount namespace, do whatever it is that table was going to do. It's actually a parameter, it's not something that you get back.

There were a bunch of tables that already had that feature. We leveraged those. Those only work for the file system for things that osquery is looking for on disk. If you need to do things in other namespaces, then we had to do that by hand, basically. So, we wrote a custom Golang extension with a various nsenter style trickery to get into our containers. And the end result of this is really cool, which means that every single piece of data that we collect from EC2, we have exactly the same information collected for every container workload at Netflix.

Furthermore, since we've deployed the agent on the host machines where we have full control over the life cycle and the installation of those, you get this no matter what. If you use this container platform, we're gonna monitor you.

Netflix Osquery in Our Ubuntu Base AMI

How do we actually get it into those EC2 instances? We bake and build a base OS image that pretty much is pretty universally adopted. You know, 98-plus percent of all of the EC2 instances use it. And we basically start with the osquery upstream OSS package. We have our own little Debian package that installs our canonical set of query packs. It configures osquery and it kind of does some trickery with systemd to keep osquery from going too nuts. So, we have belt and suspenders here.

We do configure the watchdog in osquery, but also configure a systemd to keep an eye on osquery because we really, like I said, do not want it going nuts all of a sudden and having application that owners notice or, God forbid, actually causing a change to the application's behavior.

As I mentioned, we love extensions. We have a Golang-based extension. It has a custom logger that wires itself up to our Keystone pipelines that move the data through the world. And we also have a handful of custom tables that we've implemented mostly for container monitoring, but a few that are just sort of Netflix-specific. And then obviously, in order to do this on the container system, we have a special set of query packs that query and monitor the containers. And then we have custom watchdog configuration for our Titus compute agents because those are really big machines, or like M5, R5 metals. You know, they're gigantic computers. And so, we wanna...we basically chill the watchdog out a little bit on those because they're so gigantic.

Log4j

So, as I mentioned, we don't have the ability to reconfigure what we do with osquery. So, we have to use what we had, right? And what we had was something sort of useful. So, we had a whole process that we developed basically overnight to discover our vulnerable universe. And that was done by things like scanners, a detection engine.

We also have a big and complicated Java dependency analysis system that actually we built as a result of the Apache Struts thing a few years back. It was really great that we had that thing, but we augmented it with osquery. And what osquery was able to tell us was: not things about Java packages because we did not have that cool extension that Uma talked about yesterday. But what we did have is the ability to look at Debian packages.

So, we had hot-fixed our JRE packages to put the property fix. Remember the property fix? Wasn't that just a beautiful thing? And then it turned out to be insufficient. You see the pain in me from Log4j? And so, we were able to rapidly discover who had deployed that hot fix, but then, of course, you know, by Monday or Tuesday, I forget when it actually was, you know, that ceased to be sufficient and so that went away.

The really enduring thing that we used osquery for and actually allowed us to find things that we would not have found in any of these other methods, was to discover non-standard Java runtime environments. The vast majority of Java applications that run at Netflix, of which there are thousands (we are a big Java house) run in our base operating system with our standard set of JRE tools and frameworks and so on. But there were a bunch of just random weird things like Sumo Logic and, you know, things that just had their own bundled JRE. And we used osquery to find those, and then we would page those app owners and be like, "What is this?"

"The really enduring thing that we used osquery for and actually allowed us to find things that we would not have found in any of these other methods, was to discover non-standard Java runtime environments."

On at least a handful of occasions, people said, "Oh, forgot about that thing. We need to patch that too. It's a vendor product," and so on and so forth. And we actually found instantiations of that in some of our most critical services. So, it definitely provided value. Even though, you know, we didn't have the ability to reconfigure it, it still provided us a better view of the overall universe. And the end result, I could give an entire talk about this and I know I'm droning on and on. We were able to do remediate all of our thousands of applications, tens of thousands of clusters and ESGs in less than seven days. And osquery was a big part of helping us to do that. I will say that that was not a pleasant seven days for anyone in Netflix engineering. And, you know, we thank them deeply for this, and I think it was really great and successful.

Discovering TLS Certificates

TLS. This is near and dear to my heart. We have a lot of TLS certificates. We use them for internal application identities, which we use for mutual TLS. We have publicly trusted certs that we use both for on our edge as well as internally for internal tools. We also have certs that we use to talk to partners and that partners use to talk to us. We have several big systems that manage and issue these certs.

One of the things that we didn't know is where are all these certs? Because in some cases, you can literally go to a web UI and be like, "I'd like a cert," put some information in about it, and it’ll be like, "Here you go." And then those people take those certs and they go God knows where with them. And so, what we wanted to know is: where do they go?

So we used osquery. It was simple—we found the listening ports and curl them for TLS certificates and send that back through our data pipeline. And this was very simple, very easy to add. And we were able to discover approximately 700,000 different certs every day as a result of this progress. And we are able to see where we're using mTLS, how much of that has been adopted. We also use this to notify people when their public-facing certs are about to expire and it's gonna cause problems because again, we can't notify them if we don't know where they put them.

Runtime Software Inventory

Last one I'll mention, which is sort of related to some of the vulnerability management work that we've had, is that most of Netflix's notion of what's installed where is built from pre-deployment information. We bake armies, we build container images and we scan those for vulnerabilities.

"What we wanted to know is, at runtime, what is actually installed? What is the current state of the system? So, we use osquery."

But like I said, you know, immutable infrastructure is a nice dream, but it isn't always true. And so, what we wanted to know is, at runtime, what is actually installed? What is the current state of the system? So, we use osquery to list out all the Debian packages, all the Python, all the npm, and a handful of other things. And this has been really valuable because it allows us to, like I said, the state-of-the-art was: how much do you think the difference between what we scan before pre-deployment is versus after deployment? And people would say, you know, "I think it's wildly different." Other people are like, "No. It's totally fine to look at a pre-deployment." And we really had no data to verify that. Now we do.

The answer to that question is nuanced, so it's not really terribly interesting to go into. But what we've done so far is that now we’re able to answer that question, which helps us to drive new efforts. A PCI audit came around and we're like, "Hey, we need proof of how we're doing patch management." And we're like, "Hey, we’ve got this history of packages installed at runtime collected from this trusted source." And so, we were able to use that as part of the evidence to pass our PCI audit.

Farewell & Thanks for All the Fish

So, with that, I will wrap up. And I would like to thank various folks—some of whom are here, some of whom are not—who helped to contribute to this. Gabriel, my colleague here from the base operating system team really started all of this and has worked with me all along to make it work, and many others. So, with that, I would be happy to take your questions.

Audience Questions

Question: I was just wondering about expanding your osquery deployment into your M&A environments. Is that part of your jurisdiction and how does that work?

Nabil: Great question. I will say that, you know, historically, Netflix has not been a mergers-and-acquisitions type of company, but then like maybe about 18 months ago, we like got on a little bit of a buying spree. In some cases, we've brought them into our AWS organization so that we have a little bit more control over them. But for the most part, I believe they're completely separate.

Question: I have two questions. The first is you made reference to making use of schedule_max_drift, and I'm wondering how much of a clock smear you're doing across all of your instances so that you're not tanking performance.

Nabil: So, I think we just set that to be relatively big because again, we're not collecting anything faster than an hour, so we really don't care. Mostly what we were finding is that, especially as I got a little nuts with the Golang extensions, we did have some queries, like on those container hosts, that take a long time. And they use a non-trivial amount of resources. And every once in a while, we were discovering the watchdog was murdering osquery and we didn't want it to. And again, the timeliness was not important for us, so we just like...we just amped up schedule_max_drift like big time, basically.

Question: What are you using for checking the performance of your queries?

Nabil: I would say that the primary method that we use is that Gabriel and I think really hard. And then we're like, "Seems cool and the watchdog and systemd will save us if anything bad goes wrong." So, yeah. We're airing on the side of: we are okay to lose visibility rather than affect performance. And again, the hard question is: what is the right approach as a security professional? I would feel squeezy about that, but from the operational constraints that we have, it's a totally reasonable thing for us to do. And furthermore, you know, if we happen to get killed in one instance in a cluster that has 6,000, it's kind of okay. We got 5,999 other ones that should be identical that we can collect data from.

Question: After seeing Log4j, have you decided to add some Java capabilities because this keeps happening every few months? Do you see it in Java?

Nabil: In a word, no. But like I said, I think what I really want is to get even a very dumb and basic dynamic capability because there's a decent way that we could go just by actuating osquery as it exists today differently. I would love to have that first and then look at extensions and add-ons. We have made other investments to deal with Java vulnerabilities, like to do with our dependency scanner and how we do vulnerability management overall. Lots of changes resulted in this, but so far, at least the main thing that came out of this for osquery is that Netflix kind of realized, "Hey, this is super useful." This is actually helpful. It's not just this crazy thing that Nabil and Gabriel do in their spare time. And so, that was good for us. And so, there's more attention on it, more desire to own it as a real thing.

Question: In your architecture diagram, you show logging to both an Elastic instance and to Hive, I think it was, or you're probably querying from S3 with Hive, I imagine. How do you find the balance of using both of those? Do you take advantage of both the more hot storage in Elastic and the colder storage in Hive? And what's the balance like for that?

Nabil: That's a very good question. I think, to be totally honest, the reason that it's architected that way is that that's very commonly how people wire data up in Keystone or our streaming data system. These are the two most common things that people do with it, so we did it too. In the beginning, I'll say that I use the Elastic instance more because I was more familiar with it. But like I said, it's actually kind of tricky. We have a test environment and a prod environment. And then underneath that, there are regional-specific elastic clusters and we’re doing fleet-wide queries, which is really what we care about.

Rarely am I looking at osquery data and being like, "I only want prod." I want everything. And so, that's just much easier to do now in Hive. So, we still have those elastic clusters. I actually kind of wonder if we should keep them around because they're kind of expensive. There's just a bunch of EC2 instances that host all of that and it costs a lot of money. I kind of keep it there because it's just very accessible to explore the data. You just go in there and throw a keyword in and just see if you can find things. But me personally, I have gotten more familiar with our big data tooling, which is much more powerful, much more capable. And for the most part, systems that consume the osquery data do so via either those real-time streaming things or via the big data tables.

So, for the most part, people aren't sitting at an osquery pane of glass, right? They're just looking at how that information has traversed to other systems. Like we have an inventory system that holds information and security vulnerabilities and other data about every application at Netflix. Osquery feeds metadata into that, and it just shows up in that inventory system. So, I would say, for the most part, I think it's more Hive on the big data side, but we keep the Elastic thing around for quick queries and for exploration. It's just easier to interact with. Like, you can send anybody there and they can just look at what's available.

Question: You know, at a conference about an agent, I have to say it's reaffirming to hear “we're using an agent almost for free with no impact to production.” When what I keep hearing out in the press is like, "No, you gotta use an agentless solution." Did you look at agentless solutions of snapshotting, you know, 1 million volumes, copying them somewhere else, and running these queries? And why was that a bad idea?

Nabil: So, full transparency, we are looking at agentless approaches, especially for vulnerability management. But one of the nice things about osquery is that we do have the headroom to run another process, use a little bit of RAM, occasionally, use a little bit of CPU like that. That's cool as long as we don't go nuts. And we also really like a bunch of the runtime information that we get: the listening ports, the process listings, the ARGs that we now have as well as the networking, and other kinds of data. So, really, you know, if you go back to those pillars of stuff we get, we only get about half of those if we just do side-scanning or a volume-based scanning.

Cheaper Monitoring of Millions of AWS Workloads: How Netflix Cybersecurity Team Uses Osquery