Osquery has become a popular source of instrumentation for a wide variety of use cases. On github security showcase, it is currently among the top most popular open source security projects. Given the popularity, a recurring question is what use cases can one address with osquery in an enterprise environment?
In this blog, I’ll discuss:
- Key attributes of osquery that make it an excellent universal agent of choice for collecting telemetry for multiple use cases
- Considerations for operationalizing osquery and collecting high volume telemetry from an endpoint and aggregation at scale across many endpoints
- Components of a solution for osquery-powered security analytics
Osquery: A Universal open source agent
Osquery is well-documented. In this blog, I’ll touch on a few key attributes of osquery that make it a pragmatic universal open source agent. As you explore osquery further and understand the depth and breadth of the collected data, you will realize the possible use cases are limited only by your imagination.
- Osquery interfaces with the kernel (e.g., openbsm, kaudit, etw) to capture kernel behavioral activity/events (e.g., processes launched, socket connections, file changes). This is done via event tables. The events and attributes provide a reliable source of telemetry for detecting intrusion and malicious behavior on an endpoint. These tables lay a sound foundation for Osquery-based EDR and FIM solutions.
- Osquery can scrape point-in-time OS state via scheduled queries. The queries can be scheduled to run as snapshot or differential (i.e., only changes are sent with differential for greater efficiency). The results of scheduled queries provide a reliable source of telemetry for software asset inventory and OS configured state inventory. These set of features and tables provide a foundation for Osquery-based Asset Inventory and Vuln Detection.
- Over a remote TLS connection or via the CLI, one can ask live/ad-hoc queries. These interfaces lay a sound foundation for incident investigation and for debug and diagnostics. These capabilities along with the rich set of tables lay the foundation for Osquery-based Incident Investigation.
- Using parsing lens (e.g., augeas), one can parse various configuration files. These advanced capabilities provide a flexible and reliable source of data for auditing and implementing benchmarks such as CIS. They can also provide rich data for implementing supporting controls for compliance standards such as PCI, SOC2, and FedRAMP. This flexible telemetry provides a solid basis for data and evidence collection for Osquery-based Compliance.
- Using the thrift extension mechanism, one can write extensions using C++, Python, Node, and Go. The extension mechanism provides a quick basis to add new capabilities to the deployment using scripting (e.g., Python).
Operationalizing osquery: Endpoint log considerations
In any reasonably sized osquery deployment, no matter your end goal or use case, you will need a mechanism to deploy osquery at some scale and manage a fleet of osquery agents. There are many fleet management solutions and I’ll not cover that topic here. Rather I’ll cover some of the key considerations of configuration-related options on the endpoint and highlight the implications of result logs. Osquery is very flexible and has many configuration options. Very specifically, we’ll list some topics and delve a bit more into data collection implications.
Osquery configuration management entails:
- Configuring of flags and options
- Configuring cert+secret for TLS-based remote control
- Scheduling queries
- Configure event tables and FIM
- Log destinations
A popular but challenging (for scaling) approach in dealing with osquery telemetry is log forwarding. An alternate approach is to simply use HTTP/TLS for everything. At Uptycs we specialize our solution around HTTP/TLS and we’ve built our Endpoint Detection Network (EDN) to scale out osquery deployments.
Here are a few things to consider if you are planning to write to local disk/log file for log forwarding (e.g., to Splunk or ELK):
- Depending on how many events (activity/load on the machine) and scheduled queries, osquery can generate significant volume of telemetry that can result in significant resource utilization on the endpoint.
- On the endpoint you will have to forward the logs reliably to a backend.
- If you use snapshot mode and capture with high frequency, it will be prohibitively expense (CPU, storage and bandwidth), and if the frequency is low, you will miss events/data.
- If you use differential mode, reconstructing timeline-based state in a backend such as splunk or ELK is a significant challenge that you need to consider. The differential data received by the backend is structured with add/remove markers relative to an epoch. Thus the backend needs state reconstruction logic based on discrete point-in-time events. Due to this challenge, many may revert to snapshot and that mode is expensive and inefficient.
Osquery-powered Security Analytics
As summarized earlier, osquery is extremely capable and can be used as a universal agent for many uses cases including:
- Intrusion/Malicious activity detection (EDR)
- File Integrity Monitoring
- Incident Investigation
- Vulnerability Detection
- Audit and Compliance
In all of the above, osquery is a valuable part of a solution that provides structured telemetry. To complete the solution, you will need:
- A mechanism to manage a fleet of osquery universal agents
- Ability to aggregate osquery telemetry securely and collect it at scale across many agents
- A backend or data lake with streaming and historical analytics to solve the broad set of security use cases
In this blog I focused on osquery use cases. Of critical importance for enterprise-grade osquery deployments are considerations of scale. I’ll cover that topic in my next blog post.