Does Osquery Violate GDPR Rules Around Personally Identifiable Data (PII)?

AHHHH! GDPR day is upon us!

If you've used a service or signed up for anything ever in your life then you've surely noticed the onslaught of "Terms of Privacy Update" emails over the last couple of days. That could only mean one thing: GDPR implementation day has finally arrived! But for all the unavoidable noise around GDPR, we couldn't help but notice a lack in any advice or documentation about osquery and its link to Personally Identifiable Information (PII) -- a focal area in the GDPR regulation (here's a "handy" 100 page reference guide on GDPR). So, let's get right to it then.

GDPR DAY HAS ARRIVED!

Does osquery reveal any Personally Identifiable Information (PII)?

A good place to start is to more precisely define what is PII data, for which we turn to NIST publication 800 – 122 (https://csrc.nist.gov/publications/detail/sp/800-122/final), as authoritative source as any. It defines PII as any information that can be used to distinguish, trace, or be linked to an individual. Information like names, drivers licenses, biometric information etc. can be used to distinctly identify, or distinguish, an individual.

Payment transaction data can be used to trace the activity of an individual. And finally, even seemingly anonymous data – such as a to-do list with no direct PII data – may be easily linkable to an individual if some of the items on the to-do list are so role specific that they leave little doubt who the list belongs to. So, PII data can be very nuanced.

PII-v3

While the osquery team has made a concerted effort to prioritize privacy over functionality (see the documentation) and is well known for having rejected requests to make information like browser history available, unfortunately, under the NIST definition of PII, all three types of data are widely available.

Osquery has tables such as users, which directly identify individuals. Perhaps not so obviously, tables such as process_envs or system_info can have environment variables references users’ home directories or system hostname that will usually identify the user of the system. IP addresses are prevalent in several tables, such as interface_details and process_open_sockets. IP addresses are a well known source of traceability information - geolocating an IP address could easily trace the activities of an employee.

Finally, innocent tables such as temperature_sensors – which measure system temperatures – could be used to link the system to an individual, in cases where the individual could be expected to have certain system usage behaviors. For example, a remote employee operating from a geography that is several timezones away from the rest of her peers can be linked to a system based on the time graph of the system temperature sensors.

So, osquery is replete with PII data of all types. That means it is not fit to be used in any situation where GDPR or other PII related compliance might be needed, right? Clearly, we all ought to do some navel gazing and perhaps consider careers as turnip farmers instead, yes?

Not so fast. There is hope yet for the likes of us.

How (if at all) is the PII data stored?

First of all, none of the above is relevant unless one decides to store the data retrieved from osquery. Recall that osquery does not itself store any data, and can be used to investigate systems in real-time. If osquery use is limited to this, then obviously, there is no privacy implication at all.

Of course, that would severely limit the usefulness of osquery; most users of osquery schedule querypacks and have the results archived in some system of record like a SIEM or purpose built datastore like Uptycs for use in subsequent investigation and diagnostic activities.

One practical solution is to carefully store the osquery data in what the NIST 800 – 122 publication would call a de-indentified manner. The basic idea is for all PII data, to create a mapping table from the actual data to a machine-generated token, and then only store the machine-generated token. If the PII data has to be deleted (for example, when a user chooses to “opt out”), all that has to be done is that to remove the mapping table entry; at that point, none of the recorded data can be attributed back to an individual.

What about linked data?

As mentioned before, seemingly innocent tables such as temperature_sensors or just the timing of program execution reflected in the processes table might be enough to link the system to a specific individual.

One solution to this would be to store all the data in an encrypted manner, with the encryption key kept in a mapping table. Once a user opts out, the encryption key can be deleted from the mapping table, so that the encrypted data is effectively lost (deleted) forever.

The bottom line is that while osquery is a powerful tool for securing and managing systems, its PII implications need to be carefully considered and incorporated in any implementation strategy. Doing so enables all the power of osquery while respecting the privacy users and while we're not legal advisors, we have helped our customers wade through these complexities.

Learn more about osquery: