A methodical developer’s perspective on mapping privacy regulations to changes in the database structure, updates in DevOps practices, backups, and restricted processing.
After 2 years of fearful anticipation, GDPR is finally here, in full effect starting with May 25, 2018. A considerable number of clients who've entrusted their data to our soultions keep asking a lot of questions in one or another way related to GDPR.
Aside from implementing "state of the art" encryption, there are still many important aspects of software development that need to be reconsidered with GDPR in mind. In this blog post, we've tried to give a broader context to GDPR demands, outline some of the GDPR-related engineering challenges that affect development practices and approaches, and suggest some ideas for solving them.
Disclaimer: This is a highly opinionated piece that reflects our point of view as there is no industry standard or even consensus on many things mentioned in this post, both in regards to GDPR and the mentioned "best practices"*.
What is GDPR?
If you’ve been living under a rock for the last 2 years, GDPR or General Data Protection Regulation is a EU law on data protection and privacy. It sent many European businesses (and especially engineers and system administrators working for those businesses) into panic mode over the number and strictness of demands and regulations, absence of clear outlines on how to meet them, a few extremely obscure points (like storing backups forever, but storing personal data for a limited amount of time), and draconian fines racking up to EUR 20 000 000.
GDPR mandates you to have a clear and distinct policy for the users’ data privacy in place
Many of the GDPR requirements are policy- and procedure-related. Those policies and procedures seriously affect the implementation. The first and foremost requirement stated in GDPR is the need to get an explicit consent for data processing from the user. This means that a simple “I agree” button and a customary license agreement will not do now.
Now, as we’ve covered the most obvious points of GDPR-related advice, let’s proceed to the serious matters. From an engineering standpoint, the things worth sorting out and talking about in relation to GDPR and data processing are:
- “Data minimization”. Do not collect more data than you need to operate the business and deliver user value (no matter how tempting it may look or how hard it is to tailor the existing forms to the new requirements).
- “Limited storage time”. Make sure you don’t store the users’ data forever (and that you’ve informed them about it) and that at any given moment of storing, you can either totally wipe it out or provide it to the users, if they want to keep their data for their own purposes outside your system.
- “Consistent integrity and confidentiality”: Consistently prevent leaks and tampering (which includes, but is not limited to data encryption and pseudonymization).
- “Apply purpose limitation”: Collect the customers’ data intentionally, with a clear goal in mind, not "in case we might want to feed it into some fancy AI someday".
- “Accuracy”: Make sure that the data collection processes actually collect accurate data.
- “Processes, not functions”: Clearly define every personal data-related process.
When should we process the personal data?
- When we have an explicit consent from the user (to provide their personal data);
- During a contract relationship;
- When it is required by law;
- In case of a legitimate interest of a controller (including marketing purposes).
There is also one important thing overlooked by many – we can’t use the users’ data in the development environments of our products now. Which means that:
- We need to come up with a better idea of data flow and a model that allows building compatible data sets and deployment between development, staging, and production.
- We need a proper way to get the necessary “dummy” data for development purposes. This should be achieved either through generating it according to the model or through using pseudonymization/anonymization of the production data.
- We need to set a process for this.
Personal Rights as defined by GDPR
Before we dive into the technical details, let’s have a bird’s eye view of all the challenges. Citizens protected by GDPR are given certain rights. What are they and what do they require?
In the world of automated backups, non-volatile storage systems, and all-pervasive caching, a complete elimination of customer data poses a good engineering challenge even for the most technically savvy companies.
In a perfect world, this is an encouragement for everybody to inter- and co-operate and break the walls of “walled gardens”. The right to the data portability requires your system to export all the customer data in a universal machine-friendly format. In reality, that’s just another problem to solve.
Even though the processor can keep the data, it is not allowed to process it without a further explicit consent from the user – so how do we go about excluding certain records from SELECT *?
This means that users have the right to change their data (personally identifiable information, PII) in your system if they believe it to be incorrect. Considering the way modern systems are designed, this seems like one of the easier parts of trying to comply with GDPR – if you enumerate PII-related records appropriately.
The system has to provide means for informing and notifying users in a human-friendly manner (plain language and short comprehensible statements).
Users need to have the right to access all of the data you've got on them.
Now, let’s talk about the challenges we’re facing as engineers, taking all of the above into consideration.
Implementing the rights
One of the core processes for implementing all the GDPR requirements is assessing and enumerating the personally identifiable data you’re collecting and processing. Gone are the days of half-blind collecting of “everything that sticks, placed wherever convenient”. A clear scope for data collection, storage, and flow through all services and storages in your system is an important precursor to the technical measures that need to be implemented.
The right to be forgotten
How do you remove all the records related to one user if they’re distributed across a number of databases, tables, and systems? The text of GDPR actually drops a few hints.
Pseudonymization of data (as required by GDPR articles 6, 25, 32, 89) suggests that you need to have a unified anonymous/pseudonymous ID across all your systems. It means that you should stop using emails as unique user IDs across your databases. You need to store the data in an identified fashion instead of constantly searching through it and correlating attributes. But more about that later.
It is a good idea to have a unified API call / library function for removing users (and their data) everywhere. But deletion itself poses a few direct challenges:
- Orchestrating a cascade of deletion across all storages. This is hard to implement properly and it's a challenge if some entries are used in aggregate metrics (although this can and should be solved by anonymization).
- Storage-level challenge of deleting DB row(s) versus replacing PII with a “Removed user” placeholder. In some cases (i.e. traditional SQL databases), DELETE performance seriously depends on the indexing strategy.
- Foreign keys in RDBMS. You don’t store all the data in one table… or two. Or ten. You need foreign keys to link entities efficiently. Being able to easily remove the records either breaks the referential integrity or requires allowing NULL foreign keys in your system.
- Sequential lists (i.e. event sourcing data models) – if (immediate or eventual) consistency is important, you don't want PIIs in there as removing records from sequential lists usually breaks other things.
- Signed chains of all kinds, especially the blockchain type. If blockchain holds its guarantees properly, you shouldn't put any personally identifiable data there.
Aside from these obvious challenges, there are some trickier ones.
Backups are troublesome – you’ve got to choose between:
- Not backing up PII at all
- Pseudonymizing the backup, while User_IDs are matched to PII identifiers separately so that purging the “forgotten” User_IDs makes the backup safe from the need to go through it to delete the users’ data.
In cases, where this is hard, it is possible to make an encrypted backup and a separate list of “IDs to forget” and never recover the data related to these IDs from backups.
Also, there are third parties, with whom you exchange the data. And it starts with your payment processor, SalesForce, analytics platform, or anything else, where you exchange sensitive data. Although the data given to 3rd parties leaves your control, if you are expected to be able to re-match some data you owned to the IDs you’ve had (and transmitted), you’ll need to establish an appropriate mechanism for that.
This is a typical state synchronisation problem between two parties:
- You can refuse requests with an appropriate error code (410 Gone). This is effectively a “pull” scheme for 3rd parties to find out about your changes – easy for you, not so easy for them.
- You can notify them about the user IDs that you forget every time you do so – which is a “push” approach.
If 3rd parties hand the data to you, having a (semi-)automatic way of recording “supplier X has no records of Z anymore” will help with consistency.
Let’s look into updating other models affected by a deleted user. For instance, someone “likes” your post, but then they request you to remove their info. Should you remove the “like” and re-calculate the total count of “likes”, too? Or do you need to store the likes in a pseudonymized fashion, where removal of a pseudonymized ID disconnects the “like” from the person, but keep the “like”? Various social media engines treat this case differently, but in logistics and e-commerce the answer is obvious: you don’t delete executed orders. So pseudonymize wherever possible.
It’s also an additional UI challenge to make sure that the deletion is properly confirmed and is never exploited by adversaries! And we're only getting started.
Preventing an automated system from processing the data it stores is burdensome, illogical, and crazy from the modern point of view. Yet this is reasonably achievable (and one of the least challenging parts we’ve seen so far):
- For most databases, adding a table of “restricted objects” and filtering the output against it seems to be a reasonable solution.
- Adding a boolean column / field to every table or collection containing PII(s) is another way of solving this, but the procedure requires careful coordination.
- All in all, if the data isn’t fetched by DB to the application logic, it’s not processed.
- One way of addressing the expensive changes in applications where the database scheme is pre-defined and hard to migrate is creating a custom view. It would filter the actual table output for the restricted records.
This one is easy – provided that you enumerate the personal data assets, too. While there is no consensus what “all data” means, it should cover:
- All the personally identifiable data;
- All the data a user generates in your system, including with activity logs (but not the technical logs).
Things to take into account:
- If the export process takes long, consider moving it to the background and emailing the users a link or notifying them via a push notification. This also should be done securely as you don’t want users to trigger the export process and then have some adversary pick up all the data.
- It is not obligatory to fully automate the process – but after enumerating all the users’ assets, it’s far easier to automate it than go through with a manual procedure, making real humans export the data.
- If possible and accessible, aggregating the data from the 3rd party systems you’ve used also starts making sense.
Right to rectification
To be most GDPR-compliant, a user’s profile should meet the following criteria:
- All personal data fields must be editable. That includes the profile itself and also most of the data that users leave in your system.
- The users need to have the ability to change their email addresses. Identify users by unique IDs instead of using emails.
- For some fields, the process can’t be automated and it’s fine. Allow the users to write emails to your support engineers if they need to change some special data (i.e. change their phone number if you’re a wireless carrier company).
- Consider having non-volatile (versioned) changes to user profiles to avoid mistakes and simplify safe propagation of the changes.
- Since some of the users’ data that you might need to manage can be supplied by 3rd parties (i.e. data pulled from Facebook):
- Provide the users with a way to also edit your local copy of that data.
- Let the persons who are not your authenticated users edit data in your system if you’ve collected that data from 3rd parties (another interesting challenge, isn’t it?). Currently there are no specific regulations as to how this should be carried out, but ZKP-like protocols over publicly known identifier (i.e. name, SSN, etc.) can do the trick.
The right of access
Technically, the right to access requires the same measures as above since this is de-jure the same as being able to export it.
- For the non-registered users, the ability to check whether you’ve got data on them is a challenge. To avoid enumeration by adversaries, you need some kind of limited reveal protocol with a completely non-trusted source (request only even letters of the name or fill some blanks on your side to prove that you know them, etc. – some kind of semi-ZKP procedure that would be able to satisfy the most paranoid minds).
- You have to let the unregistered users an opportunity to ask whether you have data on them, but that would have to be either a manual process, usage of 3rd party as a trusted arbiter, or would involve rather rough ZKP tricks.
- Divide the users’ data into pieces with lower and higher sensitivity/privacy levels and only allow access to the more private data after checking and confirming the users’ identities.
- One of the core concepts necessary for implementing all of the above is mapping the existing sensitive data flow, enumerating storages and their procedures.
One concept to rule them all: data flow
The first challenge that arises is defining 2 things:
- the personal data scope: list of personal data assets you’re processing and storing;
- the data flow: how these records travel across your systems.
Often we understand the requirements the business itself sets for the data protection, but there is always much more to it: keys, logs, metadata (geo-tags, EXIF, etc.). These pieces of data might not be a part of the business logic of your product, but your code could be accumulating them, too.
How to understand the whole picture? Define the data flow, check/create the documentation for the components of your architecture and storage model, and also look for the undocumented public APIs.
Implementing the principles
Implementing the principles is less about the actual technical capabilities and more about the processes, the UI/UX – but it’s still worth going over some of them in search of challenges.
- Avoid buttons stating “by clicking “OK” you…” and replace them proper checkboxes. This simplifies detecting/recording if your site is being scraped and provides the thought after explicit consent.
- It should be easy to withdraw consent at any time. Having flags for different types of consent in the user profile for turning various types of data processing on and off makes sense.
- Collect the identification data (IPs and timestamps) when receiving the consent necessary for data processing – these are your proofs.
- Define the necessary timespan for storing the data, document it.
- Do not store the data after the defined timespan expires.
- Keep the “store until” database column / profile field, set a semi-automated purging procedure in place for cleaning up of the unnecessary records.
- Use TTL indexes in MongoDB.
- It might make sense to anonymize / pseudonymize the data that you don’t need for serving real humans, but which is necessary to have the statistics/analytics working.
Encryption is a crucial security tool for protecting the sensitive data. The article 32 of GDPR mentions “state of the art” encryption, which in this case can be interpreted as the “industry’s best practices”.
Traditionally, the process of data encryption consists of 3 crucial parts:
Encrypting the data in between the components:
- Between the application and the database: query your database via TLS-encrypted connection (if it is allowed/possible). Switch to using SSL/TLS by default in the database driver.
- Between the micro-services: encrypt via TLS v1.3 or Secure Session; if the communication between the micro-services relies on the message bus, store the messages in an encrypted form with authenticated encryption (preferably asymmetric encryption, where each node has its own keypair).
- Between the user and the application: use HTTPS wherever possible.
- When dealing with 3rd party services/apps: establish trusted connections, use TLS, do not use anonymous APIs.
- Encrypt the data at rest:
- Use the filesystem “at rest” encryption via LUKS or similar technology. What are the possible problems in this case? Usage of a single key, which can be extracted from the running memory. A wrong choice of a cipher suite makes cryptographic attacks possible.
- Database-native “at rest” encryption. Use the database’s engine for encrypting all the data written to the disk. The following challenges arise in the key storage model:
- Key management mistake #0 – storing keys near the data. If the node is compromised, having a key near the ciphertext eliminates any potential security expected from the implemented cryptographic solutions.
- The keys are stored in the application and transmitted to the database with queries: for any attacker persisting in the DB host system, this is no different from the “keys near the data” case.
- “At rest” in-app encryption: encrypting the data in the application/middleware, but storing the secrets elsewhere. This approach gets common these days, but it requires implementing a proper key management, rotation, compaction, and other KM processes in-app for the encryption to be really useful.
- “At-rest” encryption via a trusted service: using Vault, Acra or Always Encrypted engine of MS SQL to encrypt/decrypt the data in the application and manage the keys:
- It is the most resistant to smash-and-grab attacks.
- It holds really well against some persistent attackers and is able to detect those, which it cannot resist.
- Encrypt the backups:
- Given so many considerations about storing the sensitive data in backups, it might make sense to run encrypted backups.
A popular piece of advice for GDPR compliance is “if you don’t need to know the PII – avoid having it”. Another piece of such advice is an ambiguous formula “secure by design”, which can mean anything from “designing systems with security in mind” to “secure by default”, in turn ranging from application's approach to configuration and permissions, to principle of least privilege towards the data.
The very least privilege is storing data in encrypted form and only allowing authorised keys to access it. In a way, this is equal to not storing it in an accessible/processable form at all.
It’s possible to shed some responsibility for PII data by building a solution that doesn’t store data in unencrypted form and holds no possible keys to it. And there are a couple of approaches for doing just that – operating on data without knowing “what’s it the box”. End-to-end encryption (E2EE) and zero knowledge architectures fit the bill.
End-to-end technology was popularised through E2EE messengers (think Signal or Wire) as the mechanics of the E2E-encrypted communication guaranteed that only the sender and the recipient of the message could actually read it. Now there are a number of end-to-end secure filesystems, collaboration products, and VCS, bringing E2E security to data storage schemes (you might want to try our open source framework Hermes, which is developed to enable building better end-to-end encrypted data flow with the emphasis on data sharing and collaboration – check it out on Github).
The challenge that arises with using E2EE or Zero-Knowledge Proof architectures? You’ll have to implement a system that is capable of performing all of its duties without having an access to sensitive data. This requires certain discipline, good enumeration of sensitive records, precise understanding of data flow… Exactly the things GDPR requires you to do to implement the rights and processes well.
Pseudonymization and Anonymization
Data anonymization is an irreversible type of information sanitisation the intent of which is protection of privacy. It is the process of either encrypting or removing the personally identifiable information from the data sets, making the people whom the data describes anonymous.
Along with pseudonymization, anonymization is mentioned in the chapters of GDPR dedicated to “privacy by design”, one of the main requirements of GDPR, but they fall under different categories in the regulation. Anonymization and pseudonymization are often confused, but they are not the same.
Pseudonymization is a data management and de-identification procedure through which the personally identifiable information fields in a data record are replaced by one or more artificial identifiers or “pseudonyms”. The article 4 of GDPR defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person”.
Pseudonymization (along with encryption) is suggested as one of the possible organisational measures for reaching the compliance with GDPR in a number of its articles (namely, article 6 – Lawfulness of processing, 32 – Security of processing, 25 – Data protection by design and by default, and 89 – Safeguards and derogations relating to processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes).
The main practical difference between pseudonymization and anonymization lies in the implementation:
- Pseudonymization will require a keyed HMAC with strong salt and periodic key rotation and an identification table that matches each hash to the actual value.
- Pseudonymization can be performed during:
- Ingestion (when data gets into database);
- Migration (when existing data is gradually migrated to pseudonymized state);
- Query (when data remains unchanged, but the output is masked in custom view).
- Anonymization is easier – it’s just removing or replacing PIIs with random identification.
- Points to consider, regarding both:
- Pseudonymization isn’t perfect for moving data from your production environment to development. Anonymization is much better suited for this task.
- Pseudonymize data for machine learning purposes.
- Use random IDs and dictionary or just hash+salt/bcrypt it.
Access control & logging
Access control and logging should be the spine of the PII data flow. For the best results:
- Apply secure defaults (use the principle of least privilege) everywhere.
- Don’t log personal/sensitive data (at least not in unencrypted form).
- Log the access to personal data in a secure way.
- Allow no anonymous access to the data;
- Authenticate all the 3rd parties;
- Use HTTP to authenticate any endpoints which might expose the data if robots.txt is ignored.
- Keep a record of all the automated processes possible in the system that may use personal data (as required by the article 30 of GDPR) – this is not (just) bureaucracy, it really helps to keep an eye on the processes.
Everybody fails and everything breaks, be it sooner or later. What is needed is not dialling the paranoia levels to 11, but getting ready for a failure/breach in a practical and sane manner. Some good starting points would be:
- Having good real-time monitoring with corresponding security events.
- Having somebody responsible for watching and reacting. The modern SRE principles are more than enough to cover the basic needs.
- Being able to identify breaches and notify users;
- Actually having a process/procedure/checklist in place for notifying users (and authorities).
- Using pentesting to see how you handle breaches in real life.
Currently there are no ISO-like standards what would guarantee a 100% GDPR compliance (such standard will hardly ever be possible, due to the flexible nature of the regulation itself). From the procedural standpoint, all security standards / certifications help clarify procedures and approaches, but what does it look like from a technical perspective?
For maximum GDPR compliance, consider looking into technical standards and frameworks that emphasize sensitive data security, i.e.:
- OWASP ASVS;
- OWASP Mobile;
- NIST encryption standards and guidelines
- Sensitive-data related segments of HIPAA and FISMA.
Conclusion: Why should I really care?
Aside from increasing the legislative and bureaucratic pressure, GDPR actually carries a lot of really good intent:
- It enforces the best practices, marking the legacy security solutions as "inappropriate" (however, following the new regulations in the real world is still quite a challenge);
- It makes companies of all sizes aware of the fact that they're processing something that is valuable, regulated, should be taken seriously, and shouldn't be overlooked until the “latter stages of the company’s lifecycle” (in other words – “never”).
If you develop something that stores and processes customers’ data, you’re responsible for staying up-to-date with data protection and need to make your best effort to create a reliable layer of security around your infrastructure.
If you pass a certain growth threshold (250 employees or processing data on a large scale – and to add to the fun, the exact scale of “large scale” is not defined in the text of the GDPR regulation), you need to keep a log of all types of processing taking place, including with transmission of data to 3rd parties. This also includes your cloud provider, surprise-surprise.
Sooner or later, even the most absurd regulations are adapted to the real world and the governments start enforcing them in a hard way. For a change, the GDPR is not absurd – it is an actually useful regulation, albeit one that is still vaguely defined and is thereby hard to comply with. It’s only a question of time when it becomes the de-facto standard operating practice. Preparing beforehand and watching the changes GDPR brings is smarter than adapting when the train is already moving.
This article didn’t just come up out of the blue. As we assisted our clients in integrating the tools we build, mapping them to practical risks, we did a lot of thinking ourselves, but without seeing other engineers around the globe ask themselves the same questions, we would never have seen the whole picture this clearly.
Below is a (regularly updated) list of links to the GDPR-related technical posts and resources that we suggest you also look into:
- The official GDPR portal;
- Well-structured searchable text of the GDPR;
- Software development and GDPR (article);
- GDPR – A Practical Guide for Developers (article);
- How GDPR Will Change The Way You Develop (article);
- What is GDPR? The Summary Guide to GDPR Compliance in the UK (article);
- Hard questions about GDPR (article).
If you have any further questions regarding the practical steps you can take to ensure a better GDPR compliance, get in touch.
Disclaimer, continued*No single silver-bullet tool can make your product or infrastructure fully and instantly GDPR-compliant. No matter how well you implement encryption and pseudonymization, there will still be a ton of things you’ve still got to get right, both in architecture/code and in processes that your code implements. Having plenty of experience in solving the encryption-related part, we also wanted to take a look at the rest of the demands and this is how this article was created.