Recently, we released a new version of the Acra database security suite. Acra allows encrypting sensitive data fully transparently for the app and the database without any application code changes.
In this article, we explain details of how transparent data encryption works, what actually makes it “transparent”, and what features of SQL protocol Acra encapsulates under the hood for a comforting user experience.
Acra transparent data encryption is available as part of Acra Community Edition (free on GitHub, checkout ready-to-use engineering examples).
If you are familiar with terms field level encryption and transparent data encryption, scroll to the How does Acra work? section.
- What are modern transparent data encryption techniques?
- Database wire protocols
- Integrating Acra type awareness to the database wire protocols
- Example: application uses simple text query protocol
- Example: application uses extended binary query protocol
- Summary
What are modern transparent data encryption techniques? #
In the context of databases, data protection goes beyond “data at rest encryption” and “row level security”. The modern approach is to use transparent column level encryption or field level encryption.
Field level encryption means that the database stores sensitive fields (names, emails, accounts) in encrypted form.
“Acra transparent data encryption” means that the encryption process happens transparently for both: the database and the application that talks with the database. The typical approach is to use an encryption proxy in the middle, like the Acra database security suite. It’s important that encryption happens outside the database, thus the database never gets access to the plaintext data or encryption keys.
Don’t confuse Acra transparent data encryption with Oracle TDE and SQL Server TDE: while these technologies have similar naming, they work differently under the hood.
With Acra transparent data encryption, neither the database nor the app know that the data is encrypted, because it happens transparently for them. Also, Acra uses unique encryption keys per each sensitive data field, significantly decreasing chances of plaintext data leakage.
Benefits of transparent data encryption #
The biggest benefit of Acra transparent data encryption is that it doesn’t require any change in the application code or dealing with cryptographic libraries and keys. Transparent data encryption saves resources on development and allows to protect data with minimum effort.
Acra’s transparent encryption is the most useful when the infrastructure (apps, databases) is already built. For example, when data privacy regulations change, your team is required to encrypt sensitive fields but doesn’t need to rebuild the existing system.
How does Acra work? #
In short: your application communicates with the database via Acra.
Note: Acra can work as SQL proxy, as API server, as client-side SDK, or even module for SCADA. Here we describe Acra that works as a SQL database proxy.
Acra works as a database proxy: it receives all incoming data from the applications and encrypts sensitive data fields before sending them to the database. In the reverse direction, Acra captures queried data from the database, decrypts it, and sends to the application.
Developers or DBAs customize which fields to consider sensitive and which actions Acra should take (encrypt, searchable encrypt, mask, tokenize, etc). Typically, sensitive fields are PII (names, emails, dates), financial data, regulatory-specific data.
Important point is that Acra adds extra protection for sensitive data fields, which works in addition to TLS and database-level security controls (like row-level security and data-at-rest encryption).
How transparent data encryption works in Acra? #
In databases, data fields are stored in certain types: string, int, and bytes.
But encrypted data is binary data, it doesn’t preserve data format and type. It is stored as a binary large object (blob) in the database, or, in some rare cases, encoded as base64 string.
While the format preserving encryption algorithms exist, they don’t provide acceptable performance and security guarantees.
FPE works well in small home projects, but it doesn’t satisfy modern security requirements of the enterprise solutions.
Acra uses AES-256-GCM and wraps data into the AcraBlock during encryption of the sensitive data fields.
So, when data is encrypted, the database doesn’t know its original type anymore. But when data is read from the database and decrypted, it should receive its type again. Developers prefer to work with string, ints, and their custom serialization types but not the binary chunks.
Thus, it is important that Acra not only decrypts but decodes data back to its original type.
How does Acra know the original data types? #
In the previous versions, Acra didn’t know the original data type and sent decrypted data back as binary. Developers added a few code lines to their apps to decode binary data back to its original type. It made the encryption almost transparent and required some application changes, which is not ideal in many scenarios.
Since Acra 0.93.0 release, Acra supports fully transparent encryption. Acra decrypts and decodes binary data back to the original type, so the applications require zero changes and continue to work with strings and ints.
We call this feature “type awareness”.
When users configure Acra which data fields to encrypt in encryptor_config, they can specify the data type of each field, so Acra knows what to expect.
With type awareness, Acra not only hides cryptographic details from developers but reduces encryption integration cost and time.
Building type awareness was much harder than it looks like. It required special expertise in PostgreSQL and MySQL wire protocols, encoding, and encryption.
So, how to achieve it?
Curious to try Acra Community Edition without coding?
Database wire protocols #
Let’s describe how database wire protocols work in the parts related to querying data.
PostgreSQL and MySQL wire protocols support 2 types of queries: simple and complex.
PostgreSQL calls them a “simple query” and “extended query”. MySQL uses a different naming: “text protocol” and “prepared statements”.
PostgreSQL supports text and binary formats of query results for both protocols. By default, a simple query is not configurable and supports only text format, while an extended query protocol supports format selection (text or binary).
MySQL supports text and binary formats of query results according to a protocol type.
In general, these protocols share a similar approach. The first one looks as simple as possible via querying data by SQL query. The second one looks like SQL’s prepared statements but on a wire protocol layer.
The PostgreSQL protocols are a bit more complicated, and we will use them as a reference. Please note that all steps described below are true for MySQL too. Packets are named and look a bit differently, but the main approach is the same.
A. Simple query protocols #
Here is schema of packets for a simple query INSERT INTO users(id, email) VALUES ('1', 'example@cossacklabs.com') RETURNING id
:
The database driver (DB driver) sends one packet with a query and receives four packets in response at once.
In the first packet, the Row Description
, the database describes how many fields are in each row, their types, and encoding.
The second packet Data row
contains result rows with data.
The last two packets, Command completion
and Ready for query
, are the service packets signaling that the last commands were executed and the database is ready to accept a new command.
B. Extended query protocols #
Extended query protocols are more complicated:
The database driver sends a query in two steps: first—a query, next—parameters. Each step contains several packets to both sides.
In the first step, the driver sends 3 packets:
- The
Parse
packet with a queryINSERT INTO users(id, email) VALUES ($1, $2) RETURNING id
with placeholders$1, $2
instead of real values.
The driver gives a name to that query as a statement name
. It works as a prepared statement and can be re-used and referenced in the future. Optionally the driver may specify OID for each parameter or leave 0 that means “unspecified”.
-
The
Describe
packet—to request description for parameters and result rows. -
The
Sync
is a required packet from the driver for extended protocol and closes implicit transactions.
The database responds with 4 packets:
-
The
Parse completion
packet notifies that it is processed successfully. -
The
Parameter description
packet contains type identifiers OID expected from the database for each parameter. It describes to drivers what types are expected and how they should encode language-specific types into database-specific ones. -
The
Row description
packet contains the count of fields of every row in the result set, field or column name, and type identifiers. It helps drivers to understand what types they should expect and verify that they can decode them into language-specific types. -
The
Ready for query
packet signals about finishing processing of this block of packets and readiness to accept new commands.
In the second step, the driver sends values for parameters, asks to execute a query, and returns the query result:
- The
Bind
packet contains values for every placeholder, which format was used for encoding, statement name related to these parameters, and optionally specifies desired formats for the result rows and a portal name for the values.
The portal name is similar to a named cursor in PostgreSQL protocol. More information about the portal name you can find in the PostgreSQL documentation.
-
The
Describe
packet—to request description for the result rows (without parameter descriptions as in the first step). The driver should repeat it because the types can be changed according to real values of the parameters. -
The
Execute
packet signals—to execute a statement with passed values and link the result with the portal name. -
The
Sync
packet notifies about the end of this block of packets.
The database responds with 5 packets:
-
The
Bind
packet notifies about successful bind operation. -
The
Row description
contains the same or updated types according to the received parameters. -
The
Data row
packets contain the result values according to theRow description
metadata. -
The
Command completion
packet notifies that sending of theData row
packets is finished. -
The
Ready for query
packet notifies about finishing processing current block of commands and readiness to accept a new command.
Integrating Acra type awareness to the database wire protocols #
This is how databases communicate with applications through drivers.
While looking at these schemas, let’s find out how we can hide the fact that the data types on the database side differ from the application types. Since the extended protocol includes a simple query protocol, we will rely on a more complex extended protocol to cover all the packet types.
From the application side to the database side we can find Parse
packets including specified parameter types via OID. It is optional, usually drivers leave 0 values that mean “unspecified”.
On the other hand, the database sends the ParameterDescription
and RowDescription
packets and they are the most interesting for us.
The ParameterDescription
packet contains information about the types a database expects to receive as parameters according to declared table names in a query. Strict drivers rely on that values and validate ORM models and passed parameters into the function to query data. If the language-specific data types cannot be mapped correctly to the database type, then the database driver can raise an error.
The RowDescription
packet contains information about the data types of the result fields in the rows. The database drivers use metadata to validate the data types of a receiver’s variables / structs and check if they can be mapped from the database type.
Example: application uses simple text query protocol #
How does an application and a database communicate data types to each other?
Let’s take a look at a case with one table CREATE TABLE (id INTEGER PRIMARY KEY, email BYTEA)
. BYTEA
is a PostgreSQL type for binary data, MySQL uses BINARY
.
An application gets an email from the user, encrypts it (itself via Acra), and stores ciphertext in the database.
The simplest flow looks like this:
-
The application gets an
email
string and converts it to the language-specific binary type (bytes
for Python,bytea[]
for Java,[]byte
for Golang). Usually, string can be represented and used as is for binary arrays. -
The applications write email to the database. The database driver creates a query
INSERT INTO users(id, email) VALUES ($1, $2) RETURNING id, email
and pass it as arguments (1
,example@cossacklabs.com
).
Different drivers do different actions:
- encode values as SQL literals, escape string values if needed, and put them into the query string:
INSERT INTO users (id, email) VALUES (1, '[example@cossacklabs.com](mailto:example@cossacklabs.com)') RETURNING id, email
, eventually send the result using textual wire protocol; - use a binary protocol and send query in two steps: first, a query with placeholders, and then only data in text or binary format;
- send query as a prepared statement under the hood: sending it in two steps but explicitly using prepared statements as SQL commands.
Let’s continue with the first case.
A driver will encode email to a binary literal that is usually represented as HEX string: example@cossacklabs.com
-> \x6578616d706c6540636f737361636b6c6162732e636f6d
(PostgreSQL bytea literal).
-
The database parses a SQL query, extracts table’s name
users
, loads table’s schema from the internal system tables with a description of the column’s types declared inCREATE TABLE
query. Then the database understands that the literal1
has anINTEGER
type (but could beBIGINTEGER
,SMALLINTEGER
, etc.) andemail
isBYTEA
, and it can validate type restrictions. -
The database finds data and prepares to send it to the application. The database stores all data in a binary format for better efficiency. So, before sending, the data should be encoded into the textual format according to a wire protocol. The same system table is used to understand which data types should be used to encode the results. In our example, the system should send inserted
id
andemail
values. -
The database sends metadata of the response and the result data to the application. Database sends data in 2 packets.
First—metadata describing a query result data format (in our example, it’s a textual format, but it can be binary as well) and the result data types, so the application’s driver understands how to parse and decode the data. Usually, the data types and formats are represented as DB-specific identifiers.
Second packet—the result data according to the types and formats.
-
The application receives and passes these packets to the database driver. The driver parses the metadata and then decodes the result data to the supported language-specific types.
-
The application decodes the binary value of
email
to the string type to use string-related methods or prepare to render it in the user interface.
How Acra injects type metadata into simple query protocol #
If Acra knows the types used by the application, it can capture the data, encrypt, and encode it to the binary type used on the database side.
In a reverse direction, Acra captures data from the database encoded as binary values. Then it updates metadata in the first packet with the type’s identifiers to make them acceptable by the application, decodes data to the proper types (in our example it is TEXT), and sends it to the application.
With Acra type awareness, the application sends email
as a string value and gets back the same type. Under the hood, the email is encrypted, encoded to a binary type, then decrypted, and decoded back to its original type.
But it all happens transparently for the application and the database.
Example: application uses extended binary query protocol #
The steps described above are true for a simple query protocol with text encoding.
But most modern databases use extended query protocols in binary format to improve performance. The binary protocol is usually several steps longer than the text protocol.
Let’s see the example of an application using extended binary query protocol to talk to the database.
-
The application gets an
email
from the UI and converts it to the binary type. -
Then the application passes it to the database driver in the same way as described above. But now, the driver doesn’t escape values and puts them into the query as literals.
-
The database driver sends this query in several steps. First, it sends a query with the types of placeholders used in the query:
Parse[driver] -> ParseComplete[db]
. Here thisemail
will be sent asBYTEA
type.
After that, the driver usually asks the database to return the types for the placeholders and types for data in the final response to the query.
The database responds with two packets: Describe[driver] -> ParameterDescription[db] -> RowDescription[db]
.
Now, the driver should send the data to use it instead of the placeholders encoded into the types declared before: Bind[driver] -> BindComplete[db]
.
Then, the driver asks to describe the result’s types as they can be changed according to passed parameters. For example, the result type can be changed if the application uses CASE statements in the SELECT
.
The database responds with the types as described before: Describe[driver] -> RowDescription[db]
.
Finally, the application sends a command to execute a prepared query with the passed parameters and returns the result rows: Execute[driver] -> DataRows[db]
.
- And at the end, the application receives the data as described in the previous case of a simple flow and converts
email
to the string type.
How Acra injects type metadata into extended binary query protocol #
So, this flow is more complicated than the first one. What should Acra do to hide the encoding / decoding steps and allow the application to work with email
as string, while the database stores it as BYTEA
?
Acra adds type awareness carefully handling the next metadata:
- Captures the
Parse[driver]
packet and tamper the metadata about the placeholder’s types if any.
The application wants to send the string types, the database wants to receive the BYTEA
types as they were stored. So, Acra captures the metadata of the packet, looks at the encryptor_config
, and finds columns that should be encrypted, then Acra replaces the types for all the related placeholders and puts a BYTEA
type identifier instead.
- Capture the
ParameterDescription[db]
andRowDescription[db]
packets.
With the ParameterDescription[db]
packet, the database says that it expects to receive a BYTEA
type for the encrypted columns. Acra changes these identifiers to the proper application’s types described in encryptor_config
. In our case, it is changed to TEXT
, and the application does not have any troubles.
Note, that different database drivers behave differently: some drivers don’t ask to describe these types or just ignore them. But strict drivers request and verify that in the application (f. e. in the ORM’s models) the types correspond to the database’s types and raise an exception if they don’t.
With the RowDescription[db]
packet, the database responds with the result’s types. In our case, it returns id
and email
. This packet describes that id
has an INTEGER’s type identifier and email
has a BYTEA
type. So, Acra captures, recognizes that email
is an encrypted column, and replaces the result’s type to the application expected type.
- In the end, Acra decrypts the data in the DataRows and encodes it into the correct format.
The database driver can ask the database to encode the data into the text or binary format. It usually applies to integers, as the strings are formatted equally in both formats.
Integers are encoded into 8/16/32/64 bit values in a binary format (for example, 0x11223344
) or into decimal numbers in a text representation (for example, 123
). After the decryption, Acra encodes these values into the correct format.
We work with companies on demanding markets.
Read how we use Acra to protect data in critical infrastructure
Summary #
Encryption is complicated, transparent data encryption is even more complicated.
With the latest 0.93.0 release, Acra introduced a “type awareness” feature. Type awareness makes integrating field level encryption transparent for the database or the application. The users don’t need to change the application code or database schemas—only integrate Acra and configure which fields to encrypt and in what format.
Under the hood, Acra carefully parses database wire protocols, injects required type values, encrypts, encodes, decrypts, and decodes data, but it’s all hidden from developers eyes.
Our goal is to make strong data security accessible for developers without a cryptographic background. So developers can build their products and be confident that data security has been taken care of.
Acra Community Edition is free to use. Try our new engineering examples on GitHub to see how type awareness works in various examples.
Find out the description of new configuration settings related to type awareness in Acra documentation.
Switch to Acra Enterprise Edition, when your solution grows and requires elevated security guarantees.