[16 minute read]
Information is the lifeblood of the modern internet. As of January 2021, 4.66 billion people were active internet users, each one sending and receiving a torrent of data.
Quite of lot of this data is (relatively speaking) harmless and can be shared and accessed to various degrees. A vast and lucrative market around analysing this data has sprung up, taking advantage of this highly accessible information, with associated concerns about how this data is used.
However, what about information that is both valuable and sensitive? Too sensitive to run the risk that an unauthorised party could even see it, let alone work with it?
There are plenty of cases like this, and the standards for how this data can be used and shared are exceptionally high. This can be a pretty significant barrier to getting the full value out of that information. Consider the following example:
A man walks into a bank and wishes to open an account. The bank is legally obliged to ensure that the money entering that account isn’t from the proceeds of crime, so the first thing they must do is establish who this man is and where the funds have come from. This process is known as KYC or Know-Your-Customer, a key part of Anti-Money Laundering (AML) compliance.
Sounds simple right? Companies: banks and other financial institutions; keep computerised records, so surely verification is simply a matter of searching the relevant databases?
In fact, such checks are a headache for all involved because of the need to preserve privacy. For example, in the EU, GDPR legislation imposes tight restrictions on how personal data can be used and shared. And that is entirely understandable, because the damage that can be done with the important information is enormous.
And all of this is down to a problem with encryption.
Information can be stored in encrypted form when travelling or at rest, but under current encryption methods, the data must be decrypted when it is being processed. That means that if you want to search a central repository of information, the information has to (at some point) be decrypted.
Securing a large-scale repository of unencrypted information against intrusion is already extremely hard; being able to safely grant access to external parties is even harder, so information sharing between companies and other record-holders is generally a slow process.
The solution to this (and many other problems) is Fully Homomorphic Encryption, a new model of cryptography that allows computation on encrypted data. Under FHE, an encrypted database can be searched for information without the need to decrypt the information at any point. At a stroke, a major barrier to safe information management can be eliminated, although not without a cost; the computational nature of FHE makes it notoriously slow, even on modern computing systems.
However, FHE is only slower than unencrypted processing from a computational standpoint. When compared against the lengthy (and not to mention still fundamentally insecure) processes that currently ensure data security, an FHE-enabled world will not only guarantee privacy but will also work a lot faster in general.
With optical computing, we can start to address the computational overhead too; now the FHE world will move even faster, breaking down the limits imposed by current encryption methods. However, to reach that point, we’re still going to need the software tools and the broader algorithmic understanding to make this a reality.
So, in the interests of illustrating the value that can already be unlocked, this article is dedicated to the task of implementing an encrypted search operation using Zama’s Concrete Boolean library, and running it with the assistance of optical Fourier transform hardware.
Visualisation of the core of our optical Fourier transform system, a chip-scale hybrid CMOS/Photonic processor for ultra-fast Fourier transforms. In this article, we use our system-level simulators (developed for our Beta program) to execute the Fourier transforms required for FHE operations in Concrete-Boolean, allowing us to provide a projected benchmark for the benefits of this technology in performing an encrypted search.
Concrete Boolean is the library we applied in our last article, where we used it to execute Conway’s Game of Life in encrypted space. This library provides us with the programming tools that we need to start working with the computational properties of FHE in a direct and secure way.
Indeed, the speed with which we’ve been able to start using it is testament to the power that it places in the hands of users. Concrete Boolean was released about a month ago; in that time, we’ve been able to put out two articles on non-trivial applications. It’s an extremely encouraging development for the field in general.
Some FHE applications (such as Conway’s Game of Life) are interesting. This is an application which is useful. By using our optical Fourier transform approach, we can then take that useful application and make it run muchfaster than it would on conventional electronics.
How much faster? We put the string searching model we built to the test on our simulator architecture, which replicates the functionality of our physical demonstrator systems. Under the parameters of the commercial-grade system we have designed, this example search algorithm can be made to run 60x faster when the Fourier transforms are executed by an optical process.
Search algorithms are common tools for information retrieval and assessment. For example, consider the way that email is currently filtered. This usually involves a range of different techniques, but one of the simplest, and earliest methods (the “naive Bayes” classifier) involved detecting key words or phrases. For example, emails containing the following in the subject line or message body:
Are highly unlikely to be legitimate, and may be actively harmful. These phrases are made of sequences of characters (like any other text), which in computing are represented as strings, sequences of bits that represent characters. For example, the “$$$” string in ASCII encoding is a sequence of bytes (8 bit sequences), each of which represents a “$” character:
00100100 00100100 00100100
A straightforward key-word based approach to filtering means that detecting this sequence in an email indicates it is likely to be spam.
This kind of identification is an example of string searching. String searching is an extremely common task in computing; if you’ve ever searched a document for a particular word or typed a message on a phone that auto-predicts what the next word will be, you’ve seen it in action. We’ll use strings as an example in this article, but the approach we take to encrypted searching can be broadly generalised.
As we did with the Game of Life application in our previous article, we’ll start by exploring approaches for unencrypted information.
The fundamental problem of string searching can be described compactly as:
“Given string S1 of length n and search string S2 of length m where m ≤ n, find all instances of S2 in S1”.
For example, finding the word “fox” in a longer string:
The simplest way to approach this problem is to iteratively compare sections or “windows” of length m of S1 against S2. Starting from the beginning of S1…
…we check to see if these characters match and then move on to the next sub-string. Spaces count as characters, so we end up having to check these too.
Eventually, we’ll reach the correct pairing. If we’re checking a long document for multiple instances of a word, we’ll have to continue on like this to the end, making a note of the indices where matches were found.
This is the naive approach; the first thing that would come to mind. However, consider a different S1 and S2:
Now the string that we’re searching for is right at the end of S1, which means that the total number of checks that we have to perform in the worst-case leads to a runtime complexity of O(nm).
What if there were other ways of representing string information in a searchable form?
If we’re allowed to consider a dictionary of words, we can also use a structured approach. Consider the previous string:
This was the worst-case scenario for the naive example. If we’re allowed to operate on the assumption that these words are stored in a dictionary, there’s a faster way of storing and representing information in the form of a Trie data structure. When we load the string, we can split it on the basis of the “ “ (space) delimiter into a dictionary:
We can then build up a tree structure by taking each word and adding each character to sequential nodes. Each node can itself contains a list of other characters, like so:
The cost of building this structure in runtime complexity is linear, i.e. O(n). This is because, in the worst case, we will have to add one node to the trie per character in the dictionary of words.
By searching over this structure, we can notionally reduce the search time significantly. For example, if we want to search to see if the word “fox” is present, we now only have to perform at worst 3 operations as we search through the nodes. We first query the root node to see if it contains the “f” character, if it does not, then we know “fox” cannot exist in the set of words. If we do find an “f”, then we perform similar checks recursively on the node containing “f”, until we find the terminal character.
The cost of checking a node for a character is O(1) if we use an efficient structure to store subnodes e.g. a hashmap. This means that the runtime complexity of searching for a given word in the trie is O(m) where m is the length of the word we are searching for.
Tries are also more compact than a straightforward dictionary, as many prefixes are duplicated (e.g the sequence “Fo”, which starts many words such as “for” “fortune”, “Fourier” and so on, only needs to appear once). Overall, this approach works very well; tries (or similar approaches) are foundational data structures for tasks like auto-complete and spell-checking.
However, the catch is that implementing such a structure also involves a lot of decisional branching, and that’s an issue for FHE. Over the course of an encrypted computation, there’s no easy way to perform the kinds of comparison that are needed to execute a jump to another section of the computation. So to begin with, we’re going to settle for the naive approach.
That’s not to say that the idea of using a better-than-naive approach to searching is impossible with FHE; we can certainly improve upon things, so after we’re done with the naive method, we’ll also outline an alternative approach that falls somewhere between the naive approach and the more advanced trie method.
Thus far, we’ve described string searching in terms of characters, but paid little attention to the way that computers actually work. As we mentioned before, computer’s don’t see characters; each character is represented by a sequence of bits.
Bits can be 1 or 0, so we can implement a string check via Boolean operations. This is in fact what’s going on behind the scenes whenever we perform a string check; it’s just that there’s usually one or more layers of abstraction between the code we write and what’s actually going on.
If we have a word such as “fox” and we’re checking through strings where the characters are encoded with ASCII, then each character is represented by an 8-bit value. There are two Boolean operations which are useful here.
The first is the “exclusive not-OR” or XNOR gate. Here’s the inputs and outputs (the “truth-table”) for such a gate.
And the second is the AND gate
We can combine the utility of these gates like so; Let’s say we have two sequences of bits, e.g
S1 = “011000011101110110000101”
S2 = “01110110”
And we want to find S2 in S1 via the naive sliding-window approach. For each comparison window, we can XNOR each bit together and then apply a cascade of AND gates to the output. If the bit sequences don’t match…
… then the output of the final AND bit will always be zero
The only time this process will return a non-zero output bit is if all the bits match:
This allows us to reduce a query over multiple binary values to a single binary outcome.
This by itself is a pretty big deal! It’s not a new idea, but it is a key piece of understanding before we leap into encrypted string searching, so we’ll stop for a moment and consider what we can do with it.
Thus far we’ve phrased everything in terms of searching a single string of text to see if it contains another string. That’s fine; it’s been a useful tool for introducing us to the link between Boolean operations and the way that information is represented on computers.
However, beyond this, if we can convert the problem of searching human-readable text strings into a Boolean model, we can implement this approach to searching over other data types too. This is what we need to begin developing the more complex applications of encrypted search.
In fact, some of these data types are actively easier to work with than text. For ASCII text representations we’re working with a byte per character, which means comparing 8 binary values. That’s not a big deal on a modern machine, but when working with encrypted FHE operations, every additional Boolean gate we need to apply translates into meaningful extra work.
Text requires an 8-bit representation because written human language contains many different symbols. English contains 26 lower case characters, 26 upper case characters, punctuation, and various other symbols that are used often ($!& etc). If we’re representing the complete English character set with unique binary values, then of course we need quite a few of them.
However, language doesn’t always mean human communication; language in the broader sense can be seen as any formal system of symbols.
For example, DNA is formed from 4 nucleotide base pairs: adenine, cytosine, guanine and thyamine. This is an example of a symbolic language that only requires 2 bits to represent the full range of characters. In light of the methods we use in our example, an encrypted search over a DNA sequence would be much faster on a per-character basis than over a text string.
We’ll likely come back to this idea in a later article, but for now we’ll move on to how we translate this into an encrypted search.
To perform an encrypted string search, we can combine Concrete-Boolean with the Boolean comparison process. In our example, we’ll be searching the S1 string
We begin by encrypting each individual bit of each character in both S1 and S2 as a ciphertext. Each discrete character (e.g “l”) can now be thought of as being represented by a vector of 8 ciphertexts.
To compare two characters, we first have to perform an encrypted XNOR operation between each ciphertext just as we did before.
If we do this for each character in S2, we now have a collection of ciphertexts containing the output of all the XNOR operations between the section of S1 and S2. We can then apply encrypted AND operations to these ciphertexts. This will return a single ciphertext containing an encryption of either “0” if the strings don’t match or “1” if they do.
This is just like our initial Boolean search method, but now everything is encrypted. The only information that can be extracted is whether the S2 string is present in S1 and, if so, where it is in S1. At the same time, the only person or organisation that can retrieve this information is the holder of the decryption key.
Let’s see how this works in practice.
We first import the elements we will use from the Concrete-Boolean library: the
gen_keys function used to generate the keys, and the
ClientKey types. We also import Concrete’s
CryptoAPIError type and define the number of bits per character.
The next thing we need is a function to encrypt a string into a vector of ciphertexts. The simplest approach is to convert the string to bytes, loop over the bits, and encrypt each of them. Since Concrete-Boolean works with Boolean values rather than integers, we identify 0 with False and 1 with True.
This step is not strictly needed, but will make error handling easier. The string function we will write may fail for different reasons. Some of them are due to how Concrete-Boolean works, and will result in a
CryptoAPIError. But others may not be related to the cryptography algorithm at all. For instance, one of the strings may be empty, and we choose to throw an error in this case. To deal with these different possibilities, we define a custom error type,
FHEError. We also implement the
From<CryptoAPIError> trait so that a
CryptoAPIError can be converted to an
Finally, the function
search_word takes a server key and two sequences of ciphertexts
b as arguments. If
a is empty, it returns an error. Otherwise, it returns an encryption of
true if the string used to generate
a is in that used to generate
b or of
As an example, the code shown below will look for the words “like” and “hate” in the sentence “I like making FHE easy!”.
Here is the result:
As we mentioned earlier, this algorithm is not the most optimal when working with plaintext data: it is much better to first split the list into words, assemble them into a trie structure, and use it to search each word. There are two reasons why this is not possible in (or, at least not easily transferable to) FHE:
First, using FHE, we can’t use conditionals based on the plaintext data to terminate the calculation early. This is a fundamental property of all well-designed FHE schemes, as the time needed to perform a calculation would otherwise leak information about the plaintext data.
Second, if we want to leak no information at all about the string, we can’t even provide the function with the length of each word, as a statistical analysis on these lengths could reveal something about the text. (For instance, some technical fields are more prone to long words than others.)
However, sometimes faster computation may be worth leaking some information about the original string. If the string is short enough, there is only so much that a statistical analysis will reveal. And, for longer strings, one could imagine padding them with random words to partially hide the information that word lengths reveal.
In cases where this trade-off is acceptable, one can split the string into individual words, encrypt them, and check a word to be searched against each of them sequentially. For additional security, one may pad or crop each word to a fixed length, so that the only information the server gets is the total number of words. This has the added benefit over the above version that the server does not even know the length of each word to be searched for. The complexity of the search algorithm then becomes O(m l), where m is the number of words in the string and l is the fixed length.
For completeness, we give here the full lib.rs file (a large part of which has already been explained above):
And the associated main.rs file: