FHE and machine learning: a student perspective
When it comes to both protecting and using sensitive data, fully homomorphic encryption is a technology that offers security and peace of mind.
Introducing an intern to the world of encrypted data science
By maintaining full cryptographic security for data at every stage, including during processing, FHE enables data scientists and developers to collaboratively apply common data analysis tools to the most valuable information without the risks that we commonly associate with data sharing.
Applying machine learning models under this model of computing introduces some additional hurdles that need to be overcome. However, modern FHE libraries are now making FHE increasingly easy to use without the need for expert knowledge in cryptography.
To demonstrate this, Optalysys offered an undergraduate internship over the summer of 2022. The goal of this internship was to introduce a student to the world of encrypted data science, and see what was possible given the current state of the field.
Our intern for 2022, Peter Li, was a second-year (now third-year) undergraduate studying engineering at Cambridge University. Peter’s work focused on the deployment of common machine learning models in the FHE space, and his article below covers the application of the Zama Concrete library to several different network models and data-sets.
For data scientists looking to make the transition into working safely with highly sensitive information, this article features examples of the implementation and execution of several different machine learning techniques in encrypted space.
More broadly, if you’re an organisation looking for the next significant step in data science, Peter’s work demonstrates that developing useful FHE implementations is now well within scope for data specialists. When paired with Optalysys technology that allows FHE to be deployed at scale and over large data sets, the market for novel applications and services running on previously untouchable data will be primed for massive growth.
That’s all from us at Optalysys for now; we’ll leave you with Peter’s article below. We’ll be back with our own work on FHE and optical computing in the near future.
Learning patterns from existing data
Machine learning is a domain of AI dealing with systems that can learn patterns from existing data, and then use this learnt information to make predictions about new data. Since 2010, incredibly rapid progress has been made in both the fundamental data science and deployment of machine learning, particularly in the associated domain of deep learning. So what is it that has pushed machine learning to become so prominent?
There are many reasons why, but a good starting point would be our own abilities with respect to data. Humans are conditioned to think and work in three dimensions: this means we can readily visualise and understand the relationships between datasets with three or fewer variables.
Unfortunately, many useful sets of data collected in the real world do not fit that criteria. Datasets with tens or hundreds of variables are commonplace in data science, and although there are techniques for doing so, it would be very difficult for humans to understand all the key relationships between variables. And if you were dealing with, say, a multi-megapixel image for computer vision, you’ll be looking at millions of variables. Needless to say, trying to hand-programme an effective general image recognition algorithm would be a hopeless endeavour.
This is where machine learning comes in. The ability of computers to pick up on relationships that are otherwise locked away in complex multivariate data sets, and then make predictions based on that information, is incredibly useful for countless applications spanning many different industries. These include things features we take for granted in our daily lives, such as personalised recommendations given by the likes of Netflix and Spotify, to things we don’t often think about such as analysing market data for automated high-frequency trading.
We often attribute the popularity of ML to recent advances in deep learning. This is indeed true, although it is important to remember that the reason why ML can be used on such a wide scale is thanks to the more holistic improvements in computing infrastructure, as well as the unprecedented abundance of data.
Back in 2012, IBM said that 90% of all data ever generated by humanity was in the past two years. The rate of data creation has only accelerated since then. Data is essential for training machine learning algorithms, and increasingly powerful models have come into existence as we harvest ever more of it.
More powerful models also require vast quantities of storage and compute power, especially in the case of deep learning. More data also means more infrastructure is needed for storage, processing and (secure) transmission of this data. As a result, there has been an increased use of cloud computing for machine learning applications.
The rise of cloud computing, alongside ever improving computer hardware, addresses the problem of high resource demand for machine learning. It allows companies to outsource the intensive computation associated with ML on a ‘pay for what you need’ basis to third-party servers, which save them the capital cost of buying new hardware. With the growing popularity of machine learning, increasing complexities of deep learning models, and ever increasing quantities of data involved in both training and inference, this trend of machine learning moving to the cloud is unlikely to slow down. Various large cloud providers now offer Machine Learning as a Service (MLaaS) that help machine learning teams worldwide with tasks such as data preprocessing, model training, tuning and evaluation, or just straight up Inference as a Service (IaaS). Providers include Microsoft Azure, AWS, Google Cloud Platform and IBM Watson.
A movement to the Cloud comes with privacy and security concerns. For many applications of ML, the data which serves as the foundation for training or making predictions may be sensitive information that the owners would like to keep confidential. An example would be predictive diagnostics in healthcare. If the machine learning model is hosted on the Cloud, then this would involve the hospital sending sensitive health data to a third party Cloud provider. Another example would be predictive analysis of credit score, which would involve sending large quantities of sensitive financial data to a third party provider.
The involvement of a third party increases the risk of data leaks considerably. Even though data is encrypted during transmission, it must be decrypted once it reaches the third party provider where the ML model is hosted, in order for it to be used. This creates a sufficient window for the sensitive data to become compromised, whether through exfiltration by malicious insiders or external hacking attacks. In 2019, there had been over 5000 confirmed instances of data breaches, with billions of personal files leaked. When it comes to Cloud providers given access to large quantities of potentially sensitive data, such risks are not negligible.
There are of course techniques for preprocessing data to obfuscate the confidential parts of it, but these cannot be solely relied upon. Too much restriction, and the data loses much of its utility. Too little, and there is a significant risk of de-anonymisation.
Promising PETs
Fortunately, there are promising privacy enhancing technologies (PETs) in development that offer the prospect of being able to enjoy the benefits of machine learning in the cloud without introducing the risks that lead to privacy concerns.
The biggest game changer in this category is FHE — Fully Homomorphic Encryption. The concept of homomorphic encryption is straightforward to understand: you can perform whatever calculation you want directly on the encrypted data, without having to decrypt it first.
When you do finally decrypt it (usually when it’s back in safe hands, such as with the original data owner), the output is the same as if you had performed the calculation on the underlying cleartext.
As simple as it sounds, this makes FHE the Holy Grail of cryptography: currently we can securely encrypt data when in transit or storage, but not when data is being processed. Much of the vulnerability of data in the cloud stems from this reason, and FHE addresses this gap.
Once confidential data is sent to the cloud, it would remain encrypted for its entire journey, and any computation can be performed on the ciphertext. After the desired result (still encrypted) is sent back to the owner of the data, they can use the secret key which only they have access to, to decrypt it and obtain the desired result.
The idea of homomorphic encryption has been around for decades, since the first work following the introduction of RSA (which is itself partially homomorphic) in the late seventies.
However, though there had been partially homomorphic encryption and somewhat homomorphic encryption schemes through the years, it wasn’t until 2009 when Craig Gentry published the whitepaper for the first ever Fully Homomorphic Encryption scheme. Since then considerable progress had been made by researchers on newer generations of FHE schemes, but there have been technical challenges that kept FHE largely confined to research papers.
By far the biggest of these barriers is speed; current FHE schemes are impractically slow when executed on conventional computers — up to 1,000,000 times slower than the equivalent operation performed in plaintext. Despite these challenges, significant progress is being made in bringing the advantages of FHE closer to mass adoption than ever.
Another key appeal of current FHE schemes is that they are quantum safe due to their mathematical basis in the Learning with Errors problems, which are secure even against known attacks by quantum computers, which most contemporary encryption techniques are vulnerable against.
Optalysys are developing specialist hardware for tackling the speed issues of FHE, combining the capabilities of optical computing and silicon photonics with electronics designed around FHE workloads. Efficient hardware is itself a holy grail for FHE, as conventional computing is simply never going to be well suited to the kind of calculations it requires.
On the developer side, there are a number of increasingly advanced open source libraries that support FHE computation, including those developed by big players in the tech world: SEAL by Microsoft, and HElib developed by IBM, as well as the Google FHE transpiler. We are also particularly interested in the OpenFHE project, spearheaded by Duality, which aims to unify most of the current FHE schemes and features as well as developing new ones such as scheme-switching. The library we use in this article is the Concrete library, developed by the French company Zama.
The key mission of Concrete is to allow developers to easily convert programs to run with FHE. The specific FHE scheme on which Concrete operates is called TFHE; whenever we talk about the properties of FHE in the context of this article, we are specifically describing the properties of TFHE.
The core TFHE library is written in Rust, but they also have two Python APIs: Concrete Numpy, which converts regular Numpy operations into equivalent FHE circuits; and Concrete ML, an API designed for machine learning developers, that has the capability of converting entire ML models directly into FHE.
In the rest of the article we will focus on the framework Concrete ML, and how it offers a glimpse into a future where data can be passed to machine learning models without ever compromising privacy.
Despite still being in the alpha stage of development, Concrete ML already proved remarkably easy to use. It allows a user to convert entire machine learning models written in the popular ML frameworks of Scikit-learn and PyTorch directly into their FHE equivalent with sometimes only a few extra lines of code.
Machine learning is often split into the categories of supervised, unsupervised and reinforcement learning. It is worth noting that Concrete ML currently (version 0.2, but version 0.3 has since been released) only supports supervised learning, where the model is trained using labelled data.
Here, we put together three simple machine learning models that tackle different problems that reflect real world applications for FHE. All three of the problems fall under the category of classification problems, including both binary and multiclass classification. We will present one example from each of the below categories:
- Linear models: Logistic Regression
- Tree-based models: XGBoost
- Deep-learning: Fully Connected Neural Network (MLP)
This is a very simple dataset available to be readily imported from the Scikit-Learn library, used for testing ML algorithms on binary classification. While a very simple model on a small dataset, it links closely to an area where we see great potential for FHE deployment: diagnostic analytics.
Machine learning models have become increasingly apt at predicting the likelihood of a patient having particular diseases, especially with the ever increasing wealth of medical data available. However, medical data also tend to be confidential in nature, and there are increasing privacy concerns as patient data becomes more and more digitised.
Below we demonstrate how FHE can be used to encrypt data when passed through a machine learning model for inference. For the sake of brevity, we’ll only focus on certain operations; an associated Jupyter notebook can be found here. We’ll start with our pre-processing of the Scikit-Learn Breast Cancer dataset.