Amazon strives to be the most customer-centric company on Earth. This powerful tenet drives everything we do at Amazon, every single day. We work backwards from customer needs when designing each of the products and services we proudly offer, but also when engineering all the systems and processes that power our worldwide operations.
It is no secret by now that all these systems and services use machine learning techniques to constantly improve over time. Our customers benefit from these improvements in many ways: better personalization and recommendations on Amazon Fresh, Music, and Prime Video; more accurate speech recognition and question answering in Alexa devices; faster delivery for all our retail offerings — to name just a few.
Broadly speaking, machine learning techniques help us to discover useful patterns in the data and to leverage these patterns to make better decisions on behalf of our customers.
One aspect of our data-processing systems that isn't frequently shared is: how do we make sure that Amazon’s customer data is protected through the entire processes of ingestion, transportation, storage, and finally processing and modeling?
The short answer is that we use state-of-the-art privacy-enhancing technologies. However, this answer only reflects the technical side of a much larger picture. At its heart, the design and implementation of every privacy-enhancing technology we use at Amazon is inspired by our relentless customer obsession. This principle directs us to act with utmost respect for our customers’ privacy and is embedded in every aspect of how we handle customer data. In practice, this translates into a set of company-wide processes and policies that govern how every single data record is processed and stored inside Amazon's systems.
Differential privacy ensures that algorithms can learn any frequent patterns in the data while preventing them from memorizing concrete details about any specific individual in the dataset.
Borja de Balle Pigem - machine learning scientist, Amazon Research
These data handling policies specify, for example, the cryptographic requirements that any system handling customer data must satisfy, both in terms of communication and storage. They also specify how such systems handle authentication inside Amazon's corporate network, effectively restricting any employee or system from accessing customer data unless such access is absolutely necessary to perform a critical business function.
Compliance with these policies is enforced and monitored through the entire life cycle of every system and service, from design to implementation, beta-testing, release, and run-time operations. Making sure existing systems operate in accordance with the highest standards in data protection is the everyday job of thousands of engineers at Amazon. At the same time, scientists and engineers are focused on continuously innovating, allowing us to bring better products and services to our customers.
One of the areas within the field of privacy-enhancing technologies where we are innovating on behalf of our customers is differential privacy, a well-known standard for privacy-aware data processing. Differential privacy provides a framework for measuring and limiting the amount of information about individuals in a population that can be recovered from the output of a data analysis algorithm. Technically speaking, differential privacy protects against membership attacks: a hypothetical adversary privy to the result of a data analysis algorithm will not be able to determine if the data of a particular individual was used in the analysis.
In the context of machine learning, differential privacy ensures that algorithms can learn any frequent patterns in the data while preventing them from memorizing concrete details about any specific individual in the dataset. For example, using differentially private machine learning to analyze the commuting patterns of individuals within a city would yield a model reflecting all the routes frequently used by a significant fraction of the population, but would not remember the commute patterns of any specific individual.
This example shows how differential privacy offers strong protection to individuals, while at the same time allowing data analysts to perform their jobs effectively. Furthermore, differential privacy provides tools to quantify the trade-off between the fidelity of the patterns being recovered and the level of privacy offered to each individual in a given dataset. Such trade-off is an inescapable premise in the scientific foundations of data privacy: it is not possible to make an algorithm differentially private without degrading its utility (in the example above, the accuracy of the patterns being recovered). Therefore, the crux of making differential privacy a useful technology resides in understanding and optimizing the privacy-utility trade-off in each particular application.
This leads us to the context of a paper that will be presented this summer at the Thirty-fifth International Conference on Machine Learning (ICML) in Stockholm, Sweden. This paper studies one of the basic building blocks of differential privacy, the so-called Gaussian mechanism. This is a well-known method in differential privacy: it proposes to privatize a data analysis algorithm by adding noise drawn from a Gaussian distribution to its output.
For example, in the context of analyzing commute patterns, one way to use this idea is to build a model that counts the number of daily commutes between every pair of points in a city, and then add Gaussian noise to each of these counts. The amount of noise, controlled by the variance of the Gaussian distribution, should be calibrated to mask the contributions of the data of any particular individual to the final result. This approach has been known to provide a certain level of differential privacy for many years, but it was not clear if the method was optimal in the sense of the privacy-utility trade-off.
In this new paper, we show that, in fact, the method researchers have been using to decide how much noise to add presents a fundamental limitation, leading to a sub-optimal trade-off between accuracy and privacy. Our new method relies on a deeper mathematical analysis of the noise calibration question, and obtains the optimal trade-off.
In other words, the new method achieves the same amount of privacy as before with better accuracy, or equivalently, the same accuracy as before with a better level of privacy. See the full paper for more details, including illustrative plots and a detailed experimental evaluation.