In what is surely an unthinking cut-and-paste issue, page 921 of the Brexit deal mandatesthe use of SHA-1 and 1024-bit RSA:
The open standard s/MIME as extension to de facto e-mail standard SMTP will be deployed to encrypt messages containing DNA profile information. The protocol s/MIME (V3) allows signed receipts, security labels, and secure mailing lists… The underlying certificate used by s/MIME mechanism has to be in compliance with X.509 standard…. The processing rules for s/MIME encryption operations… are as follows:
the sequence of the operations is: first encryption and then signing,
the encryption algorithm AES (Advanced Encryption Standard) with 256 bit key length and RSA with 1,024 bit key length shall be applied for symmetric and asymmetric encryption respectively,
the hash algorithm SHA-1 shall be applied.
s/MIME functionality is built into the vast majority of modern e-mail software packages including Outlook, Mozilla Mail as well as Netscape Communicator 4.x and inter-operates among all major e-mail software packages.
Every Netflix data scientist, whether their background is from biology, psychology, physics, economics, math, statistics, or biostatistics, has made meaningful contributions to the way Netflix analyzes causal effects. Scientists from these fields have made many advancements in causal effects research in the past few decades, spanning instrumental variables, forest methods, heterogeneous effects, time-dynamic effects, quantile effects, and much more. These methods can provide rich information for decision making, such as in experimentation platforms (“XP”) or in algorithmic policy engines.
We want to amplify the effectiveness of our researchers by providing them software that can estimate causal effects models efficiently, and can integrate causal effects into large engineering systems. This can be challenging when algorithms for causal effects need to fit a model, condition on context and possible actions to take, score the response variable, and compute differences between counterfactuals. Computation can explode and become overwhelming when this is done with large datasets, with high dimensional features, with many possible actions to choose from, and with many responses. In order to gain broad software integration of causal effects models, a significant investment in software engineering, especially in computation, is needed. To address the challenges, Netflix has been building an interdisciplinary field across causal inference, algorithm design, and numerical computing, which we now want to share with the rest of the industry as computational causal inference (CompCI). A whitepaper detailing the field can be found here.
Computational causal inference brings a software implementation focus to causal inference, especially in regards to high performance numerical computing. We are implementing several algorithms to be highly performant, with a low memory footprint. As an example, our XP is pivoting away from two sample t-tests to models that estimate average effects, heterogeneous effects, and time-dynamic treatment effects. These effects help the business understand the user base, different segments in the user base, and whether there are trends in segments over time. We also take advantage of user covariates throughout these models in order to increase statistical power. While this rich analysis helps to inform business strategy and increase member joy, the volume of the data demands large amounts of memory, and the estimation of the causal effects on such volume of data is computationally heavy.
In the past, the computations for covariate adjusted heterogeneous effects and time-dynamic effects were slow, memory heavy, hard to debug, a large source of engineering risk, and ultimately could not scale to many large experiments. Using optimizations from CompCI, we can estimate hundreds of conditional average effects and their variances on a dataset with 10 million observations in 10 seconds, on a single machine. In the extreme, we can also analyze conditional time dynamic treatment effects for hundreds of millions of observations on a single machine in less than one hour. To achieve this, we leverage a software stack that is completely optimized for sparse linear algebra, a lossless data compression strategy that can reduce data volume, and mathematical formulas that are optimized specifically for estimating causal effects. We also optimize for memory and data alignment.
This level of computing affords us a lot of luxury. First, the ability to scale complex models means we can deliver rich insights for the business. Second, being able to analyze large datasets for causal effects in seconds increases research agility. Third, analyzing data on a single machine makes debugging easy. Finally, the scalability makes computation for large engineering systems tractable, reducing engineering risk.
Computational causal inference is a new, interdisciplinary field we are announcing because we want to build it collectively with the broader community of experimenters, researchers, and software engineers. The integration of causal inference into engineering systems can lead to large amounts of new innovation. Being an interdisciplinary field, it truly requires the community of local, domain experts to unite. We have released a whitepaper to begin the discussion. There, we describe the rising demand for scalable causal inference in research and in software engineering systems. Then, we describe the state of common causal effects models. Afterwards, we describe what we believe can be a good software framework for estimating and optimizing for causal effects.
Finally, we close the CompCI whitepaper with a series of open challenges that we believe require an interdisciplinary collaboration, and can unite the community around. For example:
Time dynamic treatment effects are notoriously hard to scale. They require a panel of repeated observations, which generate large datasets. They also contain autocorrelation, creating complications for estimating the variance of the causal effect. How can we make the computation for the time-dynamic treatment effect, and its distribution, more scalable?
In machine learning, specifying a loss function and optimizing it using numerical methods allows a developer to interact with a single, umbrella framework that can span several models. Can such an umbrella framework exist to specify different causal effects models in a unified way? For example, could it be done through the generalized method of moments? Can it be computationally tractable?
How should we develop software that understands if a causal parameter is identified? A solution to this helps to create software that is safe to use, and can provide safe, programmatic access to the analysis of causal effects. We believe there are many edge cases in identification that require an interdisciplinary group to solve.
We hope this begins the discussion, and over the coming months we will be sharing more on the research we have done to make estimation of causal effects performant. There are still many more challenges in the field that are not listed here. We want to form a community spanning experimenters, researchers, and software engineers to learn about problems and solutions together. If you are interested in being part of this community, please reach us at [email protected]
MIT researchers have built a system that fools natural-language processing systems by swapping words with synonyms:
The software, developed by a team at MIT, looks for the words in a sentence that are most important to an NLP classifier and replaces them with a synonym that a human would find natural. For example, changing the sentence “The characters, cast in impossibly contrived situations, are totally estranged from reality” to “The characters, cast in impossibly engineered circumstances, are fully estranged from reality” makes no real difference to how we read it. But the tweaks made an AI interpret the sentences completely differently.
The results of this adversarial machine learning attack are impressive:
For example, Google’s powerful BERT neural net was worse by a factor of five to seven at identifying whether reviews on Yelp were positive or negative.
Abstract: Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate the advantages of this framework in three ways: (1) effective — it outperforms state-of-the-art attacks in terms of success rate and perturbation rate, (2) utility-preserving — it preserves semantic content and grammaticality, and remains correctly classified by humans, and (3) efficient — it generates adversarial text with computational complexity linear to the text length.
At Raspberry Pi, we’re interested in all things to do with technology, from building new tools and helping people teach computing, to researching how young people learn to create with technology and thinking about the role tech plays in our lives and society. One of the aspects of technology I myself have been thinking about recently is algorithms.
Technology impacts our lives at the level of privacy, culture, law, environment, and ethics.
All kinds of algorithms — set series of repeatable steps that computers follow to perform a task — are running in the background of our lives. Some we recognise and interact with every day, such as online search engines or navigation systems; others operate unseen and are rarely directly experienced. We let algorithms make decisions that impact our lives in both large and small ways. As such, I think we need to consider the ethics behind them.
We need to talk about ethics
Ethics are rules of conduct that are recognised as acceptable or good by society. It’s easier to discuss the ethics of a specific algorithm than to talk about ethics of algorithms as a whole. Nevertheless, it is important that we have these conversations, especially because people often see computers as ‘magic boxes’: you push a button and something magically comes out of the box, without any possibility of human influence over what that output is. This view puts power solely in the hands of the creators of the computing technology you’re using, and it isn’t guaranteed that these people have your best interests at heart or are motivated to behave ethically when designing the technology.
Who creates the algorithms you use, and what are their motivations?
You should be critical of the output algorithms deliver to you, and if you have questions about possible flaws in an algorithm, you should not discount these as mere worries. Such questions could include:
Algorithms that make decisions have to use data to inform their choices. Are the data sets they use to make these decisions ethical and reliable?
Running an algorithm time and time again means applying the same approach time and time again. When dealing with societal problems, is there a single approach that will work successfully every time?
Below, I give two concrete examples to show where ethics come into the creation and use of algorithms. If you know other examples (or counter-examples, feel free to disagree with me), please share them in the comments.
Algorithms can be biased
Part of the ‘magic box’ mental model is the idea that computers are cold instructions followers that cannot think for themselves — so how can they be biased?
Humans aren’t born biased: we learn biases alongside everything else, as we watch the way our family and other people close to us interact with the world. Algorithms acquire biases in the same way: the developers who create them might inadvertently add their own biases.
Humans can be biased, and therefore the algorithms they create can be biased too.
An example of this is a gang violence data analysis tool that the Met Police in London launched in 2012. Called the gang matrix, the tool held the personal information of over 300 individuals. 72% of the individuals on the matrix were non-white, and some had never committed a violent crime. In response to this, Amnesty International filed a complaint stating that the makeup of the gang matrix was influenced by police officers disproportionately labelling crimes committed by non-white individuals as gang-related.
Who curates the content we consume?
We live in a content-rich society: there is much, much more online content than one person could possibly take in. Almost every piece of content we consume is selected by algorithms; the music you listen to, the videos you watch, the articles you read, and even the products you buy.
Some of you may have experienced a week in January of 2012 in which you saw a lot of either cute kittens or sad images on Facebook; if so, you may have been involved in a global social experiment that Facebook engineers performed on 600,000 of its users without their consent. Some of these users were shown overwhelmingly positive content, and others overwhelmingly negative content. The Facebook engineers monitored the users’ actions to gage how they responded. Was this experiment ethical?
In order to select content that is attractive to you, content algorithms observe the choices you make and the content you consume. The most effective algorithms give you more of the same content, with slight variation. How does this impact our beliefs and views? How do we broaden our horizons?
Why trust algorithms at all then?
People generally don’t like making decisions; almost everyone knows the discomfort of indecision. In addition, emotions have a huge effect on the decisions humans make moment to moment. Algorithms on the other hand aren’t impacted by emotions, and they can’t be indecisive.
While algorithms are not immune to bias, in general they are way less susceptible to it than humans. And if a bias is identified in an algorithm, an engineer can remove the bias by editing the algorithm or changing the dataset the algorithm uses. The same cannot be said for human biases, which are often deeply ingrained and widespread in society.
As is true for all technology, algorithms can create new problems as well as solve existing problems.
That’s why there are more and less appropriate areas for algorithms to operate in. For example, using algorithms in policing is almost always a bad idea, as the data involved is recorded by humans and is very subjective. In objective, data-driven fields, on the other hand, algorithms have been employed very successfully, such as diagnostic algorithms in medicine.
Algorithms in your life
I would love to hear what you think: this conversation requires as many views as possible to be productive. Share your thoughts on the topic in the comments! Here are some more questions to get you thinking:
What algorithms do you interact with every day?
How large are the decisions you allow algorithms to make?
Are there algorithms you absolutely do not trust?
What do you think would happen if we let algorithms decide everything?
Feel free to respond to other people’s comments and discuss the points they raise.
The ethics of algorithms is one of the topics for which we offer you a discussion forum on our free online course Impact of Technology. The course also covers how to facilitate classroom discussions about technology — if you’re an educator teaching computing or computer science, it is a great resource for you!
Presidential campaign season is officially, officially, upon us now, which means it’s time to confront the weird and insidious ways in which technology is warping politics. One of the biggest threats on the horizon: artificial personas are coming, and they’re poised to take over political debate. The risk arises from two separate threads coming together: artificial intelligence-driven text generation and social media chatbots. These computer-generated “people” will drown out actual human discussions on the Internet.
Text-generation software is already good enough to fool most people most of the time. It’s writing news stories, particularly in sportsandfinance. It’s talking with customers on merchant websites. It’s writing convincing op-eds on topics in the news (though there are limitations). And it’s being used to bulk up “pink-slime journalism” — websites meant to appear like legitimate local news outlets but that publish propaganda instead.
There’s a record of algorithmic content pretending to be from individuals, as well. In 2017, the Federal Communications Commission had an online public-commenting period for its plans to repeal net neutrality. A staggering 22 million comments were received. Many of them — maybe half — werefake, using stolen identities. These comments were also crude; 1.3 million were generated from the same template, with some words altered to make them appear unique. They didn’t stand up to even cursory scrutiny.
These efforts will only get more sophisticated. In a recent experiment, Harvard senior Max Weiss used a text-generation program to create 1,000 comments in response to a government call on a Medicaid issue. These comments were all unique, and sounded like real people advocating for a specific policy position. They fooled the Medicaid.gov administrators, who accepted them as genuine concerns from actual human beings. This being research, Weiss subsequently identified the comments and asked for them to be removed, so that no actual policy debate would be unfairly biased. The next group to try this won’t be so honorable.
Chatbots have been skewing social-media discussions for years. About a fifth of all tweets about the 2016 presidential election were published by bots, according to one estimate, as were about a third of all tweets about that year’s Brexit vote. An Oxford Internet Institute report from last year found evidence of bots being used to spread propaganda in 50 countries. These tended to be simple programs mindlessly repeating slogans: a quarter million pro-Saudi “We all have trust in Mohammed bin Salman” tweets following the 2018 murder of Jamal Khashoggi, for example. Detecting many bots with a few followers each is harder than detecting a few bots with lots of followers. And measuring the effectiveness of these bots is difficult. The best analyses indicate that they did not affect the 2016 US presidential election. More likely, they distort people’s sense of public sentiment and their faith in reasoned political debate. We are all in the middle of a novel social experiment.
Over the years, algorithmic bots have evolved to have personas. They have fake names, fake bios, and fake photos — sometimes generated by AI. Instead of endlessly spewing propaganda, they post only occasionally. Researchers can detect that these are bots and not people, based on their patterns of posting, but the bot technology is getting better all the time, outpacing tracking attempts. Future groups won’t be so easily identified. They’ll embed themselves in human social groups better. Their propaganda will be subtle, and interwoven in tweets about topics relevant to those social groups.
Combine these two trends and you have the recipe for nonhuman chatter to overwhelm actual political speech.
Soon, AI-driven personas will be able to write personalized letters to newspapers and elected officials, submit individual comments to public rule-making processes, and intelligently debate political issues on social media. They will be able to comment on social-media posts, news sites, and elsewhere, creating persistent personas that seem real even to someone scrutinizing them. They will be able to pose as individuals on social media and send personalized texts. They will be replicated in the millions and engage on the issues around the clock, sending billions of messages, long and short. Putting all this together, they’ll be able to drown out any actual debate on the Internet. Not just on social media, but everywhere there’s commentary.
Maybe these persona bots will be controlled by foreign actors. Maybe it’ll be domestic political groups. Maybe it’ll be the candidates themselves. Most likely, it’ll be everybody. The most important lesson from the 2016 election about misinformation isn’t that misinformation occurred; it is how cheap and easy misinforming people was. Future technological improvements will make it all even more affordable.
Our future will consist of boisterous political debate, mostly bots arguing with other bots. This is not what we think of when we laud the marketplace of ideas, or any democratic political process. Democracy requires two things to function properly: information and agency. Artificial personas can starve people of both.
Solutions are hard to imagine. We can regulate the use of bots — a proposed California law would require bots to identify themselves — but that is effective only against legitimate influence campaigns, such as advertising. Surreptitious influence operations will be much harder to detect. The most obvious defense is to develop and standardize better authentication methods. If social networks verify that an actual person is behind each account, then they can better weed out fake personas. But fake accounts are already regularly created for real people without their knowledge or consent, and anonymous speech is essential for robust political debate, especially when speakers are from disadvantaged or marginalized communities. We don’t have an authentication system that both protects privacy and scales to the billions of users.
We can hope that our ability to identify artificial personas keeps up with our ability to disguise them. If the arms race between deep fakes and deep-fake detectors is any guide, that’ll be hard as well. The technologies of obfuscation always seem one step ahead of the technologies of detection. And artificial personas will be designed to act exactly like real people.
In the end, any solutions have to be nontechnical. We have to recognize the limitations of online political conversation, and again prioritize face-to-face interactions. These are harder to automate, and we know the people we’re talking with are actual people. This would be a cultural shift away from the internet and text, stepping back from social media and comment threads. Today that seems like a completely unrealistic solution.
Misinformation efforts are now common around the globe, conducted in more than 70 countries. This is the normal way to push propaganda in countries with authoritarian leanings, and it’s becoming the way to run a political campaign, for either a candidate or an issue.
Artificial personas are the future of propaganda. And while they may not be effective in tilting debate to one side or another, they easily drown out debate entirely. We don’t know the effect of that noise on democracy, only that it’ll be pernicious, and that it’s inevitable.
Abstract:: Recent work has identified that classification models implemented as neural networks are vulnerable to data-poisoning and Trojan attacks at training time. In this work, we show that these training-time vulnerabilities extend to deep reinforcement learning (DRL) agents and can be exploited by an adversary with access to the training process. In particular, we focus on Trojan attacks that augment the function of reinforcement learning policies with hidden behaviors. We demonstrate that such attacks can be implemented through minuscule data poisoning (as little as 0.025% of the training data) and in-band reward modification that does not affect the reward on normal inputs. The policies learned with our proposed attack approach perform imperceptibly similar to benign policies but deteriorate drastically when the Trojan is triggered in both targeted and untargeted settings. Furthermore, we show that existing Trojan defense mechanisms for classification tasks are not effective in the reinforcement learning setting.
Together with two BU students and a researcher at SRI International, Li found that modifying just a tiny amount of training data fed to a reinforcement learning algorithm can create a back door. Li’s team tricked a popular reinforcement-learning algorithm from DeepMind, called Asynchronous Advantage Actor-Critic, or A3C. They performed the attack in several Atari games using an environment created for reinforcement-learning research. Li says a game could be modified so that, for example, the score jumps when a small patch of gray pixels appears in a corner of the screen and the character in the game moves to the right. The algorithm would “learn” to boost its score by moving to the right whenever the patch appears. DeepMind declined to comment.
In 1999, I invented the Solitaire encryption algorithm, designed to manually encrypt data using a deck of cards. It was written into the plot of Neal Stephenson’s novel Cryptonomicon, and I even wrote an afterward to the book describing the cipher.
I don’t talk about it much, mostly because I made a dumb mistake that resulted in the algorithm not being reversible. Still, for the short message lengths you’re likely to use a manual cipher for, it’s still secure and will likely remain secure.
Abstract: The Solitaire cipher was designed by Bruce Schneier as a plot point in the novel Cryptonomicon by Neal Stephenson. The cipher is intended to fit the archetype of a modern stream cipher whilst being implementable by hand using a standard deck of cards with two jokers. We find a model for repetitions in the keystream in the stream cipher Solitaire that accounts for the large majority of the repetition bias. Other phenomena merit further investigation. We have proposed modifications to the cipher that would reduce the repetition bias, but at the cost of increasing the complexity of the cipher (probably beyond the goal of allowing manual implementation). We have argued that the state update function is unlikely to lead to cycles significantly shorter than those of a random bijection.
According to foreign policy experts and the defense establishment, the United States is caught in an artificial intelligence arms race with China — one with serious implications for national security. The conventional version of this story suggests that the United States is at a disadvantage because of self-imposed restraints on the collection of data and the privacy of its citizens, while China, an unrestrained surveillance state, is at an advantage. In this vision, the data that China collects will be fed into its systems, leading to more powerful AI with capabilities we can only imagine today. Since Western countries can’t or won’t reap such a comprehensive harvest of data from their citizens, China will win the AI arms race and dominate the next century.
This idea makes for a compelling narrative, especially for those trying to justify surveillance — whether government- or corporate-run. But it ignores some fundamental realities about how AI works and how AI research is conducted.
Thanks to advances in machine learning, AI has flipped from theoretical to practical in recent years, and successes dominate public understanding of how it works. Machine learning systems can now diagnose pneumonia from X-rays, play the games of go and poker, and read human lips, all better than humans. They’re increasingly watching surveillance video. They are at the core of self-driving car technology and are playing roles in both intelligence-gathering and military operations. These systems monitor our networks to detect intrusions and look for spam and malware in our email.
And it’s true that there are differences in the way each country collects data. The United States pioneered “surveillance capitalism,” to use the Harvard University professor Shoshana Zuboff’s term, where data about the population is collected by hundreds of large and small companies for corporate advantage — and mutually shared or sold for profit The state picks up on that data, in cases such as the Centers for Disease Control and Prevention’s use of Google search data to map epidemics and evidence shared by alleged criminals on Facebook, but it isn’t the primary user.
China, on the other hand, is far more centralized. Internet companies collect the same sort of data, but it is shared with the government, combined with government-collected data, and used for social control. Every Chinese citizen has a national ID number that is demanded by most services and allows data to easily be tied together. In the western region of Xinjiang, ubiquitoussurveillance is used to oppress the Uighur ethnic minority — although at this point there is still a lot of human labor making it all work. Everyone expects that this is a test bed for the entire country.
Data is increasingly becoming a part of control for the Chinese government. While many of these plans are aspirational at the moment — there isn’t, as some have claimed, a single “social credit score,” but instead future plans to link up a wide variety of systems — data collection is universally pushed as essential to the future of Chinese AI. One executive at search firm Baidu predicted that the country’s connected population will provide them with the raw data necessary to become the world’s preeminent tech power. China’s official goal is to become the world AI leader by 2030, aided in part by all of this massive data collection and correlation.
This all sounds impressive, but turning massive databases into AI capabilities doesn’t match technological reality. Current machine learning techniques aren’t all that sophisticated. All modern AI systems follow the same basic methods. Using lots of computing power, different machine learning models are tried, altered, and tried again. These systems use a large amount of data (the training set) and an evaluation function to distinguish between those models and variations that work well and those that work less well. After trying a lot of models and variations, the system picks the one that works best. This iterative improvement continues even after the system has been fielded and is in use.
So, for example, a deep learning system trying to do facial recognition will have multiple layers (hence the notion of “deep”) trying to do different parts of the facial recognition task. One layer will try to find features in the raw data of a picture that will help find a face, such as changes in color that will indicate an edge. The next layer might try to combine these lower layers into features like shapes, looking for round shapes inside of ovals that indicate eyes on a face. The different layers will try different features and will be compared by the evaluation function until the one that is able to give the best results is found, in a process that is only slightly more refined than trial and error.
Large data sets are essential to making this work, but that doesn’t mean that more data is automatically better or that the system with the most data is automatically the best system. Train a facial recognition algorithm on a set that contains only faces of white men, and the algorithm will have trouble with any other kind of face. Use an evaluation function that is based on historical decisions, and any past bias is learned by the algorithm. For example, mortgage loan algorithms trained on historic decisions of human loan officers have been found to implement redlining. Similarly, hiring algorithms trained on historical data manifest the same sexism as human staff often have. Scientists are constantly learning about how to train machine learning systems, and while throwing a large amount of data and computing power at the problem can work, more subtle techniques are often more successful. All data isn’t created equal, and for effective machine learning, data has to be both relevant and diverse in the right ways.
Future research advances in machine learning are focused on two areas. The first is in enhancing how these systems distinguish between variations of an algorithm. As different versions of an algorithm are run over the training data, there needs to be some way of deciding which version is “better.” These evaluation functions need to balance the recognition of an improvement with not over-fitting to the particular training data. Getting functions that can automatically and accurately distinguish between two algorithms based on minor differences in the outputs is an art form that no amount of data can improve.
The second is in the machine learning algorithms themselves. While much of machine learning depends on trying different variations of an algorithm on large amounts of data to see which is most successful, the initial formulation of the algorithm is still vitally important. The way the algorithms interact, the types of variations attempted, and the mechanisms used to test and redirect the algorithms are all areas of active research. (An overview of some of this work can be found here; even trying to limit the research to 20 papers oversimplifies the work being done in the field.) None of these problems can be solved by throwing more data at the problem.
The British AI company DeepMind’s success in teaching a computer to play the Chinese board game go is illustrative. Its AlphaGo computer program became a grandmaster in two steps. First, it was fed some enormous number of human-played games. Then, the game played itself an enormous number of times, improving its own play along the way. In 2016, AlphaGo beat the grandmaster Lee Sedol four games to one.
While the training data in this case, the human-played games, was valuable, even more important was the machine learning algorithm used and the function that evaluated the relative merits of different game positions. Just one year later, DeepMind was back with a follow-on system: AlphaZero. This go-playing computer dispensed entirely with the human-played games and just learned by playing against itself over and over again. It plays like an alien. (It also became a grandmaster in chess and shogi.)
These are abstract games, so it makes sense that a more abstract training process works well. But even something as visceral as facial recognition needs more than just a huge database of identified faces in order to work successfully. It needs the ability to separate a face from the background in a two-dimensional photo or video and to recognize the same face in spite of changes in angle, lighting, or shadows. Just adding more data may help, but not nearly as much as added research into what to do with the data once we have it.
Meanwhile, foreign-policy and defense experts are talking about AI as if it were the next nuclear arms race, with the country that figures it out best or first becoming the dominant superpower for the next century. But that didn’t happen with nuclear weapons, despite research only being conducted by governments and in secret. It certainly won’t happen with AI, no matter how much data different nations or companies scoop up.
It is true that China is investing a lot of money into artificial intelligence research: The Chinese government believes this will allow it to leapfrog other countries (and companies in those countries) and become a major force in this new and transformative area of computing — and it may be right. On the other hand, much of this seems to be a wasteful boondoggle. Slapping “AI” on pretty much anything is how to get funding. The Chinese Ministry of Education, for instance, promises to produce “50 world-class AI textbooks,” with no explanation of what that means.
In the democratic world, the government is neither the leading researcher nor the leading consumer of AI technologies. AI research is much more decentralized and academic, and it is conducted primarily in the public eye. Research teams keep their training data and models proprietary but freely publish their machine learning algorithms. If you wanted to work on machine learning right now, you could download Microsoft’s Cognitive Toolkit, Google’s Tensorflow, or Facebook’s Pytorch. These aren’t toy systems; these are the state-of-the art machine learning platforms.
AI is not analogous to the big science projects of the previous century that brought us the atom bomb and the moon landing. AI is a science that can be conducted by many different groups with a variety of different resources, making it closer to computer design than the space race or nuclear competition. It doesn’t take a massive government-funded lab for AI research, nor the secrecy of the Manhattan Project. The research conducted in the open science literature will trump research done in secret because of the benefits of collaboration and the free exchange of ideas.
While the United States should certainly increase funding for AI research, it should continue to treat it as an open scientific endeavor. Surveillance is not justified by the needs of machine learning, and real progress in AI doesn’t need it.
A pair of Russia-designed cryptographic algorithms — the Kuznyechik block cipher and the Streebog hash function — have the same flawedS-box that is almost certainly an intentional backdoor. It’s just not the kind of mistake you make by accident, not in 2014.
MezzFS — Mounting object storage in Netflix’s media processing platform
By Barak Alon (on behalf of Netflix’s Media Cloud Engineering team)
MezzFS (short for “Mezzanine File System”) is a tool we’ve developed at Netflix that mountscloud objects as local files via FUSE. It’s used extensively in our media processing platform, which includes services like Archer and runs features like video encoding and title image generation on tens of thousands of Amazon EC2 instances. There are similar tools out there, but we’ve developed some unique features like “replays” and “adaptive buffering” that we think are worth sharing.
What problem are we solving?
We are constantly innovating on video encoding technology at Netflix, and we have a lot of content to encode. Video encoding is what MezzFS was originally designed for and remains one of its canonical use cases, so we’ll focus on video encoding to describe the problem that MezzFS solves.
Video encoding is the process of converting an uncompressed video into a compressed format defined by a codec, and it’s an essential part of preparing content to be streamed on Netflix. A single movie at Netflix might be encoded dozens of times for different codecs and video resolutions. Encoding is not a one-time process — large portions of the entire Netflix catalog are re-encoded whenever we’ve made significant advancements in encoding technology.
We scale out video encoding by processing segments of an uncompressed video (we segment movies by scene) in parallel. We have one file — the original, raw movie file — and many worker processes, all encoding different segments of the file. That file is stored in our object storage service, which splits and encrypts the file into separate chunks, storing the chunks in Amazon S3. This object storage service also handles content security, auditing, disaster recovery, and more.
The individual video encoders process their segments of the movie with tools like FFmpeg, which doesn’t speak our object storage service’s API and expects to deal with a file on the local filesystem. Furthermore, the movie file is very large (often several 100s of GB), and we want to avoid downloading the entire file for each individual video encoder that might be processing only a small segment of the whole movie.
This is just one of many use cases that MezzFS supports, but all the use cases share a similar theme: stream the right bits of a remote object efficiently and expose those bits as a file on the filesystem.
The solution: MezzFS
MezzFS is a Python application that implements the FUSE interface. It’s built as a Debian package and installed by applications running on our media processing platform, which use MezzFS’s command line interface to mount remote objects as local files.
MezzFS has a number of features, including:
Stream objects —MezzFS exposes multi-terabyte objects without requiring any disk space.
Assemble and decrypt parts — Our object storage service splits objects into many parts and stores them in S3. MezzFS knows how to assemble and decrypt the parts.
Mount multiple objects —Multiple cloud objects can be mounted on the local filesystem simultaneously.
Disk Caching —MezzFS can be configured to cache objects on the local disk.
Mount ranges of objects —Arbitrary ranges of a cloud object can be mounted as separate files on the local file system. This is particularly useful in media computing, where it is common to mount the frames of a movie scene as separate files.
Regional caching — Netflix operates in multiple AWS regions. If an application in region A is using MezzFS to read from an object stored in region B, MezzFS will cache the object in region A. In addition to improving download speed, this is useful for cutting down on cross-region transfer costs when many workers will be processing the same data — we only pay the transfer costs for one worker, and the rest use the cached object.
Replays — More on this below…
Adaptive buffering — More on this below…
We’ve been using MezzFS in production for 5 years, and have validated it at scale — during a typical week at Netflix, MezzFS performs ~100 million mounts for dozens of different use cases and streams about ~25 petabytes of data.
MezzFS has become a crucial tool for us, and we don’t just send it out into the wild with a packed lunch and hope it will be fine.
MezzFS collects metrics on data throughput, download efficiency, resource usage, etc. in Atlas, Netflix’s in-memory dimensional time series database. Its logs are collected in an ELK stack. But one of the more novel tools we’ve developed for debugging and developing is the MezzFS “replay”.
At mount time, MezzFS can be configured to record a “replay” file. This file includes:
Metadata — This includes: the remote objects that were mounted, the environment in which MezzFS is running, etc.
File operations — All “open” and “read” operations. That is, all mounted files that were opened and every single byte range read that MezzFS received.
Actions — MezzFS records everything it buffers and everything it caches
Statistics — Finally, MezzFS will record various statistics about the mount, including: total bytes downloaded, total bytes read, total time spent reading, etc.
A single replay may include million of file operations, so these files are packed in a custom binary format to minimize their footprint.
Based on these replay files, we’ve built tools that:
Visualize a replay
This has proven very useful for quickly gaining insight into data access patterns and why they might be causing performance issues.
Here’s a GIF of what these visualization look like:
The bytes of a remote object are represented by pixels on the screen, where the top left is the start of the remote object and the bottom right is the end. The different colors mean different things — green means the bytes have been scheduled for downloading, yellow means the bytes are being actively downloaded, blue means the bytes have been successfully returned, etc. What we see in the above visualization is a very simple access pattern — a remote object is mounted and then streamed through sequentially.
Here is a more interesting, “sparse” access pattern, and one that inspired “adaptive buffering” described later in this post. We can see lots of little green bars quickly sprinkle the screen — these bars represent the bytes that were downloaded:
Rerun a replay
We mount the same objects and rerun all of the operations recorded in the replay file. We use this to debug errors and performance issues in specific mounts.
Rerun a batch of replays
We collect replays from actual MezzFS mounts in production, and we rerun large batches of replays for regression and performance tests. We’ve integrated these tests into our build pipeline, where a build will fail if there are any errors across the gamut of replays or if the performance of a new MezzFS commit falls below some threshold. We parallelize rerun jobs with Titus, Netflix’s container management platform, which allows us to exercise many hundreds of replay files in minutes. The results are aggregated in Elasticsearch, allowing us to quickly analyze MezzFS’s performance across the entire batch.
These replays have proven essential for developing optimizations like “adaptive buffering”.
One of the challenges of efficiently streaming bits in a FUSE system is that the kernel will break reads into chunks. This means that if an application reads, for example, 1 GB from a mounted file, MezzFS might receive that as 16,384 consecutive reads of 64KB. Making 16,384 separate HTTP calls to S3 for 64KB will suffer significant overhead, so it’s better to “read ahead” larger chunks of data from S3, speeding up subsequent reads by anticipating that the data will be read sequentially. We call the size of the chunks being read ahead the “buffer size”.
While large buffer sizes speed up sequential data access, they can slow down “sparse” data access — that is, the application is not reading through the file consecutively, but is reading small segments dispersed throughout the file (as shown in the visualization above). In this scenario, most of the buffered data isn’t actually going to be used, leading to a lot of unnecessary downloading and very slow reads.
One option is to expect applications to specify a buffer size when mounting with MezzFS. This is not always easy for application developers to do, since applications might be using third party tools and developers might not actually know their access pattern. It gets even messier when an application changes access patterns during a single MezzFS mount.
With “adaptive buffering,” we aimed to make MezzFS “just work” for a variety of access patterns, without requiring application developers to maintain MezzFS configuration.
How it works
MezzFS records a sliding window of the most recent reads. When it receives a read for data that has not already been buffered, it calculates an appropriate buffer size. It does this by first grouping the window of reads into “clusters”, where a cluster is a contiguous set of reads.
Here’s an illustration of how reads relate to clusters:
If the average number of bytes per read divided by the average number of bytes per cluster is close to 1, we classify the access pattern as “sparse”. In the “sparse” case, we try to match the buffer size to the average number of bytes per read. If number is closer to 0, we classify the access pattern as “dense”, and we set the buffer size to the maximum allowed buffer size divided by the number of clusters (We divide by the number of clusters to account for a common case when an application might have multiple threads all reading different parts from the same file, but each thread is reading its part “densely.” If we used the maximum allowed buffer size for each thread, our buffers would consume too much memory).
Here’s an attempt to represent this logic with some pseudo code:
There is a limit on the throughput you can get out of a single HTTP connection to S3. So when the calculated buffer size is large, we divide the buffer into separate requests and parallelize them across multiple threads. So for “sparse” access patterns we improve performance by choosing a small buffer size, and for “dense” access patterns we improve performance by buffering lots of data in parallel.
How much faster is this?
We’ve been using adaptive buffering in production across a number of different use cases. For the purpose of clarity in demonstration, we used the “rerun a batch of replays” technique described above to run a quick and dirty test comparing the old buffering technique against the new.
Two replay files that represent two canonical access patterns were used:
Dense/Sequential — Sequentially read 1GB from a remote object.
Sparse/Random — Read 32MB in chunks of 64KB, dispersed randomly throughout a remote object.
And we compared two buffering strategies:
Fixed Sized Buffering— This is the old technique, where the buffer size is fixed at 8MB (we chose 8MB as a “one-size-fits-all” buffer size after running some experiments across MezzFS use cases at the time).
Adaptive Buffering— The shiny new technique described above.
We ran each combination of replay file and buffering strategy 10 times each inside containers with 2 Gbps network and 16 CPUs, recording the total time to process all the operations in the replay files. The following table represents the minimum of all 10 runs (while mean and standard deviation often seem like good aggregations, we use minimum here because variability is often caused by other processes interrupting MezzFS, or variability in network conditions outside of MezzFS’s control).
Looking at the dense/sequential replay, fixed buffering has a throughput of ~0.5 Gbps, while adaptive buffering has a throughput of ~1.1Gbps.
While a handful of seconds might not seem worth all the trouble, these seconds become hours for many of our use cases that stream significantly more bytes. And shaving off hours is especially beneficial in latency sensitive workflows, like encoding videos that are released on Netflix the day they are shot.
MezzFS has become a core part of our media transformation and innovation platform. We’ve built some pretty fancy tools around it that we’re actively using to quickly and confidently develop new features and optimizations.
The next big feature on our roadmap is support for writes, which has exciting potential for our next generation media processing platform and our growing, global network of movie production studios.
Netflix’s media processing platform is maintained by the Media Cloud Engineering (MCE) team. If you’re excited about large-scale distributed computing problems in media processing, we’re hiring!
While SWGDE promotes the adoption of SHA2 and SHA3 by vendors and practitioners, the MD5 and SHA1 algorithms remain acceptable for integrity verification and file identification applications in digital forensics. Because of known limitations of the MD5 and SHA1 algorithms, only SHA2 and SHA3 are appropriate for digital signatures and other security applications.
This is technically correct: the current state of cryptanalysis against MD5 and SHA-1 allows for collisions, but not for pre-images. Still, it’s really bad form to accept these algorithms for any purpose. I’m sure the group is dealing with legacy applications, but I would like it to really push those application vendors to update their hash functions.
Abstract: The RSA PKCS#1 v1.5 signature algorithm is the most widely used digital signature scheme in practice. Its two main strengths are its extreme simplicity, which makes it very easy to implement, and that verification of signatures is significantly faster than for DSA or ECDSA. Despite the huge practical importance of RSA PKCS#1 v1.5 signatures, providing formal evidence for their security based on plausible cryptographic hardness assumptions has turned out to be very difficult. Therefore the most recent version of PKCS#1 (RFC 8017) even recommends a replacement the more complex and less efficient scheme RSA-PSS, as it is provably secure and therefore considered more robust. The main obstacle is that RSA PKCS#1 v1.5 signatures use a deterministic padding scheme, which makes standard proof techniques not applicable.
We introduce a new technique that enables the first security proof for RSA-PKCS#1 v1.5 signatures. We prove full existential unforgeability against adaptive chosen-message attacks (EUF-CMA) under the standard RSA assumption. Furthermore, we give a tight proof under the Phi-Hiding assumption. These proofs are in the random oracle model and the parameters deviate slightly from the standard use, because we require a larger output length of the hash function. However, we also show how RSA-PKCS#1 v1.5 signatures can be instantiated in practice such that our security proofs apply.
In order to draw a more complete picture of the precise security of RSA PKCS#1 v1.5 signatures, we also give security proofs in the standard model, but with respect to weaker attacker models (key-only attacks) and based on known complexity assumptions. The main conclusion of our work is that from a provable security perspective RSA PKCS#1 v1.5 can be safely used, if the output length of the hash function is chosen appropriately.
I don’t think the protocol is “provably secure,” meaning that it cannot have any vulnerabilities. What this paper demonstrates is that there are no vulnerabilities under the model of the proof. And, more importantly, that PKCS #1 v1.5 is as secure as any of its successors like RSA-PSS and RSA Full-Domain.
If you store sensitive or confidential data in Amazon DynamoDB, you might want to encrypt that data as close as possible to its origin so your data is protected throughout its lifecycle.
You can use the DynamoDB Encryption Client to protect your table data before you send it to DynamoDB. Encrypting your sensitive data in transit and at rest helps assure that your plaintext data isn’t available to any third party, including AWS.
You don’t need to be a cryptography expert to use the DynamoDB Encryption Client. The encryption and signing elements are designed to work with your existing DynamoDB applications. After you create and configure the required components, the DynamoDB Encryption Client transparently encrypts and signs your table items when you call PutItem and verifies and decrypts them when you call GetItem.
You can create your own custom components, or use the basic implementations that are included in the library. We’ve made sure that the classes that we provide implement strong and secure cryptography.
The DynamoDB Encryption Client is now available in Python, as well as Java. All supported language implementations are interoperable. For example, you can encrypt table data with the Python library and decrypt it with the Java library.
The DynamoDB Encryption Client is an open-source project. We hope that you will join us in developing the libraries and writing great documentation.
How it works
The DynamoDB Encryption Client processes one table item at a time. First, it encrypts the values (but not the names) of attributes that you specify. Then, it calculates a signature over the attributes that you specify, so you can detect unauthorized changes to the item as a whole, including adding or deleting attributes, or substituting one encrypted value for another.
However, attribute names, and the names and values in the primary key (the partition key and sort key, if one is provided) must remain in plaintext to make the item discoverable. They’re included in the signature by default.
Important: Do not put any sensitive data in the table name, attribute names, the names and values of the primary key attributes, or any attribute values that you tell the client not to encrypt.
How to use it
I’ll demonstrate how to use the DynamoDB Encryption Client in Python with a simple example. I’ll encrypt and sign one table item, and then add it to an existing table. This example uses a test item with arbitrary data, but you can use a similar procedure to protect a table item that contains highly sensitive data, such as a customer’s personal information.
I’ll start by creating a DynamoDB table resource that represents an existing table. If you use the code, be sure to supply a valid table name.
# Create a DynamoDB table
table = boto3.resource('dynamodb').Table(table_name)
Step 2: Create a cryptographic materials provider
Next, create an instance of a cryptographic materials provider (CMP). The CMP is the component that gathers the encryption and signing keys that are used to encrypt and sign your table items. The CMP also determines the encryption algorithms that are used and whether you create unique keys for every item or reuse them.
The DynamoDB Encryption Client includes several CMPs and you can create your own. And, if you’re in doubt, we help you to choose a CMP that fits your application and its security requirements.
To create a Direct KMS Provider, you specify an AWS KMS customer master key. Be sure to replace the fictitious customer master key ID (the value of aws-cmk-id) in this example with a valid one.
# Create a Direct KMS provider. Pass in a valid KMS customer master key.
aws_cmk_id = '1234abcd-12ab-34cd-56ef-1234567890ab'
aws_kms_cmp = AwsKmsCryptographicMaterialsProvider(key_id=aws_cmk_id)
Step 3: Create an attribute actions object
An attribute actions object tells the DynamoDB Encryption Client which item attribute values to encrypt and which attributes to include in the signature. The options are: ENCRYPT_AND_SIGN, SIGN_ONLY, and DO_NOTHING.
This sample attribute action encrypts and signs all attributes values except for the value of the test attribute; that attribute is neither encrypted nor included in the signature.
# Tell the encrypted table to encrypt and sign all attributes except one.
actions = AttributeActions(
If you’re using a helper class, such as the EncryptedTable class that I use in the next step, you can’t specify an attribute action for the primary key. The helper classes make sure that the primary key is signed, but never encrypted (SIGN_ONLY).
Step 4: Create an encrypted table
Now I can use the original table object, along with the materials provider and attribute actions, to create an encrypted table.
# Use these objects to create an encrypted table resource.
encrypted_table = EncryptedTable(
In this example, I’m using the EncryptedTable helper class, which adds encryption features to the DynamoDB Table class in the AWS SDK for Python (Boto 3). The DynamoDB Encryption Client in Python also includes EncryptedClient and EncryptedResource helper classes.
The DynamoDB Encryption Client helper classes call the DescribeTable operation to find the primary key. The application that runs the code must have permission to call the operation.
We’re done configuring the client. Now, we can encrypt, sign, verify, and decrypt table items.
To view the encrypted item, call the GetItem operation on the original table object, instead of the encrypted_table object. It gets the item from the DynamoDB table without verifying and decrypting it.
Here’s an excerpt of the output that displays the encrypted item:
Figure 1: Output that displays the encrypted item
Client-side or server-side encryption?
The DynamoDB Encryption Client is designed for client-side encryption, where you encrypt your data before you send it to DynamoDB.
But, you have other options. DynamoDB supports encryption at rest, a server-side encryption option that transparently encrypts the data in your table whenever DynamoDB saves the table to disk. You can even use both the DynamoDB Encryption Client and encryption at rest together. The encrypted and signed items that the client generates are standard table items that have binary data in their attribute values. Your choice depends on the sensitivity of your data and the security requirements of your application.
Although the Java and Python versions of the DynamoDB Encryption Client are fully compatible, the DynamoDB Encryption Client isn’t compatible with other client-side encryption libraries, such as the AWS Encryption SDK or the S3 Encryption Client. You can’t encrypt data with one library and decrypt it with another. For data that you store in DynamoDB, we recommend the DynamoDB Encryption Client.
Encryption is crucial
Using tools like the DynamoDB Encryption Client helps you to protect your table data and comply with the security requirements for your application. We hope that you use the client and join us in developing it on GitHub.
Abstract: ElsieFour (LC4) is a low-tech cipher that can be computed by hand; but unlike many historical ciphers, LC4 is designed to be hard to break. LC4 is intended for encrypted communication between humans only, and therefore it encrypts and decrypts plaintexts and ciphertexts consisting only of the English letters A through Z plus a few other characters. LC4 uses a nonce in addition to the secret key, and requires that different messages use unique nonces. LC4 performs authenticated encryption, and optional header data can be included in the authentication. This paper defines the LC4 encryption and decryption algorithms, analyzes LC4’s security, and describes a simple appliance for computing LC4 by hand.
Almost two decades ago I designed Solitaire, a pen-and-paper cipher that uses a deck of playing cards to store the cipher’s state. This algorithm uses specialized tiles. This gives the cipher designer more options, but it can be incriminating in a way that regular playing cards are not.
Creating these defenses is the goal of NIST’s lightweight cryptography initiative, which aims to develop cryptographic algorithm standards that can work within the confines of a simple electronic device. Many of the sensors, actuators and other micromachines that will function as eyes, ears and hands in IoT networks will work on scant electrical power and use circuitry far more limited than the chips found in even the simplest cell phone. Similar small electronics exist in the keyless entry fobs to newer-model cars and the Radio Frequency Identification (RFID) tags used to locate boxes in vast warehouses.
All of these gadgets are inexpensive to make and will fit nearly anywhere, but common encryption methods may demand more electronic resources than they possess.
This video demos a real-like Pokedex, complete with visual recognition, that I created using a Raspberry Pi, Python, and Deep Learning. You can find the entire blog post, including code, using this link: https://www.pyimagesearch.com/2018/04/30/a-fun-hands-on-deep-learning-project-for-beginners-students-and-hobbyists/ Music credit to YouTube user “No Copyright” for providing royalty free music: https://www.youtube.com/watch?v=PXpjqURczn8
The history of Pokémon in 30 seconds
The Pokémon franchise was created by video game designer Satoshi Tajiri in 1995. In the fictional world of Pokémon, Pokémon Trainers explore the vast landscape, catching and training small creatures called Pokémon. To date, there are 802 different types of Pokémon. They range from the ever recognisable Pikachu, a bright yellow electric Pokémon, to the highly sought-after Shiny Charizard, a metallic, playing-card-shaped Pokémon that your mate Alex claims she has in mint condition, but refuses to show you.
In the world of Pokémon, children as young as ten-year-old protagonist and all-round annoyance Ash Ketchum are allowed to leave home and wander the wilderness. There, they hunt vicious, deadly creatures in the hope of becoming a Pokémon Master.
Adrian’s deep learning Pokédex
Adrian is a bit of a deep learning pro, as demonstrated by his Santa/Not Santa detector, which we wrote about last year. For that project, he also provided a great explanation of what deep learning actually is. In a nutshell:
…a subfield of machine learning, which is, in turn, a subfield of artificial intelligence (AI).While AI embodies a large, diverse set of techniques and algorithms related to automatic reasoning (inference, planning, heuristics, etc), the machine learning subfields are specifically interested in pattern recognition and learning from data.
As with his earlier Raspberry Pi project, Adrian uses the Keras deep learning model and the TensorFlow backend, plus a few other packages such as Adrian’s own imutils functions and OpenCV.
Adrian trained a Convolutional Neural Network using Keras on a dataset of 1191 Pokémon images, obtaining 96.84% accuracy. As Adrian explains, this model is able to identify Pokémon via still image and video. It’s perfect for creating a Pokédex – an interactive Pokémon catalogue that should, according to the franchise, be able to identify and read out information on any known Pokémon when captured by camera. More information on model training can be found on Adrian’s blog.
For the physical build, a Raspberry Pi 3 with camera module is paired with the Raspberry Pi 7″ touch display to create a portable Pokédex. And while Adrian comments that the same result can be achieved using your home computer and a webcam, that’s not how Adrian rolls as a Raspberry Pi fan.
Plus, the smaller size of the Pi is perfect for one of you to incorporate this deep learning model into a 3D-printed Pokédex for ultimate Pokémon glory, pretty please, thank you.
Adrian has gone into impressive detail about how the project works and how you can create your own on his blog, pyimagesearch. So if you’re interested in learning more about deep learning, and making your own Pokédex, be sure to visit.
The ISO has rejected two symmetric encryption algorithms: SIMON and SPECK. These algorithms were both designed by the NSA and made public in 2013. They are optimized for small and low-cost processors like IoT devices.
The risk of using NSA-designed ciphers, of course, is that they include NSA-designed backdoors. Personally, I doubt that they’re backdoored. And I always like seeing NSA-designed cryptography (particularly its key schedules). It’s like examining alien technology.
User authentication is the functionality that every web application shared. We should have perfected that a long time ago, having implemented it so many times. And yet there are so many mistakes made all the time.
Part of the reason for that is that the list of things that can go wrong is long. You can store passwords incorrectly, you can have a vulnerably password reset functionality, you can expose your session to a CSRF attack, your session can be hijacked, etc. So I’ll try to compile a list of best practices regarding user authentication. OWASP top 10 is always something you should read, every year. But that might not be enough.
So, let’s start. I’ll try to be concise, but I’ll include as much of the related pitfalls as I can cover – e.g. what could go wrong with the user session after they login:
Store passwords with bcrypt/scrypt/PBKDF2. No MD5 or SHA, as they are not good for password storing. Long salt (per user) is mandatory (the aforementioned algorithms have it built in). If you don’t and someone gets hold of your database, they’ll be able to extract the passwords of all your users. And then try these passwords on other websites.
Use HTTPS. Period. (Otherwise user credentials can leak through unprotected networks). Force HTTPS if user opens a plain-text version.
Mark cookies as secure. Makes cookie theft harder.
Use CSRF protection (e.g. CSRF one-time tokens that are verified with each request). Frameworks have such functionality built-in.
Logout – let your users logout by deleting all cookies and invalidating the session. This makes usage of shared computers safer (yes, users should ideally use private browsing sessions, but not all of them are that savvy)
Session expiry – don’t have forever-lasting sessions. If the user closes your website, their session should expire after a while. “A while” may still be a big number depending on the service provided. For ajax-heavy website you can have regular ajax-polling that keeps the session alive while the page stays open.
Remember me – implementing “remember me” (on this machine) functionality is actually hard due to the risks of a stolen persistent cookie. Spring-security uses this approach, which I think should be followed if you wish to implement more persistent logins.
Forgotten password flow – the forgotten password flow should rely on sending a one-time (or expiring) link to the user and asking for a new password when it’s opened. 0Auth explain it in this post and Postmark gives some best pracitces. How the link is formed is a separate discussion and there are several approaches. Store a password-reset token in the user profile table and then send it as parameter in the link. Or do not store anything in the database, but send a few params: userId:expiresTimestamp:hmac(userId+expiresTimestamp). That way you have expiring links (rather than one-time links). The HMAC relies on a secret key, so the links can’t be spoofed. It seems there’s no consensus, as the OWASP guide has a bit different approach
One-time login links – this is an option used by Slack, which sends one-time login links instead of asking users for passwords. It relies on the fact that your email is well guarded and you have access to it all the time. If your service is not accessed to often, you can have that approach instead of (rather than in addition to) passwords.
Limit login attempts – brute-force through a web UI should not be possible; therefore you should block login attempts if they become too many. One approach is to just block them based on IP. The other one is to block them based on account attempted. (Spring example here). Which one is better – I don’t know. Both can actually be combined. Instead of fully blocking the attempts, you may add a captcha after, say, the 5th attempt. But don’t add the captcha for the first attempt – it is bad user experience.
Don’t leak information through error messages – you shouldn’t allow attackers to figure out if an email is registered or not. If an email is not found, upon login report just “Incorrect credentials”. On passwords reset, it may be something like “If your email is registered, you should have received a password reset email”. This is often at odds with usability – people don’t often remember the email they used to register, and the ability to check a number of them before getting in might be important. So this rule is not absolute, though it’s desirable, especially for more critical systems.
Consider using a 3rd party authentication – OpenID Connect, OAuth by Google/Facebook/Twitter (but be careful with OAuth flaws as well). There’s an associated risk with relying on a 3rd party identity provider, and you still have to manage cookies, logout, etc., but some of the authentication aspects are simplified.
For high-risk or sensitive applications use 2-factor authentication. There’s a caveat with Google Authenticator though – if you lose your phone, you lose your accounts (unless there’s a manual process to restore it). That’s why Authy seems like a good solution for storing 2FA keys.
I’m sure I’m missing something. And you see it’s complicated. Sadly we’re still at the point where the most common functionality – authenticating users – is so tricky and cumbersome, that you almost always get at least some of it wrong.
What happens when you combine the Internet of Things, Machine Learning, and Edge Computing? Before I tell you, let’s review each one and discuss what AWS has to offer.
Internet of Things (IoT) – Devices that connect the physical world and the digital one. The devices, often equipped with one or more types of sensors, can be found in factories, vehicles, mines, fields, homes, and so forth. Important AWS services include AWS IoT Core, AWS IoT Analytics, AWS IoT Device Management, and Amazon FreeRTOS, along with others that you can find on the AWS IoT page.
Machine Learning (ML) – Systems that can be trained using an at-scale dataset and statistical algorithms, and used to make inferences from fresh data. At Amazon we use machine learning to drive the recommendations that you see when you shop, to optimize the paths in our fulfillment centers, fly drones, and much more. We support leading open source machine learning frameworks such as TensorFlow and MXNet, and make ML accessible and easy to use through Amazon SageMaker. We also provide Amazon Rekognition for images and for video, Amazon Lex for chatbots, and a wide array of language services for text analysis, translation, speech recognition, and text to speech.
Edge Computing – The power to have compute resources and decision-making capabilities in disparate locations, often with intermittent or no connectivity to the cloud. AWS Greengrass builds on AWS IoT, giving you the ability to run Lambda functions and keep device state in sync even when not connected to the Internet.
ML Inference at the Edge Today I would like to toss all three of these important new technologies into a blender! You can now perform Machine Learning inference at the edge using AWS Greengrass. This allows you to use the power of the AWS cloud (including fast, powerful instances equipped with GPUs) to build, train, and test your ML models before deploying them to small, low-powered, intermittently-connected IoT devices running in those factories, vehicles, mines, fields, and homes that I mentioned.
Here are a few of the many ways that you can put Greengrass ML Inference to use:
Precision Farming – With an ever-growing world population and unpredictable weather that can affect crop yields, the opportunity to use technology to increase yields is immense. Intelligent devices that are literally in the field can process images of soil, plants, pests, and crops, taking local corrective action and sending status reports to the cloud.
Physical Security – Smart devices (including the AWS DeepLens) can process images and scenes locally, looking for objects, watching for changes, and even detecting faces. When something of interest or concern arises, the device can pass the image or the video to the cloud and use Amazon Rekognition to take a closer look.
Industrial Maintenance – Smart, local monitoring can increase operational efficiency and reduce unplanned downtime. The monitors can run inference operations on power consumption, noise levels, and vibration to flag anomalies, predict failures, detect faulty equipment.
Greengrass ML Inference Overview There are several different aspects to this new AWS feature. Let’s take a look at each one:
Machine Learning Models – Precompiled TensorFlow and MXNet libraries, optimized for production use on the NVIDIA Jetson TX2 and Intel Atom devices, and development use on 32-bit Raspberry Pi devices. The optimized libraries can take advantage of GPU and FPGA hardware accelerators at the edge in order to provide fast, local inferences.
Model Deployment – SageMaker models can (if you give them the proper IAM permissions) be referenced directly from your Greengrass groups. You can also make use of models stored in S3 buckets. You can add a new machine learning resource to a group with a couple of clicks:
Applying technology to healthcare data has the potential to produce many exciting and important outcomes. The analysis produced from healthcare data can empower clinicians to improve the health of individuals and populations by enabling them to make better decisions that enhance the care they provide.
The Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) program and community is working toward this goal by producing data standards and open-source solutions to store and analyze observational health data. Using the OHDSI tools, you can visualize the health of your entire population. You can build cohorts of patients, analyze incidence rates for various conditions, and estimate the effect of treatments on patients with certain conditions. You can also model health outcome predictions using machine learning algorithms.
One of the challenges often faced when working with big data tools is the expense of the infrastructure required to run them. Another challenge is the learning curve to implement and begin using these tools. Amazon Web Services has enabled us to address many of the classic IT challenges by making enterprise class infrastructure and technology available in an affordable, elastic, and automated way. This blog post demonstrates how to combine some of the OHDSI projects (Atlas, Achilles, WebAPI, and the OMOP Common Data Model) with AWS technologies. By doing so, you can quickly and inexpensively implement a health data science and informatics environment.
Shown following is just one example of the population health analysis that is possible with the OHDSI tools. This visualization shows the prevalence of various drugs within the given population of people. This information helps researchers and clinicians discover trends and make better informed decisions about patient health.
OHDSI application architecture on AWS
Before deploying an application on AWS that transmits, processes, or stores protected health information (PHI) or personally identifiable information (PII), address your organization’s compliance concerns. Make sure that you have worked with your internal compliance and legal team to ensure compliance with the laws and regulations that govern your organization. To understand how you can use AWS services as a part of your overall compliance program, see the AWS HIPAA Compliance whitepaper. With that said, we paid careful attention to the HIPAA control set during the design of this solution.
This blog post presents a complete OHDSI application environment, including a data warehouse with sample data. It has the following features:
Following, you can see a block diagram of how the OHDSI tools map to the services provided by AWS.
Atlas is the web application that researchers interact with to perform analysis. Atlas interacts with the underlying databases through a web services application named WebAPI. In this example, both Atlas and WebAPI are deployed and managed by AWS Elastic Beanstalk. Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications. Simply upload the Atlas and WebAPI code and Elastic Beanstalk automatically handles the deployment. It covers everything from capacity provisioning, load balancing, autoscaling, and high availability, to application health monitoring. Using a feature of Elastic Beanstalk called ebextensions, the Atlas and WebAPI servers are customized to use an encrypted storage volume for the middleware application logs.
Atlas stores the state of the various patient cohorts that are analyzed in a dedicated database separate from your observational health data. This database is provided by Amazon Aurora with PostgreSQL compatibility.
Amazon Aurora is a relational database built for the cloud that combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching, and backups. It is configured for high availability and uses encryption at rest for the database and backups, and encryption in flight for the JDBC connections.
All of your observational health data is stored inside the OHDSI Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). This model also stores useful vocabulary tables that help to translate values from various data sources (like EHR systems and claims data).
The OMOP CDM schema is deployed onto Amazon Redshift. Amazon Redshift is a fast, fully managed data warehouse that allows you to run complex analytic queries against petabytes of structured data. It uses using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. You can also resize an Amazon Redshift cluster as your requirements for it change.
The solution in this blog post automatically loads de-identified sample data of 1,000 people from the CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF). The data has helpful formatting from LTS Computing LLC. Vocabulary data from the OHDSI Athena project is also loaded into the OMOP CDM, and a results set is computed by OHDSI Achilles.
Following is a detailed technical diagram showing the configuration of the architecture to be deployed.
Deploying OHDSI on AWS
Everything just described is automatically deployed by using an AWS CloudFormation template. Using this template, you can quickly get started with the OHDSI project. The CloudFormation templates for this deployment as well as all of the supporting scripts and source code can be found in the AWS Labs GitHub repo.
From your AWS account, open the CloudFormation Management Console and choose Create Stack. From there, copy and paste the following URL in the Specify an Amazon S3 template URL box, and choose Next.
On the next screen, you provide a Stack Name (this can be anything you like) and a few other parameters for your OHDSI environment.
You use the DatabasePassword parameter to set the password for the master user account of the Amazon Redshift and Aurora databases.
You use the EBEndpoint name to generate a unique URL for Atlas to access the OHDSI environment. It is http://EBEndpoint.AWS-Region.elasticbeanstalk.com, where EBEndpoint.AWS-Region indicates the Elastic Beanstalk endpoint and AWS Region. You can configure this URL through Elastic Beanstalk if you want to change it in the future.
You use the KPair option to choose one of your existing Amazon EC2 key pairs to use with the instances that Elastic Beanstalk deploys. By doing this, you can gain administrative access to these instances in the future if you need to. If you don’t already have an Amazon EC2 key pair, you can generate one for free. You do this by going to the Key Pairs section of the EC2 console and choosing Generate Key Pair.
Finally, you use the UserIPRange parameter to specify a CIDR IP address range from which to access your OHDSI environment. By default, your OHDSI environment is accessible over the public internet. Use UserIPRange to limit access over the Internet to a single IP address or a range of IP addresses that represent users you want to have access. Through additional configuration, you can also make your OHDSI environment completely private and accessible only through a VPN or AWS Direct Connect private circuit.
When you’ve provided all Parameters, choose Next.
On the next screen, you can provide some other optional information like tags at your discretion, or just choose Next.
On the next screen, you can review what will be deployed. At the bottom of the screen, there is a check box for you to acknowledge that AWS CloudFormation might create IAM resources with custom names. This is correct; the template being deployed creates four custom roles that give permission for the AWS services involved to communicate with each other. Details of these permissions are inside the CloudFormation template referenced in the URL given in the first step. Check the box acknowledging this and choose Next.
You can watch as CloudFormation builds out your OHDSI architecture. A CloudFormation deployment is called a stack. The parent stack creates two child stacks, one containing the VPC and IAM roles and another created by Elastic Beanstalk with the Atlas and WebAPI servers. When all three stacks have reached the green CREATE_COMPLETE status, as shown in the screenshot following, then the OHDSI architecture has been deployed.
There is still some work going on behind the scenes, though. To watch the progress, browse to the Amazon Redshift section of your AWS Management Console and choose the Amazon Redshift cluster that was created for your OHDSI architecture. After you do so, you can observe the Loads and Queries tabs.
First, on the Loads tab, you can see the CMS De-SynPUF sample data and Athena vocabulary data being loaded into the OMOP Common Data Model. After you see the VOCABULARY table reach the COMPLETED status (as shown following), all of the sample and vocabulary data has been loaded.
After the data loads, the Achilles computation starts. On the Queries tab, you can watch Achilles running queries against your database to build out the Results schema. Achilles runs a large number of queries, and the entire process can take quite some time (about 20 minutes for the sample data we’ve loaded). Eventually, no new queries show up in the Queries tab, which shows that the Achilles computation is completed. The entire process from the time you executed the CloudFormation template until the Achilles computation is completed usually takes about an hour and 15 minutes.
At this point, you can browse to the Elastic Beanstalk section of the AWS Management Console. There, you can choose the OHDSI Application and Environment (green box) that was deployed by the CloudFormation template. At the top of the dashboard, as shown following, you see a link to a URL. This URL matches the name you provided in the EBEndpoint parameter of the CloudFormation template. Choose this URL, and you can start using Atlas to explore the CMS DE-SynPUF sample data!
Cost of deploying this environment
It used to be common to see healthcare data analytics environments deployed in an on-premises data center with expensive data warehouse appliances and virtualized environments. The cloud era has democratized the availability of the infrastructure required to do this type of data analysis, so that now it is within reach of even small organizations. This environment can expand to analyze petabyte-scale health data, and you only pay for what you need. See an estimated breakdown of the monthly cost components for this environment as deployed on the AWS Solution Calculator.
It’s also worth noting that this environment does not have to be run all of the time. If you are only performing analyses periodically, you can terminate the environment when you are finished and restore it from the database backups when you want to continue working. This would reduce the cost of operation even further.
Now that you have a fully functional OHDSI environment with sample data, you can use this to explore and learn the toolset and its capabilities. After learning with the sample data, you can begin gaining insights by analyzing your own organization’s health data. You can do this using an extract, transform, load (ETL) process from one or more of your health data sources.
James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.