The NSA has published an advisory outlining how “malicious cyber actors” are “are manipulating trust in federated authentication environments to access protected data in the cloud.” This is related to the SolarWinds hack I have previouslywrittenabout, and represents one of the techniques the SVR is using once it has gained access to target networks.
Malicious cyberactors are abusing trust in federated authentication environments to access protected data. The exploitation occurs after the actors have gained initial access to a victim’s on-premises network. The actors leverage privileged access in the on-premises environment to subvert the mechanisms that the organization uses to grant access to cloud and on-premises resources and/or to compromise administrator credentials with the ability to manage cloud resources. The actors demonstrate two sets of tactics, techniques,and procedures (TTP) for gaining access to the victim network’s cloud resources, often with a particular focus on organizational email.
In the first TTP, the actors compromise on-premises components of a federated SSO infrastructure and steal the credential or private key that is used to sign Security Assertion Markup Language (SAML) tokens(TA0006, T1552, T1552.004). Using the private keys, the actors then forge trusted authentication tokens to access cloud resources. A recent NSA Cybersecurity Advisory warned of actors exploiting a vulnerability in VMware Access and VMware Identity Manager that allowed them to perform this TTP and abuse federated SSO infrastructure.While that example of this TTP may have previously been attributed to nation-state actors, a wealth of actors could be leveraging this TTP for their objectives. This SAML forgery technique has been known and used by cyber actors since at least 2017.
In a variation of the first TTP, if the malicious cyber actors are unable to obtain anon-premises signing key, they would attempt to gain sufficient administrative privileges within the cloud tenant to add a malicious certificate trust relationship for forging SAML tokens.
In the second TTP, the actors leverage a compromised global administrator account to assign credentials to cloud application service principals (identities for cloud applications that allow the applications to be invoked to access other cloud resources). The actors then invoke the application’s credentials for automated access to cloud resources (often email in particular) that would otherwise be difficult for the actors to access or would more easily be noticed as suspicious (T1114, T1114.002).
This is an ongoing story, and I expect to see a lot more about TTP — nice acronym there — in coming weeks.
The United States government’s continuing disagreement with the Chinese company Huawei underscores a much larger problem with computer technologies in general: We have no choice but to trust them completely, and it’s impossible to verify that they’re trustworthy. Solving this problem which is increasingly a national security issue will require us to both make major policy changes and invent new technologies.
The Huawei problem is simple to explain. The company is based in China and subject to the rules and dictates of the Chinese government. The government could require Huawei to install back doors into the 5G routers it sells abroad, allowing the government to eavesdrop on communications or – even worse – take control of the routers during wartime. Since the United States will rely on those routers for all of its communications, we become vulnerable by building our 5G backbone on Huawei equipment.
It’s obvious that we can’t trust computer equipment from a country we don’t trust, but the problem is much more pervasive than that. The computers and smartphones you use are not built in the United States. Their chips aren’t made in the United States. The engineers who design and program them come from over a hundred countries. Thousands of people have the opportunity, acting alone, to slip a back door into the final product.
There’s more. Open-source software packages are increasingly targeted by groups installing back doors. Fake apps in the Google Play store illustrate vulnerabilities in our software distribution systems. The NotPetya worm was distributed by a fraudulent update to a popular Ukranian accounting package, illustrating vulnerabilities in our update systems. Hardware chips can be back-doored at the point of fabrication, even if the design is secure. The National Security Agency exploited the shipping process to subvert Cisco routers intended for the Syrian telephone company. The overall problem is that of supply-chain security, because every part of the supply chain can be attacked.
And while nation-state threats like China and Huawei – or Russia and the antivirus company Kaspersky a couple of years earlier – make the news, many of the vulnerabilities I described above are being exploited by cybercriminals.
Policy solutions involve forcing companies to open their technical details to inspection, including the source code of their products and the designs of their hardware. Huawei and Kaspersky have offered this sort of openness as a way to demonstrate that they are trustworthy. This is not a worthless gesture, and it helps, but it’s not nearly enough. Too many back doors can evade this kind of inspection.
Technical solutions fall into two basic categories, both currently beyond our reach. One is to improve the technical inspection processes for products whose designers provide source code and hardware design specifications, and for products that arrive without any transparency information at all. In both cases, we want to verify that the end product is secure and free of back doors. Sometimes we can do this for some classes of back doors: We can inspect source code this is how a Linux back door was discovered and removed in 2003 or the hardware design, which becomes a cleverness battle between attacker and defender.
This is an area that needs more research. Today, the advantage goes to the attacker. It’s hard to ensure that the hardware and software you examine is the same as what you get, and it’s too easy to create back doors that slip past inspection. And while we can find and correct some of these supply-chain attacks, we won’t find them all. It’s a needle-in-a-haystack problem, except we don’t know what a needle looks like. We need technologies, possibly based on artificial intelligence, that can inspect systems more thoroughly and faster than humans can do. We need them quickly.
The other solution is to build a secure system, even though any of its parts can be subverted. This is what the former Deputy Director of National Intelligence Sue Gordon meant in April when she said about 5G, “You have to presume a dirty network.” Or more precisely, can we solve this by building trustworthy systems out of untrustworthy parts?
It sounds ridiculous on its face, but the Internet itself was a solution to a similar problem: a reliable network built out of unreliable parts. This was the result of decades of research. That research continues today, and it’s how we can have highly resilient distributed systems like Google’s network even though none of the individual components are particularly good. It’s also the philosophy behind much of the cybersecurity industry today: systems watching one another, looking for vulnerabilities and signs of attack.
Security is a lot harder than reliability. We don’t even really know how to build secure systems out of secure parts, let alone out of parts and processes that we can’t trust and that are almost certainly being subverted by governments and criminals around the world. Current security technologies are nowhere near good enough, though, to defend against these increasingly sophisticated attacks. So while this is an important part of the solution, and something we need to focus research on, it’s not going to solve our near-term problems.
At the same time, all of these problems are getting worse as computers and networks become more critical to personal and national security. The value of 5G isn’t for you to watch videos faster; it’s for things talking to things without bothering you. These things – cars, appliances, power plants, smart cities – increasingly affect the world in a direct physical manner. They’re increasingly autonomous, using A.I. and other technologies to make decisions without human intervention. The risk from Chinese back doors into our networks and computers isn’t that their government will listen in on our conversations; it’s that they’ll turn the power off or make all the cars crash into one another.
All of this doesn’t leave us with many options for today’s supply-chain problems. We still have to presume a dirty network – as well as back-doored computers and phones — and we can clean up only a fraction of the vulnerabilities. Citing the lack of non-Chinese alternatives for some of the communications hardware, already some are calling to abandon attempts to secure 5G from Chinese back doors and work on having secure American or European alternatives for 6G networks. It’s not nearly enough to solve the problem, but it’s a start.
Perhaps these half-solutions are the best we can do. Live with the problem today, and accelerate research to solve the problem for the future. These are research projects on a par with the Internet itself. They need government funding, like the Internet itself. And, also like the Internet, they’re critical to national security.
Critically, these systems must be as secure as we can make them. As former FCC Commissioner Tom Wheeler has explained, there’s a lot more to securing 5G than keeping Chinese equipment out of the network. This means we have to give up the fantasy that law enforcement can have back doors to aid criminal investigations without also weakening these systems. The world uses one network, and there can only be one answer: Either everyone gets to spy, or no one gets to spy. And as these systems become more critical to national security, a network secure from all eavesdroppers becomes more important.
What our smartphones and relationship abusers share is that they both exert power over us in a world shaped to tip the balance in their favour, and they both work really, really hard to obscure this fact and keep us confused and blaming ourselves. Here are some of the ways our unequal relationship with our smartphones is like an abusive relationship:
They isolate us from deeper, competing relationships in favour of superficial contact — ‘user engagement’ — that keeps their hold on us strong. Working with social media, they insidiously curate our social lives, manipulating us emotionally with dark patterns to keep us scrolling.
They tell us the onus is on us to manage their behavior. It’s our job to tiptoe around them and limit their harms. Spending too much time on a literally-designed-to-be-behaviorally-addictive phone? They send company-approved messages about our online time, but ban from their stores the apps that would really cut our use. We just need to use willpower. We just need to be good enough to deserve them.
They betray us, leaking data / spreading secrets. What we shared privately with them is suddenly public. Sometimes this destroys lives, but hey, we only have ourselves to blame. They fight nasty and under-handed, and are so, so sorry when they get caught that we’re meant to feel bad for them. But they never truly change, and each time we take them back, we grow weaker.
They love-bomb us when we try to break away, piling on the free data or device upgrades, making us click through page after page of dark pattern, telling us no one understands us like they do, no one else sees everything we really are, no one else will want us.
It’s impossible to just cut them off. They’ve wormed themselves into every part of our lives, making life without them unimaginable. And anyway, the relationship is complicated. There is love in it, or there once was. Surely we can get back to that if we just manage them the way they want us to?
Nope. Our devices are basically gaslighting us. They tell us they work for and care about us, and if we just treat them right then we can learn to trust them. But all the evidence shows the opposite is true.
Excellent article on fraudulent seller tactics on Amazon.
The most prominent black hat companies for US Amazon sellers offer ways to manipulate Amazon’s ranking system to promote products, protect accounts from disciplinary actions, and crush competitors. Sometimes, these black hat companies bribe corporate Amazon employees to leak information from the company’s wiki pages and business reports, which they then resell to marketplace sellers for steep prices. One black hat company charges as much as $10,000 a month to help Amazon sellers appear at the top of product search results. Other tactics to promote sellers’ products include removing negative reviews from product pages and exploiting technical loopholes on Amazon’s site to lift products’ overall sales rankings.
[…]
AmzPandora’s services ranged from small tasks to more ambitious strategies to rank a product higher using Amazon’s algorithm. While it was online, it offered to ping internal contacts at Amazon for $500 to get information about why a seller’s account had been suspended, as well as advice on how to appeal the suspension. For $300, the company promised to remove an unspecified number of negative reviews on a listing within three to seven days, which would help increase the overall star rating for a product. For $1.50, the company offered a service to fool the algorithm into believing a product had been added to a shopper’s cart or wish list by writing a super URL. And for $1,200, an Amazon seller could purchase a “frequently bought together” spot on another marketplace product’s page that would appear for two weeks, which AmzPandora promised would lead to a 10% increase in sales.
This was a good article on this from last year. (My blog post.)
Amazon has a real problem here, primarily because trust in the system is paramount to Amazon’s success. As much as they need to crack down on fraudulent sellers, they really want articles like these to not be written.
Facebook is making a new and stronger commitment to privacy. Last month, the company hired three of its most vociferous critics and installed them in senior technical positions. And on Wednesday, Mark Zuckerberg wrote that the company will pivot to focus on private conversations over the public sharing that has long defined the platform, even while conceding that “frankly we don’t currently have a strong reputation for building privacy protective services.”
There is amplereason to question Zuckerberg’s pronouncement: The company has made — and broken — many privacy promises over the years. And if you read his 3,000-word post carefully, Zuckerberg says nothing about changing Facebook’s surveillance capitalism business model. All the post discusses is making private chats more central to the company, which seems to be a play for increased market dominance and to counter the Chinese company WeChat.
In security and privacy, the devil is always in the details — and Zuckerberg’s post provides none. But we’ll take him at his word and try to fill in some of the details here. What follows is a list of changes we should expect if Facebook is serious about changing its business model and improving user privacy.
How Facebook treats people on its platform
Increased transparency over advertiser and app accesses to user data. Today, Facebook users can download and view much of the data the company has about them. This is important, but it doesn’t go far enough. The company could be more transparent about what data it shares with advertisers and others and how it allows advertisers to select users they show ads to. Facebook could use its substantial skills in usability testing to help people understand the mechanisms advertisers use to show them ads or the reasoning behind what it chooses to show in user timelines. It could deliver on promises in this area.
Better — and more usable — privacy options. Facebook users have limited control over how their data is shared with other Facebook users and almost no control over how it is shared with Facebook’s advertisers, which are the company’s real customers. Moreover, the controls are buried deep behind complex and confusing menu options. To be fair, some of this is because privacy is complex, and it’s hard to understand the results of different options. But much of this is deliberate; Facebook doesn’t want its users to make their data private from other users.
The company could give people better control over how — and whether — their data is used, shared, and sold. For example, it could allow users to turn off individually targeted news and advertising. By this, we don’t mean simply making those advertisements invisible; we mean turning off the data flows into those tailoring systems. Finally, since most users stick to the default options when it comes to configuring their apps, a changing Facebook could tilt those defaults toward more privacy, requiring less tailoring most of the time.
More user protection from stalking. “Facebook stalking” is often thought of as “stalking light,” or “harmless.” But stalkers are rarely harmless. Facebook should acknowledge this class of misuse and work with experts to build tools that protect all of its users, especially its most vulnerable ones. Such tools should guide normal people away from creepiness and give victims power and flexibility to enlist aid from sources ranging from advocates to police.
Fully ending real-name enforcement. Facebook’s real-names policy, requiring people to use their actual legal names on the platform, hurtspeople such as activists, victims of intimate partner violence, police officers whose work makes them targets, and anyone with a public persona who wishes to have control over how they identify to the public. There are many ways Facebook can improve on this, from ending enforcement to allowing verifying pseudonyms for everyone — not just celebrities like Lady Gaga. Doing so would mark a clear shift.
How Facebook runs its platform
Increased transparency of Facebook’s business practices. One of the hard things about evaluating Facebook is the effort needed to get good information about its business practices. When violations are exposed by the media, as they regularly are, we are all surprised at the different ways Facebook violates user privacy. Most recently, the company used phone numbers provided for two-factor authentication for advertising and networking purposes. Facebook needs to be both explicit and detailed about how and when it shares user data. In fact, a move from discussing “sharing” to discussing “transfers,” “access to raw information,” and “access to derived information” would be a visible improvement.
Increased transparency regarding censorship rules. Facebook makes choices about what content is acceptable on its site. Those choices are controversial, implemented by thousands of low-paid workers quickly implementing unclear rules. These are tremendously hard problems without clear solutions. Even obvious rules like banning hateful words run into challenges when people try to legitimately discuss certain important topics. Whatever Facebook does in this regard, the company needs be more transparent about its processes. It should allow regulators and the public to audit the company’s practices. Moreover, Facebook should share any innovative engineering solutions with the world, much as it currently shares its data center engineering.
Better security for collected user data. There have been numerous examples of attackers targeting cloud service platforms to gain access to user data. Facebook has a large and skilled product security team that says some of the right things. That team needs to be involved in the design trade-offs for features and not just review the near-final designs for flaws. Shutting down a feature based on internal security analysis would be a clear message.
Better data security so Facebook sees less. Facebook eavesdrops on almost every aspect of its users’ lives. On the other hand, WhatsApp — purchased by Facebook in 2014 — provides users with end-to-end encrypted messaging. While Facebook knows who is messaging whom and how often, Facebook has no way of learning the contents of those messages. Recently, Facebook announced plans to combine WhatsApp, Facebook Messenger, and Instagram, extending WhatsApp’s security to the consolidated system. Changing course here would be a dramatic and negative signal.
Collecting less data from outside of Facebook. Facebook doesn’t just collect data about you when you’re on the platform. Because its “like” button is on so many other pages, the company can collect data about you when you’re not on Facebook. It even collects what it calls “shadow profiles” — data about you even if you’re not a Facebook user. This data is combined with other surveillance data the company buys, including health and financial data. Collecting and saving less of this data would be a strong indicator of a new direction for the company.
Better use of Facebook data to prevent violence. There is a trade-off between Facebook seeing less and Facebook doing more to prevent hateful and inflammatory speech. Dozens of people have been killed by mob violence because of fake news spread on WhatsApp. If Facebook were doing a convincing job of controlling fake news without end-to-end encryption, then we would expect to hear how it could use patterns in metadata to handle encrypted fake news.
How Facebook manages for privacy
Create a team measured on privacy and trust. Where companies spend their money tells you what matters to them. Facebook has a large and important growth team, but what team, if any, is responsible for privacy, not as a matter of compliance or pushing the rules, but for engineering? Transparency in how it is staffed relative to other teams would be telling.
Hire a senior executive responsible for trust. Facebook’s current team has been focused on growth and revenue. Its one chief security officer, Alex Stamos, was not replaced when he left in 2018, which may indicate that having an advocate for security on the leadership team led to debate and disagreement. Retaining a voice for security and privacy issues at the executive level, before those issues affected users, was a good thing. Now that responsibility is diffuse. It’s unclear how Facebook measures and assesses its own progress and who might be held accountable for failings. Facebook can begin the process of fixing this by designating a senior executive who is responsible for trust.
Engage with regulators. Much of Facebook’s posturing seems to be an attempt to forestall regulation. Facebook sends lobbyists to Washington and other capitals, and until recently the company sent support staff to politician’s offices. It has secret lobbying campaigns against privacy laws. And Facebook has repeatedly violated a 2011 Federal Trade Commission consent order regarding user privacy. Regulating big technical projects is not easy. Most of the people who understand how these systems work understand them because they build them. Societies will regulate Facebook, and the quality of that regulation requires real education of legislators and their staffs. While businesses often want to avoid regulation, any focus on privacy will require strong government oversight. If Facebook is serious about privacy being a real interest, it will accept both government regulation and community input.
User privacy is traditionally against Facebook’s core business interests. Advertising is its business model, and targeted ads sell better and more profitably — and that requires users to engage with the platform as much as possible. Increased pressure on Facebook to manage propaganda and hate speech could easily lead to more surveillance. But there is pressure in the other direction as well, as users equate privacy with increased control over how they present themselves on the platform.
We don’t expect Facebook to abandon its advertising business model, relent in its push for monopolistic dominance, or fundamentally alter its social networking platforms. But the company can give users important privacy protections and controls without abandoning surveillance capitalism. While some of these changes will reduce profits in the short term, we hope Facebook’s leadership realizes that they are in the best long-term interest of the company.
Facebook talks about community and bringing people together. These are admirable goals, and there’s plenty of value (and profit) in having a sustainable platform for connecting people. But as long as the most important measure of success is short-term profit, doing things that help strengthen communities will fall by the wayside. Surveillance, which allows individually targeted advertising, will be prioritized over user privacy. Outrage, which drives engagement, will be prioritized over feelings of belonging. And corporate secrecy, which allows Facebook to evade both regulators and its users, will be prioritized over societal oversight. If Facebook now truly believes that these latter options are critical to its long-term success as a company, we welcome the changes that are forthcoming.
This essay was co-authored with Adam Shostack, and originally appeared on Medium OneZero. We wrote a similar essay in 2002 about judging Microsoft’s then newfound commitment to security.
In his 2008 white paper that first proposed bitcoin, the anonymous Satoshi Nakamoto concluded with: “We have proposed a system for electronic transactions without relying on trust.” He was referring to blockchain, the system behind bitcoin cryptocurrency. The circumvention of trust is a great promise, but it’s just not true. Yes, bitcoin eliminates certain trusted intermediaries that are inherent in other payment systems like credit cards. But you still have to trust bitcoin — and everything about it.
Much has been written about blockchains and how they displace, reshape, or eliminate trust. But when you analyze both blockchain and trust, you quickly realize that there is much more hype than value. Blockchain solutions are often much worse than what they replace.
First, a caveat. By blockchain, I mean something very specific: the data structures and protocols that make up a public blockchain. These have three essential elements. The first is a distributed (as in multiple copies) but centralized (as in there’s only one) ledger, which is a way of recording what happened and in what order. This ledger is public, meaning that anyone can read it, and immutable, meaning that no one can change what happened in the past.
The second element is the consensus algorithm, which is a way to ensure all the copies of the ledger are the same. This is generally called mining; a critical part of the system is that anyone can participate. It is also distributed, meaning that you don’t have to trust any particular node in the consensus network. It can also be extremely expensive, both in data storage and in the energy required to maintain it. Bitcoin has the most expensive consensus algorithm the world has ever seen, by far.
Finally, the third element is the currency. This is some sort of digital token that has value and is publicly traded. Currency is a necessary element of a blockchain to align the incentives of everyone involved. Transactions involving these tokens are stored on the ledger.
Private blockchains are completely uninteresting. (By this, I mean systems that use the blockchain data structure but don’t have the above three elements.) In general, they have some external limitation on who can interact with the blockchain and its features. These are not anything new; they’re distributed append-only data structures with a list of individuals authorized to add to it. Consensus protocols have been studied in distributed systems for more than 60 years. Append-only data structures have been similarly well covered. They’re blockchains in name only, and — as far as I can tell — the only reason to operate one is to ride on the blockchain hype.
All three elements of a public blockchain fit together as a single network that offers new security properties. The question is: Is it actually good for anything? It’s all a matter of trust.
Trust is essential to society. As a species, humans are wired to trust one another. Society can’t function without trust, and the fact that we mostly don’t even think about it is a measure of how well trust works.
The word “trust” is loaded with many meanings. There’s personal and intimate trust. When we say we trust a friend, we mean that we trust their intentions and know that those intentions will inform their actions. There’s also the less intimate, less personal trust — we might not know someone personally, or know their motivations, but we can trust their future actions. Blockchain enables this sort of trust: We don’t know any bitcoin miners, for example, but we trust that they will follow the mining protocol and make the whole system work.
Most blockchain enthusiasts have a unnaturally narrow definition of trust. They’re fond of catchphrases like “in code we trust,” “in math we trust,” and “in crypto we trust.” This is trust as verification. But verification isn’t the same as trust.
In 2012, I wrote a book about trust and security, Liars and Outliers. In it, I listed four very general systems our species uses to incentivize trustworthy behavior. The first two are morals and reputation. The problem is that they scale only to a certain population size. Primitive systems were good enough for small communities, but larger communities required delegation, and more formalism.
The third is institutions. Institutions have rules and laws that induce people to behave according to the group norm, imposing sanctions on those who do not. In a sense, laws formalize reputation. Finally, the fourth is security systems. These are the wide varieties of security technologies we employ: door locks and tall fences, alarm systems and guards, forensics and audit systems, and so on.
These four elements work together to enable trust. Take banking, for example. Financial institutions, merchants, and individuals are all concerned with their reputations, which prevents theft and fraud. The laws and regulations surrounding every aspect of banking keep everyone in line, including backstops that limit risks in the case of fraud. And there are lots of security systems in place, from anti-counterfeiting technologies to internet-security technologies.
In his 2018 book, Blockchain and the New Architecture of Trust, Kevin Werbach outlines four different “trust architectures.” The first is peer-to-peer trust. This basically corresponds to my morals and reputational systems: pairs of people who come to trust each other. His second is leviathan trust, which corresponds to institutional trust. You can see this working in our system of contracts, which allows parties that don’t trust each other to enter into an agreement because they both trust that a government system will help resolve disputes. His third is intermediary trust. A good example is the credit card system, which allows untrusting buyers and sellers to engage in commerce. His fourth trust architecture is distributed trust. This is emergent trust in the particular security system that is blockchain.
What blockchain does is shift some of the trust in people and institutions to trust in technology. You need to trust the cryptography, the protocols, the software, the computers and the network. And you need to trust them absolutely, because they’re often single points of failure.
When that trust turns out to be misplaced, there is no recourse. If your bitcoin exchange gets hacked, you lose all of your money. If your bitcoin wallet gets hacked, you lose all of your money. If you forget your login credentials, you lose all of your money. If there’s a bug in the code of your smart contract, you lose all of your money. If someone successfully hacks the blockchain security, you lose all of your money. In many ways, trusting technology is harder than trusting people. Would you rather trust a human legal system or the details of some computer code you don’t have the expertise to audit?
Blockchain enthusiasts point to more traditional forms of trust — bank processing fees, for example — as expensive. But blockchain trust is also costly; the cost is just hidden. For bitcoin, that’s the cost of the additional bitcoin mined, the transaction fees, and the enormous environmental waste.
Blockchain doesn’t eliminate the need to trust human institutions. There will always be a big gap that can’t be addressed by technology alone. People still need to be in charge, and there is always a need for governance outside the system. This is obvious in the ongoing debate about changing the bitcoin block size, or in fixing the DAO attack against Ethereum. There’s always a need to override the rules, and there’s always a need for the ability to make permanent rules changes. As long as hard forks are a possibility — that’s when the people in charge of a blockchain step outside the system to change it — people will need to be in charge.
Any blockchain system will have to coexist with other, more conventional systems. Modern banking, for example, is designed to be reversible. Bitcoin is not. That makes it hard to make the two compatible, and the result is often an insecurity. Steve Wozniak was scammed out of $70K in bitcoin because he forgot this.
Blockchain technology is often centralized. Bitcoin might theoretically be based on distributed trust, but in practice, that’s just not true. Just about everyone using bitcoin has to trust one of the few available wallets and use one of the few available exchanges. People have to trust the software and the operating systems and the computers everything is running on. And we’ve seen attacks against wallets and exchanges. We’ve seen Trojans and phishing and password guessing. Criminals have even used flaws in the system that people use to repair their cell phones to steal bitcoin.
Moreover, in any distributed trust system, there are backdoor methods for centralization to creep back in. With bitcoin, there are only a few miners of consequence. There’s one company that provides most of the mining hardware. There are only a few dominant exchanges. To the extent that most people interact with bitcoin, it is through these centralized systems. This also allows for attacks against blockchain-based systems.
These issues are not bugs in current blockchain applications, they’re inherent in how blockchain works. Any evaluation of the security of the system has to take the whole socio-technical system into account. Too many blockchain enthusiasts focus on the technology and ignore the rest.
To the extent that people don’t use bitcoin, it’s because they don’t trust bitcoin. That has nothing to do with the cryptography or the protocols. In fact, a system where you can lose your life savings if you forget your key or download a piece of malware is not particularly trustworthy. No amount of explaining how SHA-256 works to prevent double-spending will fix that.
Similarly, to the extent that people do use blockchains, it is because they trust them. People either own bitcoin or not based on reputation; that’s true even for speculators who own bitcoin simply because they think it will make them rich quickly. People choose a wallet for their cryptocurrency, and an exchange for their transactions, based on reputation. We even evaluate and trust the cryptography that underpins blockchains based on the algorithms’ reputation.
To see how this can fail, look at the various supply-chain security systems that are using blockchain. A blockchain isn’t a necessary feature of any of them. The reasons they’re successful is that everyone has a single software platform to enter their data in. Even though the blockchain systems are built on distributed trust, people don’t necessarily accept that. For example, some companies don’t trust the IBM/Maersk system because it’s not their blockchain.
Irrational? Maybe, but that’s how trust works. It can’t be replaced by algorithms and protocols. It’s much more social than that.
Still, the idea that blockchains can somehow eliminate the need for trust persists. Recently, I received an email from a company that implemented secure messaging using blockchain. It said, in part: “Using the blockchain, as we have done, has eliminated the need for Trust.” This sentiment suggests the writer misunderstands both what blockchain does and how trust works.
Do you need a public blockchain? The answer is almost certainly no. A blockchain probably doesn’t solve the security problems you think it solves. The security problems it solves are probably not the ones you have. (Manipulating audit data is probably not your major security risk.) A false trust in blockchain can itself be a security risk. The inefficiencies, especially in scaling, are probably not worth it. I have looked at many blockchain applications, and all of them could achieve the same security properties without using a blockchain — of course, then they wouldn’t have the cool name.
Honestly, cryptocurrencies are useless. They’re only used by speculators looking for quick riches, people who don’t like government-backed currencies, and criminals who want a black-market way to exchange money.
To answer the question of whether the blockchain is needed, ask yourself: Does the blockchain change the system of trust in any meaningful way, or just shift it around? Does it just try to replace trust with verification? Does it strengthen existing trust relationships, or try to go against them? How can trust be abused in the new system, and is this better or worse than the potential abuses in the old system? And lastly: What would your system look like if you didn’t use blockchain at all?
If you ask yourself those questions, it’s likely you’ll choose solutions that don’t use public blockchain. And that’ll be a good thing — especially when the hype dissipates.
I have wanted to write this essay for over a year. The impetus to finally do it came from an invite to speak at the Hyperledger Global Forum in December. This essay is a version of the talk I wrote for that event, made more accessible to a general audience.
It seems to be the season for blockchain takedowns. James Waldo has an excellent essay in Queue. And Nicholas Weaver gave atalk at the Enigma Conference, summarized here. It’s a shortened version of this talk.
On November 4, 2016, the hacker “Guccifer 2.0,: a front for Russia’s military intelligence service, claimed in a blogpost that the Democrats were likely to use vulnerabilities to hack the presidential elections. On November 9, 2018, President Donald Trump started tweeting about the senatorial elections in Florida and Arizona. Without any evidence whatsoever, he said that Democrats were trying to steal the election through “FRAUD.”
Cybersecurity expertswould say that posts like Guccifer 2.0’s are intended to undermine public confidence in voting: a cyber-attack against the US democratic system. Yet Donald Trump’s actions are doing far more damage to democracy. So far, his tweets on the topic have been retweeted over 270,000 times, eroding confidence far more effectively than any foreign influence campaign.
We need new ideas to explain how public statements on the Internet can weaken American democracy. Cybersecurity today is not only about computer systems. It’s also about the ways attackers can use computer systems to manipulate and undermine public expectations about democracy. Not only do we need to rethink attacks against democracy; we also need to rethink the attackers as well.
This is one key reason why we wrote a new research paper which uses ideas from computer security to understand the relationship between democracy and information. These ideas help us understand attacks which destabilize confidence in democratic institutions or debate.
Our research implies that insider attacks from within American politics can be more pernicious than attacks from other countries. They are more sophisticated, employ tools that are harder to defend against, and lead to harsh political tradeoffs. The US can threaten charges or impose sanctions when Russian trolling agencies attack its democratic system. But what punishments can it use when the attacker is the US president?
People who think about cybersecurity build on ideas about confrontations between states during the Cold War. Intellectuals such as Thomas Schelling developed deterrence theory, which explained how the US and USSR could maneuver to limit each other’s options without ever actually going to war. Deterrence theory, and related concepts about the relative ease of attack and defense, seemed to explain the tradeoffs that the US and rival states faced, as they started to use cyber techniques to probe and compromise each others’ information networks.
However, these ideas fail to acknowledge one key differences between the Cold War and today. Nearly all states — whether democratic or authoritarian — are entangled on the Internet. This creates both new tensions and new opportunities. The US assumed that the internet would help spread American liberal values, and that this was a good and uncontroversial thing. Illiberal states like Russia and China feared that Internet freedom was a direct threat to their own systems of rule. Opponents of the regime might use social media and online communication to coordinate among themselves, and appeal to the broader public, perhaps toppling their governments, as happened in Tunisia during the Arab Spring.
This led illiberal states to develop new domestic defenses against open information flows. As scholars like Molly Roberts have shown, states like China and Russia discovered how they could "flood" internet discussion with online nonsense and distraction, making it impossible for their opponents to talk to each other, or even to distinguish between truth and falsehood. These flooding techniques stabilized authoritarian regimes, because they demoralized and confused the regime’s opponents. Libertarians often argue that the best antidote to bad speech is more speech. What Vladimir Putin discovered was that the best antidote to more speech was bad speech.
Russia saw the Arab Spring and efforts to encourage democracy in its neighborhood as direct threats, and began experimenting with counter-offensive techniques. When a Russia-friendly government in Ukraine collapsed due to popular protests, Russia tried to destabilize new, democratic elections by hacking the system through which the election results would be announced. The clear intention was to discredit the election results by announcing fake voting numbers that would throw public discussion into disarray.
This attack on public confidence in election results was thwarted at the last moment. Even so, it provided the model for a new kind of attack. Hackers don’t have to secretly alter people’s votes to affect elections. All they need to do is to damage public confidence that the votes were counted fairly. As researchers have argued, “simply put, the attacker might not care who wins; the losing side believing that the election was stolen from them may be equally, if not more, valuable.”
These two kinds of attacks — “flooding” attacks aimed at destabilizing public discourse, and “confidence” attacks aimed at undermining public belief in elections — were weaponized against the US in 2016. Russian social media trolls, hired by the “Internet Research Agency,” flooded online political discussions with rumors and counter-rumors in order to create confusion and political division. Peter Pomerantsev describes how in Russia, “one moment [Putin’s media wizard] Surkov would fund civic forums and human rights NGOs, the next he would quietly support nationalist movements that accuse the NGOs of being tools of the West.” Similarly, Russian trolls tried to get Black Lives Matter protesters and anti-Black Lives Matter protesters to march at the same time and place, to create conflict and the appearance of chaos. Guccifer 2.0’s blog post was surely intended to undermine confidence in the vote, preparing the ground for a wider destabilization campaign after Hillary Clinton won the election. Neither Putin nor anyone else anticipated that Trump would win, ushering in chaos on a vastly greater scale.
We do not know how successful these attacks were. A new book by John Sides, Michael Tesler and Lynn Vavreck suggests that Russian efforts had no measurable long-term consequences. Detailed research on the flow of news articles through social media by Yochai Benker, Robert Farris, and Hal Roberts agrees, showing that Fox News was far more influential in the spread of false news stories than any Russian effort.
However, global adversaries like the Russians aren’t the only actors who can use flooding and confidence attacks. US actors can use just the same techniques. Indeed, they can arguably use them better, since they have a better understanding of US politics, more resources, and are far more difficult for the government to counter without raising First Amendment issues.
For example, when the Federal Communication Commission asked for comments on its proposal to get rid of “net neutrality,” it was flooded by fake comments supporting the proposal. Nearly every real person who commented was in favor of net neutrality, but their arguments were drowned out by a flood of spurious comments purportedly made by identities stolen from porn sites, by people whose names and email addresses had been harvested without their permission, and, in some cases, from dead people. This was done not just to generate fake support for the FCC’s controversial proposal. It was to devalue public comments in general, making the general public’s support for net neutrality politically irrelevant. FCC decision making on issues like net neutrality used to be dominated by industry insiders, and many would like to go back to the old regime.
Trump’s efforts to undermine confidence in the Florida and Arizona votes work on a much larger scale. There are clear short-term benefits to asserting fraud where no fraud exists. This may sway judges or other public officials to make concessions to the Republicans to preserve their legitimacy. Yet they also destabilize American democracy in the long term. If Republicans are convinced that Democrats win by cheating, they will feel that their own manipulation of the system (by purging voter rolls, making voting more difficult and so on) are legitimate, and very probably cheat even more flagrantly in the future. This will trash collective institutions and leave everyone worse off.
It is notable that some Arizonan Republicans — including Martha McSally — have so far stayed firm against pressure from the White House and the Republican National Committee to claim that cheating is happening. They presumably see more long term value from preserving existing institutions than undermining them. Very plausibly, Donald Trump has exactly the opposite incentives. By weakening public confidence in the vote today, he makes it easier to claim fraud and perhaps plunge American politics into chaos if he is defeated in 2020.
If experts who see Russian flooding and confidence measures as cyberattacks on US democracy are right, then these attacks are just as dangerous — and perhaps more dangerous — when they are used by domestic actors. The risk is that over time they will destabilize American democracy so that it comes closer to Russia’s managed democracy — where nothing is real any more, and ordinary people feel a mixture of paranoia, helplessness and disgust when they think about politics. Paradoxically, Russian interference is far too ineffectual to get us there — but domestically mounted attacks by all-American political actors might.
To protect against that possibility, we need to start thinking more systematically about the relationship between democracy and information. Our paper provides one way to do this, highlighting the vulnerabilities of democracy against certain kinds of information attack. More generally, we need to build levees against flooding while shoring up public confidence in voting and other public information systems that are necessary to democracy.
The first may require radical changes in how we regulate social media companies. Modernization of government commenting platforms to make them robust against flooding is only a very minimal first step. Up until very recently, companies like Twitter have won market advantage from bot infestations — even when it couldn’t make a profit, it seemed that user numbers were growing. CEOs like Mark Zuckerberg have begun to worry about democracy, but their worries will likely only go so far. It is difficult to get a man to understand something when his business model depends on not understanding it. Sharp — and legally enforceable — limits on automated accounts are a first step. Radical redesign of networks and of trending indicators so that flooding attacks are less effective may be a second.
The second requires general standards for voting at the federal level, and a constitutional guarantee of the right to vote. Technical experts nearly universally favor robust voting systems that would combine paper records with random post-election auditing, to prevent fraud and secure public confidence in voting. Other steps to ensure proper ballot design, and standardize vote counting and reporting will take more time and discussion — yet the record of other countries show that they are not impossible.
The US is nearly unique among major democracies in the persistent flaws of its election machinery. Yet voting is not the only important form of democratic information. Apparent efforts to deliberately skew the US census against counting undocumented immigrants show the need for a more general audit of the political information systems that we need if democracy is to function properly.
It’s easier to respond to Russian hackers through sanctions, counter-attacks and the like than to domestic political attacks that undermine US democracy. To preserve the basic political freedoms of democracy requires recognizing that these freedoms are sometimes going to be abused by politicians such as Donald Trump. The best that we can do is to minimize the possibilities of abuse up to the point where they encroach on basic freedoms and harden the general institutions that secure democratic information against attacks intended to undermine them.
This essay was co-authored with Henry Farrell, and previously appeared on Motherboard, with a terrible headline that I was unable to get changed.
Suppose you have a program running on your system that you don’t quite trust. Maybe it’s a program submitted by a student to an automated grading system. Or maybe it’s a QEMU device model running in a Xen control domain ("domain 0" or “dom0”), and you want to make sure that even if an attacker from a rogue virtual machine manages to take over the QEMU process, they can’t do any further harm. There are many things you want to do as far as restricting its ability to do mischief. But one thing in particular you probably want to do is to be able to reliably kill the process once you think it should be done. This turns out to be quite a bit more tricky than you’d think.
The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name
Column description
RateCodeID
Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on).
FareAmount
Represents the time-and-distance fare calculated by the meter.
TripDistance
Represents the elapsed trip distance in miles reported by the taxi meter.
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
Trigger the AWS Step Function state machine by passing the input file path.
The first stage in the state machine triggers an AWS Lambda
The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
The state machine waits a few seconds before checking the Spark job status.
Based on the job status, the state machine moves to the success or failure state.
Subsequent Spark jobs are submitted using the same approach.
The state machine waits a few seconds for the job to finish.
The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
rate_code – Represents the rate code for the trip.
total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
csv – The same trip data that is used for the first Spark job.
miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
rate_code_id – Represents the rate code type.
total_distance – Derived from first Spark job and represents the total trip distance.
total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
curl localhost:8998/sessions
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
Wait state – Pauses the state machine until a job completes execution.
Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
"MilesPerRate Job": {
"Type": "Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:blog-miles-per-rate-job-submit-function",
"ResultPath": "$.jobId",
"Next": "Wait for MilesPerRate job to complete"
}
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter
Description
ClusterSubnetID
The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet.
KeyName
The name of the existing EC2 key pair to access the Amazon EMR cluster.
VPCID
The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC.
S3RootPath
The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written.
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID
Resource Type
Description
StepFunctionsStateExecutionRole
IAM role
IAM role to execute the state machine and have a trust relationship with the states service.
SparkETLStateMachine
AWS Step Functions state machine
State machine in AWS Step Functions for the Spark ETL workflow.
LambdaSecurityGroup
Amazon EC2 security group
Security group that is used for the Lambda function to call the Livy API.
RateCodeStatusJobSubmitFunction
AWS Lambda function
Lambda function to submit the RateCodeStatus job.
MilesPerRateJobSubmitFunction
AWS Lambda function
Lambda function to submit the MilesPerRate job.
SparkJobStatusFunction
AWS Lambda function
Lambda function to check the Spark job status.
LambdaStateMachineRole
IAM role
IAM role for all Lambda functions to use the lambda trust relationship.
EMRCluster
Amazon EMR cluster
EMR cluster where Livy is running and where the job is placed.
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
{
“rootPath”: “s3://tm-app-demos”
}
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
Execute the first Spark job, MilesPerRate.
The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome. Go ahead—give this solution a try, and share your experience with us!
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
Security updates have been issued by Debian (gitlab and packagekit), Fedora (glibc, postgresql, and webkitgtk4), Oracle (java-1.7.0-openjdk, java-1.8.0-openjdk, kernel, libvirt, and qemu-kvm), Red Hat (java-1.7.0-openjdk, kernel-rt, qemu-kvm, and qemu-kvm-rhev), SUSE (openjpeg2, qemu, and squid3), and Ubuntu (kernel, linux, linux-aws, linux-azure, linux-gcp, linux-kvm, linux-oem, linux, linux-aws, linux-kvm,, linux-hwe, linux-azure, linux-gcp, linux-oem, linux-lts-trusty, linux-lts-xenial, linux-aws, qemu, and xdg-utils).
“Security is hard” is a tautology, especially in the fast-moving world of container orchestration. We have previously covered various aspects of Linux container security through, for example, the Clear Containers implementation or the broader question of Kubernetes and security, but those are mostly concerned with container isolation; they do not address the question of trusting a container’s contents. What is a container running? Who built it and when? Even assuming we have good programmers and solid isolation layers, propagating that good code around a Kubernetes cluster and making strong assertions on the integrity of that supply chain is far from trivial. The 2018 KubeCon + CloudNativeCon Europe event featured some projects that could eventually solve that problem.
Here’s a posting from Canonical concerning the cryptocurrency-mining app that was discovered in its Snap Store. “Several years ago when we started the work on snap packages, we understood that we could not instantly implement an alternative that was completely safe from all perspectives. In addition to being safe, it had to be useful. So the challenge we gave ourselves was to significantly improve the situation immediately, and then pave the road for incremental improvements that could be rolled out gradually.”
In our blog post on Tuesday, Cryptocurrency Security Challenges, we wrote about the two primary challenges faced by anyone interested in safely and profitably participating in the cryptocurrency economy: 1) make sure you’re dealing with reputable and ethical companies and services, and, 2) keep your cryptocurrency holdings safe and secure.
In this post, we’re going to focus on how to make sure you don’t lose any of your cryptocurrency holdings through accident, theft, or carelessness. You do that by backing up the keys needed to sell or trade your currencies.
$34 Billion in Lost Value
Of the 16.4 million bitcoins said to be in circulation in the middle of 2017, close to 3.8 million may have been lost because their owners no longer are able to claim their holdings. Based on today’s valuation, that could total as much as $34 billion dollars in lost value. And that’s just bitcoins. There are now over 1,500 different cryptocurrencies, and we don’t know how many of those have been misplaced or lost.
Now that some cryptocurrencies have reached (at least for now) staggering heights in value, it’s likely that owners will be more careful in keeping track of the keys needed to use their cryptocurrencies. For the ones already lost, however, the owners have been separated from their currencies just as surely as if they had thrown Benjamin Franklins and Grover Clevelands over the railing of a ship.
The Basics of Securing Your Cryptocurrencies
In our previous post, we reviewed how cryptocurrency keys work, and the common ways owners can keep track of them. A cryptocurrency owner needs two keys to use their currencies: a public key that can be shared with others is used to receive currency, and a private key that must be kept secure is used to spend or trade currency.
Many wallets and applications allow the user to require extra security to access them, such as a password, or iris, face, or thumb print scan. If one of these options is available in your wallets, take advantage of it. Beyond that, it’s essential to back up your wallet, either using the backup feature built into some applications and wallets, or manually backing up the data used by the wallet. When backing up, it’s a good idea to back up the entire wallet, as some wallets require additional private data to operate that might not be apparent.
No matter which backup method you use, it is important to back up often and have multiple backups, preferable in different locations. As with any valuable data, a 3-2-1 backup strategy is good to follow, which ensures that you’ll have a good backup copy if anything goes wrong with one or more copies of your data.
One more caveat, don’t reuse passwords. This applies to all of your accounts, but is especially important for something as critical as your finances. Don’t ever use the same password for more than one account. If security is breached on one of your accounts, someone could connect your name or ID with other accounts, and will attempt to use the password there, as well. Consider using a password manager such as LastPass or 1Password, which make creating and using complex and unique passwords easy no matter where you’re trying to sign in.
Approaches to Backing Up Your Cryptocurrency Keys
There are numerous ways to be sure your keys are backed up. Let’s take them one by one.
1. Automatic backups using a backup program
If you’re using a wallet program on your computer, for example, Bitcoin Core, it will store your keys, along with other information, in a file. For Bitcoin Core, that file is wallet.dat. Other currencies will use the same or a different file name and some give you the option to select a name for the wallet file.
To back up the wallet.dat or other wallet file, you might need to tell your backup program to explicitly back up that file. Users of Backblaze Backup don’t have to worry about configuring this, since by default, Backblaze Backup will back up all data files. You should determine where your particular cryptocurrency, wallet, or application stores your keys, and make sure the necessary file(s) are backed up if your backup program requires you to select which files are included in the backup.
Backblaze B2 is an option for those interested in low-cost and high security cloud storage of their cryptocurrency keys. Backblaze B2 supports 2-factor verification for account access, works with a number of apps that support automatic backups with encryption, error-recovery, and versioning, and offers an API and command-line interface (CLI), as well. The first 10GB of storage is free, which could be all one needs to store encrypted cryptocurrency keys.
2. Backing up by exporting keys to a file
Apps and wallets will let you export your keys from your app or wallet to a file. Once exported, your keys can be stored on a local drive, USB thumb drive, DAS, NAS, or in the cloud with any cloud storage or sync service you wish. Encrypting the file is strongly encouraged — more on that later. If you use 1Password or LastPass, or other secure notes program, you also could store your keys there.
3. Backing up by saving a mnemonic recovery seed
A mnemonic phrase, mnemonic recovery phrase, or mnemonic seed is a list of words that stores all the information needed to recover a cryptocurrency wallet. Many wallets will have the option to generate a mnemonic backup phrase, which can be written down on paper. If the user’s computer no longer works or their hard drive becomes corrupted, they can download the same wallet software again and use the mnemonic recovery phrase to restore their keys.
The phrase can be used by anyone to recover the keys, so it must be kept safe. Mnemonic phrases are an excellent way of backing up and storing cryptocurrency and so they are used by almost all wallets.
A mnemonic recovery seed is represented by a group of easy to remember words. For example:
The first four letters are enough to unambiguously identify the word.
Similar words are avoided (such as: build and built).
Bitcoin and most other cryptocurrencies such as Litecoin, Ethereum, and others use mnemonic seeds that are 12 to 24 words long. Other currencies might use different length seeds.
4. Physical backups — Paper, Metal
Some cryptocurrency holders believe that their backup, or even all their cryptocurrency account information, should be stored entirely separately from the internet to avoid any risk of their information being compromised through hacks, exploits, or leaks. This type of storage is called “cold storage.” One method of cold storage involves printing out the keys to a piece of paper and then erasing any record of the keys from all computer systems. The keys can be entered into a program from the paper when needed, or scanned from a QR code printed on the paper.
Printed public and private keys
Some who go to extremes suggest separating the mnemonic needed to access an account into individual pieces of paper and storing those pieces in different locations in the home or office, or even different geographical locations. Some say this is a bad idea since it could be possible to reconstruct the mnemonic from one or more pieces. How diligent you wish to be in protecting these codes is up to you.
Mnemonic recovery phrase booklet
There’s another option that could make you the envy of your friends. That’s the CryptoSteel wallet, which is a stainless steel metal case that comes with more than 250 stainless steel letter tiles engraved on each side. Codes and passwords are assembled manually from the supplied part-randomized set of tiles. Users are able to store up to 96 characters worth of confidential information. Cryptosteel claims to be fireproof, waterproof, and shock-proof.
Cryptosteel cold wallet
Of course, if you leave your Cryptosteel wallet in the pocket of a pair of ripped jeans that gets thrown out by the housekeeper, as happened to the character Russ Hanneman on the TV show Silicon Valley in last Sunday’s episode, then you’re out of luck. That fictional billionaire investor lost a USB drive with $300 million in cryptocoins. Let’s hope that doesn’t happen to you.
Encryption & Security
Whether you store your keys on your computer, an external disk, a USB drive, DAS, NAS, or in the cloud, you want to make sure that no one else can use those keys. The best way to handle that is to encrypt the backup.
With Backblaze Backup for Windows and Macintosh, your backups are encrypted in transmission to the cloud and on the backup server. Users have the option to add an additional level of security by adding a Personal Encryption Key (PEK), which secures their private key. Your cryptocurrency backup files are secure in the cloud. Using our web or mobile interface, previous versions of files can be accessed, as well.
Our object storage cloud offering, Backblaze B2, can be used with a variety of applications for Windows, Macintosh, and Linux. With B2, cryptocurrency users can choose whichever method of encryption they wish to use on their local computers and then upload their encrypted currency keys to the cloud. Depending on the client used, versioning and life-cycle rules can be applied to the stored files.
Other backup programs and systems provide some or all of these capabilities, as well. If you are backing up to a local drive, it is a good idea to encrypt the local backup, which is an option in some backup programs.
Address Security
Some experts recommend using a different address for each cryptocurrency transaction. Since the address is not the same as your wallet, this means that you are not creating a new wallet, but simply using a new identifier for people sending you cryptocurrency. Creating a new address is usually as easy as clicking a button in the wallet.
One of the chief advantages of using a different address for each transaction is anonymity. Each time you use an address, you put more information into the public ledger (blockchain) about where the currency came from or where it went. That means that over time, using the same address repeatedly could mean that someone could map your relationships, transactions, and incoming funds. The more you use that address, the more information someone can learn about you. For more on this topic, refer to Address reuse.
Note that a downside of using a paper wallet with a single key pair (type-0 non-deterministic wallet) is that it has the vulnerabilities listed above. Each transaction using that paper wallet will add to the public record of transactions associated with that address. Newer wallets, i.e. “deterministic” or those using mnemonic code words support multiple addresses and are now recommended.
There are other approaches to keeping your cryptocurrency transaction secure. Here are a couple of them.
Multi-signature
Multi-signature refers to requiring more than one key to authorize a transaction, much like requiring more than one key to open a safe. It is generally used to divide up responsibility for possession of cryptocurrency. Standard transactions could be called “single-signature transactions” because transfers require only one signature — from the owner of the private key associated with the currency address (public key). Some wallets and apps can be configured to require more than one signature, which means that a group of people, businesses, or other entities all must agree to trade in the cryptocurrencies.
Deep Cold Storage
Deep cold storage ensures the entire transaction process happens in an offline environment. There are typically three elements to deep cold storage.
First, the wallet and private key are generated offline, and the signing of transactions happens on a system not connected to the internet in any manner. This ensures it’s never exposed to a potentially compromised system or connection.
Second, details are secured with encryption to ensure that even if the wallet file ends up in the wrong hands, the information is protected.
Third, storage of the encrypted wallet file or paper wallet is generally at a location or facility that has restricted access, such as a safety deposit box at a bank.
Deep cold storage is used to safeguard a large individual cryptocurrency portfolio held for the long term, or for trustees holding cryptocurrency on behalf of others, and is possibly the safest method to ensure a crypto investment remains secure.
Keep Your Software Up to Date
You should always make sure that you are using the latest version of your app or wallet software, which includes important stability and security fixes. Installing updates for all other software on your computer or mobile device is also important to keep your wallet environment safer.
One Last Thing: Think About Your Testament
Your cryptocurrency funds can be lost forever if you don’t have a backup plan for your peers and family. If the location of your wallets or your passwords is not known by anyone when you are gone, there is no hope that your funds will ever be recovered. Taking a bit of time on these matters can make a huge difference.
To the Moon*
Are you comfortable with how you’re managing and backing up your cryptocurrency wallets and keys? Do you have a suggestion for keeping your cryptocurrencies safe that we missed above? Please let us know in the comments.
*To the Moon — Crypto slang for a currency that reaches an optimistic price projection.
Earlier this month, the Pentagon stopped selling phones made by the Chinese companies ZTE and Huawei on military bases because they might be used to spy on their users.
It’s a legitimate fear, and perhaps a prudent action. But it’s just one instance of the much larger issue of securing our supply chains.
All of our computerized systems are deeply international, and we have no choice but to trust the companies and governments that touch those systems. And while we can ban a few specific products, services or companies, no country can isolate itself from potential foreign interference.
In this specific case, the Pentagon is concerned that the Chinese government demanded that ZTE and Huawei add “backdoors” to their phones that could be surreptitiously turned on by government spies or cause them to fail during some future political conflict. This tampering is possible because the software in these phones is incredibly complex. It’s relatively easy for programmers to hide these capabilities, and correspondingly difficult to detect them.
This isn’t the first time the United States has taken action against foreign software suspected to contain hidden features that can be used against us. Last December, President Trump signed into law a bill banning software from the Russian company Kaspersky from being used within the US government. In 2012, the focus was on Chinese-made Internet routers. Then, the House Intelligence Committee concluded: “Based on available classified and unclassified information, Huawei and ZTE cannot be trusted to be free of foreign state influence and thus pose a security threat to the United States and to our systems.”
Nor is the United States the only country worried about these threats. In 2014, China reportedly banned antivirus products from both Kaspersky and the US company Symantec, based on similar fears. In 2017, the Indian government identified 42 smartphone apps that China subverted. Back in 1997, the Israeli company Check Point was dogged by rumors that its government added backdoors into its products; other of that country’s tech companies have been suspected of the same thing. Even al-Qaeda was concerned; ten years ago, a sympathizer released the encryption software Mujahedeen Secrets, claimed to be free of Western influence and backdoors. If a country doesn’t trust another country, then it can’t trust that country’s computer products.
But this trust isn’t limited to the country where the company is based. We have to trust the country where the software is written — and the countries where all the components are manufactured. In 2016, researchers discovered that many different models of cheap Android phones were sending information back to China. The phones might be American-made, but the software was from China. In 2016, researchers demonstrated an even more devious technique, where a backdoor could be added at the computer chip level in the factory that made the chips without the knowledge of, and undetectable by, the engineers who designed the chips in the first place. Pretty much every US technology company manufactures its hardware in countries such as Malaysia, Indonesia, China and Taiwan.
We also have to trust the programmers. Today’s large software programs are written by teams of hundreds of programmers scattered around the globe. Backdoors, put there by we-have-no-idea-who, have been discovered in Juniper firewalls and D-Link routers, both of which are US companies. In 2003, someone almost slipped a very clever backdoor into Linux. Think of how many countries’ citizens are writing software for Apple or Microsoft or Google.
We can go even farther down the rabbit hole. We have to trust the distribution systems for our hardware and software. Documents disclosed by Edward Snowden showed the National Security Agency installing backdoors into Cisco routers being shipped to the Syrian telephone company. There are fake apps in the Google Play store that eavesdrop on you. Russian hackers subverted the update mechanism of a popular brand of Ukrainian accounting software to spread the NotPetya malware.
In 2017, researchers demonstrated that a smartphone can be subverted by installing a malicious replacement screen.
I could go on. Supply-chain security is an incredibly complex problem. US-only design and manufacturing isn’t an option; the tech world is far too internationally interdependent for that. We can’t trust anyone, yet we have no choice but to trust everyone. Our phones, computers, software and cloud systems are touched by citizens of dozens of different countries, any one of whom could subvert them at the demand of their government. And just as Russia is penetrating the US power grid so they have that capability in the event of hostilities, many countries are almost certainly doing the same thing at the consumer level.
We don’t know whether the risk of Huawei and ZTE equipment is great enough to warrant the ban. We don’t know what classified intelligence the United States has, and what it implies. But we do know that this is just a minor fix for a much larger problem. It’s doubtful that this ban will have any real effect. Members of the military, and everyone else, can still buy the phones. They just can’t buy them on US military bases. And while the US might block the occasional merger or acquisition, or ban the occasional hardware or software product, we’re largely ignoring that larger issue. Solving it borders on somewhere between incredibly expensive and realistically impossible.
Perhaps someday, global norms and international treaties will render this sort of device-level tampering off-limits. But until then, all we can do is hope that this particular arms race doesn’t get too far out of control.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
The S3 location of the job script
The S3 location of the temporary directory
The S3 location of the JDBC driver
The S3 location of the Parquet data (output)
The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
Sign in to the AWS Management Console, and search for IAM:
On the IAM console, choose Roles in the left navigation pane.
Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
Choose Next: Permissions.
Search for the AWSGlueServiceRole policy, and select it.
Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
Select this policy, and choose Next: Review.
Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
Open the console, and search for Secrets Manager.
In the AWS Secrets Manager console, choose Store a new secret.
Under Select a secret type, choose Other type of secrets.
In the Secret key/value, set one row for each of the following parameters:
db_username
db_password
db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
db_table
driver_name (ibm.db2.jcc.DB2Driver)
output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
Choose Next.
For Secret name, use DB2_Database_Connection_Info.
Choose Next.
Keep the Disable automatic rotation check box selected.
Choose Next.
Choose Store.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
In the AWS Management Console, search for AWS Glue.
In the navigation pane on the left, choose Jobs under the ETL
Choose Add job.
Fill in the basic Job properties:
Give the job a name (for example, db2-job).
Choose the IAM role that you created previously (GluePermissions).
For This job runs, choose A new script to be authored by you.
For ETL language, choose Python.
In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
Choose Next.
On the Connections page, choose Next
On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
After the job shows as Succeeded, choose Logs to read the output of the job.
In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
While creating the job, choose the correct jar for the JDBC dependency.
In the script, change the reference to the secret to be used from AWS Secrets Manager:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
partitionColumn: The name of an integer column that is used for partitioning.
lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
upperBound: The maximum value of partitionColumn that is used to decide partition stride.
numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Conclusion
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.
William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.
Security updates have been issued by Debian (kernel), Gentoo (rsync), openSUSE (Chromium), Oracle (kernel), Red Hat (kernel and kernel-rt), Scientific Linux (kernel), SUSE (kernel and php7), and Ubuntu (dpdk, libraw, linux, linux-lts-trusty, linux-snapdragon, and webkit2gtk).
Journalism Trust Initiative (JTI) е инициатива за саморегулиране на медиите, предназначена да насърчава качествената журналистика в новата информационна екосистема. Това е идея на Репортери без граници съвместно с партньори като Агенция Франс Прес (АФП) и Европейския съюз за радио и телевизия (EBU).
В рамките на инициативата ще бъдат създадени система от стандарти, след което ще може да се провежда сертифициране.
Очакваното значение на стандартите – според първоначалните текстове, свързани с инициативата:
ново средство за борба с дезинформацията и защита на надеждната и качествена информация;
ползи за доставчици на съдържание, които се присъединят към инициативата и прилагат стандартите;
повече прозрачност по отношение на доставчиците на съдържание;
по-добра видимост онлайн за качественото съдържание;
повече рекламни приходи, тъй като рекламодателите ще могат да разпознават качествени медии;
обществена подкрепа за качествените медии;
основа за знак за качество и доверие.
Стандартите ще бъдат разработени за период 12-18 месеца със сътрудничество на френския орган по стандартизация AFNOR и германския орган за стандартизация Deutsches Institut für Normung (DIN).
Компанията Google е информирала Репортери без граници, че е взела решение да участва в инициативата.
Most likely you’ve read the tantalizing stories of big gains from investing in cryptocurrencies. Someone who invested $1,000 into bitcoins five years ago would have over $85,000 in value now. Alternatively, someone who invested in bitcoins three months ago would have seen their investment lose 20% in value. Beyond the big price fluctuations, currency holders are possibly exposed to fraud, bad business practices, and even risk losing their holdings altogether if they are careless in keeping track of the all-important currency keys.
It’s certain that beyond the rewards and risks, cryptocurrencies are here to stay. We can’t ignore how they are changing the game for how money is handled between people and businesses.
Some Advantages of Cryptocurrency
Cryptocurrency is accessible to anyone.
Decentralization means the network operates on a user-to-user (or peer-to-peer) basis.
Transactions can completed for a fraction of the expense and time required to complete traditional asset transfers.
Transactions are digital and cannot be counterfeited or reversed arbitrarily by the sender, as with credit card charge-backs.
There aren’t usually transaction fees for cryptocurrency exchanges.
Cryptocurrency allows the cryptocurrency holder to send exactly what information is needed and no more to the merchant or recipient, even permitting anonymous transactions (for good or bad).
Cryptocurrency operates at the universal level and hence makes transactions easier internationally.
There is no other electronic cash system in which your account isn’t owned by someone else.
On top of all that, blockchain, the underlying technology behind cryptocurrencies, is already being applied to a variety of business needs and itself becoming a hot sector of the tech economy. Blockchain is bringing traceability and cost-effectiveness to supply-chain management — which also improves quality assurance in areas such as food, reducing errors and improving accounting accuracy, smart contracts that can be automatically validated, signed and enforced through a blockchain construct, the possibility of secure, online voting, and many others.
Like any new, booming marketing there are risks involved in these new currencies. Anyone venturing into this domain needs to have their eyes wide open. While the opportunities for making money are real, there are even more ways to lose money.
We’re going to cover two primary approaches to staying safe and avoiding fraud and loss when dealing with cryptocurrencies. The first is to thoroughly vet any person or company you’re dealing with to judge whether they are ethical and likely to succeed in their business segment. The second is keeping your critical cryptocurrency keys safe, which we’ll deal with in this and a subsequent post.
Caveat Emptor — Buyer Beware
The short history of cryptocurrency has already seen the demise of a number of companies that claimed to manage, mine, trade, or otherwise help their customers profit from cryptocurrency. Mt. Gox, GAW Miners, and OneCoin are just three of the many companies that disappeared with their users’ money. This is the traditional equivalent of your bank going out of business and zeroing out your checking account in the process.
That doesn’t happen with banks because of regulatory oversight. But with cryptocurrency, you need to take the time to investigate any company you use to manage or trade your currencies. How long have they been around? Who are their investors? Are they affiliated with any reputable financial institutions? What is the record of their founders and executive management? These are all important questions to consider when evaluating a company in this new space.
Would you give the keys to your house to a service or person you didn’t thoroughly know and trust? Some companies that enable you to buy and sell currencies online will routinely hold your currency keys, which gives them the ability to do anything they want with your holdings, including selling them and pocketing the proceeds if they wish.
That doesn’t mean you shouldn’t ever allow a company to keep your currency keys in escrow. It simply means that you better know with whom you’re doing business and if they’re trustworthy enough to be given that responsibility.
Keys To the Cryptocurrency Kingdom — Public and Private
If you’re an owner of cryptocurrency, you know how this all works. If you’re not, bear with me for a minute while I bring everyone up to speed.
Cryptocurrency has no physical manifestation, such as bills or coins. It exists purely as a computer record. And unlike currencies maintained by governments, such as the U.S. dollar, there is no central authority regulating its distribution and value. Cryptocurrencies use a technology called blockchain, which is a decentralized way of keeping track of transactions. There are many copies of a given blockchain, so no single central authority is needed to validate its authenticity or accuracy.
The validity of each cryptocurrency is determined by a blockchain. A blockchain is a continuously growing list of records, called “blocks”, which are linked and secured using cryptography. Blockchains by design are inherently resistant to modification of the data. They perform as an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable, permanent way. A blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority. On a scaled network, this level of collusion is impossible — making blockchain networks effectively immutable and trustworthy.
The other element common to all cryptocurrencies is their use of public and private keys, which are stored in the currency’s wallet. A cryptocurrency wallet stores the public and private “keys” or “addresses” that can be used to receive or spend the cryptocurrency. With the private key, it is possible to write in the public ledger (blockchain), effectively spending the associated cryptocurrency. With the public key, it is possible for others to send currency to the wallet.
Cryptocurrency “coins” can be lost if the owner loses the private keys needed to spend the currency they own. It’s as if the owner had lost a bank account number and had no way to verify their identity to the bank, or if they lost the U.S. dollars they had in their wallet. The assets are gone and unusable.
The Cryptocurrency Wallet
Given the importance of these keys, and lack of recourse if they are lost, it’s obviously very important to keep track of your keys.
If you’re being careful in choosing reputable exchanges, app developers, and other services with whom to trust your cryptocurrency, you’ve made a good start in keeping your investment secure. But if you’re careless in managing the keys to your bitcoins, ether, Litecoin, or other cryptocurrency, you might as well leave your money on a cafe tabletop and walk away.
What Are the Differences Between Hot and Cold Wallets?
Just like other numbers you might wish to keep track of — credit cards, account numbers, phone numbers, passphrases — cryptocurrency keys can be stored in a variety of ways. Those who use their currencies for day-to-day purchases most likely will want them handy in a smartphone app, hardware key, or debit card that can be used for purchases. These are called “hot” wallets. Some experts advise keeping the balances in these devices and apps to a minimal amount to avoid hacking or data loss. We typically don’t walk around with thousands of dollars in U.S. currency in our old-style wallets, so this is really a continuation of the same approach to managing spending money.
A “hot” wallet, the Bread mobile app
Some investors with large balances keep their keys in “cold” wallets, or “cold storage,” i.e. a device or location that is not connected online. If funds are needed for purchases, they can be transferred to a more easily used payment medium. Cold wallets can be hardware devices, USB drives, or even paper copies of your keys.
A “cold” wallet, the Trezor hardware wallet
A “cold” wallet, the Ledger Nano S
A “cold” Bitcoin paper wallet
Wallets are suited to holding one or more specific cryptocurrencies, and some people have multiple wallets for different currencies and different purposes.
A paper wallet is nothing other than a printed record of your public and private keys. Some prefer their records to be completely disconnected from the internet, and a piece of paper serves that need. Just like writing down an account password on paper, however, it’s essential to keep the paper secure to avoid giving someone the ability to freely access your funds.
How to Keep your Keys, and Cryptocurrency Secure
In a post this coming Thursday, Securing Your Cryptocurrency, we’ll discuss the best strategies for backing up your cryptocurrency so that your currencies don’t become part of the millions that have been lost. We’ll cover the common (and uncommon) approaches to backing up hot wallets, cold wallets, and using paper and metal solutions to keeping your keys safe.
In the meantime, please tell us of your experiences with cryptocurrencies — good and bad — and how you’ve dealt with the issue of cryptocurrency security.
Today, we’re excited to announce local build support in AWS CodeBuild.
AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.
In this blog post, I’ll show you how to set up CodeBuild locally to build and test a sample Java application.
By building an application on a local machine you can:
Test the integrity and contents of a buildspec file locally.
Test and build an application locally before committing.
Identify and fix errors quickly from your local development environment.
Prerequisites
In this post, I am using AWS Cloud9 IDE as my development environment.
If you would like to use AWS Cloud9 as your IDE, follow the express setup steps in the AWS Cloud9 User Guide.
The AWS Cloud9 IDE comes with Docker and Git already installed. If you are going to use your laptop or desktop machine as your development environment, install Docker and Git before you start.
Note: We need to provide three environment variables namely IMAGE_NAME, SOURCE and ARTIFACTS.
IMAGE_NAME: The name of your build environment image.
SOURCE: The absolute path to your source code directory.
ARTIFACTS: The absolute path to your artifact output folder.
When you run the sample project, you get a runtime error that says the YAML file does not exist. This is because a buildspec.yml file is not included in the sample web project. AWS CodeBuild requires a buildspec.yml to run a build. For more information about buildspec.yml, see Build Spec Example in the AWS CodeBuild User Guide.
Let’s add a buildspec.yml file with the following content to the sample-web-app folder and then rebuild the project.
version: 0.2
phases:
build:
commands:
- echo Build started on `date`
- mvn install
artifacts:
files:
- target/javawebdemo.war
This time your build should be successful. Upon successful execution, look in the /artifacts folder for the final built artifacts.zip file to validate.
Conclusion:
In this blog post, I showed you how to quickly set up the CodeBuild local agent to build projects right from your local desktop machine or laptop. As you see, local builds can improve developer productivity by helping you identify and fix errors quickly.
I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.
The Internet of Things (IoT) has precipitated to an influx of connected devices and data that can be mined to gain useful business insights. If you own an IoT device, you might want the data to be uploaded seamlessly from your connected devices to the cloud so that you can make use of cloud storage and the processing power to perform sophisticated analysis of data. To upload the data to the AWS Cloud, devices must pass authentication and authorization checks performed by the respective AWS services. The standard way of authenticating AWS requests is the Signature Version 4 algorithm that requires the caller to have an access key ID and secret access key. Consequently, you need to hardcode the access key ID and the secret access key on your devices. Alternatively, you can use the built-in X.509 certificate as the unique device identity to authenticate AWS requests.
AWS IoT has introduced the credentials provider feature that allows a caller to authenticate AWS requests by having an X.509 certificate. The credentials provider authenticates a caller using an X.509 certificate, and vends a temporary, limited-privilege security token. The token can be used to sign and authenticate any AWS request. Thus, the credentials provider relieves you from having to manage and periodically refresh the access key ID and secret access key remotely on your devices.
In the process of retrieving a security token, you use AWS IoT to create a thing (a representation of a specific device or logical entity), register a certificate, and create AWS IoT policies. You also configure an AWS Identity and Access Management (IAM) role and attach appropriate IAM policies to the role so that the credentials provider can assume the role on your behalf. You also make an HTTP-over-Transport Layer Security (TLS) mutual authentication request to the credentials provider that uses your preconfigured thing, certificate, policies, and IAM role to authenticate and authorize the request, and obtain a security token on your behalf. You can then use the token to sign any AWS request using Signature Version 4.
In this blog post, I explain the AWS IoT credentials provider design and then demonstrate the end-to-end process of retrieving a security token from AWS IoT and using the token to write a temperature and humidity record to a specific Amazon DynamoDB table.
Note: This post assumes you are familiar with AWS IoT and IAM to perform steps using the AWS CLI and OpenSSL. Make sure you are running the latest version of the AWS CLI.
Overview of the credentials provider workflow
The following numbered diagram illustrates the credentials provider workflow. The diagram is followed by explanations of the steps.
To explain the steps of the workflow as illustrated in the preceding diagram:
The AWS IoT device uses the AWS SDK or custom client to make an HTTPS request to the credentials provider for a security token. The request includes the device X.509 certificate for authentication.
The credentials provider forwards the request to the AWS IoT authentication and authorization module to verify the certificate and the permission to request the security token.
If the certificate is valid and has permission to request a security token, the AWS IoT authentication and authorization module returns success. Otherwise, it returns failure, which goes back to the device with the appropriate exception.
If assuming the role succeeds, AWS STS returns a temporary, limited-privilege security token to the credentials provider.
The credentials provider returns the security token to the device.
The AWS SDK on the device uses the security token to sign an AWS request with AWS Signature Version 4.
The requested service invokes IAM to validate the signature and authorize the request against access policies attached to the preconfigured IAM role.
If IAM validates the signature successfully and authorizes the request, the request goes through.
In another solution, you could configure an AWS Lambda rule that ingests your device data and sends it to another AWS service. However, in applications that require the uploading of large files such as videos or aggregated telemetry to the AWS Cloud, you may want your devices to be able to authenticate and send data directly to the AWS service of your choice. The credentials provider enables you to do that.
Outline of the steps to retrieve and use security token
Perform the following steps as part of this solution:
Create an AWS IoT thing: Start by creating a thing that corresponds to your home thermostat in the AWS IoT thing registry database. This allows you to authenticate the request as a thing and use thing attributes as policy variables in AWS IoT and IAM policies.
Register a certificate: Create and register a certificate with AWS IoT, and attach it to the thing for successful device authentication.
Create and configure an IAM role: Create an IAM role to be assumed by the service on behalf of your device. I illustrate how to configure a trust policy and an access policy so that AWS IoT has permission to assume the role, and the token has necessary permission to make requests to DynamoDB.
Create a role alias: Create a role alias in AWS IoT. A role alias is an alternate data model pointing to an IAM role. The credentials provider request must include a role alias name to indicate which IAM role to assume for obtaining a security token from AWS STS. You may update the role alias on the server to point to a different IAM role and thus make your device obtain a security token with different permissions.
Attach a policy: Create an authorization policy with AWS IoT and attach it to the certificate to control which device can assume which role aliases.
Request a security token: Make an HTTPS request to the credentials provider and retrieve a security token and use it to sign a DynamoDB request with Signature Version 4.
Use the security token to sign a request: Use the retrieved token to sign a request to DynamoDB and successfully write a temperature and humidity record from your home thermostat in a specific table. Thus, starting with an X.509 certificate on your home thermostat, you can successfully upload your thermostat record to DynamoDB and use it for further analysis. Before the availability of the credentials provider, you could not do this.
Deploy the solution
1. Create an AWS IoT thing
Register your home thermostat in the AWS IoT thing registry database by creating a thing type and a thing. You can use the AWS CLI with the following command to create a thing type. The thing type allows you to store description and configuration information that is common to a set of things.
Now, you need to have a Certificate Authority (CA) certificate, sign a device certificate using the CA certificate, and register both certificates with AWS IoT before your device can authenticate to AWS IoT. If you do not already have a CA certificate, you can use OpenSSL to create a CA certificate, as described in Use Your Own Certificate. To register your CA certificate with AWS IoT, follow the steps on Registering Your CA Certificate.
You then have to create a device certificate signed by the CA certificate and register it with AWS IoT, which you can do by following the steps on Creating a Device Certificate Using Your CA Certificate. Save the certificate and the corresponding key pair; you will use them when you request a security token later. Also, remember the password you provide when you create the certificate.
Run the following command in the AWS CLI to attach the device certificate to your thing so that you can use thing attributes in policy variables.
If the attach-thing-principal command succeeds, the output is empty.
3. Configure an IAM role
Next, configure an IAM role in your AWS account that will be assumed by the credentials provider on behalf of your device. You are required to associate two policies with the role: a trust policy that controls who can assume the role, and an access policy that controls which actions can be performed on which resources by assuming the role.
The following trust policy grants the credentials provider permission to assume the role. Put it in a text document and save the document with the name, trustpolicyforiot.json.
The following access policy allows DynamoDB operations on the table that has the same name as the thing name that you created in Step 1, MyHomeThermostat, by using credentials-iot:ThingName as a policy variable. I explain after Step 5 about using thing attributes as policy variables. Put the following policy in a text document and save the document with the name, accesspolicyfordynamodb.json.
Finally, run the following command in the AWS CLI to attach the access policy to your role.
aws iam attach-role-policy --role-name dynamodb-access-role --policy-arn arn:aws:iam::<your_aws_account_id>:policy/accesspolicyfordynamodb
If the attach-role-policy command succeeds, the output is empty.
Configure the PassRole permissions
The IAM role that you have created must be passed to AWS IoT to create a role alias, as described in Step 4. The user who performs the operation requires iam:PassRole permission to authorize this action. You also should add permission for the iam:GetRole action to allow the user to retrieve information about the specified role. Create the following policy to grant iam:PassRole and iam:GetRole permissions. Name this policy, passrolepermission.json.
Now, run the following command to attach the policy to the user.
aws iam attach-user-policy --policy-arn arn:aws:iam::<your_aws_account_id>:policy/passrolepermission --user-name <user_name>
If the attach-user-policy command succeeds, the output is empty.
4. Create a role alias
Now that you have configured the IAM role, you will create a role alias with AWS IoT. You must provide the following pieces of information when creating a role alias:
RoleAlias: This is the primary key of the role alias data model and hence a mandatory attribute. It is a string; the minimum length is 1 character, and the maximum length is 128 characters.
RoleArn: This is the Amazon Resource Name (ARN) of the IAM role you have created. This is also a mandatory attribute.
CredentialDurationSeconds: This is an optional attribute specifying the validity (in seconds) of the security token. The minimum value is 900 seconds (15 minutes), and the maximum value is 3,600 seconds (60 minutes); the default value is 3,600 seconds, if not specified.
Run the following command in the AWS CLI to create a role alias. Use the credentials of the user to whom you have given the iam:PassRole permission.
You created and registered a certificate with AWS IoT earlier for successful authentication of your device. Now, you need to create and attach a policy to the certificate to authorize the request for the security token.
Let’s say you want to allow a thing to get credentials for the role alias, Thermostat-dynamodb-access-role-alias, with thing owner Alice, thing type thermostat, and the thing attached to a principal. The following policy, with thing attributes as policy variables, achieves these requirements. After this step, I explain more about using thing attributes as policy variables. Put the policy in a text document, and save it with the name, alicethermostatpolicy.json.
If the attach-policy command succeeds, the output is empty.
You have completed all the necessary steps to request an AWS security token from the credentials provider!
Using thing attributes as policy variables
Before I show how to request a security token, I want to explain more about how to use thing attributes as policy variables and the advantage of using them. As a prerequisite, a device must provide a thing name in the credentials provider request.
Thing substitution variables in AWS IoT policies
AWS IoT Simplified Permission Management allows you to associate a connection with a specific thing, and allow the thing name, thing type, and other thing attributes to be available as substitution variables in AWS IoT policies. You can write a generic AWS IoT policy as in alicethermostatpolicy.json in Step 5, attach it to multiple certificates, and authorize the connection as a thing. For example, you could attach alicethermostatpolicy.json to certificates corresponding to each of the thermostats you have that you want to assume the role alias, Thermostat-dynamodb-access-role-alias, and allow operations only on the table with the name that matches the thing name. For more information, see the full list of thing policy variables.
Thing substitution variables in IAM policies
You also can use the following three substitution variables in the IAM role’s access policy (I used credentials-iot:ThingName in accesspolicyfordynamodb.json in Step 3):
credentials-iot:ThingName
credentials-iot:ThingTypeName
credentials-iot:AwsCertificateId
When the device provides the thing name in the request, the credentials provider fetches these three variables from the database and adds them as context variables to the security token. When the device uses the token to access DynamoDB, the variables in the role’s access policy are replaced with the corresponding values in the security token. Note that you also can use credentials-iot:AwsCertificateId as a policy variable; AWS IoT returns certificateId during registration.
6. Request a security token
Make an HTTPS request to the credentials provider to fetch a security token. You have to supply the following information:
Certificate and key pair: Because this is an HTTP request over TLS mutual authentication, you have to provide the certificate and the corresponding key pair to your client while making the request. Use the same certificate and key pair that you used during certificate registration with AWS IoT.
RoleAlias: Provide the role alias (in this example, Thermostat-dynamodb-access-role-alias) to be assumed in the request.
ThingName: Provide the thing name that you created earlier in the AWS IoT thing registry database. This is passed as a header with the name, x-amzn-iot-thingname. Note that the thing name is mandatory only if you have thing attributes as policy variables in AWS IoT or IAM policies.
Run the following command in the AWS CLI to obtain your AWS account-specific endpoint for the credentials provider. See the DescribeEndpoint API documentation for further details.
Note that if you are on Mac OS X, you need to export your certificate to a .pfx or .p12 file before you can pass it in the https request. Use OpenSSL with the following command to convert the device certificate from .pem to .pfx format. Remember the password because you will need it subsequently in a curl command.
Now, make an HTTPS request to the credentials provider to fetch a security token. You may use your preferred HTTP client for the request. I use curl in the following examples.
This command returns a security token object that has an accessKeyId, a secretAccessKey, a sessionToken, and an expiration. The following is sample output of the curl command.
Create a DynamoDB table called MyHomeThermostat in your AWS account. You will have to choose the hash (partition key) and the range (sort key) while creating the table to uniquely identify a record. Make the hash the serial_number of the thermostat and the range the timestamp of the record. Create a text file with the following JSON to put a temperature and humidity record in the table. Name the file, item.json.
You can use the accessKeyId, secretAccessKey, and sessionToken retrieved from the output of the curl command to sign a request that writes the temperature and humidity record to the DynamoDB table. Use the following commands to accomplish this.
In this blog post, I demonstrated how to retrieve a security token by using an X.509 certificate and then writing an item to a DynamoDB table by using the security token. Similarly, you could run applications on surveillance cameras or sensor devices that exchange the X.509 certificate for an AWS security token and use the token to upload video streams to Amazon Kinesis or telemetry data to Amazon CloudWatch.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the AWS IoT forum.
– Ramkishore
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.