Five researchers hacked Apple Computer’s networks — not their products — and found fifty-five vulnerabilities. So far, they have received $289K.
One of the worst of all the bugs they found would have allowed criminals to create a worm that would automatically steal all the photos, videos, and documents from someone’s iCloud account and then do the same to the victim’s contacts.
A university study confirmed the obvious: if you pay a random bunch of freelance programmers a small amount of money to write security software, they’re not going to do a very good job at it.
In an experiment that involved 43 programmers hired via the Freelancer.com platform, University of Bonn academics have discovered that developers tend to take the easy way out and write code that stores user passwords in an unsafe manner.
For their study, the German academics asked a group of 260 Java programmers to write a user registration system for a fake social network.
Of the 260 developers, only 43 took up the job, which involved using technologies such as Java, JSF, Hibernate, and PostgreSQL to create the user registration component.
Of the 43, academics paid half of the group with €100, and the other half with €200, to determine if higher pay made a difference in the implementation of password security features.
Further, they divided the developer group a second time, prompting half of the developers to store passwords in a secure manner, and leaving the other half to store passwords in their preferred method — hence forming four quarters of developers paid €100 and prompted to use a secure password storage method (P100), developers paid €200 and prompted to use a secure password storage method (P200), devs paid €100 but not prompted for password security (N100), and those paid €200 but not prompted for password security (N200).
I don’t know why anyone would expect this group of people to implement a good secure password system. Look at how they were hired. Look at the scope of the project. Look at what they were paid. I’m sure they grabbed the first thing they found on GitHub that did the job.
I’m not very impressed with the study or its conclusions.
Developers and maintainers of free-software projects are drawn from the same pool of people, and maintainers in one project are often developers in another, but there is still a certain amount of friction between the two groups. Maintainers depend on developers to contribute changes, but the two groups have a different set of incentives when it comes to reviewing and accepting those changes. Two talks at the 2018 Embedded Linux Conference shed some light on this relationship and how it can be made to work more smoothly.
My career is a different story. Over the past two decades and a change, I went from writing CGI scripts and setting up WAN routers for a chain of shopping malls, to doing pentests for institutional customers, to designing a series of network monitoring platforms and handling incident response for a big telco, to building and running the product security org for one of the largest companies in the world. It’s been an interesting ride – and now that I’m on the hook for the well-being of about 100 folks across more than a dozen subteams around the world, I’ve been thinking a bit about the lessons learned along the way.
Of course, I’m a bit hesitant to write such a post: sometimes, your efforts pan out not because of your approach, but despite it – and it’s possible to draw precisely the wrong conclusions from such anecdotes. Still, I’m very proud of the culture we’ve created and the caliber of folks working on our team. It happened through the work of quite a few talented tech leads and managers even before my time, but it did not happen by accident – so I figured that my observations may be useful for some, as long as they are taken with a grain of salt.
But first, let me start on a somewhat somber note: what nobody tells you is that one’s level on the leadership ladder tends to be inversely correlated with several measures of happiness. The reason is fairly simple: as you get more senior, a growing number of people will come to you expecting you to solve increasingly fuzzy and challenging problems – and you will no longer be patted on the back for doing so. This should not scare you away from such opportunities, but it definitely calls for a particular mindset: your motivation must come from within. Look beyond the fight-of-the-day; find satisfaction in seeing how far your teams have come over the years.
With that out of the way, here’s a collection of notes, loosely organized into three major themes.
The curse of a techie leader
Perhaps the most interesting observation I have is that for a person coming from a technical background, building a healthy team is first and foremost about the subtle art of letting go.
There is a natural urge to stay involved in any project you’ve started or helped improve; after all, it’s your baby: you’re familiar with all the nuts and bolts, and nobody else can do this job as well as you. But as your sphere of influence grows, this becomes a choke point: there are only so many things you could be doing at once. Just as importantly, the project-hoarding behavior robs more junior folks of the ability to take on new responsibilities and bring their own ideas to life. In other words, when done properly, delegation is not just about freeing up your plate; it’s also about empowerment and about signalling trust.
Of course, when you hand your project over to somebody else, the new owner will initially be slower and more clumsy than you; but if you pick the new leads wisely, give them the right tools and the right incentives, and don’t make them deathly afraid of messing up, they will soon excel at their new jobs – and be grateful for the opportunity.
A related affliction of many accomplished techies is the conviction that they know the answers to every question even tangentially related to their domain of expertise; that belief is coupled with a burning desire to have the last word in every debate. When practiced in moderation, this behavior is fine among peers – but for a leader, one of the most important skills to learn is knowing when to keep your mouth shut: people learn a lot better by experimenting and making small mistakes than by being schooled by their boss, and they often try to read into your passing remarks. Don’t run an authoritarian camp focused on total risk aversion or perfectly efficient resource management; just set reasonable boundaries and exit conditions for experiments so that they don’t spiral out of control – and be amazed by the results every now and then.
Death by planning
When nothing is on fire, it’s easy to get preoccupied with maintaining the status quo. If your current headcount or budget request lists all the same projects as last year’s, or if you ever find yourself ending an argument by deferring to a policy or a process document, it’s probably a sign that you’re getting complacent. In security, complacency usually ends in tears – and when it doesn’t, it leads to burnout or boredom.
In my experience, your goal should be to develop a cadre of managers or tech leads capable of coming up with clever ideas, prioritizing them among themselves, and seeing them to completion without your day-to-day involvement. In your spare time, make it your mission to challenge them to stay ahead of the curve. Ask your vendor security lead how they’d streamline their work if they had a 40% jump in the number of vendors but no extra headcount; ask your product security folks what’s the second line of defense or containment should your primary defenses fail. Help them get good ideas off the ground; set some mental success and failure criteria to be able to cut your losses if something does not pan out.
Of course, malfunctions happen even in the best-run teams; to spot trouble early on, instead of overzealous project tracking, I found it useful to encourage folks to run a data-driven org. I’d usually ask them to imagine that a brand new VP shows up in our office and, as his first order of business, asks “why do you have so many people here and how do I know they are doing the right things?”. Not everything in security can be quantified, but hard data can validate many of your assumptions – and will alert you to unseen issues early on.
When focusing on data, it’s important not to treat pie charts and spreadsheets as an art unto itself; if you run a security review process for your company, your CSAT scores are going to reach 100% if you just rubberstamp every launch request within ten minutes of receiving it. Make sure you’re asking the right questions; instead of “how satisfied are you with our process”, try “is your product better as a consequence of talking to us?”
Whenever things are not progressing as expected, it is a natural instinct to fall back to micromanagement, but it seldom truly cures the ill. It’s probable that your team disagrees with your vision or its feasibility – and that you’re either not listening to their feedback, or they don’t think you’d care. It’s good to assume that most of your employees are as smart or smarter than you; barking your orders at them more loudly or more frequently does not lead anyplace good. It’s good to listen to them and either present new facts or work with them on a plan you can all get behind.
In some circumstances, all that’s needed is honesty about the business trade-offs, so that your team feels like your “partner in crime”, not a victim of circumstance. For example, we’d tell our folks that by not falling behind on basic, unglamorous work, we earn the trust of our VPs and SVPs – and that this translates into the independence and the resources we need to pursue more ambitious ideas without being told what to do; it’s how we game the system, so to speak. Oh: leading by example is a pretty powerful tool at your disposal, too.
The human factor
I’ve come to appreciate that hiring decent folks who can get along with others is far more important than trying to recruit conference-circuit superstars. In fact, hiring superstars is a decidedly hit-and-miss affair: while certainly not a rule, there is a proportion of folks who put the maintenance of their celebrity status ahead of job responsibilities or the well-being of their peers.
For teams, one of the most powerful demotivators is a sense of unfairness and disempowerment. This is where tech-originating leaders can shine, because their teams usually feel that their bosses understand and can evaluate the merits of the work. But it also means you need to be decisive and actually solve problems for them, rather than just letting them vent. You will need to make unpopular decisions every now and then; in such cases, I think it’s important to move quickly, rather than prolonging the uncertainty – but it’s also important to sincerely listen to concerns, explain your reasoning, and be frank about the risks and trade-offs.
Whenever you see a clash of personalities on your team, you probably need to respond swiftly and decisively; being right should not justify being a bully. If you don’t react to repeated scuffles, your best people will probably start looking for other opportunities: it’s draining to put up with constant pie fights, no matter if the pies are thrown straight at you or if you just need to duck one every now and then.
More broadly, personality differences seem to be a much better predictor of conflict than any technical aspects underpinning a debate. As a boss, you need to identify such differences early on and come up with creative solutions. Sometimes, all you need is taking some badly-delivered but valid feedback and having a conversation with the other person, asking some questions that can help them reach the same conclusions without feeling that their worldview is under attack. Other times, the only path forward is making sure that some folks simply don’t run into each for a while.
Finally, dealing with low performers is a notoriously hard but important part of the game. Especially within large companies, there is always the temptation to just let it slide: sideline a struggling person and wait for them to either get over their issues or leave. But this sends an awful message to the rest of the team; for better or worse, fairness is important to most. Simply firing the low performers is seldom the best solution, though; successful recovery cases are what sets great managers apart from the average ones.
Oh, one more thought: people in leadership roles have their allegiance divided between the company and the people who depend on them. The obligation to the company is more formal, but the impact you have on your team is longer-lasting and more intimate. When the obligations to the employer and to your team collide in some way, make sure you can make the right call; it might be one of the the most consequential decisions you’ll ever make.
The blockchain hype is huge, the ICO craze (“Coindike”) is generating millions if not billions of “funding” for businesses that claim to revolutionize basically anything.
I’ve been following all of that for a while. I got my first (and only) Bitcoin several years ago, I know how the technology works, I’ve implemented the data structure part, I’ve tried (with varying success) to install an Ethereum wallet since almost as soon as Ethereum appeared, and I’ve read and subscribed to newsletters about dozens of projects and new cryptocurrencies, including storj.io, siacoin, namecoin, etc. I would say I’m at least above average in terms of knowledge on how the cryptocurrencies, blockchain, smart contracts, EVM, proof-of-wahtever operates. And I’ve voiced my concerns about the technology in general.
Now it’s rant time.
I’ve been reading whitepapers of various projects, I’ve been to various meetups and talks, I’ve been reading the professed future applications of the blockchain, and I have to admit – it’s all Greek to me. I have no clue what these people are talking about. And why would all of that make any sense. I still think I’m not clever enough to understand the upcoming revolution, but there’s also a cynical side of me that says “this is all a scam”.
Why “X on the blockchain” somehow makes it magical and superior to a good old centralized solution? No, spare me the cliches about “immutable ledger”, “lack of central authority” and the likes. These are the phrases that a person learns after reading literally one article about blockchain. Have you actually written anything apart from a complex-sounding whitepaper or a hello-world smart contract? Do you really know how the overlay network works, how the economic incentives behind that network work, how all the cryptography works? Maybe there are many, many people that indeed know that and they know it better than me and are thus able to imagine the business case behind “X on the blockchain”.
I can’t. I can’t see why it would be useful to abandon a centralized database that you can query in dozens of ways, test easily and scale trivially in favour of a clunky write-only, low-throughput, hard-to-debug privacy nightmare that is any public blockchain. And how do you imagine to gain a substantial userbase with an ecosystem where the Windows client for the 2nd most popular blockchain (Ethereum) has been so buggy, I (a software engineer) couldn’t get it work and sync the whole chain. And why would building a website ontop of that clunky, user-unfriendly database has any benefit over a centralized competitor?
Do we all believe that somehow the huge datacenters with guarnateed power backups, regular hardware and network checks, regular backups and overall – guaranteed redundancy – will somehow be beaten by a few thousand machines hosting a software that has the sole purpose of guaranteeing integrity? Bitcoin has 10 thousand nodes. Ethereum has 22 thousand nodes. And while these nodes are probably very well GPU-equipped, they aren’t supercomputers. Amazon’s AWS has a million servers. How’s that for comparison. And why would anyone take seriously 22 thousand non-servers. Or even 220 thousand, if we believe in some inevitable growth.
Don’t get me wrong, the technology is really cool. The way tamper-evident data structures (hash chains) were combined with a consensus algorithm, an overlay network and a financial incentive is really awesome. When you add a distributed execution environment, it gets even cooler. But is it suitable for literally everything? I fail to see how.
I’m sure I’m missing something. The fact that many of those whitepapers sound increasingly like Greek to me might hint that I’m just a dumb developer and those enlightened people are really onto something huge. I guess time will tell.
But I happen to be living in a country that saw a transition to capitalism in the years of my childhood. And there were a lot of scams and ponzi schemes that people believed in. Because they didn’t know how capitalism works, how the market works. I’m seeing some similarities – we have no idea how the digital realm really works, and so a lot of scams are bound to appear, until we as a society learn the basics.
Until then – enjoy your ICO, enjoy your tokens, enjoy your big-player competitor with practically the same business model, only on a worse database.
And I hope that after the smoke of hype and fraud clears, we’ll be able to enjoy the true benefits of the blockchain innovation.
Back in 2013, Der Spiegelreported that the NSA intercepts and collects Windows bug reports:
One example of the sheer creativity with which the TAO spies approach their work can be seen in a hacking method they use that exploits the error-proneness of Microsoft’s Windows. Every user of the operating system is familiar with the annoying window that occasionally pops up on screen when an internal problem is detected, an automatic message that prompts the user to report the bug to the manufacturer and to restart the program. These crash reports offer TAO specialists a welcome opportunity to spy on computers.
When TAO selects a computer somewhere in the world as a target and enters its unique identifiers (an IP address, for example) into the corresponding database, intelligence agents are then automatically notified any time the operating system of that computer crashes and its user receives the prompt to report the problem to Microsoft. An internal presentation suggests it is NSA’s powerful XKeyscore spying tool that is used to fish these crash reports out of the massive sea of Internet traffic.
The automated crash reports are a “neat way” to gain “passive access” to a machine, the presentation continues. Passive access means that, initially, only data the computer sends out into the Internet is captured and saved, but the computer itself is not yet manipulated. Still, even this passive access to error messages provides valuable insights into problems with a targeted person’s computer and, thus, information on security holes that might be exploitable for planting malware or spyware on the unwitting victim’s computer.
Although the method appears to have little importance in practical terms, the NSA’s agents still seem to enjoy it because it allows them to have a bit of a laugh at the expense of the Seattle-based software giant. In one internal graphic, they replaced the text of Microsoft’s original error message with one of their own reading, “This information may be intercepted by a foreign sigint system to gather detailed information and better exploit your machine.” (“Sigint” stands for “signals intelligence.”)
The article talks about the (limited) value of this information with regard to specific target computers, but I have another question: how valuable would this database be for finding new zero-day Windows vulnerabilities to exploit? Microsoft won’t have the incentive to examine and fix problems until they happen broadly among its user base. The NSA has a completely different incentive structure.
I don’t remember this being discussed back in 2013.
The so-called (and marketing-branded) “blockchain technology” is promised to revolutionize every industry. Anything, they say, will become decentralized, free from middle men or government control. Services will thrive on various installments of the blockchain, and smart contracts will automatically enforce any logic that is related to the particular domain.
I don’t mind having another technological leap (after the internet), and given that I’m technically familiar with the blockchain, I may even be part of it. But I’m not convinced it will happen, and I’m not convinced it’s going to be the next internet.
If we strip the hype, the technology behind Bitcoin is indeed a technical masterpiece. It combines existing techniques (likes hash chains and merkle trees) with a very good proof-of-work based consensus algorithm. And it creates a digital currency, which ontop of being worth billions now, is simply cool.
But will this technology be mass-adopted, and will mass adoption allow it to retain the technological benefits it has?
First, I’d like to nitpick a little bit – if anyone is speaking about “decentralized software” when referring to “the blockchain”, be suspicious. Bitcon and other peer-to-peer overlay networks are in fact “distributed” (see the pictures here). “Decentralized” means having multiple providers, but doesn’t mean each user will be full-featured nodes on the network. This nitpicking is actually part of another argument, but we’ll get to that.
If blockchain-based applications want to reach mass adoption, they have to be user-friendly. I know I’m being captain obvious here (and fortunately some of the people in the area have realized that), but with the current state of the technology, it’s impossible for end users to even get it, let alone use it.
My first serious concern is usability. To begin with, you need to download the whole blockchain on your machine. When I got my first bitcoin several years ago (when it was still 10 euro), the blockchain was kind of small and I didn’t notice that problem. Nowadays both the Bitcoin and Ethereum blockchains take ages to download. I still haven’t managed to download the ethereum one – after several bugs and reinstalls of the client, I’m still at 15%. And we are just at the beginning. A user just will not wait for days to download something in order to be able to start using a piece of technology.
I recently proposed downloading snapshots of the blockchain via bittorrent to be included in the Ethereum protocol itself. I know that snapshots of the Bitcoin blockchain have been distributed that way, but it has been a manual process. If a client can quickly download the huge file up to a recent point, and then only donwload the latest ones in the the traditional way, starting up may be easier. Of course, the whole chain would have to be verified, but maybe that can be a background process that doesn’t stop you from using whatever is built ontop of the particular blockchain. (I’m not sure if that will be secure enough, and that, say potential Sybil attacks on the bittorrent part won’t make it undesirable, it’s just an idea).
But even if such an approach works and is adopted, that would still mean that for every service you’d have to download a separate blockchain. Of course, projects like Ethereum may seem like the “one stop shop” for cool blockchain-based applications, but fragmentation is already happening – there are alt-coins bundled with various services like file storage, DNS, etc. That will not be workable for end-users. And it’s certainly not an option for mobile, which is the dominant client now. If instead of downloading the entire chain, something like consistent hashing is used to distribute the content in small portions among clients, it might be workable. But how will trust work in that case, I don’t know. Maybe it’s possible, maybe not.
And yes, I know that you don’t necessarily have to install a wallet/client in order to make use of a given blockchain – you can just have a cloud-based wallet. Which is fairly convenient, but that gets me to my nitpicking from a few paragraphs above and to may second concern – this effectively turns a distributed system into a decentralized one – a limited number of cloud providers hold most of the data (just as a limited number of miners hold most of the processing power). And then, even though the underlying technology allows for a distributed deployment, we’ll end-up again with simply decentralized or even de-facto cenetralized, if mergers and acquisitions lead us there (and they probably will). And in order to be able to access our wallets/accounts from multiple devices, we’d use a convenient cloud service where we’d login with our username and password (because the private key is just too technical and hard for regular users). And that seems to defeat the whole idea.
Not only that, but there is an inevitable centralization of decisions (who decides on the size of the block, who has commit rights to the client repository) as well as a hidden centralization of power – how much GPU power does the Chinese mining “farms” control and can they influence the network significantly? And will the average user ever know that or care (as they don’t care that Google is centralized). I think that overall, distributed technologies will follow the power law, and the majority of data/processing power/decision power will be controller by a minority of actors. And so our distributed utopia will not happen in its purest form we dream of.
My third concern is incentive. Distributed technologies that have been successful so far have a pretty narrow set of incentives. The internet was promoted by large public institutions, including government agencies and big universitives. Bittorrent was successful mainly because it allowed free movies and songs with 2 clicks of the mouse. And Bitcoin was successful because it offered financial benefits. I’m oversimplifying of course, but “government effort”, “free & easy” and “source of more money” seem to have been the successful incentives. On the other side of the fence there are dozens of failed distributed technologies. I’ve tried many of them – alternative search engines, alternative file storage, alternative ride-sharings, alternative social networks, alternative “internets” even. None have gained traction. Because they are not easier to use than their free competitors and you can’t make money out of them (and no government bothers promoting them).
Will blockchain-based services have sufficient incentives to drive customers? Will centralized competitors just easily crush the distributed alternatives by being cheaper, more-user friendly, having sales departments that can target more than hardcore geeks who have no problem syncing their blockchain via the command line? The utopian slogans seem very cool to idealists and futurists, but don’t sell. “Free from centralized control, full control over your data” – we’d have to go through a long process of cultural change before these things make sense to more than a handful of people.
Speaking of services, often examples include “the sharing economy”, where one stranger offers a service to another stranger. Blockchain technology seems like a good fit here indeed – the services are by nature distributed, why should the technology be centralized? Here comes my fourth concern – identity. While for the cryptocurrencies it’s actually beneficial to be anonymous, for most of the real-world services (i.e. the industries that ought to be revolutionized) this is not an option. You can’t just go in the car of publicKey=5389BC989A342…. “But there are already distributed reputation systems”, you may say. Yes, and they are based on technical, not real-world identities. That doesn’t build trust. I don’t trust that publicKey=5389BC989A342… is the same person that got the high reputation. There may be five people behind that private key. The private key may have been stolen (e.g. in a cloud-provider breach).
The values of companies like Uber and AirBNB is that they serve as trust brokers. They verify and vouch for their drivers and hosts (and passengers and guests). They verify their identity through government-issued documents, skype calls, selfies, compare pictures to documents, get access to government databases, credit records, etc. Can a fully distributed service do that? No. You’d need a centralized provider to do it. And how would the blockchain make any difference then? Well, I may not be entirely correct here. I’ve actually been thinking quite a lot about decentralized identity. E.g. a way to predictably generate a private key based on, say biometrics+password+government-issued-documents, and use the corresponding public key as your identifier, which is then fed into reputation schemes and ultimately – real-world services. But we’re not there yet.
And that is part of my fifth concern – the technology itself. We are not there yet. There are bugs, there are thefts and leaks. There are hard-forks. There isn’t sufficient understanding of the technology (I confess I don’t fully grasp all the implementation details, and they are always the key). Often the technology is advertised as “just working”, but it isn’t. The other day I read an article (lost the link) that clarifies a common misconception about smart contracts – they cannot interact with the outside world – they can’t call APIs (e.g. stock market prices, bank APIs), they can’t push or fetch data from anywhere but the blockchain. That mandates the need, again, for a centralized service that pushes the relevant information before smart contracts can pick it up. I’m pretty sure that all cool-sounding applications are not possible without extensive research. And even if/when they are, writing distributed code is hard. Debugging a smart contract is hard. Yes, hard is cool, but that doesn’t drive economic value.
I have mostly been referring to public blockchains so far. Private blockchains may have their practical application, but there’s one catch – they are not exactly the cool distributed technology that the Bitcoin uses. They may be called “blockchains” because they…chain blocks, but they usually centralize trust. For example the Hyperledger project uses PKI, with all its benefits and risks. In these cases, a centralized authority issues the identity “tokens”, and then nodes communicate and form a shared ledger. That’s a bit easier problem to solve, and the nodes would usually be on actual servers in real datacenters, and not on your uncle’s Windows XP.
That said, hash chaining has been around for quite a long time. I did research on the matter because of a side-project of mine and it seems providing a tamper-proof/tamper-evident log/database on semi-trusted machines has been discussed in many computer science papers since the 90s. That alone is not “the magic blockchain” that will solve all of our problems, no matter what gossip protocols you sprinkle ontop. I’m not saying that’s bad, on the contrary – any variation and combinations of the building blocks of the blockchain (the hash chain, the consensus algorithm, the proof-of-work (or stake), possibly smart contracts), has potential for making useful products.
I know I sound like the a naysayer here, but I hope I’ve pointed out particular issues, rather than aimlessly ranting at the hype (though that’s tempting as well). I’m confident that blockchain-like technologies will have their practical applications, and we will see some successful, widely-adopted services and solutions based on that, just as pointed out in this detailed report. But I’m not convinced it will be revolutionizing.
I hope I’m proven wrong, though, because watching a revolutionizing technology closely and even being part of it would be quite cool.
Abstract:Apple’s 2016 fight against a court order commanding it to help the FBI unlock the iPhone of one of the San Bernardino terrorists exemplifies how central the question of regulating government surveillance has become in American politics and law. But scholarly attempts to answer this question have suffered from a serious omission: scholars have ignored how government surveillance is checked by “surveillance intermediaries,” the companies like Apple, Google, and Facebook that dominate digital communications and data storage, and on whose cooperation government surveillance relies. This Article fills this gap in the scholarly literature, providing the first comprehensive analysis of how surveillance intermediaries constrain the surveillance executive. In so doing, it enhances our conceptual understanding of, and thus our ability to improve, the institutional design of government surveillance.
Surveillance intermediaries have the financial and ideological incentives to resist government requests for user data. Their techniques of resistance are: proceduralism and litigiousness that reject voluntary cooperation in favor of minimal compliance and aggressive litigation; technological unilateralism that designs products and services to make surveillance harder; and policy mobilization that rallies legislative and public opinion to limit surveillance. Surveillance intermediaries also enhance the “surveillance separation of powers”; they make the surveillance executive more subject to inter-branch constraints from Congress and the courts, and to intra-branch constraints from foreign-relations and economics agencies as well as the surveillance executive’s own surveillance-limiting components.
The normative implications of this descriptive account are important and cross-cutting. Surveillance intermediaries can both improve and worsen the “surveillance frontier”: the set of tradeoffs between public safety, privacy, and economic growth from which we choose surveillance policy. And while intermediaries enhance surveillance self-government when they mobilize public opinion and strengthen the surveillance separation of powers, they undermine it when their unilateral technological changes prevent the government from exercising its lawful surveillance authorities.
As devastating as the latest widespread ransomware attacks have been, it’s a problem with a solution. If your copy of Windows is relatively current and you’ve kept it updated, your laptop is immune. It’s only older unpatched systems on your computer that are vulnerable.
Patching is how the computer industry maintains security in the face of rampant Internet insecurity. Microsoft, Apple and Google have teams of engineers who quickly write, test and distribute these patches, updates to the codes that fix vulnerabilities in software. Most people have set up their computers and phones to automatically apply these patches, and the whole thing works seamlessly. It isn’t a perfect system, but it’s the best we have.
But it is a system that’s going to fail in the “Internet of things”: everyday devices like smart speakers, household appliances, toys, lighting systems, even cars, that are connected to the web. Many of the embedded networked systems in these devices that will pervade our lives don’t have engineering teams on hand to write patches and may well last far longer than the companies that are supposed to keep the software safe from criminals. Some of them don’t even have the ability to be patched.
Fast forward five to 10 years, and the world is going to be filled with literally tens of billions of devices that hackers can attack. We’re going to see ransomware against our cars. Our digital video recorders and web cameras will be taken over by botnets. The data that these devices collect about us will be stolen and used to commit fraud. And we’re not going to be able to secure these devices.
Like every other instance of product safety, this problem will never be solved without considerable government involvement.
For years, I have been calling for more regulation to improve security in the face of this market failure. In the short term, the government can mandate that these devices have more secure default configurations and the ability to be patched. It can issue best-practice regulations for critical software and make software manufacturers liable for vulnerabilities. It’ll be expensive, but it will go a long way toward improved security.
But it won’t be enough to focus only on the devices, because these things are going to be around and on the Internet much longer than the two to three years we use our phones and computers before we upgrade them. I expect to keep my car for 15 years, and my refrigerator for at least 20 years. Cities will expect the networks they’re putting in place to last at least that long. I don’t want to replace my digital thermostat ever again. Nor, if I ever need one, do I want a surgeon to ever have to go back in to replace my computerized heart defibrillator in order to fix a software bug.
No amount of regulation can force companies to maintain old products, and it certainly can’t prevent companies from going out of business. The future will contain billions of orphaned devices connected to the web that simply have no engineers able to patch them.
Imagine this: The company that made your Internet-enabled door lock is long out of business. You have no way to secure yourself against the ransomware attack on that lock. Your only option, other than paying, and paying again when it’s reinfected, is to throw it away and buy a new one.
Ultimately, we will also need the network to block these attacks before they get to the devices, but there again the market will not fix the problem on its own. We need additional government intervention to mandate these sorts of solutions.
None of this is welcome news to a government that prides itself on minimal intervention and maximal market forces, but national security is often an exception to this rule. Last week’s cyberattacks have laid bare some fundamental vulnerabilities in our computer infrastructure and serve as a harbinger. There’s a lot of good research into robust solutions, but the economic incentives are all misaligned. As politically untenable as it is, we need government to step in to create the market forces that will get us out of this mess.
What does your personal utopia look like? Do you think we (as mankind) can achieve it? Why/why not?
I spent the month up to my eyeballs in a jam game, but this question was in the back of my mind a lot. I could use it as a springboard to opine about anything, especially in the current climate: politics, religion, nationalism, war, economics, etc., etc. But all of that has been done to death by people who actually know what they’re talking about.
The question does say “personal”. So in a less abstract sense… what do I want the world to look like?
Mostly, I want everyone to have the freedom to make things.
I’ve been having a surprisingly hard time writing the rest of this without veering directly into the ravines of “basic income is good” and “maybe capitalism is suboptimal”. Those are true, but not really the tone I want here, and anyway they’ve been done to death by better writers than I. I’ve talked this out with Mel a few times, and it sounds much better aloud, so I’m going to try to drop my Blog Voice and just… talk.
I’m construing “art” very broadly here. More broadly than “media”, too. I’m including shitty robots, weird Twitter almost-bots, weird Twitter non-bots, even a great deal of open source software. Anything that even remotely resembles creative work — driven perhaps by curiosity, perhaps by practicality, but always by a soul bursting with ideas and a palpable need to get them out.
Western culture thrives on art. Most culture thrives on art. I’m not remotely qualified to defend this, but I suspect you could define culture in terms of art. It’s pretty important.
You’d think this would be reflected in how we discuss art, but often… it’s not. Tell me how often you’ve heard some of these gems.
“I could do that.”
“My eight-year-old kid could do that.”
Jokes about the worthlessness of liberal arts degrees.
Jokes about people trying to write novels in their spare time, the subtext being that only dreamy losers try to write novels, or something.
The caricature of a hippie working on a screenplay at Starbucks.
Oh, and then there was the guy who made a bot to scrape tons of art from artists who were using Patreon as a paywall — and a primary source of income. The justification was that artists shouldn’t expect to make a living off of, er, doing art, and should instead get “real jobs”.
I do wonder. How many of the people repeating these sentiments listen to music, or go to movies, or bought an iPhone because it’s prettier? Are those things not art that took real work to create? Is creating those things not a “real job”?
Perhaps a “real job” has to be one that’s not enjoyable, not a passion? And yet I can’t recall ever hearing anyone say that Taylor Swift should get a “real job”. Or that, say, pro football players should get “real jobs”. What do pro football players even do? They play a game a few times a year, and somehow this drives the flow of unimaginable amounts of money. We dress it up in the more serious-sounding “sport”, but it’s a game in the same general genre as hopscotch. There’s nothing wrong with that, but somehow it gets virtually none of the scorn that art does.
Another possible explanation is America’s partly-Christian, partly-capitalist attitude that you deserve exactly whatever you happen to have at the moment. (Whereas I deserve much more and will be getting it any day now.) Rich people are rich because they earned it, and we don’t question that further. Poor people are poor because they failed to earn it, and we don’t question that further, either. To do so would suggest that the system is somehow unfair, and hard work does not perfectly correlate with any particular measure of success.
I’m sure that factors in, but it’s not quite satisfying: I’ve also seen a good deal of spite aimed at people who are making a fairly decent chunk through Patreon or similar. Something is missing.
I thought, at first, that the key might be the American worship of work. Work is an inherent virtue. Politicians run entire campaigns based on how many jobs they’re going to create. Notably, no one seems too bothered about whether the work is useful, as long as someone decided to pay you for it.
Finally I stumbled upon the key. America doesn’t actually worship work. America worships business. Business means a company is deciding to pay you. Business means legitimacy. Business is what separates a hobby from a career.
And this presents a problem for art.
If you want to provide a service or sell a product, that’ll be hard, but America will at least try to look like it supports you. People are impressed that you’re an entrepreneur, a small business owner. Politicians will brag about policies made in your favor, whether or not they’re stabbing you in the back.
Small businesses have a particular structure they can develop into. You can divide work up. You can have someone in sales, someone in accounting. You can provide specifications and pay a factory to make your product. You can defer all of the non-creative work to someone else, whether that means experts in a particular field or unskilled labor.
But if your work is inherently creative, you can’t do that. The very thing you’re making is your idea in your style, driven by your experience. This is not work that’s readily parallelizable. Even if you sell physical merchandise and register as an LLC and have a dedicated workspace and do various other formal business-y things, the basic structure will still look the same: a single person doing the thing they enjoy. A hobbyist.
Consider the bulleted list from above. Those are all individual painters or artists or authors or screenwriters. The kinds of artists who earn respect without question are generally those managed by a business, those with branding: musical artists signed to labels, actors working for a studio. Even football players are part of a tangle of business.
(This doesn’t mean that business automatically confers respect, of course; tech in particular is full of anecdotes about nerds’ disdain for people whose jobs are design or UI or documentation or whathaveyou. But a businessy look seems to be a significant advantage.)
It seems that although art is a large part of what informs culture, we have a culture that defines “serious” endeavors in such a way that independent art cannot possibly be “serious”.
Which wouldn’t really matter at all, except that we also have a culture that expects you to pay for food and whatnot.
The reasoning isn’t too outlandish. Food is produced from a combination of work and resources. In exchange for getting the food, you should give back some of your own work and resources.
Obviously this is riddled with subtle flaws, but let’s roll with it for now and look at a case study. Like, uh, me!
Mel and I built and released two games together in the six weeks between mid-January and the end of February. Together, those games have made $1,000 in sales. The sales trail off fairly quickly within a few days of release, so we’ll call that the total gross for our effort.
I, dumb, having never actually sold anything before, thought this was phenomenal. Then I had the misfortune of doing some math.
Itch takes at least 10%, so we’re down to $900 net. Divided over six weeks, that’s $150 per week, before taxes — or $3.75 per hour if we’d been working full time.
Ah, but wait! There are two of us. And we hadn’t been working full time — we’d been working nearly every waking hour, which is at least twice “full time” hours. So we really made less than a dollar an hour. Even less than that, if you assume overtime pay.
From the perspective of capitalism, what is our incentive to do this? Between us, we easily have over thirty years of experience doing the things we do, and we spent weeks in crunch mode working on something, all to earn a small fraction of minimum wage. Did we not contribute back our own work and resources? Was our work worth so much less than waiting tables?
Waiting tables is a perfectly respectable way to earn a living, mind you. Ah, but wait! I’ve accidentally done something clever here. It is generally expected that you tip your waiter, because waiters are underpaid by the business, because the business assumes they’ll be tipped. Not tipping is actually, almost impressively, one of the rudest things you can do. And yet it’s not expected that you tip an artist whose work you enjoy, even though many such artists aren’t being paid at all.
Now, to be perfectly fair, both games were released for free. Even a dollar an hour is infinitely more than the zero dollars I was expecting — and I’m amazed and thankful we got as much as we did! Thank you so much. I bring it up not as a complaint, but as an armchair analysis of our systems of incentives.
People can take art for granted and whatever, yes, but there are several other factors at play here that hamper the ability for art to make money.
For one, I don’t want to sell my work. I suspect a great deal of independent artists and writers and open source developers (!) feel the same way. I create things because I want to, because I have to, because I feel so compelled to create that having a non-creative full-time job was making me miserable. I create things for the sake of expressing an idea. Attaching a price tag to something reduces the number of people who’ll experience it. In other words, selling my work would make it less valuable in my eyes, in much the same way that adding banner ads to my writing would make it less valuable.
And yet, I’m forced to sell something in some way, or else I’ll have to find someone who wants me to do bland mechanical work on their ideas in exchange for money… at the cost of producing sharply less work of my own. Thank goodness for Patreon, at least.
There’s also the reverse problem, in that people often don’t want to buy creative work. Everyone does sometimes, but only sometimes. It’s kind of a weird situation, and the internet has exacerbated it considerably.
Consider that if I write a book and print it on paper, that costs something. I have to pay for the paper and the ink and the use of someone else’s printer. If I want one more book, I have to pay a little more. I can cut those costs pretty considerable by printing a lot of books at once, but each copy still has a price, a marginal cost. If I then gave those books away, I would be actively losing money. So I can pretty well justify charging for a book.
Along comes the internet. Suddenly, copying costs nothing. Not only does it cost nothing, but it’s the fundamental operation. When you download a file or receive an email or visit a web site, you’re really getting a copy! Even the process which ultimately shows it on your screen involves a number of copies. This is so natural that we don’t even call it copying, don’t even think of it as copying.
True, bandwidth does cost something, but the rate is virtually nothing until you start looking at very big numbers indeed. I pay $60/mo for hosting this blog and a half dozen other sites — even that’s way more than I need, honestly, but downgrading would be a hassle — and I get 6TB of bandwidth. Even the longest of my posts haven’t exceeded 100KB. A post could be read by 64 million people before I’d start having a problem. If that were the population of a country, it’d be the 23rd largest in the world, between Italy and the UK.
How, then, do I justify charging for my writing? (Yes, I realize the irony in using my blog as an example in a post I’m being paid $88 to write.)
Well, I do pour effort and expertise and a fraction of my finite lifetime into it. But it doesn’t cost me anything tangible — I already had this hosting for something else! — and it’s easier all around to just put it online.
The same idea applies to a vast bulk of what’s online, and now suddenly we have a bit of a problem. Not only are we used to getting everything for free online, but we never bothered to build any sensible payment infrastructure. You still have to pay for everything by typing in a cryptic sequence of numbers from a little physical plastic card, which will then give you a small loan and charge the seller 30¢ plus 2.9% for the “convenience”.
If a website could say “pay 5¢ to read this” and you clicked a button in your browser and that was that, we might be onto something. But with our current setup, it costs far more than 5¢ to transfer 5¢, even though it’s just a number in a computer somewhere. The only people with the power and resources to fix this don’t want to fix it — they’d rather be the ones charging you the 30¢ plus 2.9%.
That leads to another factor of platforms and publishers, which are more than happy to eat a chunk of your earnings even when you do sell stuff. Google Play, the App Store, Steam, and anecdotally many other big-name comparative platforms all take 30% of your sales. A third! And that’s good! It seems common among book publishers to take 85% to 90%. For ebook sales — i.e., ones that don’t actually cost anything — they may generously lower that to a mere 75% to 85%.
Bless Patreon for only taking 5%. Itch.io is even better: it defaults to 10%, but gives you a slider, which you can set to anything from 0% to 100%.
I’ve mentioned all this before, so here’s a more novel thought: finite disposable income. Your audience only has so much money to spend on media right now. You can try to be more compelling to encourage them to spend more of it, rather than saving it, but ultimately everyone has a limit before they just plain run out of money.
Now, popularity is heavily influenced by social and network effects, so it tends to create a power law distribution: a few things are ridiculously hyperpopular, and then there’s a steep drop to a long tail of more modestly popular things.
If a new hyperpopular thing comes out, everyone is likely to want to buy it… but then that eats away a significant chunk of that finite pool of money that could’ve gone to less popular things.
This isn’t bad, and buying a popular thing doesn’t make you a bad person; it’s just what happens. I don’t think there’s any satisfying alternative that doesn’t involve radically changing the way we think about our economy.
Taylor Swift, who I’m only picking on because her infosec account follows me on Twitter, has sold tens of millions of albums and is worth something like a quarter of a billion dollars. Does she need more? If not, should she make all her albums free from now on?
Maybe she does, and maybe she shouldn’t. The alternative is for someone to somehow prevent her from making more money, which doesn’t sit well. Yet it feels almost heretical to even ask if someone “needs” more money, because we take for granted that she’s earned it — in part by being invested in by a record label and heavily advertised. The virtue is work, right? Don’t a lot of people work just as hard? (“But you have to be talented too!” Then please explain how wildly incompetent CEOs still make millions, and leave burning businesses only to be immediately hired by new ones? Anyway, are we really willing to bet there is no one equally talented but not as popular by sheer happenstance?)
It’s kind of a moot question anyway, since she’s probably under contract with billionaires and it’s not up to her.
Where the hell was I going with this.
Right, so. Money. Everyone needs some. But making it off art can be tricky, unless you’re one of the lucky handful who strike gold.
And I’m still pretty goddamn lucky to be able to even try this! I doubt I would’ve even gotten into game development by now if I were still working for an SF tech company — it just drained so much of my creative energy, and it’s enough of an uphill battle for me to get stuff done in the first place.
How many people do I know who are bursting with ideas, but have to work a tedious job to keep the lights on, and are too tired at the end of the day to get those ideas out? Make no mistake, making stuff takes work — a lot of it. And that’s if you’re already pretty good at the artform. If you want to learn to draw or paint or write or code, you have to do just as much work first, with much more frustration, and not as much to show for it.
So there’s my utopia. I want to see a world where people have the breathing room to create the things they dream about and share them with the rest of us.
Can it happen? Maybe. I think the cultural issues are a fairly big blocker; we’d be much better off if we treated independent art with the same reverence as, say, people who play with a ball for twelve hours a year. Or if we treated liberal arts degrees as just as good as computer science degrees. (“But STEM can change the world!” Okay. How many people with computer science degrees would you estimate are changing the world, and how many are making a website 1% faster or keeping a lumbering COBOL beast running or trying to trick 1% more people into clicking on ads?)
I don’t really mean stuff like piracy, either. Piracy is a thing, but it’s… complicated. In my experience it’s not even artists who care the most about piracy; it’s massive publishers, the sort who see artists as a sponge to squeeze money out of. You know, the same people who make everything difficult to actually buy, infest it with DRM so it doesn’t work on half the stuff you own, and don’t even sell it in half the world.
I mean treating art as a free-floating commodity, detached from anyone who created it. I mean neo-Nazis adopting a comic book character as their mascot, against the creator’s wishes. I mean politicians and even media conglomerates using someone else’s music in well-funded videos and ads without even asking. I mean assuming Google Image Search, wonder that it is, is some kind of magical free art machine. I mean the snotty Reddit post I found while looking up Patreon’s fee structure, where some doofus was insisting that Patreon couldn’t possibly pay for a full-time YouTuber’s time, because not having a job meant they had lots of time to spare.
Maybe I should go one step further: everyone should create at least once or twice. Everyone should know what it’s like to have crafted something out of nothing, to be a fucking god within the microcosm of a computer screen or a sewing machine or a pottery table. Everyone should know that spark of inspiration that we don’t seem to know how to teach in math or science classes, even though it’s the entire basis of those as well. Everyone should know that there’s a good goddamn reason I listed open source software as a kind of art at the beginning of this post.
Basic income and more arts funding for public schools. If Uber can get billions of dollars for putting little car icons on top of Google Maps and not actually doing any of their own goddamn service themselves, I think we can afford to pump more cash into webcomics and indie games and, yes, even underwater basket weaving.
Yesterday, the NYC taxi union had a one-hour strike protesting Trump’s “Muslim Ban”, refusing to pick up passengers at the JFK airport. Uber responded by disabling surge pricing at the airport. This has widely been interpreted as a bad thing, so the hashtag “#DeleteUber” has been trending, encouraging people to delete their Uber accounts/app.
These people are wrong, obviously so. Surge Pricing Uber’s “Surge Pricing” isn’t price gouging, as many assume. Instead, the additional money goes directly to the drivers, to encourage them come to the area surging and pick up riders. Uber isn’t a taxi company. It can’t direct drivers to go anywhere. All it can do is provide incentives. “Surge Pricing” for customers means “Surge Income” for the drivers, giving them an incentive. Drivers have a map showing which areas of the city are surging, so they can drive there.
Another way of thinking about it is “Demand Pricing”. It’s simply the economic Law of Supply and Demand. If demand increases, then prices increase, and then supply increases chasing the higher profits. It’s why famously you can’t get a taxi cab on New Years Eve, but you can get an Uber driver. Taxi drivers can’t charge more when demand is surging, so there’s no more taxis available on that date than on any other. But Uber drivers can/do charge more, so there’s more Uber drivers.
Supply and Demand is every much a law as Gravity. If the supply of taxi drivers is less than the demand, then not everyone is going to get a ride. That’s basic math. If there’s only 20 drivers right now, and 100 people wanting a ride, then 80 riders are going to be disappointed. The only solution is more drivers. Paying drivers more money gets more drivers. The part time drivers, the drivers planning on partying instead of working, will decide to work New Years chasing the surge wages.
Uber made the following announcement:
Surge pricing has been turned off at #JFK Airport. This may result in longer wait times. Please be patient.
Without turning off Surge Pricing, Uber’s computers would notice the spike in demand, as would-be taxi customers switch to Uber. The computers would then institute surge pricing around JFK automatically. This would notify the drivers in the area, who would then flock to JFK, chasing the higher income. This would be bad for the strike.
By turning off surge pricing, there would be no increase in supply. It would mean the only drivers going to JFK are those dropping off passengers. It would mean that Uber wouldn’t be servicing any more riders than on a normal day, making no difference to the taxi strike, one way or the other.
Why wouldn’t Uber stop pickups at JFK altogether, joining the strike? Because it’d be a tough decision for them. They have a different relationship with their drivers. Both taxis and Uber are required to take passengers to the airport if asked, but taxis are much better at weaseling out of it [*]. That means screwing drivers, forcing them to go way out to JFK with no return fare. In contrast, taxis were warned enough ahead of time to avoid the trip.
The above section assumes a carefully considered Uber policy. In reality, they didn’t have the time.
The taxi union didn’t announce their decision until 5pm, with the strike set for only one hour, between 6pm and 7pm.
BREAKING: NYTWA drivers call for one hour work stoppage @ JFK airport today 6 PM to 7 PM to protest #muslimban! #nobannowall
Uber’s announcement was at 7:30pm, half hour after the strike was over. They may not have been aware of the strike until after it started, when somebody noticed an enormous surge starting at 6pm. I can imagine them running around in a panic at 6:05pm, trying to figure out how to respond.
Disabling surge pricing is probable their default action. They’ve been down this route before. Every time there is a terrorist attack or natural disaster, and computers turn on Surge Pricing, somebody has to rush to go turn it off again, offer customer rebates, and so on for PR purposes.
Why doesn’t the press report this?
Everyone knows Surge Pricing is evil. After all, that’s what you always read it in the press. But that’s because the press knows as little about basic economics as their readers.
A good example is this CNN story on the incident [*].
CNN describes this as “effectively lowering the cost of a ride“. They ignore the reality, that this was “effectively lowering the supply of rides“. Reading this, readers will naturally assume there’s an unlimited supply ready to service the lower priced rides. What CNN fails to tell readers is that there is no increase in supply, that there can’t be more rides than normal. They ignore the bit in the tweet that warns against longer wait times due to lack of supply.
Conclusion The timing alone makes the #DeleteUber claims nonsense, as the strike was already over for 36 minutes when Uber tweeted. But in any case, Uber’s decision not to do surge pricing did not “entice” customers with lower prices — they would still have long waits (as the tweet says), causing a strong dis-enticement. No rational person could interpret this as Uber trying to profit from this event.
On the other hand, before this event, Uber announced it’s opposition to Trump’s action, and promised to help any of it’s drivers adversely affected.
Update: The #deleteUber people are switching to Lyft, which continued to pick up passengers during the strike. Lyft is a company funded by Trump adviser Peter Thiel.
Today, the FTC filed a lawsuit[*] against D-Link for security problems, such as backdoor passwords. I thought I’d write up some notes.
The suit is not “product liability”, but “unfair and deceptive” business practices for promising “security”. In addition, they interpret “security” different from the cybersecurity community.
This needs to be stressed because right now in our industry, there is a big discussion of product liability, insisting that everything attached to the Internet needs to be secured. People will therefore assume the FTC action is based on “liability”.
Instead, all six counts are based upon the fact that D-Link offers its products for securing networks, and claims they are secure. Because they have backdoor passwords, clear-text passwords, command-injection bugs, and public private-keys, the FTC feels the claims of security to be untrue.
The key point I’m trying to make is that D-Link can resolve the suit (in theory) by simply removing all claims of “security”. Sure, it can claim it supports stateful-inspection firewalls and WPA2, but not things like “WPA2 security”. (Sure, the FTC may come back with a new lawsuit — but it would solve the points raised in this one).
On the other hand, while “deception” is the law the FTC uses, their obvious real intent is to improve security. They intend for D-Link to remove it’s security weakness, not to change its claims. The lawsuit is also intended to scare all IoT makers into securing their products, not to remove claims of security.
We see this intent in other posts on the FTC website. They’ve long been talking about IoT security. Recently, they announced a contest giving out $25,000 to the best solution for patching out-of-date IoT devices [*]. It’s a silly contest, but shows what their real intent is.
Thus, the language of the lawsuit is very much about improving security, while the actual counts are about unfair/deceptive practices.
This is nonsense for a number of reasons. Among their claims is that D-Link lied to their customers for saying “you need to change the default password to secure the device”, because the device still had a command-injection bug. That’s a shocking departure from common sense. We in the cybersecurity community repeatedly advise people to change passwords to make devices more secure, ignoring any other insecurity that might exist. It means I’m just as deceptive as D-Link is.
The FTC’s action is a clear violation of “due process”. They didn’t create a standard ahead of time of bugs that it would consider making a product “insecure”, but instead arbitrarily punished D-Link for not meeting an unknown standard “secure”. They never published a document saying “you can’t advertise your product as being ‘secure’ if it contains this list of problems”.
More to the point, their idea of “secure” is at odds with the cybersecurity community. We would indeed describe WPA2 as secure, regardless of some other feature of the device that makes it insecure. Most IoT devices are intended to be used behind a firewall anyway, so the only attack surface is the WiFi network. In such cases, the device can have backdoor passwords up the ying-yang, and we in the cybersecurity community will still call this “secure”.
This is important because no product will ever be perfectly secure. Ten years from now, hackers will still discover some bug in some IoT product that nobody considered before, and the FTC will come down on them and punish them for deceptive practice. This is also counterproductive to the FTC’s goals: if they are going to be so unfair about it, they are going to create incentives for companies to produce the wrong solution, to stop advertising their products as “secure”.
The consequence of this action against D-Link is that the FTC is going to create an enormous chilling effect on innovation. As apps and IoT devices proliferate, the FTC is going to punish those on the forefront creating new and innovative products. At the same time, it’s going to have little impact on actual security. They’ll raise the price of brand-name products, while still being unable to target the white-box/no-name products that contain most of the vulnerabilities.
D-Link’s makes a standard claim that we always make in the security industry:
The Tor blog has a post about the refresh of its Tor-enabled Android phone prototype, which is now in a workable state though it still has some rough edges. There is also a worrisome trend that the post highlights:
“It is unfortunate that Google seems to see locking down Android as the only solution to the fragmentation and resulting insecurity of the Android platform. We believe that more transparent development and release processes, along with deals for longer device firmware support from SoC vendors, would go a long way to ensuring that it is easier for good OEM players to stay up to date. Simply moving more components to Google Play, even though it will keep those components up to date, does not solve the systemic problem that there are still no OEM incentives to update the base system. Users of old AOSP base systems will always be vulnerable to library, daemon, and operating system issues. Simply giving them slightly more up to date apps is a bandaid that both reduces freedom and does not solve the root security problems. Moreover, as more components and apps are moved to closed source versions, Google is reducing its ability to resist the demand that backdoors be introduced. It is much harder to backdoor an open source component (especially with reproducible builds and binary transparency) than a closed source one.”
Roy Ben-Alta is Sr. Business Development Manager at AWS – Big Data & Machine Learning
We can’t believe that there are just a couple of weeks left before re:Invent 2016. If you are attending this year, you will want to check out our Big Data sessions! Unlike in previous years, these sessions are covered in multiple tracks, such as Big Data & Analytics, Architecture, Databases, and IoT. We will also have—for the first time—two mini-conferences: Big Data and Machine Learning. These resource mini-conferences include full-day technical deep dives on a broad variety of topics, including big data, IoT, machine learning, and more.
This year, we have over 40 sessions!
We have great sessions from Netflix, Chick-fil-A, Under Armour, FINRA, King.com, Beeswax, GE, Toyota Racing Development, Quantcast, Groupon, Amazon.com, Scholastic,Thomson Reuters, DataXu, Sony, EA, and many more. All sessions are recorded and made available on YouTube. Also, all slide decks from the sessions are made available on SlideShare.net after the conference.
Today, I highlight the sessions to be presented as part of the Big Data & Machine Learning mini-conferences, Big Data analytics, and relevant sessions from other tracks. The following sessions are in this year’s session catalog. Choose any link to learn more or to add a session to your schedule.
We are looking forward to meeting you at re:invent.
BDM205 – Big Data Mini-Con State of the Union – Tuesday Join us for this general session where AWS big data experts present an in-depth look at the current state of big data. Learn about the latest big data trends and industry use cases. Hear how other organizations are using the AWS big data platform to innovate and remain competitive. Take a look at some of the most recent AWS big data announcements, as we kick off the Big Data Mini-Con.
MAC206 – Amazon Machine Learning State of the Union Mini-Con – Wednesday With the growing number of business cases for artificial intelligence (AI), machine learning and deep learning continue to drive the development of state-of-the-art technology. We see this manifested in computer vision, predictive modeling, natural language understanding, and recommendation engines. During this full day of sessions and workshops, learn how we use some of these technologies within Amazon, and how you can develop your applications to leverage the benefits of these AI services.
Deep dive customer use case sessions
ARC306 – Event Handling at Scale: Designing an Auditable Ingestion and Persistence Architecture for 10K+ events/second How does McGraw-Hill Education use the AWS platform to scale and reliably receive 10,000 learning events per second? How do we provide near-real-time reporting and event-driven analytics for hundreds of thousands of concurrent learners in a reliable, secure, and auditable manner that is cost effective? MHE designed and implemented a robust solution that integrates AWS API Gateway, AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Elasticsearch Service, Amazon DynamoDB, HDFS, Amazon EMR, Amazon EC2, and other technologies to deliver this cloud-native platform across the US and soon the world. This session describes the challenges we faced, architecture considerations, how we gained confidence for a successful production roll-out, and the behind-the-scenes lessons we learned.
ARC308 – Metering Big Data at AWS: From 0 to 100 Million Records in 1 Second Learn how AWS processes millions of records per second to support accurate metering across AWS and our customers. This session shows how we migrated from traditional frameworks to AWS managed services to support a broad processing pipeline. You gain insights on how we used AWS services to build a reliable, scalable, and fast processing system using Amazon Kinesis, Amazon S3, and Amazon EMR. Along the way, we dive deep into use cases that deal with scaling and accuracy constraints. Attend this session to see AWS’s end-to-end solution that supports metering at AWS.
BDA203 – Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift GE Power & Water develops advanced technologies to help solve some of the world’s most complex challenges related to water availability and quality. They had amassed billions of rows of data on on-premises databases but decided to migrate some of their core big data projects to the AWS Cloud. When they decided to transform and store it all in Amazon Redshift, they knew they needed an ETL/ELT tool that could handle this enormous amount of data and safely deliver it to its destination.
In this session, Ryan Oates, Enterprise Architect at GE Water, shares his use case, requirements, outcomes and lessons learned. He also shares the details of his solution stack, including Amazon Redshift and Matillion ETL for Amazon Redshift in AWS Marketplace. You learn best practices on Amazon Redshift ETL supporting enterprise analytics and big data requirements, simply and at scale. You learn how to simplify data loading, transformation and orchestration on to Amazon Redshift and how to build out a real data pipeline.
BDA204 – Leverage the Power of the Crowd To Work with Amazon Mechanical Turk With Amazon Mechanical Turk (MTurk), you can leverage the power of the crowd for a host of tasks ranging from image moderation and video transcription to data collection and user testing. You simply build a process that submits tasks to the Mechanical Turk marketplace and get results quickly, accurately, and at scale. In this session, Russ, from Rainforest QA, shares best practices and lessons learned from his experience using MTurk. The session covers the key concepts of MTurk, getting started as a Requester, and using MTurk via the API. You learn how to set and manage Worker incentives, achieve great Worker quality, and how to integrate and scale your crowdsourced application. By the end of this session, you have a comprehensive understanding of MTurk and know how to get started harnessing the power of the crowd.
BDA205 – Delighting Customers Through Device Data with Salesforce IoT Cloud and AWS IoT The Internet of Things (IoT) produces vast quantities of data that promise a deep, always connected view into customer experiences through their devices. In this connected age, the question is no longer how do you gather customer data, but what do you do with all that data. How do you ingest at massive scale and develop meaningful experiences for your customers? In this session, you’ll learn how Salesforce IoT Cloud works in concert with the AWS IoT engine to ingest and transform all of the data generated by every one of your customers, partners, devices, and sensors into meaningful action. You’ll also see how customers are using Salesforce and AWS together to process massive quantities of data, build business rules with simple, intuitive tools, and engage proactively with customers in real time. Session sponsored by Salesforce.
BDM203 – FINRA: Building a Secure Data Science Platform on AWS Data science is a key discipline in a data-driven organization. Through analytics, data scientists can uncover previously unknown relationships in data to help an organization make better decisions. However, data science is often performed from local machines with limited resources and multiple datasets on a variety of databases. Moving to the cloud can help organizations provide scalable compute and storage resources to data scientists, while freeing them from the burden of setting up and managing infrastructure. In this session, FINRA, the Financial Industry Regulatory Authority, shares best practices and lessons learned when building a self-service, curated data science platform on AWS. A project that allowed us to remove the technology middleman and empower users to choose the best compute environment for their workloads. Understand the architecture and underlying data infrastructure services to provide a secure, self-service portal to data scientists, learn how we built consensus for tooling from of our data science community, hear about the benefits of increased collaboration among the scientists due to the standardized tools, and learn how you can retain the freedom to experiment with the latest technologies while retaining information security boundaries within a virtual private cloud (VPC).
BDM204 – Visualizing Big Data Insights with Amazon QuickSight Amazon QuickSight is a fast BI service that makes it easy for you to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight is built to harness the power and scalability of the cloud, so you can easily run analysis on large datasets, and support hundreds of thousands of users. In this session, we’ll demonstrate how you can easily get started with Amazon QuickSight, uploading files, connecting to Amazon S3 and Amazon Redshift and creating analyses from visualizations that are optimized based on the underlying data. After we’ve built our analysis and dashboard, we’ll show you easy it is to share it with colleagues and stakeholders in just a few seconds.
BDM303 – JustGiving: Serverless Data Pipelines, Event-Driven ETL, and Stream Processing Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), application programming interfaces (API), clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. Building scalable big data pipelines with automated extract-transform-load (ETL) and machine learning processes can address these limitations. JustGiving is the world’s social platform for giving. In this session, we describe how we created several scalable and loosely coupled event-driven ETL and ML pipelines as part of our in-house data science platform called RAVEN. You learn how to leverage AWS Lambda, Amazon S3, Amazon EMR, Amazon Kinesis, and other services to build serverless, event-driven, data and stream processing pipelines in your organization. We review common design patterns, lessons learned, and best practices, with a focus on serverless big data architectures with AWS Lambda.
BDM306 – Netflix: Using Amazon S3 as the fabric of our big data ecosystem Amazon S3 is the central data hub for Netflix’s big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Amazon Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We also provide solutions and methodologies on how you can build your S3 big data hub.
BDM402 – Best Practices for Data Warehousing with Amazon Redshift In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to deliver high throughput and query performance and you learn from king.com how to design optimal schemas, load data efficiently, and use workload management.
DAT202 – Migrating Your Data Warehouse to Amazon Redshift Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real-time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic discusses how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.
DAT204 – How Thermo Fisher Is Reducing Mass Spectrometry Experiment Times from Days to Minutes with MongoDB & AWS Mass spectrometry is the gold standard for determining chemical compositions, with spectrometers often measuring the mass of a compound down to a single electron. This level of granularity produces an enormous amount of hierarchical data that doesn’t fit well into rows and columns. In this talk, learn how Thermo Fisher is using MongoDB Atlas on AWS to allow their users to get near real-time insights from mass spectrometry experiments—a process that used to take days. We also share how the underlying database service used by Thermo Fisher was built on AWS.
DAT205 – Fanatics Migrates Data to Hadoop on the AWS Cloud Using Attunity CloudBeam in AWS Marketplace Keeping a data warehouse current and relevant can be challenging because of the time and effort required to insert new data. The world’s most licensed sports merchandiser, Fanatics, used Attunity CloudBeam in AWS Marketplace to transform their data from Microsoft SQL, Oracle, and other sources to Amazon S3, where they consume the data in Hadoop and Amazon Redshift. Fanatics can now analyze the huge volumes of data from their transactional, e-commerce, and back office systems, and make this data available immediately. In this session, Fanatics shares their use case, requirements, outcomes and lessons learned. You’ll learn best practices on implementing a data lake, using Apache Kafka and how to consistently replicate data to Amazon Redshift and Amazon S3.
DAT308 – Fireside chat with Groupon, Intuit, and LifeLock on solving Big Data database challenges with Redis Redis Labs’ CMO is hosting a fireside chat with leaders from multiple industries including Groupon (e-commerce), Intuit (Finance), and LifeLock (Identity Protection). This conversation-style session covers the Big Data related challenges faced by these leading companies as they scale their applications, ensure high availability, serve the best user experience at lowest latencies, and optimize between cloud and on-premises operations. The introductory level session can appeal to both developer and DevOps functions. Attendees hear about diverse use cases such as recommendations engine, hybrid transactions and analytics operations, and time-series data analysis. The audience learns how the Redis in-memory database platform addresses the above use cases with its multi-model capability and in a cost effective manner to meet the needs of the next generation applications. Session sponsored by Redis Labs.
DAT309 – How Fulfillment by Amazon (FBA) and Scopely Improved Results and Reduced Costs with a Serverless Architecture In this session, we share an overview of leveraging serverless architectures to support high-performance data intensive applications. Fulfillment by Amazon (FBA) built the Seller Inventory Authority Platform (IAP) using Amazon DynamoDB Streams, AWS Lambda functions, Amazon Elasticsearch Service, and Amazon Redshift to improve results and reduce costs. Scopely shares how they used a flexible logging system built on Amazon Kinesis, Lambda, and Amazon ES to provide high-fidelity reporting on hotkeys in Memcached and DynamoDB, and drastically reduce the incidence of hotkeys. Both of these customers are using managed services and serverless architecture to build scalable systems that can meet the projected business growth without a corresponding increase in operational costs
DAT310 – Building Real-Time Campaign Analytics Using AWS Services Quantcast provides its advertising clients the ability to run targeted ad campaigns reaching millions of online users. The real-time bidding for campaigns runs on thousands of machines across the world. When Quantcast wanted to collect and analyze campaign metrics in real time, they turned to AWS to rapidly build a scalable, resilient, and extensible framework. Quantcast used Amazon Kinesis streams to stage data, Amazon EC2 instances to shuffle and aggregate the data, and Amazon DynamoDB and Amazon ElastiCache for building scalable time-series databases. With Elastic Load Balancing and Auto Scaling groups, they can set up distributed microservices with minimal operation overhead. This session discusses their use case, how they architected the application with AWS technologies integrated with their existing home-grown stack, and the lessons they learned.
DAT311 – How Toyota Racing Development Makes Racing Decisions in Real Time with AWS In this session, you learn how Toyota Racing Development (TRD) developed a robust and highly performant real-time data analysis tool for professional racing. In this talk, learn how we structured a reliable, maintainable, decoupled architecture built around Amazon DynamoDB as both a streaming mechanism and a long-term persistent data store. In racing, milliseconds matter and even moments of downtime can cost a race. You’ll see how we used DynamoDB together with Amazon Kinesis Streams and Amazon Kinesis Firehose to build a real-time streaming data analysis tool for competitive racing.
DAT312 – How DataXu scaled its Attribution System to handle billions of events per day with Amazon DynamoDB “Attribution” is the marketing term of art for allocating full or partial credit to advertisements that eventually lead to purchase, sign up, download, or other desired consumer interaction. DataXu shares how we use DynamoDB at the core of our attribution system to store terabytes of advertising history data. The system is cost effective and dynamically scales from 0 to 300K requests per second on demand with predictable performance and low operational overhead.
DAT313 – 6 Million New Registrations in 30 Days: How the Chick-fil-A One App Scaled with AWS Chris leads the team providing back-end services for the massively popular Chick-fil-A One mobile app that launched in June 2016. Chick-fil-A follows AWS best practices for web services and leverages numerous AWS services, including Elastic Beanstalk, DynamoDB, Lambda, and Amazon S3. This was the largest technology-dependent promotion in Chick-fil-A history. To ensure their architecture would perform at unknown and massive scale, Chris worked with AWS Support through an AWS Infrastructure Event Management (IEM) engagement and leaned on automated operations to enable load testing before launch.
DAT316 – How Telltale Games migrated its story analytics from Apache CouchDB to Amazon DynamoDB Every choice made in Telltale Games titles influences how your character develops and how the world responds to you. With millions of users making thousands of choices in a single episode, Telltale Games tracks this data and leverages it to build more relevant stories in real time as the season is developed. In this session, you’ll learn about Telltale Games’ migration from Apache CouchDB to Amazon DynamoDB, the challenges of adjusting capacity to handling spikes in database activity, and how it streamlined its analytics storage to provide new perspectives on player interaction to improve its games.
DAT318 – Migrating from RDBMS to NoSQL: How Sony Moved from MySQL to Amazon DynamoDB In this session, you learn the key differences between a relational database management service (RDBMS) and non-relational (NoSQL) databases like Amazon DynamoDB. You learn about suitable and unsuitable use cases for NoSQL databases. You’ll learn strategies for migrating from an RDBMS to DynamoDB through a 5-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration.
GAM301 – How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights In November 2015, Capital Games launched a mobile game accompanying a major feature film release. The back end of the game is hosted in AWS and uses big data services like Amazon Kinesis, Amazon EC2, Amazon S3, Amazon Redshift, and AWS Data Pipeline. Capital Games describe some of their challenges on their initial setup and usage of Amazon Redshift and Amazon EMR. They then go over their engagement with AWS Partner 47lining and talk about specific best practices regarding solution architecture, data transformation pipelines, and system maintenance using AWS big data services. Attendees of this session should expect a candid view of the process to implementing a big data solution. From problem statement identification to visualizing data, with an in-depth look at the technical challenges and hurdles along the way.
LFS303 – How to Build a Big Data Analytics Data Lake For discovery phase research, life sciences companies have to support infrastructure that processes millions to billions of transactions. The advent of a data lake to accomplish such a task is showing itself to be a stable and productive data platform pattern to meet the goal. We discuss how to build a data lake on AWS, using services and techniques such as AWS CloudFormation, Amazon EC2, Amazon S3, IAM, and AWS Lambda. We also review a reference architecture from Amgen that uses a data lake to aid in their Life Science Research.
SVR301 – Real-time Data Processing Using AWS Lambda, Amazon Kinesis In this session, you learn from Thomson Reuters how they leverage AWS for its Product Insight service. The service provides insights to collect usage analytics for Thomson Reuters products. They walk through its architecture and demonstrate how they leverage Amazon Kinesis Streams, Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Route 53, and AWS KMS for near real-time access to data being collected around the globe. They also outline how applying AWS methodologies benefited its business, such as time-to-market and cross-region ingestion, auto-scaling capabilities, low-latency, security features, and extensibility.
SVR305 – ↑↑↓↓←→←→ BA Lambda Start Ever wished you had a list of cheat codes to unleash the full power of AWS Lambda for your production workload? Come learn how to build a robust, scalable, and highly available serverless application using AWS Lambda. In this session, we discuss hacks and tricks for maximizing your AWS Lambda performance, such as leveraging customer reuse, using the 500 MB scratch space and local cache, creating custom metrics for managing operations, aligning upstream and downstream services to scale along with Lambda, and many other workarounds and optimizations across your entire function lifecycle. You also learn how Hearst converted its real-time clickstream analytics data pipeline from a server-based model to a serverless one. The infrastructure of the data pipeline relied on Amazon EC2 instances and cron jobs to shepherd data through the process. In 2016, Hearst converted its data pipeline architecture to a serverless process that is based on event triggers and the power of AWS Lambda. By moving from a time-based process to a trigger-based process, Hearst improved its pipeline latency times by 50%.
SVR308 – Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year Vevo has undergone a complete strategic and technical reboot, driven not only by product but also by engineering. Since November 2015, Vevo has been replacing monolithic, legacy content services with a modern, modular, microservices architecture, all while developing new features and functionality. In parallel, Vevo has built its data platform from scratch to power the internal analytics as well as a unique music video consumption experience through a new personalized feed of recommendations — all in less than one year. This has been a monumental effort that was made possible in this short time span largely because of AWS technologies. The content team has been heavily using serverless architectures and AWS Lambda in the form of microservices, taking a similar approach to functional programming, which has helped us speed up the development process and time to market. The data team has been building the data platform by heavily leveraging Amazon Kinesis for data exchange across services, Amazon Aurora for consumer-facing services, Apache Spark on Amazon EMR for ETL + Machine Learning, as well as Amazon Redshift as the core analytics data store..
Machine learning sessions
MAC201 – Getting to Ground Truth with Amazon Mechanical Turk Jump-start your machine learning project by using the crowd to build your training set. Before you can train your machine learning algorithm, you need to take your raw inputs and label, annotate, or tag them to build your ground truth. Learn how to use the Amazon Mechanical Turk marketplace to perform these tasks. We share Amazon’s best practices, developed while training our own machine learning algorithms and walk you through quickly getting affordable and high-quality training data.
MAC202 – Deep Learning in Alexa Neural networks have a long and rich history in automatic speech recognition. In this talk, we present a brief primer on the origin of deep learning in spoken language, and then explore today’s world of Alexa. Alexa is the AWS service that understands spoken language and powers Amazon Echo. Alexa relies heavily on machine learning and deep neural networks for speech recognition, text-to-speech, language understanding, and more. We also discuss the Alexa Skills Kit, which lets any developer teach Alexa new skills.
MAC205 – Deep Learning at Cloud Scale: Improving Video Discoverability by Scaling Up Caffe on AWS Deep learning continues to push state of the art in domains such as video analytics, computer vision, and speech recognition. Deep networks are powered by amazing levels of representational power, feature learning, and abstraction. This approach comes at the cost of a significant increase in required compute power, which makes the AWS cloud an excellent environment for training. Innovators in this space are applying deep learning to a variety of applications. One such innovator, Vilynx, a startup based in Palo Alto, realized that the current pre-roll advertising-based models for mobile video weren’t returning publishers’ desired levels of engagement. In this session, we explain the algorithmic challenges of scaling across multiple nodes, and what Intel is doing on AWS to overcome them. We describe the benefits of using AWS CloudFormation to set up a distributed training environment for deep networks. We also showcase Vilynx’s contributions to video discoverability and explain how Vilynx uses AWS tools to understand video content.
MAC301 – Transforming Industrial Processes with Deep Learning Deep learning has revolutionized computer vision by significantly increasing the accuracy of recognition systems. This session discusses how the Amazon Fulfillment Technologies Computer Vision Research team has harnessed deep learning to identify inventory defects in Amazon’s warehouses. Beginning with a brief overview of how orders on Amazon.com are fulfilled, the session describes a combination of hardware and software that uses computer vision and deep learning that visually examine bins of Amazon inventory to locate possible mismatches between the physical inventory and inventory records. With the growth of deep learning, the emphasis of new system design shifts from clever algorithms to innovative ways to harness available data.
MAC302 – Leveraging Amazon Machine Learning, Amazon Redshift, and an Amazon Simple Storage Service Data Lake for Strategic Advantage in Real Estate The Howard Hughes Corporation partnered with 47Lining to develop a managed enterprise data lake based on Amazon S3. The purpose of the managed EDL is to fuse relevant on-premises and third-party data to enable Howard Hughes to answer its most valuable business questions. Their first analysis was a lead-scoring model that uses Amazon Machine Learning (Amazon ML) to predict propensity to purchase high-end real estate. The model is based on a combined set of public and private data sources, including all publicly recorded real estate transactions in the US for the past 35 years. By changing their business process for identifying and qualifying leads to use the results of data-driven analytics from their managed data lake in AWS, Howard Hughes increased the number of identified qualified leads in their pipeline by over 400% and reduced the acquisition cost per lead by more than 10 times. In this session, you see a practical example of how to use Amazon ML to improve business results, how to architect a data lake with Amazon S3 that fuses on-premises, third-party, and public datasets, and how to train and run an Amazon ML model to attain predictions
MAC303 – Developing Classification and Recommendation Engines with Amazon EMR and Apache Spark Customers are adopting Apache Spark‒a set of open-source distributed machine learning algorithms‒on Amazon EMR for large-scale machine learning workloads, especially for applications that power customer segmentation and content recommendation. By leveraging Spark ML, customers can quickly build and execute massively parallel machine learning jobs. Additionally, Spark applications can train models in streaming or batch contexts and can access data from Amazon S3, Amazon Kinesis, Apache Kafka, Amazon Elasticsearch Service, Amazon Redshift, and other services. This session explains how to quickly and easily create scalable Spark clusters with Amazon EMR, build and share models using Apache Zeppelin notebooks, and create a sample application using Spark Streaming, which updates models with real-time data.
MAC306 – Using MXNet for Recommendation Modeling at Scale For many companies, recommendation systems solve important machine learning problems. But as recommendation systems grow to millions of users and millions of items, they pose significant challenges when deployed at scale. The user-item matrix can have trillions of entries (or more), most of which are zero. To make common ML techniques practical, sparse data requires special techniques. Learn how to use MXNet to build neural network models for recommendation systems that can scale efficiently to large sparse datasets.
MAC307 – Predicting Customer Churn with Amazon Machine Learning In this session, we take a specific business problem—predicting Telco customer churn—and explore the practical aspects of building and evaluating an Amazon Machine Learning model. We explore considerations ranging from assigning a dollar value to applying the model using the relative cost of false positive and false negative errors. We discuss all aspects of putting Amazon ML to practical use, including how to build multiple models to choose from, put models into production, and update them. We also discuss using Amazon Redshift and Amazon S3 with Amazon ML.
Services sessions: Architecture and best practices
BDM201 – Big Data Architectural Patterns and Best Practices on AWS The world is producing an ever-increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
BDM301 – Best Practices for Apache Spark on Amazon EMR Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
BDM302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data into it using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we dive deep into the Elasticsearch query DSL and review approaches for generating custom, ad-hoc reports.
BDM304 – Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.
BDM401 – Deep Dive: Amazon EMR Best Practices & Design Patterns Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
DAT304 – Deep Dive on Amazon DynamoDB Explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.
BDM202 – Workshop: Building Your First Big Data Application with AWS Want to get ramped up on how to use Amazon’s big data web services and launch your first big data application on AWS? Join us in this workshop as we build a big data application in real-time using Amazon EMR, Amazon Redshift, Amazon Kinesis, Amazon DynamoDB, and Amazon S3. We review architecture design patterns for big data solutions on AWS and give you access to a take-home lab so that you can rebuild and customize the application yourself.
IOT306 – IoT Visualizations and Analytics In this workshop, we focus on visualizations of IoT data using ELK, Amazon Elasticsearch Service, Logstash, and Kibana or Amazon Kinesis. We dive into how these visualizations can give you new capabilities and understanding when interacting with your device data from the context they provide on the world around them.
MAC401 – Scalable Deep Learning Using MXNet Deep learning continues to push the state of the art in domains such as computer vision, natural language understanding, and recommendation engines. One of the key reasons for this progress is the availability of highly flexible and developer friendly deep learning frameworks. During this workshop, members of the Amazon Machine Learning team provide a short background on Deep Learning focusing on relevant application domains and an introduction to using the powerful and scalable Deep Learning framework, MXNet. At the end of this tutorial, you’ll gain hands-on experience targeting a variety of applications including computer vision and recommendation engines as well as exposure to how to use preconfigured Deep Learning AMIs and CloudFormation Templates to help speed your development.
STG312 – Workshop: Working with AWS Snowball – Accelerating Data Ingest into the Cloud This workshop provides customers with the opportunity to work hands-on with the AWS Snowball service, with attendees broken out into small teams to perform various on-premises to cloud data transfer scenarios using actual Snowball devices. These scenarios include migrating backup & archive data to S3-IA and Amazon Glacier, HDFS cluster migration to S3 for use with Amazon EMR and Amazon Redshift, and leveraging the Snowball API & SDK to build AWS Snowball service integration into a custom application. The session opens with an overview of the service, objectives, and guidance on where to find resources. Attendees should bring their own laptops and should have a basic familiarity with AWS storage services (S3 and Amazon Glacier). Prerequisites: Participants should have an AWS account established and available for use during the workshop. Please bring your own laptop.
Late last month, popular websites like Twitter, Pinterest, Reddit and PayPal went down for most of a day. The distributed denial-of-service attack that caused the outages, and the vulnerabilities that made the attack possible, was as much a failure of market and policy as it was of technology. If we want to secure our increasingly computerized and connected world, we need more government involvement in the security of the “Internet of Things” and increased regulation of what are now critical and life-threatening technologies. It’s no longer a question of if, it’s a question of when.
First, the facts. Those websites went down because their domain name provider — a company named Dyn — was forced offline. We don’t know who perpetrated that attack, but it could have easily been a lone hacker. Whoever it was launched a distributed denial-of-service attack against Dyn by exploiting a vulnerability in large numbers — possibly millions — of Internet-of-Things devices like webcams and digital video recorders, then recruiting them all into a single botnet. The botnet bombarded Dyn with traffic, so much that it went down. And when it went down, so did dozens of websites.
Your security on the Internet depends on the security of millions of Internet-enabled devices, designed and sold by companies you’ve never heard of to consumers who don’t care about your security.
The technical reason these devices are insecure is complicated, but there is a market failure at work. The Internet of Things is bringing computerization and connectivity to many tens of millions of devices worldwide. These devices will affect every aspect of our lives, because they’re things like cars, home appliances, thermostats, light bulbs, fitness trackers, medical devices, smart streetlights and sidewalk squares. Many of these devices are low-cost, designed and built offshore, then rebranded and resold. The teams building these devices don’t have the security expertise we’ve come to expect from the major computer and smartphone manufacturers, simply because the market won’t stand for the additional costs that would require. These devices don’t get security updates like our more expensive computers, and many don’t even have a way to be patched. And, unlike our computers and phones, they stay around for years and decades.
An additional market failure illustrated by the Dyn attack is that neither the seller nor the buyer of those devices cares about fixing the vulnerability. The owners of those devices don’t care. They wanted a webcam — or thermostat, or refrigerator — with nice features at a good price. Even after they were recruited into this botnet, they still work fine — you can’t even tell they were used in the attack. The sellers of those devices don’t care: They’ve already moved on to selling newer and better models. There is no market solution because the insecurity primarily affects other people. It’s a form of invisible pollution.
And, like pollution, the only solution is to regulate. The government could impose minimum security standards on IoT manufacturers, forcing them to make their devices secure even though their customers don’t care. They could impose liabilities on manufacturers, allowing companies like Dyn to sue them if their devices are used in DDoS attacks. The details would need to be carefully scoped, but either of these options would raise the cost of insecurity and give companies incentives to spend money making their devices secure.
It’s true that this is a domestic solution to an international problem and that there’s no U.S. regulation that will affect, say, an Asian-made product sold in South America, even though that product could still be used to take down U.S. websites. But the main costs in making software come from development. If the United States and perhaps a few other major markets implement strong Internet-security regulations on IoT devices, manufacturers will be forced to upgrade their security if they want to sell to those markets. And any improvements they make in their software will be available in their products wherever they are sold, simply because it makes no sense to maintain two different versions of the software. This is truly an area where the actions of a few countries can drive worldwide change.
Regardless of what you think about regulation vs. market solutions, I believe there is no choice. Governments will get involved in the IoT, because the risks are too great and the stakes are too high. Computers are now able to affect our world in a direct and physical manner.
Security researchers have demonstrated the ability to remotely take control of Internet-enabled cars. They’ve demonstrated ransomware against home thermostats and exposed vulnerabilities in implanted medicaldevices. They’ve hacked voting machines and power plants. In one recent paper, researchers showed how a vulnerability in smart light bulbs could be used to start a chain reaction, resulting in them all being controlled by the attackers — that’s every one in a city. Security flaws in these things could mean people dying and property being destroyed.
Nothing motivates the U.S. government like fear. Remember 2001? A small-government Republican president created the Department of Homeland Security in the wake of the 9/11 terrorist attacks: a rushed and ill-thought-out decision that we’ve been trying to fix for more than a decade. A fatal IoT disaster will similarly spur our government into action, and it’s unlikely to be well-considered and thoughtful action. Our choice isn’t between government involvement and no government involvement. Our choice is between smarter government involvement and stupider government involvement. We have to start thinking about this now. Regulations are necessary, important and complex — and they’re coming. We can’t afford to ignore these issues until it’s too late.
In general, the software market demands that products be fast and cheap and that security be a secondary consideration. That was okay when software didn’t matter — it was okay that your spreadsheet crashed once in a while. But a software bug that literally crashes your car is another thing altogether. The security vulnerabilities in the Internet of Things are deep and pervasive, and they won’t get fixed if the market is left to sort it out for itself. We need to proactively discuss good regulatory solutions; otherwise, a disaster will impose bad ones on us.
A week ago Friday, someone took down numerous popular websites in a massive distributed denial-of-service (DDoS) attack against the domain name provider Dyn. DDoS attacks are neither new nor sophisticated. The attacker sends a massive amount of traffic, causing the victim’s system to slow to a crawl and eventually crash. There are more or less clever variants, but basically, it’s a datapipe-size battle between attacker and victim. If the defender has a larger capacity to receive and process data, he or she will win. If the attacker can throw more data than the victim can process, he or she will win.
The attacker can build a giant data cannon, but that’s expensive. It is much smarter to recruit millions of innocent computers on the internet. This is the “distributed” part of the DDoS attack, and pretty much how it’s worked for decades. Cybercriminals infect innocent computers around the internet and recruit them into a botnet. They then target that botnet against a single victim.
You can imagine how it might work in the real world. If I can trick tens of thousands of others to order pizzas to be delivered to your house at the same time, I can clog up your street and prevent any legitimate traffic from getting through. If I can trick many millions, I might be able to crush your house from the weight. That’s a DDoS attack it’s simple brute force.
As you’d expect, DDoSers have various motives. The attacks started out as a way to show off, then quickly transitioned to a method of intimidation or a way of just getting back at someone you didn’t like. More recently, they’ve become vehicles of protest. In 2013, the hacker group Anonymous petitioned the White House to recognize DDoS attacks as a legitimate form of protest. Criminals have used these attacks as a means of extortion, although one group found that just the fear of attack was enough. Military agencies are also thinking about DDoS as a tool in their cyberwar arsenals. A 2007 DDoS attack against Estonia was blamed on Russia and widely called an act of cyberwar.
The DDoS attack against Dyn two weeks ago was nothing new, but it illustrated several important trends in computer security.
These attack techniques are broadly available. Fully capable DDoS attack tools are available for free download. Criminal groups offer DDoS services for hire. The particular attack technique used against Dyn was first used a month earlier. It’s called Mirai, and since the source code was released four weeks ago, over a dozen botnets have incorporated the code.
The Dyn attacks were probably not originated by a government. The perpetrators were most likely hackers mad at Dyn for helping Brian Krebs identify and the FBI arrest two Israeli hackers who were running a DDoS-for-hire ring. Recently I have written about probing DDoS attacks against internet infrastructure companies that appear to be perpetrated by a nation-state. But, honestly, we don’t know for sure.
This is important. Software spreads capabilities. The smartest attacker needs to figure out the attack and write the software. After that, anyone can use it. There’s not even much of a difference between government and criminal attacks. In December 2014, there was a legitimate debate in the security community as to whether the massive attack against Sony had been perpetrated by a nation-state with a $20 billion military budget or a couple of guys in a basement somewhere. The internet is the only place where we can’t tell the difference. Everyone uses the same tools, the same techniques and the same tactics.
These attacks are getting larger. The Dyn DDoS attack set a record at 1.2 Tbps. The previous record holder was the attack against cybersecurity journalist Brian Krebs a month prior at 620 Gbps. This is much larger than required to knock the typical website offline. A year ago, it was unheard of. Now it occurs regularly.
The botnets attacking Dyn and Brian Krebs consisted largely of unsecure Internet of Things (IoT) devices webcams, digital video recorders, routers and so on. This isn’t new, either. We’ve already seen internet-enabled refrigerators and TVs used in DDoS botnets. But again, the scale is bigger now. In 2014, the news was hundreds of thousands of IoT devices the Dyn attack used millions. Analysts expect the IoT to increase the number of things on the internet by a factor of 10 or more. Expect these attacks to similarly increase.
The problem is that these IoT devices are unsecure and likely to remain that way. The economics of internet security don’t trickle down to the IoT. Commenting on the Krebs attack last month, I wrote:
The market can’t fix this because neither the buyer nor the seller cares. Think of all the CCTV cameras and DVRs used in the attack against Brian Krebs. The owners of those devices don’t care. Their devices were cheap to buy, they still work, and they don’t even know Brian. The sellers of those devices don’t care: They’re now selling newer and better models, and the original buyers only cared about price and features. There is no market solution because the insecurity is what economists call an externality: It’s an effect of the purchasing decision that affects other people. Think of it kind of like invisible pollution.
To be fair, one company that made some of the unsecure things used in these attacks recalled its unsecure webcams. But this is more of a publicity stunt than anything else. I would be surprised if the company got many devices back. We already know that the reputational damage from having your unsecure software made public isn’t large and doesn’t last. At this point, the market still largely rewards sacrificing security in favor of price and time-to-market.
DDoS prevention works best deep in the network, where the pipes are the largest and the capability to identify and block the attacks is the most evident. But the backbone providers have no incentive to do this. They don’t feel the pain when the attacks occur and they have no way of billing for the service when they provide it. So they let the attacks through and force the victims to defend themselves. In many ways, this is similar to the spam problem. It, too, is best dealt with in the backbone, but similar economics dump the problem onto the endpoints.
We’re unlikely to get any regulation forcing backbone companies to clean up either DDoS attacks or spam, just as we are unlikely to get any regulations forcing IoT manufacturers to make their systems secure. This is me again:
What this all means is that the IoT will remain insecure unless government steps in and fixes the problem. When we have market failures, government is the only solution. The government could impose security regulations on IoT manufacturers, forcing them to make their devices secure even though their customers don’t care. They could impose liabilities on manufacturers, allowing people like Brian Krebs to sue them. Any of these would raise the cost of insecurity and give companies incentives to spend money making their devices secure.
That leaves the victims to pay. This is where we are in much of computer security. Because the hardware, software and networks we use are so unsecure, we have to pay an entire industry to provide after-the-fact security.
There are solutions you can buy. Many companies offer DDoS protection, although they’re generally calibrated to the older, smaller attacks. We can safely assume that they’ll up their offerings, although the cost might be prohibitive for many users. Understand your risks. Buy mitigation if you need it, but understand its limitations. Know the attacks are possible and will succeed if large enough. And the attacks are getting larger all the time. Prepare for that.
Brian Krebs is a popular reporter on the cybersecurity beat. He regularly exposes cybercriminals and their tactics, and consequently is regularly a target of their ire. Last month, he wrote about an online attack-for-hire service that resulted in the arrest of the two proprietors. In the aftermath, his site was taken down by a massive DDoS attack.
In many ways, this is nothing new. Distributed denial-of-service attacks are a family of attacks that cause websites and other Internet-connected systems to crash by overloading them with traffic. The “distributed” part means that other insecure computers on the Internet — sometimes in the millions — are recruited to a botnet to unwittingly participate in the attack. The tactics are decades old; DDoS attacks are perpetrated by lone hackers trying to be annoying, criminals trying to extort money, and governments testing their tactics. There are defenses, and there are companies that offer DDoS mitigation services for hire.
Basically, it’s a size vs. size game. If the attackers can cobble together a fire hose of data bigger than the defender’s capability to cope with, they win. If the defenders can increase their capability in the face of attack, they win.
What was new about the Krebs attack was both the massive scale and the particular devices the attackers recruited. Instead of using traditional computers for their botnet, they used CCTV cameras, digital video recorders, home routers, and other embedded computers attached to the Internet as part of the Internet of Things.
Much has been written about how the IoT is wildly insecure. In fact, the software used to attack Krebs was simple and amateurish. What this attack demonstrates is that the economics of the IoT mean that it will remain insecure unless government steps in to fix the problem. This is a market failure that can’t get fixed on its own.
Our computers and smartphones are as secure as they are because there are teams of security engineers working on the problem. Companies like Microsoft, Apple, and Google spend a lot of time testing their code before it’s released, and quickly patch vulnerabilities when they’re discovered. Those companies can support such teams because those companies make a huge amount of money, either directly or indirectly, from their software — and, in part, compete on its security. This isn’t true of embedded systems like digital video recorders or home routers. Those systems are sold at a much lower margin, and are often built by offshore third parties. The companies involved simply don’t have the expertise to make them secure.
Even worse, most of these devices don’t have any way to be patched. Even though the source code to the botnet that attacked Krebs has been made public, we can’t update the affected devices. Microsoft delivers security patches to your computer once a month. Apple does it just as regularly, but not on a fixed schedule. But the only way for you to update the firmware in your home router is to throw it away and buy a new one.
The security of our computers and phones also comes from the fact that we replace them regularly. We buy new laptops every few years. We get new phones even more frequently. This isn’t true for all of the embedded IoT systems. They last for years, even decades. We might buy a new DVR every five or ten years. We replace our refrigerator every 25 years. We replace our thermostat approximately never. Already the banking industry is dealing with the security problems of Windows 95 embedded in ATMs. This same problem is going to occur all over the Internet of Things.
The market can’t fix this because neither the buyer nor the seller cares. Think of all the CCTV cameras and DVRs used in the attack against Brian Krebs. The owners of those devices don’t care. Their devices were cheap to buy, they still work, and they don’t even know Brian. The sellers of those devices don’t care: they’re now selling newer and better models, and the original buyers only cared about price and features. There is no market solution because the insecurity is what economists call an externality: it’s an effect of the purchasing decision that affects other people. Think of it kind of like invisible pollution.
What this all means is that the IoT will remain insecure unless government steps in and fixes the problem. When we have market failures, government is the only solution. The government could impose security regulations on IoT manufacturers, forcing them to make their devices secure even though their customers don’t care. They could impose liabilities on manufacturers, allowing people like Brian Krebs to sue them. Any of these would raise the cost of insecurity and give companies incentives to spend money making their devices secure.
Of course, this would only be a domestic solution to an international problem. The Internet is global, and attackers can just as easily build a botnet out of IoT devices from Asia as from the United States. Long term, we need to build an Internet that is resilient against attacks like this. But that’s a long time coming. In the meantime, you can expect more attacks that leverage insecure IoT devices.
“Pay-for-performance” in healthcare pays providers more to keep the people under their care healthier. This is a departure from fee-for-service where payments are for each service used. Pay-for-performance arrangements provide financial incentives to hospitals, physicians, and other healthcare providers to carry out improvements and achieve optimal outcomes for patients.
Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare.
Eliza analyzes more than 200 million such outreaches per year, primarily through outbound phone calls with interactive voice responses (IVR) and other channels. For Eliza, outreach results are the questions and responses that form a decision tree, with each question and response captured as a pair:
<question, response>: <“Did you visit your physician in the last 30 days?” , “Yes”>
This type of data has been characteristic and distinctive for Eliza and poses challenges in processing and analyzing. For example, you can’t have a table with fixed columns to store the data.
The majority of data at Eliza takes the form of outreach results captured as a set of <attribute> and <attribute value> pairs. Other data sets at Eliza include structured data for the members to target for outreach. This data is received from various systems that include customers, claims data, pharmacy data, electronic medical records (EMR/EHR) data, and enrichment data. There are considerable variety and quality considerations in the data that Eliza deals with for keeping the business running.
NorthBay was chosen as the big data partner to architect and implement a data infrastructure to improve the overall performance of Eliza’s process. NorthBay architected a data lake on AWS for Eliza’s use case and implemented majority of the data lake components by following the best practice recommendations from the AWS white paper “Building a Data Lake on AWS.”
In this post, I discuss some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways we solved these issues with AWS. The challenges we faced involved the variety of data and a need for a common view of the data.
This section highlights some of the transformations done to overcome the challenges related to data obfuscation, cleansing, and mapping.
The following architecture depicts the flow for each of these processes.
The Lambda function launches an AWS Data Pipeline orchestration process passing the relevant parameters.
The Data Pipeline process creates a transient Amazon EMR resource and submits the appropriate Hadoop job.
The Hadoop job is configured to read the relevant metadata tables from Amazon DynamoDB and AWS KMS (for encrypt/decrypt operations).
Using the metadata, the Hadoop job transforms the input data to put results in the appropriate S3 location.
When the Hadoop job is complete, an Amazon SNS topic is notified for further processing.
To meet Eliza’s needs for protecting data privacy, the following business rule was created:
When dealing with PII (Personally Identifiable Information) and PHI (Personal Health Information) data in non-production environments, the PII must be obfuscated or masked before it can be shared with the development teams.
Considering the volume and velocity, the obfuscation itself becomes a big data problem.
Eliza’s obfuscation strategy relies on creating an obfuscation for each of the 18-20 known PII data elements (such as names, date of birth, telephone number, etc.). The metadata required for the obfuscation process is stored in DynamoDB. The following table shows the sample schema and data related to this process.
Some fields are obfuscated with dummy values, some fields with hash values. The fields which are present in the data file but not in this metadata table are not considered sensitive and are therefore not modified. The decision to hash some values allows these fields to join across multiple data sets as all similar fields across data sets are hashed using the same algorithm.
The mapper part of the process reads the metadata from DynamoDB and creates an obfuscated line by going through each field and applying the corresponding obfuscation. The KMS kmsKeyId value is used, along with the actual value, to add an additional layer of complexity for the hashing algorithm.
The obfuscation process is done per file and we chose to retain a one-to-one mapping of the original data file to the obfuscated output file.
Reducer file snippet:
The data received by Eliza is populated by disparate systems and can include free-form entries by consumers and customers. For example, a phone number can be entered as any of the following:
This brings additional challenges as the data may not enter with a standard format. An additional process has to be in place to cleanse the data and bring it to a common format.
At Eliza, most of the field formats were already known and we were able to bring the data to a common format using the data cleansing technique mentioned later. The following table shows a sample definition and values for the metadata created in DynamoDB.
The values in the InputRegex column define how the columns in different data sources should be treated. The schema structure allows you to apply multiple data cleansing rules on the same field and specifies the order of applying the rules.
Mapping data allows you to combine data from multiple data sources efficiently.
At Eliza, in the current implementation, we solved the problem of providing a common view across data sources with a known metadata or schema structure. For example, the fields Zipcode, Zip, zip-code, zip_code, zip4, etc. coming from different programs and data sources refer to the same piece of information called “Zip Code”. Ontology provides a process to build the common view when combining data from these different sources.
At Eliza, based on the existing processes and knowledge of the current data sets, we were able to build a data mapping to consolidate fields across data sources.
The following DynamoDB table shows a sample schema and values for storing the mapping metadata. AttributeValue in DynamoDB corresponds to the common field name that is used across multiple data sets.
This table is read one time per data source and the information is stored in the source metadata. The source metadata is, in turn, read in the data processing Hadoop jobs while consolidating and transforming the data sets.
From the given sample, MEMBERLANGUAGE, MEMBER_LANGUAGE, and PRIMARY_LANGUAGE are treated as the same attribute, “Language”. The consolidated data has only a canonical representation of the attributes derived through “attributevalue” from mapping the metadata table.
An S3-based data lake is an architectural pattern strongly suited to the situation where data in an enterprise has a high variety and velocity with multiple consumption patterns.
Due to the sensitive nature of the highly regulated healthcare information handled by Eliza, NorthBay found Eliza in need of a real-life data lake implementation. In the course of implementing the AWS-based data platform for Eliza, we discovered that, due to the nature of the sensitive healthcare information, certain situations can occur and there are effective ways to deal with them. While the implementation details are specific to healthcare, the high-level design was purposely built to be generic and applied across industries and enterprises.
If you have questions or suggestions, please comment below.
This content of this blog post will be included in an AWS Partner Webinar on Tuesday, October 18, 2016 featuring NorthBay and AWS. To register, click here.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.