Tag Archives: resilience

Keeping Netflix Reliable Using Prioritized Load Shedding

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure

By Manuel Correa, Arthur Gonigberg, and Daniel West

Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. Yet, as recent as last year, our systems were susceptible to metaphorical traffic jams; we had on/off circuit breakers, but no progressive way to shed load. Motivated by improving the lives of our members, we’ve introduced priority-based progressive load shedding.

The animation below shows the behavior of the Netflix viewer experience when the backend is throttling traffic based on priority. While the lower priority requests are throttled, the playback experience remains uninterrupted and the viewer is able to enjoy their title. Let’s dig into how we accomplished this.

Failure can occur due to a myriad of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider. All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. With these incidents in mind, we set out to make Netflix more resilient with these goals:

  1. Consistently prioritize requests across device types (Mobile, Browser, and TV)
  2. Progressively throttle requests based on priority
  3. Validate assumptions by using Chaos Testing (deliberate fault injection) for requests of specific priorities

The resulting architecture that we envisioned with priority throttling and chaos testing included is captured below.

High level playback architecture with priority throttling and chaos testing

Building a request taxonomy

We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Based on these characteristics, traffic was classified into the following:

  • NON_CRITICAL: This traffic does not affect playback or members’ experience. Logs and background requests are examples of this type of traffic. These requests are usually high throughput which contributes to a large percentage of load in the system.
  • DEGRADED_EXPERIENCE: This traffic affects members’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
  • CRITICAL: This traffic affects the ability to play. Members will see an error message when they hit play if the request fails.

Using attributes of the request, the API gateway service (Zuul) categorizes the requests into NON_CRITICAL, DEGRADED_EXPERIENCE and CRITICAL buckets, and computes a priority score between 1 to 100 for each request given its individual characteristics. The computation is done as a first step so that it is available for the rest of the request lifecycle.

Most of the time, the request workflow proceeds normally without taking the request priority into account. However, as with any service, sometimes we reach a point when either one of our backends is in trouble or Zuul itself is in trouble. When that happens requests with higher priority get preferential treatment. The higher priority requests will get served, while the lower priority ones will not. The implementation is analogous to a priority queue with a dynamic priority threshold. This allows Zuul to drop requests with a priority lower than the current threshold.

Finding the best place to throttle traffic

Zuul can apply load shedding in two moments during the request lifecycle: when it routes requests to a specific back-end service (service throttling) or at the time of initial request processing, which affects all back-end services (global throttling).

Service throttling

Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic.

Global throttling

Another case is when Zuul itself is in trouble. As opposed to the scenario above, global throttling will affect all back-end services behind Zuul, rather than a single back-end service. The impact of this global throttling can cause much bigger problems for members. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count. When any of the thresholds for those metrics are crossed, Zuul will aggressively throttle traffic to keep itself up and healthy while the system recovers. This functionality is critical: if Zuul goes down, no traffic can get through to our backend services, resulting in a total outage.

Introducing priority-based progressive load shedding

Once we had the prioritization piece in place, we were able to combine it with our load shedding mechanism to dramatically improve streaming reliability. When we’re in a bad situation (i.e. any of the thresholds above are exceeded), we progressively drop traffic, starting with the lowest priority. A cubic function is used to manage the level of throttling. If things get really, really bad the level will hit the sharp side of the curve, throttling everything.

The graph above is an example of how the cubic function is applied. As the overload percentage increases (i.e. the range between the throttling threshold and the max capacity), the priority threshold trails it very slowly: at 35%, it’s still in the mid-90s. If the system continues to degrade, we hit priority 50 at 80% exceeded and then eventually 10 at 95%, and so on.

Given that a relatively small amount of requests impact streaming availability, throttling low priority traffic may affect certain product features but will not prevent members pressing “play” and watching their favorite show. By adding progressive priority-based load shedding, Zuul can shed enough traffic to stabilize services without members noticing.

Handling retry storms

When Zuul decides to drop traffic, it sends a signal to devices to let them know that we need them to back off. It does this by indicating how many retries they can perform and what kind of time window they can perform them in. For example:

{ “maxRetries” : <max-retries>, “retryAfterSeconds”: <seconds> }

Using this backpressure mechanism, we can stop retry storms much faster than we could in the past. We automatically adjust these two dials based on the priority of the request. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

Validating which requests are right for the job

To validate our request taxonomy assumptions on whether a specific request fell into the NON_CRITICAL, DEGRADED, or CRITICAL bucket, we needed a way to test the user’s experience when that request was shed. To accomplish this, we leveraged our internal failure injection tool (FIT) and created a failure injection point in Zuul that allowed us to shed any request based on a supplied priority. This enabled us to manually simulate a load shedded experience by blocking ranges of priorities for a specific device or member, giving us an idea of which requests could be safely shed without impacting the user.

Continually ensuring those requests are still right for the job

One of the goals here is to reduce members’ pain by shedding requests that are not expected to affect the user’s streaming experience. However, Netflix changes quickly and requests that were thought to be noncritical can unexpectedly become critical. In addition, Netflix has a wide variety of client devices, client versions, and ways to interact with the system. To make sure we weren’t causing members pain when throttling NON_CRITICAL requests in any of these scenarios, we leveraged our infrastructure experimentation platform ChAP.

This platform allows us to stage an A/B experiment that will allocate a small number of production users to either a control or treatment group for 45 minutes while throttling a range of priorities for the treatment group. This lets us capture a variety of live use cases and measure the impact to their playback experience. ChAP analyzes the members’ KPIs per device to determine if there is a deviation between the control and the treatment groups.

In our first experiment, we detected a race condition in both Android and iOS devices for a low priority request that caused sporadic playback errors. Since we practice continuous experimentation, once the initial experiments were run and the bugs were fixed, we scheduled them to run on a periodic basis. This allows us to detect regressions early and keep users streaming.

Experiment regression detection before and after fix (SPS indicates streaming availability)

Reaping the benefits

In 2019, before progressive load shedding was in place, the Netflix streaming services experienced an outage that resulted in a sizable percentage of members who were not able to play for a period of time. In 2020, days after the implementation was deployed, the team started seeing the benefit of the solution. Netflix experienced a similar issue with the same potential impact as the outage seen in 2019. Unlike then, Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all.

The graph below shows a stable streaming availability metric stream per second (SPS) while Zuul is performing progressive load shedding based on request priority during the incident. The different colors in the graph represent requests with different priority being throttled.

Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.

We are not done yet

For future work, the team is looking into expanding the use of request priority for other use cases like better retry policies between devices and back-ends, dynamically changing load shedding thresholds, tuning the request priorities using Chaos Testing as a guiding principle, and other areas that will make Netflix even more resilient.

If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us. We’re hiring!


Keeping Netflix Reliable Using Prioritized Load Shedding was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

Bad Software Is Our Fault

Post Syndicated from Bozho original https://techblog.bozho.net/bad-software-is-our-fault/

Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.

Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.

But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.

We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.

We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.

“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.

The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.

And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.

You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.

And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.

In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.

How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.

Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.

The post Bad Software Is Our Fault appeared first on Bozho's tech blog.

The Digital Security Exchange Is Live

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/04/the_digital_sec.html

Last year I wrote about the Digital Security Exchange. The project is live:

The DSX works to strengthen the digital resilience of U.S. civil society groups by improving their understanding and mitigation of online threats.

We do this by pairing civil society and social sector organizations with credible and trustworthy digital security experts and trainers who can help them keep their data and networks safe from exposure, exploitation, and attack. We are committed to working with community-based organizations, legal and journalistic organizations, civil rights advocates, local and national organizers, and public and high-profile figures who are working to advance social, racial, political, and economic justice in our communities and our world.

If you are either an organization who needs help, or an expert who can provide help, visit their website.

Note: I am on their advisory committee.

What We’re Thankful For

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/what-were-thankful-for/

All of us at Backblaze hope you have a wonderful Thanksgiving, and that you can enjoy it with family and friends. We asked everyone at Backblaze to express what they are thankful for. Here are their responses.

Fall leaves

What We’re Thankful For

Aside from friends, family, hobbies, health, etc. I’m thankful for my home. It’s not much, but it’s mine, and allows me to indulge in everything listed above. Or not, if I so choose. And coffee.

— Tony

I’m thankful for my wife Jen, and my other friends. I’m thankful that I like my coworkers and can call them friends too. I’m thankful for my health. I’m thankful that I was born into a middle class family in the US and that I have been very, very lucky because of that.

— Adam

Besides the most important things which are being thankful for my family, my health and my friends, I am very thankful for Backblaze. This is the first job I’ve ever had where I truly feel like I have a great work/life balance. With having 3 kids ages 8, 6 and 4, a husband that works crazy hours and my tennis career on the rise (kidding but I am on 4 teams) it’s really nice to feel like I have balance in my life. So cheers to Backblaze – where a girl can have it all!

— Shelby

I am thankful to work at a high-tech company that recognizes the contributions of engineers in their 40s and 50s.

— Jeannine

I am thankful for the music, the songs I’m singing. Thankful for all the joy they’re bringing. Who can live without it, I ask in all honesty? What would life be? Without a song or a dance what are we? So I say thank you for the music. For giving it to me!

— Yev

I’m thankful that I don’t look anything like the portrait my son draws of me…seriously.

— Natalie

I am thankful to work for a company that puts its people and product ahead of profits.

— James

I am thankful that even in the middle of disasters, turmoil, and violence, there are always people who commit amazing acts of generosity, courage, and kindness that restore my faith in mankind.

— Roderick

The future.

— Ahin

The Future

I am thankful for the current state of modern inexpensive broadband networking that allows me to stay in touch with friends and family that are far away, allows Backblaze to exist and pay my salary so I can live comfortably, and allows me to watch cat videos for free. The internet makes this an amazing time to be alive.

— Brian

Other than being thankful for family & good health, I’m quite thankful through the years I’ve avoided losing any of my 12+TB photo archive. 20 years of photoshoots, family photos and cell phone photos kept safe through changing storage media (floppy drives, flopticals, ZIP, JAZ, DVD-RAM, CD, DVD and hard drives), not to mention various technology/software solutions. It’s a data minefield out there, especially in the long run with changing media formats.

— Jim

I am thankful for non-profit organizations and their volunteers, such as IMAlive. Possibly the greatest gift you can give someone is empowerment, and an opportunity for them to recognize their own resilience and strength.

— Emily

I am thankful for my loving family, friends who make me laugh, a cool company to work for, talented co-workers who make me a better engineer, and beautiful Fall days in Wisconsin!

— Marjorie

Marjorie Wisconsin

I’m thankful for preschool drawings about thankfulness.

— Adam

I am thankful for new friends and working for a company that allows us to be ourselves.

— Annalisa

I’m thankful for my dog as I always find a reason to smile at him everyday. Yes, he still smells from his skunkin’ last week and he tracks mud in my house, but he came from the San Quentin puppy-prisoner program and I’m thankful I found him and that he found me! My vet is thankful as well.

— Terry

I’m thankful that my colleagues are also my friends outside of the office and that the rain season has started in California.

— Aaron

I’m thankful for family, friends, and beer. Mostly for family and friends, but beer is really nice too!

— Ken

There are so many amazing blessings that make up my daily life that I thank God for, so here I go – my basic needs of food, water and shelter, my husband and 2 daughters and the rest of the family (here and abroad) — their love, support, health, and safety, waking up to a new day every day, friends, music, my job, funny things, hugs and more hugs (who does not like hugs?).

— Cecilia

I am thankful to be blessed with a close-knit extended family, and for everything they do for my new, growing family. With a toddler and a second child on the way, it helps having so many extra sets of hands around to help with the kids!

— Zack

I’m thankful for family and friends, the opportunities my parents gave me by moving the U.S., and that all of us together at Backblaze have built a place to be proud of.

— Gleb

Aside for being thankful for family and friends, I am also thankful I live in a place with such natural beauty. Being so close to mountains and the ocean, and everything in between, is something that I don’t take for granted!

— Sona

I’m thankful for my wonderful wife, family, friends, and co-workers. I’m thankful for having a happy and healthy son, and the chance to watch him grow on a daily basis.

— Ariel

I am thankful for a dog-friendly workplace.

— LeAnn

I’m thankful for my amazing new wife and that she’s as much of a nerd as I am.

— Troy

I am thankful for every reunion with my siblings and families.

— Cecilia

I am thankful for my funny, strong-willed, happy daughter, my awesome husband, my family, and amazing friends. I am also thankful for the USA and all the opportunities that come with living here. Finally, I am thankful for Backblaze, a truly great place to work and for all of my co-workers/friends here.

— Natasha

I am thankful that I do not need to hunt and gather everyday to put food on the table but at the same time I feel that I don’t appreciate the food the sits before me as much as I should. So I use Thanksgiving to think about the people and the animals that put food on my family’s table.

— KC

I am thankful for my cat, Catnip. She’s been with me for 18 years and seen me through so many ups and downs. She’s been along my side through two long-term relationships, several moves, and one marriage. I know we don’t have much time together and feel blessed every day she’s here.

— JC

I am thankful for imperfection and misshapen candies. The imperceptible romance of sunsets through bus windows. The dream that family, friends, co-workers, and strangers are connected by love. I am thankful to my ancestors for enduring so much hardship so that I could be here enjoying Bay Area burritos.

— Damon

Autumn leaves

The post What We’re Thankful For appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.

Conclusion

Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.

 

New – Stop & Resume Workloads on EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-stop-resume-workloads-on-ec2-spot-instances/

EC2 Spot Instances give you access to spare EC2 compute capacity at up to 90% off of the On-Demand rates. Starting with the ability to request a specific number of instances of a particular size, we made Spot Instances even more useful and flexible with support for Spot Fleets and Auto Scaling Spot Fleets, allowing you to maintain any desired level of compute capacity.

EC2 users have long had the ability to stop running instances while leaving EBS volumes attached, opening the door to applications that automatically pick up where they left off when the instance starts running again.

Stop and Resume Spot Workloads
Today we are blending these two important features, allowing you to set up Spot bids and Spot Fleets that respond by stopping (rather than terminating) instances when capacity is no longer available at or below your bid price. EBS volumes attached to stopped instances remain intact, as does the EBS-backed root volume. When capacity becomes available, the instances are started and can keep on going without having to spend time provisioning applications, setting up EBS volumes, downloading data, joining network domains, and so forth.

Many AWS customers have enhanced their applications to create and make use of checkpoints, adding some resilience and gaining the ability to take advantage of EC2’s start/stop feature in the process. These customers will now be able to run these applications on Spot Instances, with savings that average 70-90%.

While the instances are stopped, you can modify the EBS Optimization, User data, Ramdisk ID, and Delete on Termination attributes. Stopped Spot Instances do not incur any charges for compute time; space for attached EBS volumes is charged at the usual rates.

Here’s how you create a Spot bid or Spot Fleet and specify the use of stop/start:

Things to Know
This feature is available now and you can start using it today in all AWS Regions where Spot Instances are available. It is designed to work well in conjunction with the new per-second billing for EC2 instances and EBS volumes, with the potential for another dimension of cost savings over and above that provided by Spot Instances.

EBS volumes always exist within a particular Availability Zone (AZ). As a result, Spot and Spot Fleet requests that specify a particular AZ will always restart in that AZ.

Take care when using this feature in conjunction with Spot Fleets that have the potential to span a wide variety of instance types. Because the composition of the fleet can change over time, you need to pay attention to your account’s limits for IP addresses and EBS volumes.

I’m looking forward to hearing about the new and creative uses that you’ll come up with for this feature. If you thought that your application was not a good fit for Spot Instances, or if the overhead needed to handle interruptions was too high, it is time to take another look!

Jeff;

 

Prime Day 2017 – Powered by AWS

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/prime-day-2017-powered-by-aws/

The third annual Prime Day set another round of records for global orders, topping Black Friday and Cyber Monday, making it the biggest day in Amazon retail history. Over the course of the 30 hour event, tens of millions of Prime members purchased things like Echo Dots, Fire tablets, programmable pressure cookers, espresso machines, rechargeable batteries, and much more! July 11th also set a record for the number of new Prime memberships, as people signed up in order to take advantage of hundreds of thousands of deals. Amazon customers shopped online and made heavy use of the Amazon App, with mobile orders more than doubling from last Prime Day.

Powered by AWS
Last year I told you about How AWS Powered Amazon’s Biggest Day Ever, and shared what the team had learned with regard to preparation, automation, monitoring, and thinking big. All of those lessons still apply and you can read that post to learn more. Preparation for this year’s Prime Day (which started just days after Prime Day 2016 wrapped up) started by collecting and sharing best practices and identifying areas for improvement, proceeding to implementation and stress testing as the big day approached. Two of the best practices involve auditing and GameDay:

Auditing – This is a formal way for us to track preparations, identify risks, and to track progress against our objectives. Each team must respond to a series of detailed technical and operational questions that are designed to help them determine their readiness. On the technical side, questions could revolve around time to recovery after a database failure, including the all-important check of the TTL (time to live) for the CNAME. Operational questions address schedules for on-call personnel, points of contact, and ownership of services & instances.

GameDay – This practice (which I believe originated with former Amazonian Jesse Robbins), is intended to validate all of the capacity planning & preparation and to verify that all of the necessary operational practices are in place and work as expected. It introduces simulated failures and helps to train the team to identify and quickly resolve issues, building muscle memory in the process. It also tests failover and recovery capabilities, and can expose latent defects that are lurking under the covers. GameDays help teams to understand scaling drivers (page views, orders, and so forth) and gives them an opportunity to test their scaling practices. To learn more, read Resilience Engineering: Learning to Embrace Failure or watch the video: GameDay: Creating Resiliency Through Destruction.

Prime Day 2017 Metrics
So, how did we do this year?

The AWS teams checked their dashboards and log files, and were happy to share their metrics with me. Here are a few of the most interesting ones:

Block Storage – Use of Amazon Elastic Block Store (EBS) grew by 40% year-over-year, with aggregate data transfer jumping to 52 petabytes (a 50% increase) for the day and total I/O requests rising to 835 million (a 30% increase). The team told me that they loved the elasticity of EBS, and that they were able to ramp down on capacity after Prime Day concluded instead of being stuck with it.

NoSQL Database – Amazon DynamoDB requests from Alexa, the Amazon.com sites, and the Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second. According to the team, the extreme scale, consistent performance, and high availability of DynamoDB let them meet needs of Prime Day without breaking a sweat.

Stack Creation – Nearly 31,000 AWS CloudFormation stacks were created for Prime Day in order to bring additional AWS resources on line.

API Usage – AWS CloudTrail processed over 50 billion events and tracked more than 419 billion calls to various AWS APIs, all in support of Prime Day.

Configuration TrackingAWS Config generated over 14 million Configuration items for AWS resources.

You Can Do It
Running an event that is as large, complex, and mission-critical as Prime Day takes a lot of planning. If you have an event of this type in mind, please take a look at our new Infrastructure Event Readiness white paper. Inside, you will learn how to design and provision your applications to smoothly handle planned scaling events such as product launches or seasonal traffic spikes, with sections on automation, resiliency, cost optimization, event management, and more.

Jeff;

 

How to Increase the Redundancy and Performance of Your AWS Directory Service for Microsoft AD Directory by Adding Domain Controllers

Post Syndicated from Peter Pereira original https://aws.amazon.com/blogs/security/how-to-increase-the-redundancy-and-performance-of-your-aws-directory-service-for-microsoft-ad-directory-by-adding-domain-controllers/

You can now increase the redundancy and performance of your AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as AWS Microsoft AD, directory by deploying additional domain controllers. Adding domain controllers increases redundancy, resulting in even greater resilience and higher availability. This new capability enables you to have at least two domain controllers operating, even if an Availability Zone were to be temporarily unavailable. The additional domain controllers also improve the performance of your applications by enabling directory clients to load-balance their requests across a larger number of domain controllers. For example, AWS Microsoft AD enables you to use larger fleets of Amazon EC2 instances to run .NET applications that perform frequent user attribute lookups.

AWS Microsoft AD is a highly available, managed Active Directory built on actual Microsoft Windows Server 2012 R2 in the AWS Cloud. When you create your AWS Microsoft AD directory, AWS deploys two domain controllers that are exclusively yours in separate Availability Zones for high availability. Now, you can deploy additional domain controllers easily via the Directory Service console or API, by specifying the total number of domain controllers that you want.

AWS Microsoft AD distributes the additional domain controllers across the Availability Zones and subnets within the Amazon VPC where your directory is running. AWS deploys the domain controllers, configures them to replicate directory changes, monitors for and repairs any issues, performs daily snapshots, and updates the domain controllers with patches. This reduces the effort and complexity of creating and managing your own domain controllers in the AWS Cloud.

In this blog post, I create an AWS Microsoft AD directory with two domain controllers in each Availability Zone. This ensures that I always have at least two domain controllers operating, even if an entire Availability Zone were to be temporarily unavailable. To accomplish this, first I create an AWS Microsoft AD directory with one domain controller per Availability Zone, and then I deploy one additional domain controller per Availability Zone.

Solution architecture

The following diagram shows how AWS Microsoft AD deploys all the domain controllers in this solution after you complete Steps 1 and 2. In Step 1, AWS Microsoft AD deploys the two required domain controllers across multiple Availability Zones and subnets in an Amazon VPC. In Step 2, AWS Microsoft AD deploys one additional domain controller per Availability Zone and subnet.

Solution diagram

Step 1: Create an AWS Microsoft AD directory

First, I create an AWS Microsoft AD directory in an Amazon VPC. I can add domain controllers only after AWS Microsoft AD configures my first two required domain controllers. In my example, my domain name is example.com.

When I create my directory, I must choose the VPC in which to deploy my directory (as shown in the following screenshot). Optionally, I can choose the subnets in which to deploy my domain controllers, and AWS Microsoft AD ensures I select subnets from different Availability Zones. In this case, I have no subnet preference, so I choose No Preference from the Subnets drop-down list. In this configuration, AWS Microsoft AD selects subnets from two different Availability Zones to deploy the directory.

Screenshot of choosing the VPC in which to create the directory

I then choose Next Step to review my configuration, and then choose Create Microsoft AD. It takes approximately 40 minutes for my domain controllers to be created. I can check the status from the AWS Directory Service console, and when the status is Active, I can add my two additional domain controllers to the directory.

Step 2: Deploy two more domain controllers in the directory

Now that I have created an AWS Microsoft AD directory and it is active, I can deploy two additional domain controllers in the directory. AWS Microsoft AD enables me to add domain controllers through the Directory Service console or API. In this post, I use the console.

To deploy two more domain controllers in the directory:

  1. I open the AWS Management Console, choose Directory Service, and then choose the Microsoft AD Directory ID. In my example, my recently created directory is example.com, as shown in the following screenshot.Screenshot of choosing the Directory ID
  2. I choose the Domain controllers tab next. Here I can see the two domain controllers that AWS Microsoft AD created for me in Step 1. It also shows the Availability Zones and subnets in which AWS Microsoft AD deployed the domain controllers.Screenshot showing the domain controllers, Availability Zones, and subnets
  3. I then choose Modify on the Domain controllers tab. I specify the total number of domain controllers I want by choosing the subtract and add buttons. In my example, I want four domain controllers in total for my directory.Screenshot showing how to specify the total number of domain controllers
  4. I choose Apply. AWS Microsoft AD deploys the two additional domain controllers and distributes them evenly across the Availability Zones and subnets in my Amazon VPC. Within a few seconds, I can see the Availability Zones and subnets in which AWS Microsoft AD deployed my two additional domain controllers with a status of Creating (see the following screenshot). While AWS Microsoft AD deploys the additional domain controllers, my directory continues to operate by using the active domain controllers—with no disruption of service.
    Screenshot of two additional domain controllers with a status of "Creating"
  5. When AWS Microsoft AD completes the deployment steps, all domain controllers are in Active status and available for use by my applications. As a result, I have improved the redundancy and performance of my directory.

Note: After deploying additional domain controllers, I can reduce the number of domain controllers by repeating the modification steps with a lower number of total domain controllers. Unless a directory is deleted, AWS Microsoft AD does not allow fewer than two domain controllers per directory in order to deliver fault tolerance and high availability.

Summary

In this blog post, I demonstrated how to deploy additional domain controllers in your AWS Microsoft AD directory. By adding domain controllers, you increase the redundancy and performance of your directory, which makes it easier for you to migrate and run mission-critical Active Directory–integrated workloads in the AWS Cloud without having to deploy and maintain your own AD infrastructure.

To learn more about AWS Directory Service, see the AWS Directory Service home page. If you have questions, post them on the Directory Service forum.

– Peter

Healthcare Industry Cybersecurity Report

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/06/healthcare_indu.html

New US government report: “Report on Improving Cybersecurity in the Health Care Industry.” It’s pretty scathing, but nothing in it will surprise regular readers of this blog.

It’s worth reading the executive summary, and then skimming the recommendations. Recommendations are in six areas.

The Task Force identified six high-level imperatives by which to organize its recommendations and action items. The imperatives are:

  1. Define and streamline leadership, governance, and expectations for health care industry cybersecurity.
  2. Increase the security and resilience of medical devices and health IT.

  3. Develop the health care workforce capacity necessary to prioritize and ensure cybersecurity awareness and technical capabilities.

  4. Increase health care industry readiness through improved cybersecurity awareness and education.

  5. Identify mechanisms to protect research and development efforts and intellectual property from attacks or exposure.

  6. Improve information sharing of industry threats, weaknesses, and mitigations.

News article.

Slashdot thread.

Some notes on Trump’s cybersecurity Executive Order

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/05/some-notes-on-trumps-cybersecurity.html

President Trump has finally signed an executive order on “cybersecurity”. The first draft during his first weeks in power were hilariously ignorant. The current draft, though, is pretty reasonable as such things go. I’m just reading the plain language of the draft as a cybersecurity expert, picking out the bits that interest me. In reality, there’s probably all sorts of politics in the background that I’m missing, so I may be wildly off-base.

Holding managers accountable

This is a great idea in theory. But government heads are rarely accountable for anything, so it’s hard to see if they’ll have the nerve to implement this in practice. When the next breech happens, we’ll see if anybody gets fired.
“antiquated and difficult to defend Information Technology”

The government uses laughably old computers sometimes. Forces in government wants to upgrade them. This won’t work. Instead of replacing old computers, the budget will simply be used to add new computers. The old computers will still stick around.
“Legacy” is a problem that money can’t solve. Programmers know how to build small things, but not big things. Everything starts out small, then becomes big gradually over time through constant small additions. What you have now is big legacy systems. Attempts to replace a big system with a built-from-scratch big system will fail, because engineers don’t know how to build big systems. This will suck down any amount of budget you have with failed multi-million dollar projects.
It’s not the antiquated systems that are usually the problem, but more modern systems. Antiquated systems can usually be protected by simply sticking a firewall or proxy in front of them.

“address immediate unmet budgetary needs necessary to manage risk”

Nobody cares about cybersecurity. Instead, it’s a thing people exploit in order to increase their budget. Instead of doing the best security with the budget they have, they insist they can’t secure the network without more money.

An alternate way to address gaps in cybersecurity is instead to do less. Reduce exposure to the web, provide fewer services, reduce functionality of desktop computers, and so on. Insisting that more money is the only way to address unmet needs is the strategy of the incompetent.

Use the NIST framework
Probably the biggest thing in the EO is that it forces everyone to use the NIST cybersecurity framework.
The NIST Framework simply documents all the things that organizations commonly do to secure themselves, such run intrusion-detection systems or impose rules for good passwords.
There are two problems with the NIST Framework. The first is that no organization does all the things listed. The second is that many organizations don’t do the things well.
Password rules are a good example. Organizations typically had bad rules, such as frequent changes and complexity standards. So the NIST Framework documented them. But cybersecurity experts have long opposed those complex rules, so have been fighting NIST on them.

Another good example is intrusion-detection. These days, I scan the entire Internet, setting off everyone’s intrusion-detection systems. I can see first hand that they are doing intrusion-detection wrong. But the NIST Framework recommends they do it, because many organizations do it, but the NIST Framework doesn’t demand they do it well.
When this EO forces everyone to follow the NIST Framework, then, it’s likely just going to increase the amount of money spent on cybersecurity without increasing effectiveness. That’s not necessarily a bad thing: while probably ineffective or counterproductive in the short run, there might be long-term benefit aligning everyone to thinking about the problem the same way.
Note that “following” the NIST Framework doesn’t mean “doing” everything. Instead, it means documented how you do everything, a reason why you aren’t doing anything, or (most often) your plan to eventually do the thing.
preference for shared IT services for email, cloud, and cybersecurity
Different departments are hostile toward each other, with each doing things their own way. Obviously, the thinking goes, that if more departments shared resources, they could cut costs with economies of scale. Also obviously, it’ll stop the many home-grown wrong solutions that individual departments come up with.
In other words, there should be a single government GMail-type service that does e-mail both securely and reliably.
But it won’t turn out this way. Government does not have “economies of scale” but “incompetence at scale”. It means a single GMail-like service that is expensive, unreliable, and in the end, probably insecure. It means we can look forward to government breaches that instead of affecting one department affecting all departments.

Yes, you can point to individual organizations that do things poorly, but what you are ignoring is the organizations that do it well. When you make them all share a solution, it’s going to be the average of all these things — meaning those who do something well are going to move to a worse solution.

I suppose this was inserted in there so that big government cybersecurity companies can now walk into agencies, point to where they are deficient on the NIST Framework, and say “sign here to do this with our shared cybersecurity service”.
“identify authorities and capabilities that agencies could employ to support the cybersecurity efforts of critical infrastructure entities”
What this means is “how can we help secure the power grid?”.
What it means in practice is that fiasco in the Vermont power grid. The DHS produced a report containing IoCs (“indicators of compromise”) of Russian hackers in the DNC hack. Among the things it identified was that the hackers used Yahoo! email. They pushed these IoCs out as signatures in their “Einstein” intrusion-detection system located at many power grid locations. The next person that logged into their Yahoo! email was then flagged as a Russian hacker, causing all sorts of hilarity to ensue, such as still uncorrected stories by the Washington Post how the Russians hacked our power-grid.
The upshot is that federal government help is also going to include much government hindrance. They really are this stupid sometimes and there is no way to fix this stupid. (Seriously, the DHS still insists it did the right thing pushing out the Yahoo IoCs).
Resilience Against Botnets and Other Automated, Distributed Threats

The government wants to address botnets because it’s just the sort of problem they love, mass outages across the entire Internet caused by a million machines.

But frankly, botnets don’t even make the top 10 list of problems they should be addressing. Number #1 is clearly “phishing” — you know, the attack that’s been getting into the DNC and Podesta e-mails, influencing the election. You know, the attack that Gizmodo recently showed the Trump administration is partially vulnerable to. You know, the attack that most people blame as what probably led to that huge OPM hack. Replace the entire Executive Order with “stop phishing”, and you’d go further fixing federal government security.

But solving phishing is tough. To begin with, it requires a rethink how the government does email, and how how desktop systems should be managed. So the government avoids complex problems it can’t understand to focus on the simple things it can — botnets.

Dealing with “prolonged power outage associated with a significant cyber incident”

The government has had the hots for this since 2001, even though there’s really been no attack on the American grid. After the Russian attacks against the Ukraine power grid, the issue is heating up.

Nation-wide attacks aren’t really a threat, yet, in America. We have 10,000 different companies involved with different systems throughout the country. Trying to hack them all at once is unlikely. What’s funny is that it’s the government’s attempts to standardize everything that’s likely to be our downfall, such as sticking Einstein sensors everywhere.

What they should be doing is instead of trying to make the grid unhackable, they should be trying to lessen the reliance upon the grid. They should be encouraging things like Tesla PowerWalls, solar panels on roofs, backup generators, and so on. Indeed, rather than industrial system blackout, industry backup power generation should be considered as a source of grid backup. Factories and even ships were used to supplant the electric power grid in Japan after the 2011 tsunami, for example. The less we rely on the grid, the less a blackout will hurt us.

“cybersecurity risks facing the defense industrial base, including its supply chain”

So “supply chain” cybersecurity is increasingly becoming a thing. Almost anything electronic comes with millions of lines of code, silicon chips, and other things that affect the security of the system. In this context, they may be worried about intentional subversion of systems, such as that recent article worried about Kaspersky anti-virus in government systems. However, the bigger concern is the zillions of accidental vulnerabilities waiting to be discovered. It’s impractical for a vendor to secure a product, because it’s built from so many components the vendor doesn’t understand.

“strategic options for deterring adversaries and better protecting the American people from cyber threats”

Deterrence is a funny word.

Rumor has it that we forced China to backoff on hacking by impressing them with our own hacking ability, such as reaching into China and blowing stuff up. This works because the Chinese governments remains in power because things are going well in China. If there’s a hiccup in economic growth, there will be mass actions against the government.

But for our other cyber adversaries (Russian, Iran, North Korea), things already suck in their countries. It’s hard to see how we can make things worse by hacking them. They also have a strangle hold on the media, so hacking in and publicizing their leader’s weird sex fetishes and offshore accounts isn’t going to work either.

Also, deterrence relies upon “attribution”, which is hard. While news stories claim last year’s expulsion of Russian diplomats was due to election hacking, that wasn’t the stated reason. Instead, the claimed reason was Russia’s interference with diplomats in Europe, such as breaking into diplomat’s homes and pooping on their dining room table. We know it’s them when they are brazen (as was the case with Chinese hacking), but other hacks are harder to attribute.

Deterrence of nation states ignores the reality that much of the hacking against our government comes from non-state actors. It’s not clear how much of all this Russian hacking is actually directed by the government. Deterrence polices may be better directed at individuals, such as the recent arrest of a Russian hacker while they were traveling in Spain. We can’t get Russian or Chinese hackers in their own countries, so we have to wait until they leave.

Anyway, “deterrence” is one of those real-world concepts that hard to shoe-horn into a cyber (“cyber-deterrence”) equivalent. It encourages lots of bad thinking, such as export controls on “cyber-weapons” to deter foreign countries from using them.

“educate and train the American cybersecurity workforce of the future”

The problem isn’t that we lack CISSPs. Such blanket certifications devalue the technical expertise of the real experts. The solution is to empower the technical experts we already have.

In other words, mandate that whoever is the “cyberczar” is a technical expert, like how the Surgeon General must be a medical expert, or how an economic adviser must be an economic expert. For over 15 years, we’ve had a parade of non-technical people named “cyberczar” who haven’t been experts.

Once you tell people technical expertise is valued, then by nature more students will become technical experts.

BTW, the best technical experts are software engineers and sysadmins. The best cybersecurity for Windows is already built into Windows, whose sysadmins need to be empowered to use those solutions. Instead, they are often overridden by a clueless cybersecurity consultant who insists on making the organization buy a third-party product instead that does a poorer job. We need more technical expertise in our organizations, sure, but not necessarily more cybersecurity professionals.

Conclusion

This is really a government document, and government people will be able to explain it better than I. These are just how I see it as a technical-expert who is a government-outsider.

My guess is the most lasting consequential thing will be making everyone following the NIST Framework, and the rest will just be a lot of aspirational stuff that’ll be ignored.

New Whitepaper: Aligning to the NIST Cybersecurity Framework in the AWS Cloud

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/new-whitepaper-aligning-to-the-nist-cybersecurity-framework-in-the-aws-cloud/

NIST logo

Today, we released the Aligning to the NIST Cybersecurity Framework in the AWS Cloud whitepaper. Both public and commercial sector organizations can use this whitepaper to assess the AWS environment against the National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) and improve the security measures they implement and operate (also known as security in the cloud). The whitepaper also provides a third-party auditor letter attesting to the AWS Cloud offering’s conformance to NIST CSF risk management practices (also known as security of the cloud), allowing organizations to properly protect their data across AWS.

In February 2014, NIST published the Framework for Improving Critical Infrastructure Cybersecurity in response to Presidential Executive Order 13636, “Improving Critical Infrastructure Cybersecurity,” which called for the development of a voluntary framework to help organizations improve the cybersecurity, risk management, and resilience of their systems. The Cybersecurity Enhancement Act of 2014 reinforced the legitimacy and authority of the NIST CSF by codifying it and its voluntary adoption into law, and federal agency Federal Information Security Modernization Act (FISMA) reporting metrics now align to the NIST CSF. Though it is intended for adoption by the critical infrastructure sector, the foundational set of security disciplines in the NIST CSF has been endorsed by government and industry as a recommended baseline for use by any organization, regardless of its sector or size.

We recognize the additional level of effort an organization has to expend for each new security assurance framework it implements. To reduce that burden, we provide a detailed breakout of AWS Cloud offerings and associated customer and AWS responsibilities to facilitate alignment with the NIST CSF. Organizations ranging from federal and state agencies to regulated entities to large enterprises can use this whitepaper as a guide for implementing AWS solutions to achieve the risk management outcomes in the NIST CSF.

Security, compliance, and customer data protection are our top priorities, and we will continue to provide the resources and services for you to meet your desired outcomes while integrating security best practices in the AWS environment. When you use AWS solutions, you can be confident that we protect your data with a level of assurance that meets, if not exceeds, your requirements and needs, and gives you the resources to secure your AWS environment. To request support for implementing the NIST CSF in your organization by using AWS services, contact your AWS account manager.

– Chris Gile, Senior Manager, Security Assurance

Buzzword Watch: Prosilience

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/buzzword_watch_.html

Summer Fowler at CMU has invented a new word: prosilience:

I propose that we build operationally PROSILIENT organizations. If operational resilience, as we like to say, is risk management “all grown up,” then prosilience is resilience with consciousness of environment, self-awareness, and the capacity to evolve. It is not about being able to operate through disruption, it is about anticipating disruption and adapting before it even occurs–a proactive version of resilience. Nascent prosilient capabilities include exercises (tabletop or technical) that simulate how organizations would respond to a scenario. The goal, however, is to automate, expand, and perform continuous exercises based on real-world indicators rather than on scenarios.

I have long been a big fan of resilience as a security concept, and the property we should be aiming for. I’m not sure prosilience buys me anything new, but this is my first encounter with this new buzzword. It would certainly make for a best-selling business-book title.

First Round of systemd.conf 2015 Sponsors

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/first-round-of-systemdconf-2015-sponsors.html

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf
2015
sponsors!

Our first Gold sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS’s commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Silver sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world’s top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We’d like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We’ll shortly announce our second round of sponsors, please stay tuned!

If you’d like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don’t forget to register for the conference! Only a limited number of
registrations are available due to space constraints!
Register here!.

For further details about systemd.conf consult the conference website.

A Case Study in Global Fault Isolation

Post Syndicated from Lee-Ming Zen original https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/

In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we’ll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we’ll discuss how you can employ some of these concepts in your own applications.

Overview of Anycast DNS

One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

(Blue circles represent edge locations; orange circles represent AWS regions)

For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;’ thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

DNS Resolution

Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

(Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.)

Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

The Latency-Availability Decision

Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in “stripe order”; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

Route 53’s foremost goal is to always meet our promise of a 100% SLA for DNS queries – that all of our customers’ DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

Applications

You, too, can apply the same techniques in your own systems and applications. If your system isn’t end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there’s no reason to place all your eggs in one TLD basket.

Another application of what we’ve discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record’s TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

– Lee-Ming Zen