Tag Archives: resilience

Continually assessing application resilience with AWS Resilience Hub and AWS CodePipeline

Post Syndicated from Scott Bryen original https://aws.amazon.com/blogs/architecture/continually-assessing-application-resilience-with-aws-resilience-hub-and-aws-codepipeline/

As customers commit to a DevOps mindset and embrace a nearly continuous integration/continuous delivery model to implement change with a higher velocity, assessing every change impact on an application resilience is key. This blog shows an architecture pattern for automating resiliency assessments as part of your CI/CD pipeline. Automatically running a resiliency assessment within CI/CD pipelines, development teams can fail fast and understand quickly if a change negatively impacts an applications resilience. The pipeline can stop the deployment into further environments, such as QA/UAT and Production, until the resilience issues have been improved.

AWS Resilience Hub is a managed service that gives you a central place to define, validate and track the resiliency of your AWS applications. It is integrated with AWS Fault Injection Simulator (FIS), a chaos engineering service, to provide fault-injection simulations of real-world failures. Using AWS Resilience Hub, you can assess your applications to uncover potential resilience enhancements. This will allow you to validate your applications recovery time (RTO), recovery point (RPO) objectives and optimize business continuity while reducing recovery costs. Resilience Hub also provides APIs for you to integrate its assessment and testing into your CI/CD pipelines for ongoing resilience validation.

AWS CodePipeline is a fully managed continuous delivery service for fast and reliable application and infrastructure updates. You can use AWS CodePipeline to model and automate your software release processes. This enables you to increase the speed and quality of your software updates by running all new changes through a consistent set of quality checks.

Continuous resilience assessments

Figure 1 shows the resilience assessments automation architecture in a multi-account setup. AWS CodePipeline, AWS Step Functions, and AWS Resilience Hub are defined in your deployment account while the application AWS CloudFormation stacks are imported from your workload account. This pattern relies on AWS Resilience Hub ability to import CloudFormation stacks from a different accounts, regions, or both, when discovering an application structure.

High-level architecture pattern for automating resilience assessments

Figure 1. High-level architecture pattern for automating resilience assessments

Add application to AWS Resilience Hub

Begin by adding your application to AWS Resilience Hub and assigning a resilience policy. This can be done via the AWS Management Console or using CloudFormation. In this instance, the application has been created through the AWS Management Console. Sebastien Stormacq’s post, Measure and Improve Your Application Resilience with AWS Resilience Hub, walks you through how to add your application to AWS Resilience Hub.

In a multi-account environment, customers typically have dedicated AWS workload account per environment and we recommend you separate CI/CD capabilities into another account. In this post, the AWS Resilience Hub application has been created in the deployment account and the resources have been discovered using an CloudFormation stack from the workload account. Proper permissions are required to use AWS Resilience Hub to manage application in multiple accounts.

Adding application to AWS Resilience Hub

Figure 2. Adding application to AWS Resilience Hub

Create AWS Step Function to run resilience assessment

Whenever you make a change to your application CloudFormation, you need to update and publish the latest version in AWS Resilience Hub to ensure you are assessing the latest changes. Now that AWS Step Functions SDK integrations support AWS Resilience Hub, you can build a state machine to coordinate the process, which will be triggered from AWS Code Pipeline.

AWS Step Functions is a low-code, visual workflow service that developers use to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

AWS Step Function for orchestrating AWS SDK calls

Figure 3. AWS Step Function for orchestrating AWS SDK calls

  1. The first step in the workflow is to update the resources associated with the application defined in AWS Resilience Hub by calling ImportResourcesToDraftApplication.
  2. Check for the import process to complete using a wait state, a call to DescribeDraftAppVersionResourcesImportStatus and then a choice state to decide whether to progress or continue waiting.
  3. Once complete, publish the draft application by calling PublishAppVersion to ensure we are assessing the latest version.
  4. Once published, call StartAppAssessment to kick-off a resilience assessment.
  5. Check for the assessment to complete using a wait state, a call to DescribeAppAssessment and then a choice state to decide whether to progress or continue waiting.
  6. In the choice state, use assessment status from the response to determine if the assessment is pending, in progress or successful.
  7. If successful, use the compliance status from the response to determine whether to progress to success or fail.
    • Compliance status will be either “PolicyMet” or “PolicyBreached”.
  8. If policy breached, publish onto SNS to alert the development team before moving to fail.

Create stage within code pipeline

Now that we have the AWS Step Function created, we need to integrate it into our pipeline. The post Fine-grained Continuous Delivery With CodePipeline and AWS Step Functions demonstrates how you can trigger a step function from AWS Code Pipeline.

When adding the stage, you need to pass the ARN of the stack which was deployed in the previous stage as well as the ARN of the application in AWS Resilience Hub. These will be required on the AWS SDK calls and you can pass this in as a literal.

AWS CodePipeline stage step function input

Figure 4. AWS CodePipeline stage step function input

Example state using the input from AWS CodePipeline stage

Figure 5. Example state using the input from AWS CodePipeline stage

For more information about these AWS SDK calls, please refer to the AWS Resilience Hub API Reference documents.

Customers often run their workloads in lower environments in a less resilient way to save on cost. It’s important to add the assessment stage at the appropriate point of your pipeline. We recommend adding this to your pipeline after the deployment to a test environment which mirrors production but before deploying to production. By doing this you can fail fast and halt changes which will lower resilience in production.

A note on service quotas: AWS Resilience Hub allows you to run 20 assessments per month per application. If you need to increase this quota, please raise a ticket with AWS Support.

Conclusion

In this post, we have seen an approach to continuously assessing resilience as part of your CI/CD pipeline using AWS Resilience Hub, AWS CodePipeline and AWS Step Functions. This approach will enable you to understand fast if a change will weaken resilience.

AWS Resilience Hub also generates recommended AWS FIS Experiments that you can deploy and use to test the resilience of your application. As well as assessing the resilience, we also recommend you integrate running these tests into your pipeline. The post Chaos Testing with AWS Fault Injection Simulator and AWS CodePipeline demonstrates how you can active this.

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/understand-resiliency-patterns-and-trade-offs-to-architect-efficiently-in-the-cloud/

Architecting workloads to achieve your resiliency targets can be a balancing act. Firms designing for resilience on cloud often need to evaluate multiple factors before they can decide the most optimal architecture for their workloads. Example Corp has multiple applications with varying criticality, and each of their applications have different needs in terms of resiliency, complexity, and cost. They have many choices to architect their workloads for resiliency and cost, but which option suits their needs best? Will they have to make any sacrifices to implement one over another? How and why should they choose one pattern over another?

To help answer these questions, we’ll discuss the five resilience patterns in Figure 1 and the trade-offs to consider when implementing them: 1) design complexity, 2) cost to implement, 3) operational effort, 4) effort to secure, and 5) environmental impact. This will help you achieve varying levels of resiliency and make decisions about the most appropriate architecture for your needs.

Resilience patterns and trade-offs

Figure 1. Resilience patterns and trade-offs

What is resiliency? Why does it matter?

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.”

To meet your business’ resilience requirements, consider the following core factors as you design your workloads:

  • Design complexity – Usually, the more complex your workload becomes, the more complicated your resilience requirements will be. Each individual workload component has to be resilient, and you’ll need to eliminate single points of failure across people, process, and technology elements.
  • Cost to implement – Costs often significantly increase when you implement higher resilience because there are new software and infrastructure components to operate.
  • Operational effort – Deploying and supporting highly resilient systems require more complex operational processes and advanced technical skills. Before you decide to implement higher resilience, evaluate your operational competency to confirm you have the required level of process maturity and skillsets.
  • Effort to secure – Security complexity is less directly correlated to resilience. However, there are generally more components to secure for highly resilient systems. AWS Security best practices can help customers achieve their security objectives for such complex deployments.
  • Environmental impact – An increased deployment footprint for resilient systems might increase your consumption of cloud resources. However, you can use trade-offs like approximate computing and slower response times to reduce resource consumption.

P1 – Multi-AZ

P1 is a cloud-based architecture pattern (Figure 2) that introduces Availability Zones (AZs) into your architecture to increase your system’s resilience. The P1 pattern uses a Multi-AZ architecture where applications operate in multiple AZs within a single AWS Region. This allows your application to withstand AZ-level impacts.

As shown in Figure 2, Example Corp deploys their internal employee applications using the P1 pattern. These applications are low business impact and therefore have lower requirements for resiliency.

Example Corp deploys these applications on Amazon Elastic Compute Cloud (Amazon EC2), which uses health checks to automatically detect faults. If an AZ fails, Amazon EC2 prompts an Amazon EC2 Auto Scaling group to recreate their application in another unaffected AZ.

Multi-AZ deployment pattern (P1)

Figure 2. Multi-AZ deployment pattern (P1)

Trade-offs

P1 is low effort in several categories, but this comes at the expense of application recovery. If AZ is down, it will disrupt end users’ access to the application while the new resources are being re-provisioned in a new AZ. This is known as bi-modal behavior.

P2 – Multi-AZ with static stability

P2 uses multiple AZs within a Region to increase resilience, but it uses static stability to prevent bimodal behavior. P2 uses static stability systems, which remain stable and operate in one mode irrespective of changes to their operating environment.

As shown in Figure 3, Example Corp has a customer-facing website that has a lower tolerance for downtime. Any time the website is down, it could result in lost revenue. Because of this, the website requires two EC2 instances that are provisioned within two AZs. This way, if an AZ becomes impaired, the website can continue operating and does not require Example Corp to detect the fault or launch new infrastructure.

Multi-AZ with static stability pattern (P2)

Figure 3. Multi-AZ with static stability pattern (P2)

Trade-offs

P2 must be weighed against cost concerns. P1 is less expensive because it provisions less compute capacity and relies on launching new instances in case of a failure. However, P1’s bimodal behavior might affect your customers during large-scale events.

You could go further and deploy your workload to three AZs across the Region. This will reduce costs associated with over-provisioning because you only have to provision three instances versus the four we mentioned in our earlier example.

P3 – Application portfolio distribution

The P3 pattern uses a multi-Region pattern to increase functional resilience. It distributes different critical applications in multiple Regions.

Example Corp provides banking services like credit balance checks to consumers on multiple digital channels. These services are available to consumers via a mobile application, contact center, and web-based applications. If the Region fails where the mobile application is deployed, customers can still access services via the other channels deployed in other Regions. Regional disruptions are rare, but implementing this pattern ensures your users retain access to business-critical services during disruptions.

Application portfolio distribution pattern (P3)

Figure 4. Application portfolio distribution pattern (P3)

Trade-offs

Operating an application portfolio that spans multiple Regions requires significant operational planning and management. Isolated functional elements may depend on common downstream systems and data sources that are deployed in a single Region. Therefore, Region-wide events might still cause disruption; however, the impact surface area is significantly reduced.

P4 – Multi-AZ deployment (multi-Region disaster recovery)

Example Corp operates several business-critical services, such as the ability for consumers to make bank payments, that have very low tolerance for disruptions. Example Corp uses the following sub-patterns for these applications:

  • Pilot Light – This pattern works for applications that require RTO/RPO of 10s of minutes. Data is actively replicated and application infrastructure is pre-provisioned in the disaster recovery (DR) Region. Cost optimization is a key driver here because the application infrastructure is kept switched off and only switched on during the restore event.
  • Warm Standby– This pattern improves restore times significantly compared to pilot light by keeping your applications running in the DR Region but with a reduced capacity. Application infrastructure will be scaled up during a DR event but this can typically be automated with minimal manual effort. This pattern can achieve RTO/RPO of minutes if implemented correctly.

The Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper documents these patterns in detail.

Trade-offs

Regional DR patterns increase deployment complexity because infrastructure changes need to be synchronized across Regions. Testing is also significantly more complex and should include scenarios such as losing a Region and traffic routing and management. Using Infrastructure as Code to automate deployments can help alleviate these issues.

P5 – Multi-Region active-active

Example Corp’s core banking and Customer Relationship Management applications have zero tolerance for Regional disruption. They use the P5 pattern for deploying these applications because it has an RTO of real-time and an RPO of near-zero data loss. This way they run their workload simultaneously in multiple Regions, which allows them to serve traffic from all Regions.

Multi-Region active-active pattern (P5)

Figure 5. Multi-Region active-active pattern (P5)

Trade-offs

Multi-active ecosystems are generally complex because they include multiple applications that collaborate to deliver required business services. If you implement this pattern, you’ll need to consider the fact that you’re introducing asynchronous replication for data across Regions and the impact that has on data consistency.

Operating this pattern requires a very high level of process maturity, so we recommend customers gradually build towards this pattern by starting initially with deployment patterns described earlier.

Conclusion

In this blog post, we introduced five resilience patterns and the trade-offs to consider when implementing them. We showed you how Example Corp evaluated these options and how they applied to their business needs to help you decide on the most efficient architecture to implement.

Further reading

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

How to use regional SAML endpoints for failover

Post Syndicated from Jonathan VanKim original https://aws.amazon.com/blogs/security/how-to-use-regional-saml-endpoints-for-failover/

Many Amazon Web Services (AWS) customers choose to use federation with SAML 2.0 in order to use their existing identity provider (IdP) and avoid managing multiple sources of identities. Some customers have previously configured federation by using AWS Identity and Access Management (IAM) with the endpoint signin.aws.amazon.com. Although this endpoint is highly available, it is hosted in a single AWS Region, us-east-1. This blog post provides recommendations that can improve resiliency for customers that use IAM federation, in the unlikely event of disrupted availability of one of the regional endpoints. We will show you how to use multiple SAML sign-in endpoints in your configuration and how to switch between these endpoints for failover.

How to configure federation with multi-Region SAML endpoints

AWS Sign-In allows users to log in into the AWS Management Console. With SAML 2.0 federation, your IdP portal generates a SAML assertion and redirects the client browser to an AWS sign-in endpoint, by default signin.aws.amazon.com/saml. To improve federation resiliency, we recommend that you configure your IdP and AWS federation to support multiple SAML sign-in endpoints, which requires configuration changes for both your IdP and AWS. If you have only one endpoint configured, you won’t be able to log in to AWS by using federation in the unlikely event that the endpoint becomes unavailable.

Let’s take a look at the Region code SAML sign-in endpoints in the AWS General Reference. The table in the documentation shows AWS regional endpoints globally. The format of the endpoint URL is as follows, where <region-code> is the AWS Region of the endpoint: https://<region-code>.signin.aws.amazon.com/saml

All regional endpoints have a region-code value in the DNS name, except for us-east-1. The endpoint for us-east-1 is signin.aws.amazon.com—this endpoint does not contain a Region code and is not a global endpoint. AWS documentation has been updated to reference SAML sign-in endpoints.

In the next two sections of this post, Configure your IdP and Configure IAM roles, I’ll walk through the steps that are required to configure additional resilience for your federation setup.

Important: You must do these steps before an unexpected unavailability of a SAML sign-in endpoint.

Configure your IdP

You will need to configure your IdP and specify which AWS SAML sign-in endpoint to connect to.

To configure your IdP

  1. If you are setting up a new configuration for AWS federation, your IdP will generate a metadata XML configuration file. Keep track of this file, because you will need it when you configure the AWS portion later.
  2. Register the AWS service provider (SP) with your IdP by using a regional SAML sign-in endpoint. If your IdP allows you to import the AWS metadata XML configuration file, you can find these files available for the public, GovCloud, and China Regions.
  3. If you are manually setting the Assertion Consumer Service (ACS) URL, we recommend that you pick the endpoint in the same Region where you have AWS operations.
  4. In SAML 2.0, RelayState is an optional parameter that identifies a specified destination URL that your users will access after signing in. When you set the ACS value, configure the corresponding RelayState to be in the same Region as the ACS. This keeps the Region configurations consistent for both ACS and RelayState. Following is the format of a Region-specific console URL.

    https://<region-code>.console.aws.amazon.com/

    For more information, refer to your IdP’s documentation on setting up the ACS and RelayState.

Configure IAM roles

Next, you will need to configure IAM roles’ trust policies for all federated human access roles with a list of all the regional AWS Sign-In endpoints that are necessary for federation resiliency. We recommend that your trust policy contains all Regions where you operate. If you operate in only one Region, you can get the same resiliency benefits by configuring an additional endpoint. For example, if you operate only in us-east-1, configure a second endpoint, such as us-west-2. Even if you have no workloads in that Region, you can switch your IdP to us-west-2 for failover. You can log in through AWS federation by using the us-west-2 SAML sign-in endpoint and access your us-east-1 AWS resources.

To configure IAM roles

  1. Log in to the AWS Management Console with credentials to administer IAM. If this is your first time creating the identity provider trust in AWS, follow the steps in Creating IAM SAML identity providers to create the identity providers.
  2. Next, create or update IAM roles for federated access. For each IAM role, update the trust policy that lists the regional SAML sign-in endpoints. Include at least two for increased resiliency.

    The following example is a role trust policy that allows the role to be assumed by a SAML provider coming from any of the four US Regions.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam:::saml-provider/IdP"
                },
                "Action": "sts:AssumeRoleWithSAML",
                "Condition": {
                    "StringEquals": {
                        "SAML:aud": [
                            "https://us-east-2.signin.aws.amazon.com/saml",
                            "https://us-west-1.signin.aws.amazon.com/saml",
                            "https://us-west-2.signin.aws.amazon.com/saml",
                            "https://signin.aws.amazon.com/saml"
                        ]
                    }
                }
            }
        ]
    }

  3. When you use a regional SAML sign-in endpoint, the corresponding regional AWS Security Token Service (AWS STS) endpoint is also used when you assume an IAM role. If you are using service control policies (SCP) in AWS Organizations, check that there are no SCPs denying the regional AWS STS service. This will prevent the federated principal from being able to obtain an AWS STS token.

Switch regional SAML sign-in endpoints

In the event that the regional SAML sign-in endpoint your ACS is configured to use becomes unavailable, you can reconfigure your IdP to point to another regional SAML sign-in endpoint. After you’ve configured your IdP and IAM role trust policies as described in the previous two sections, you’re ready to change to a different regional SAML sign-in endpoint. The following high-level steps provide guidance on switching the regional SAML sign-in endpoint.

To switch regional SAML sign-in endpoints

  1. Change the configuration in the IdP to point to a different endpoint by changing the value for the ACS.
  2. Change the configuration for the RelayState value to match the Region of the ACS.
  3. Log in with your federated identity. In the browser, you should see the new ACS URL when you are prompted to choose an IAM role.
    Figure 1: New ACS URL

    Figure 1: New ACS URL

The steps to reconfigure the ACS and RelayState will be different for each IdP. Refer to the vendor’s IdP documentation for more information.

Conclusion

In this post, you learned how to configure multiple regional SAML sign-in endpoints as a best practice to further increase resiliency for federated access into your AWS environment. Check out the updates to the documentation for AWS Sign-In endpoints to help you choose the right configuration for your use case. Additionally, AWS has updated the metadata XML configuration for the public, GovCloud, and China AWS Regions to include all sign-in endpoints.

The simplest way to get started with SAML federation is to use AWS Single Sign-On (AWS SSO). AWS SSO helps manage your permissions across all of your AWS accounts in AWS Organizations.

If you have any questions, please post them in the Security Identity and Compliance re:Post topic or reach out to AWS Support.

Want more AWS Security news? Follow us on Twitter.

Jonathan VanKim

Jonathan VanKim

Jonathan VanKim is a Sr. Solutions Architect who specializes in Security and Identity for AWS. In 2014, he started working AWS Proserve and transitioned to SA 4 years later. His AWS career has been focused on helping customers of all sizes build secure AWS architectures. He enjoys snowboarding, wakesurfing, travelling, and experimental cooking.

Arynn Crow

Arynn Crow

Arynn Crow is a Manager of Product Management for AWS Identity. Arynn started at Amazon in 2012, trying out many different roles over the years before finding her happy place in security and identity in 2017. Arynn now leads the product team responsible for developing user authentication services at AWS.

How to automate AWS Managed Microsoft AD scaling based on utilization metrics

Post Syndicated from Dennis Rothmel original https://aws.amazon.com/blogs/security/how-to-automate-aws-managed-microsoft-ad-scaling-based-on-utilization-metrics/

AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD), provides a fully managed service for Microsoft Active Directory (AD) in the AWS cloud. When you create your directory, AWS deploys two domain controllers in separate Availability Zones that are exclusively yours for high availability. For use cases requiring even higher resilience and performance, in a specific Region or during specific hours, AWS Managed Microsoft AD allows you to scale by deploying additional domain controllers to meet your needs. These domain controllers can help load-balance, increase overall performance, or simply provide additional nodes to protect against temporary availability issues. AWS Managed Microsoft AD allows you to define the correct number of domain controllers for your directory based on your individual use case.

This post will walk you through how to automate scaling in AWS Managed Microsoft AD using utilization metrics from your directory. You’ll do this using Amazon CloudWatch Alarms, SNS notifications, and a Lambda function to increase the number of domain controllers in your directory based on utilization peaks.

Simplified directory scaling

AWS Managed Microsoft AD has now simplified this directory scaling process by integrating with Amazon CloudWatch metrics. This new integration enables you to:

  1. Analyze your directory to identify expected average and peak directory utilization
  2. Scale your directory based on utilization data to adequately address the expected load
  3. Automate the addition of domain controllers to handle unexpected load.

Integration is available for both domain controller utilization metrics such as CPU, Memory, Disk and Network, and for AD-specific metrics, such as LDAP searches, binds, DNS queries, and Directory reads/writes. Analyzing this data over time to identify expected average and peak utilization on your directory can help you deploy additional domain controllers in Regions that need them. Once you’ve established this utilization baseline, you can deploy additional domain controllers to service this load, and configure alarms for anything exceeding this baseline.

Solution overview

In this example, our AWS Managed Microsoft AD has the default two domain controllers; once your utilization threshold is reached, you’ll add one additional domain controller (domain controller 3 in the diagram) to cover this additional load. 

Figure 1: Solution overview

Figure 1: Solution overview

To create a CloudWatch Alarm with SNS topic notifications

  1. In the AWS Console, navigate to CloudWatch
  2. Choose Metrics to see the Browse Metrics panel
  3. Choose the Directory Service namespace, then choose AWS Managed Microsoft AD.
  4. In the Directory ID column, select your directory and check search for this only.
  5. From the Metric Category column, select Processor from Metric Category and check add to search. This view will show the processor utilization for your directory.
  6.  

    Figure 2. Processor utilization metrics

    Figure 2. Processor utilization metrics

  7. To see the average utilization across all domain controllers, choose Add Math, then All Functions, then AVG to create a metric math expression for average CPU utilization across all domain controllers.
  8.  

    Figure 3. Adding a math function to compute average

    Figure 3. Adding a math function to compute average

  9. Next, choose the Graphed Metrics tab in the CloudWatch metrics console, select the newly created expression, then select the bell icon from the Actions column to create a CloudWatch alarm based on this metric.
  10. Figure 4. Create a CloudWatch Alarm using Metric Math Expression

    Figure 4. Create a CloudWatch Alarm using Metric Math Expression

  11. Configure the threshold alarm to trigger when CPU utilization exceeds 70%.
  12.  In the Metrics section, under Period, choose 1 Hour.
     In the Conditions section, under Threshold Type, choose Static. Under Define the alarm condition, choose Greater than threshold. Under Define the threshold value, enter 70. See Figure 5 for an image of how alarm parameters should look on your screen. Choose Next to Configure actions

    Figure 5. Configure the alarm parameters

    Figure 5. Configure the alarm parameters

  13. On the Configure actions screen, configure the actions using the parameters listed below to send an email notification when the alarm state is triggered. See Figure 6 for an image of how email notifications are configured.
  14.   In the Notification section, set Alarm state trigger to In alarm.   Set Select an SNS topic to Create topic.  Fill in the name of the alarm in the Create a new topic field, and add the email where notifications should be sent to the Email endpoints that will receive notification field. An email address is required to create the SNS topic and you should use an email address that’s accessible by your operations team. This SNS topic will be used to trigger the Lambda automation described in a later section. Note: make a note of the SNS topic name you chose; you will use it later when creating the Lambda function in the To create an AWS Lambda function to automate scale out procedure below. 

    Figure 6. Create SNS topic and email notification

    Figure 6. Create SNS topic and email notification

  15. In the Alarm name field, provide a name for the alarm. You can optionally also add an Alarm description. Choose Next.
  16. Review your configuration, and choose Create alarm to create the alarm.

Once you’ve completed these steps, you will now have an alarm implemented for when domain controller CPU utilization exceeds an average of 70% across both domain controllers. This will trigger an SNS topic when your directory is experiencing a heavy load, which will be used to start the Lambda automation and will send an informational email notification. In the next section, we’ll configure an AWS Lambda function to automate the addition of a domain controller based on this SNS topic.

For additional details on CloudWatch Alarms, please see the Amazon CloudWatch documentation.

To create an AWS Lambda function to automate scale out

The sample Lambda function shown below checks the number of domain controllers in this Region, and increases that by adding one additional domain controller. This procedure describes how to configure the IAM role required for this Lambda function, then how to deploy the Lambda function to execute when the alarm is triggered to automatically add a domain controller when your load exceeds your typical usage baseline.

Note: For additional details on Lambda creation, please see the AWS Lambda documentation.

To automate scale-out using AWS Lambda

  1. In the AWS Console, navigate to IAM and choose Policies, then choose Create Policy.
  2. Choose the JSON tab, and create a new IAM role using the policy provided in JSON below.
  3. For more details on this configuration, see the AWS Directory Service documentation.

    Sample policy

     

    {
    	"Version":"2012-10-17",
    	"Statement":[
    	{
    		"Effect":"Allow",
    		"Action":[
    			"ds:DescribeDomainControllers",
    			"ds:UpdateNumberOfDomainControllers",
    			"ec2:DescribeSubnets",
    			"ec2:DescribeVpcs",
    			"ec2:CreateNetworkInterface",
    			"ec2:DescribeNetworkInterfaces",
    			"ec2:DeleteNetworkInterface"
    		],
    		"Resource":"*"
    	}
    	]
    }
    

  4. Choose Next:Tags to add tags (optional) before choosing Next:Review.
  5. On the Create Policy screen, provide a name in the Name field. You can optionally also add a description. Choose Create policy to complete creating the new policy.
  6.  
    Note: make a note of the policy name you chose; you will use it later when updating the execution role for the Lambda function.  

    Figure 7. Provide a name to create the IAM policy

    Figure 7. Provide a name to create the IAM policy

  7. In the AWS Console, navigate to Lambda and choose Create Function
  8. On the Create Function screen, select Author from Scratch and provide a Name, then choose Create Function.
  9. Figure 8. Create a Lambda function

    Figure 8. Create a Lambda function

  10. Once created, on the Lambda function’s page, choose the Configuration tab, then choose Permissions from the sidebar and choose the execution role name linked under Role name. This will open the IAM console in another tab, preloaded to your Lambda execution role.
  11.  

    Figure 9: Select the Execution Role

    Figure 9: Select the Execution Role

  12. On the execution role screen, choose Attach policies and select the IAM policy you’ve just created (e.g. DirectoryService-DCNumber Update). On the Attach Permissions screen, choose Attach policy to complete updating the execution role. Once completed this step, you may close this tab and return the previous browser tab.
  13.  

    Figure 10. Select and attach the IAM policy

    Figure 10. Select and attach the IAM policy

  14. On the Lambda function screen, choose the Configuration tab, then choose Triggers from the sidebar.
  15. On the Add Trigger screen, choose the pulldown under Trigger configuration and select SNS. On the SNS topic box, select the SNS topic you created in Step 9 of the To create a CloudWatch Alarm with SNS topic notifications procedure above. Then choose Add to complete the trigger configuration.
  16. On the Lambda function screen, choose the Configuration tab, then choose Environment variables from the sidebar.
  17. On the Environment variables card, click Edit.
  18. On the Edit environment variables screen, choose Add environment variables and use the Key “DIRECTORY_ID” and the Value will be the directory ID for you AWS Managed Microsoft AD.
  19.  

    Figure 11. The "Edit environment variables" screen

    Figure 11. The “Edit environment variables” screen

  20. On the Lambda function screen, choose the Code tab to open the in-browser code editor experience inside the Code source card. Paste in the sample Lambda function code given below to complete the implementation.
  21.  

    Figure 12. Paste sample code to complete the Lambda function setup

    Figure 12. Paste sample code to complete the Lambda function setup

Sample Lambda function code

The sample Lambda function given below automates adding another domain controller to your directory. When your CloudWatch alarm triggers, you will receive a notification email, and an additional domain controller will be deployed to provide the added capacity to support the increase in directory usage.

Note: The example code contains a variable for the maximum number of domain controllers (maxDcNum), to prevent you from over provisioning in the event of a missed configuration. This value is set to 3 for this blog post’s example and can be increased to suit your use case. 

import json
import boto3

maxDcNum = 10
minDcNum = 2
region   = "us-east-1"
dsId = "d-906752246f"

ds = boto3.client('ds', region_name=region)

def lambda_handler(event, context):
    
    ## get the current number of domain controllers
    dcs = ds.describe_domain_controllers(DirectoryId = dsId)

    DomainControllers = dcs["DomainControllers"]
    
    DCcount = len(DomainControllers)
    print(">>> Current number of DCs:" + str(DCcount))

    #increase the number of DCs
    if DCcount < maxDcNum:
        NewDCnumber = DCcount + 1 
        response = ds.update_number_of_domain_controllers(DirectoryId = dsId, DesiredNumber =  NewDCnumber);    

        return {
            'statusCode': 200,
            'body': json.dumps("New DC number will be " + str(NewDCnumber))
        }
    else:
        return {
            'statusCode': 200,
            'body': json.dumps("Max number of DCs reached. The number of DCs is" + str(DCcount))
        }

Note: When testing this Lambda function, remember that this will increase the number of domain controllers for your directory in that Region. If the additional domain controller is not needed, please reduce the count after the test to avoid costs for an additional domain controller. The same principles used in this article to automate the addition of domain controllers can be applied to automate the reduction of domain controllers and you should consider automating the reduction to optimize for resilience, performance and cost.

Conclusion

In this post, you’ve implemented alarms based on thresholds in Domain Controller utilization using AWS CloudWatch and automation to increase the number of domain controllers using AWS Lambda functions. This solution helps to cost-effectively improve resilience and performance of your directory, by scaling your directory based on historical load patterns.

To learn more about using AWS Managed Microsoft AD, visit the AWS Directory Service documentation. For general information and pricing, see the AWS Directory Service home page. If you have comments about this blog post, submit a comment in the Comments section below. If you have implementation or troubleshooting questions, start a new thread on the Directory Service forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Dennis Rothmel

Dennis is a Sr. Product Manager from AWS Identity with a passion for modernization and automation. He has worked globally across Asia Pacific, Europe, and America, and enjoys the travel, languages, cultures and foods that come with global working.

Author

Vladimir Provorov

Vladimir is a Product Solutions Architect from AWS Identity focused on Workforce Identity and Directory Service. He works on developing new features to make Enterprise Identity simpler and more scalable. He is excited to travel and explore the world with his family.

Keeping Netflix Reliable Using Prioritized Load Shedding

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure

By Manuel Correa, Arthur Gonigberg, and Daniel West

Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. Yet, as recent as last year, our systems were susceptible to metaphorical traffic jams; we had on/off circuit breakers, but no progressive way to shed load. Motivated by improving the lives of our members, we’ve introduced priority-based progressive load shedding.

The animation below shows the behavior of the Netflix viewer experience when the backend is throttling traffic based on priority. While the lower priority requests are throttled, the playback experience remains uninterrupted and the viewer is able to enjoy their title. Let’s dig into how we accomplished this.

Failure can occur due to a myriad of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider. All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. With these incidents in mind, we set out to make Netflix more resilient with these goals:

  1. Consistently prioritize requests across device types (Mobile, Browser, and TV)
  2. Progressively throttle requests based on priority
  3. Validate assumptions by using Chaos Testing (deliberate fault injection) for requests of specific priorities

The resulting architecture that we envisioned with priority throttling and chaos testing included is captured below.

High level playback architecture with priority throttling and chaos testing

Building a request taxonomy

We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Based on these characteristics, traffic was classified into the following:

  • NON_CRITICAL: This traffic does not affect playback or members’ experience. Logs and background requests are examples of this type of traffic. These requests are usually high throughput which contributes to a large percentage of load in the system.
  • DEGRADED_EXPERIENCE: This traffic affects members’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
  • CRITICAL: This traffic affects the ability to play. Members will see an error message when they hit play if the request fails.

Using attributes of the request, the API gateway service (Zuul) categorizes the requests into NON_CRITICAL, DEGRADED_EXPERIENCE and CRITICAL buckets, and computes a priority score between 1 to 100 for each request given its individual characteristics. The computation is done as a first step so that it is available for the rest of the request lifecycle.

Most of the time, the request workflow proceeds normally without taking the request priority into account. However, as with any service, sometimes we reach a point when either one of our backends is in trouble or Zuul itself is in trouble. When that happens requests with higher priority get preferential treatment. The higher priority requests will get served, while the lower priority ones will not. The implementation is analogous to a priority queue with a dynamic priority threshold. This allows Zuul to drop requests with a priority lower than the current threshold.

Finding the best place to throttle traffic

Zuul can apply load shedding in two moments during the request lifecycle: when it routes requests to a specific back-end service (service throttling) or at the time of initial request processing, which affects all back-end services (global throttling).

Service throttling

Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic.

Global throttling

Another case is when Zuul itself is in trouble. As opposed to the scenario above, global throttling will affect all back-end services behind Zuul, rather than a single back-end service. The impact of this global throttling can cause much bigger problems for members. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count. When any of the thresholds for those metrics are crossed, Zuul will aggressively throttle traffic to keep itself up and healthy while the system recovers. This functionality is critical: if Zuul goes down, no traffic can get through to our backend services, resulting in a total outage.

Introducing priority-based progressive load shedding

Once we had the prioritization piece in place, we were able to combine it with our load shedding mechanism to dramatically improve streaming reliability. When we’re in a bad situation (i.e. any of the thresholds above are exceeded), we progressively drop traffic, starting with the lowest priority. A cubic function is used to manage the level of throttling. If things get really, really bad the level will hit the sharp side of the curve, throttling everything.

The graph above is an example of how the cubic function is applied. As the overload percentage increases (i.e. the range between the throttling threshold and the max capacity), the priority threshold trails it very slowly: at 35%, it’s still in the mid-90s. If the system continues to degrade, we hit priority 50 at 80% exceeded and then eventually 10 at 95%, and so on.

Given that a relatively small amount of requests impact streaming availability, throttling low priority traffic may affect certain product features but will not prevent members pressing “play” and watching their favorite show. By adding progressive priority-based load shedding, Zuul can shed enough traffic to stabilize services without members noticing.

Handling retry storms

When Zuul decides to drop traffic, it sends a signal to devices to let them know that we need them to back off. It does this by indicating how many retries they can perform and what kind of time window they can perform them in. For example:

{ “maxRetries” : <max-retries>, “retryAfterSeconds”: <seconds> }

Using this backpressure mechanism, we can stop retry storms much faster than we could in the past. We automatically adjust these two dials based on the priority of the request. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

Validating which requests are right for the job

To validate our request taxonomy assumptions on whether a specific request fell into the NON_CRITICAL, DEGRADED, or CRITICAL bucket, we needed a way to test the user’s experience when that request was shed. To accomplish this, we leveraged our internal failure injection tool (FIT) and created a failure injection point in Zuul that allowed us to shed any request based on a supplied priority. This enabled us to manually simulate a load shedded experience by blocking ranges of priorities for a specific device or member, giving us an idea of which requests could be safely shed without impacting the user.

Continually ensuring those requests are still right for the job

One of the goals here is to reduce members’ pain by shedding requests that are not expected to affect the user’s streaming experience. However, Netflix changes quickly and requests that were thought to be noncritical can unexpectedly become critical. In addition, Netflix has a wide variety of client devices, client versions, and ways to interact with the system. To make sure we weren’t causing members pain when throttling NON_CRITICAL requests in any of these scenarios, we leveraged our infrastructure experimentation platform ChAP.

This platform allows us to stage an A/B experiment that will allocate a small number of production users to either a control or treatment group for 45 minutes while throttling a range of priorities for the treatment group. This lets us capture a variety of live use cases and measure the impact to their playback experience. ChAP analyzes the members’ KPIs per device to determine if there is a deviation between the control and the treatment groups.

In our first experiment, we detected a race condition in both Android and iOS devices for a low priority request that caused sporadic playback errors. Since we practice continuous experimentation, once the initial experiments were run and the bugs were fixed, we scheduled them to run on a periodic basis. This allows us to detect regressions early and keep users streaming.

Experiment regression detection before and after fix (SPS indicates streaming availability)

Reaping the benefits

In 2019, before progressive load shedding was in place, the Netflix streaming services experienced an outage that resulted in a sizable percentage of members who were not able to play for a period of time. In 2020, days after the implementation was deployed, the team started seeing the benefit of the solution. Netflix experienced a similar issue with the same potential impact as the outage seen in 2019. Unlike then, Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all.

The graph below shows a stable streaming availability metric stream per second (SPS) while Zuul is performing progressive load shedding based on request priority during the incident. The different colors in the graph represent requests with different priority being throttled.

Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.

We are not done yet

For future work, the team is looking into expanding the use of request priority for other use cases like better retry policies between devices and back-ends, dynamically changing load shedding thresholds, tuning the request priorities using Chaos Testing as a guiding principle, and other areas that will make Netflix even more resilient.

If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us. We’re hiring!


Keeping Netflix Reliable Using Prioritized Load Shedding was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

Bad Software Is Our Fault

Post Syndicated from Bozho original https://techblog.bozho.net/bad-software-is-our-fault/

Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.

Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.

But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.

We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.

We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.

“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.

The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.

And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.

You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.

And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.

In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.

How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.

Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.

The post Bad Software Is Our Fault appeared first on Bozho's tech blog.

The Digital Security Exchange Is Live

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/04/the_digital_sec.html

Last year I wrote about the Digital Security Exchange. The project is live:

The DSX works to strengthen the digital resilience of U.S. civil society groups by improving their understanding and mitigation of online threats.

We do this by pairing civil society and social sector organizations with credible and trustworthy digital security experts and trainers who can help them keep their data and networks safe from exposure, exploitation, and attack. We are committed to working with community-based organizations, legal and journalistic organizations, civil rights advocates, local and national organizers, and public and high-profile figures who are working to advance social, racial, political, and economic justice in our communities and our world.

If you are either an organization who needs help, or an expert who can provide help, visit their website.

Note: I am on their advisory committee.

What We’re Thankful For

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/what-were-thankful-for/

All of us at Backblaze hope you have a wonderful Thanksgiving, and that you can enjoy it with family and friends. We asked everyone at Backblaze to express what they are thankful for. Here are their responses.

Fall leaves

What We’re Thankful For

Aside from friends, family, hobbies, health, etc. I’m thankful for my home. It’s not much, but it’s mine, and allows me to indulge in everything listed above. Or not, if I so choose. And coffee.

— Tony

I’m thankful for my wife Jen, and my other friends. I’m thankful that I like my coworkers and can call them friends too. I’m thankful for my health. I’m thankful that I was born into a middle class family in the US and that I have been very, very lucky because of that.

— Adam

Besides the most important things which are being thankful for my family, my health and my friends, I am very thankful for Backblaze. This is the first job I’ve ever had where I truly feel like I have a great work/life balance. With having 3 kids ages 8, 6 and 4, a husband that works crazy hours and my tennis career on the rise (kidding but I am on 4 teams) it’s really nice to feel like I have balance in my life. So cheers to Backblaze – where a girl can have it all!

— Shelby

I am thankful to work at a high-tech company that recognizes the contributions of engineers in their 40s and 50s.

— Jeannine

I am thankful for the music, the songs I’m singing. Thankful for all the joy they’re bringing. Who can live without it, I ask in all honesty? What would life be? Without a song or a dance what are we? So I say thank you for the music. For giving it to me!

— Yev

I’m thankful that I don’t look anything like the portrait my son draws of me…seriously.

— Natalie

I am thankful to work for a company that puts its people and product ahead of profits.

— James

I am thankful that even in the middle of disasters, turmoil, and violence, there are always people who commit amazing acts of generosity, courage, and kindness that restore my faith in mankind.

— Roderick

The future.

— Ahin

The Future

I am thankful for the current state of modern inexpensive broadband networking that allows me to stay in touch with friends and family that are far away, allows Backblaze to exist and pay my salary so I can live comfortably, and allows me to watch cat videos for free. The internet makes this an amazing time to be alive.

— Brian

Other than being thankful for family & good health, I’m quite thankful through the years I’ve avoided losing any of my 12+TB photo archive. 20 years of photoshoots, family photos and cell phone photos kept safe through changing storage media (floppy drives, flopticals, ZIP, JAZ, DVD-RAM, CD, DVD and hard drives), not to mention various technology/software solutions. It’s a data minefield out there, especially in the long run with changing media formats.

— Jim

I am thankful for non-profit organizations and their volunteers, such as IMAlive. Possibly the greatest gift you can give someone is empowerment, and an opportunity for them to recognize their own resilience and strength.

— Emily

I am thankful for my loving family, friends who make me laugh, a cool company to work for, talented co-workers who make me a better engineer, and beautiful Fall days in Wisconsin!

— Marjorie

Marjorie Wisconsin

I’m thankful for preschool drawings about thankfulness.

— Adam

I am thankful for new friends and working for a company that allows us to be ourselves.

— Annalisa

I’m thankful for my dog as I always find a reason to smile at him everyday. Yes, he still smells from his skunkin’ last week and he tracks mud in my house, but he came from the San Quentin puppy-prisoner program and I’m thankful I found him and that he found me! My vet is thankful as well.

— Terry

I’m thankful that my colleagues are also my friends outside of the office and that the rain season has started in California.

— Aaron

I’m thankful for family, friends, and beer. Mostly for family and friends, but beer is really nice too!

— Ken

There are so many amazing blessings that make up my daily life that I thank God for, so here I go – my basic needs of food, water and shelter, my husband and 2 daughters and the rest of the family (here and abroad) — their love, support, health, and safety, waking up to a new day every day, friends, music, my job, funny things, hugs and more hugs (who does not like hugs?).

— Cecilia

I am thankful to be blessed with a close-knit extended family, and for everything they do for my new, growing family. With a toddler and a second child on the way, it helps having so many extra sets of hands around to help with the kids!

— Zack

I’m thankful for family and friends, the opportunities my parents gave me by moving the U.S., and that all of us together at Backblaze have built a place to be proud of.

— Gleb

Aside for being thankful for family and friends, I am also thankful I live in a place with such natural beauty. Being so close to mountains and the ocean, and everything in between, is something that I don’t take for granted!

— Sona

I’m thankful for my wonderful wife, family, friends, and co-workers. I’m thankful for having a happy and healthy son, and the chance to watch him grow on a daily basis.

— Ariel

I am thankful for a dog-friendly workplace.

— LeAnn

I’m thankful for my amazing new wife and that she’s as much of a nerd as I am.

— Troy

I am thankful for every reunion with my siblings and families.

— Cecilia

I am thankful for my funny, strong-willed, happy daughter, my awesome husband, my family, and amazing friends. I am also thankful for the USA and all the opportunities that come with living here. Finally, I am thankful for Backblaze, a truly great place to work and for all of my co-workers/friends here.

— Natasha

I am thankful that I do not need to hunt and gather everyday to put food on the table but at the same time I feel that I don’t appreciate the food the sits before me as much as I should. So I use Thanksgiving to think about the people and the animals that put food on my family’s table.

— KC

I am thankful for my cat, Catnip. She’s been with me for 18 years and seen me through so many ups and downs. She’s been along my side through two long-term relationships, several moves, and one marriage. I know we don’t have much time together and feel blessed every day she’s here.

— JC

I am thankful for imperfection and misshapen candies. The imperceptible romance of sunsets through bus windows. The dream that family, friends, co-workers, and strangers are connected by love. I am thankful to my ancestors for enduring so much hardship so that I could be here enjoying Bay Area burritos.

— Damon

Autumn leaves

The post What We’re Thankful For appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.

Conclusion

Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.

 

New – Stop & Resume Workloads on EC2 Spot Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-stop-resume-workloads-on-ec2-spot-instances/

EC2 Spot Instances give you access to spare EC2 compute capacity at up to 90% off of the On-Demand rates. Starting with the ability to request a specific number of instances of a particular size, we made Spot Instances even more useful and flexible with support for Spot Fleets and Auto Scaling Spot Fleets, allowing you to maintain any desired level of compute capacity.

EC2 users have long had the ability to stop running instances while leaving EBS volumes attached, opening the door to applications that automatically pick up where they left off when the instance starts running again.

Stop and Resume Spot Workloads
Today we are blending these two important features, allowing you to set up Spot bids and Spot Fleets that respond by stopping (rather than terminating) instances when capacity is no longer available at or below your bid price. EBS volumes attached to stopped instances remain intact, as does the EBS-backed root volume. When capacity becomes available, the instances are started and can keep on going without having to spend time provisioning applications, setting up EBS volumes, downloading data, joining network domains, and so forth.

Many AWS customers have enhanced their applications to create and make use of checkpoints, adding some resilience and gaining the ability to take advantage of EC2’s start/stop feature in the process. These customers will now be able to run these applications on Spot Instances, with savings that average 70-90%.

While the instances are stopped, you can modify the EBS Optimization, User data, Ramdisk ID, and Delete on Termination attributes. Stopped Spot Instances do not incur any charges for compute time; space for attached EBS volumes is charged at the usual rates.

Here’s how you create a Spot bid or Spot Fleet and specify the use of stop/start:

Things to Know
This feature is available now and you can start using it today in all AWS Regions where Spot Instances are available. It is designed to work well in conjunction with the new per-second billing for EC2 instances and EBS volumes, with the potential for another dimension of cost savings over and above that provided by Spot Instances.

EBS volumes always exist within a particular Availability Zone (AZ). As a result, Spot and Spot Fleet requests that specify a particular AZ will always restart in that AZ.

Take care when using this feature in conjunction with Spot Fleets that have the potential to span a wide variety of instance types. Because the composition of the fleet can change over time, you need to pay attention to your account’s limits for IP addresses and EBS volumes.

I’m looking forward to hearing about the new and creative uses that you’ll come up with for this feature. If you thought that your application was not a good fit for Spot Instances, or if the overhead needed to handle interruptions was too high, it is time to take another look!

Jeff;

 

Prime Day 2017 – Powered by AWS

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/prime-day-2017-powered-by-aws/

The third annual Prime Day set another round of records for global orders, topping Black Friday and Cyber Monday, making it the biggest day in Amazon retail history. Over the course of the 30 hour event, tens of millions of Prime members purchased things like Echo Dots, Fire tablets, programmable pressure cookers, espresso machines, rechargeable batteries, and much more! July 11th also set a record for the number of new Prime memberships, as people signed up in order to take advantage of hundreds of thousands of deals. Amazon customers shopped online and made heavy use of the Amazon App, with mobile orders more than doubling from last Prime Day.

Powered by AWS
Last year I told you about How AWS Powered Amazon’s Biggest Day Ever, and shared what the team had learned with regard to preparation, automation, monitoring, and thinking big. All of those lessons still apply and you can read that post to learn more. Preparation for this year’s Prime Day (which started just days after Prime Day 2016 wrapped up) started by collecting and sharing best practices and identifying areas for improvement, proceeding to implementation and stress testing as the big day approached. Two of the best practices involve auditing and GameDay:

Auditing – This is a formal way for us to track preparations, identify risks, and to track progress against our objectives. Each team must respond to a series of detailed technical and operational questions that are designed to help them determine their readiness. On the technical side, questions could revolve around time to recovery after a database failure, including the all-important check of the TTL (time to live) for the CNAME. Operational questions address schedules for on-call personnel, points of contact, and ownership of services & instances.

GameDay – This practice (which I believe originated with former Amazonian Jesse Robbins), is intended to validate all of the capacity planning & preparation and to verify that all of the necessary operational practices are in place and work as expected. It introduces simulated failures and helps to train the team to identify and quickly resolve issues, building muscle memory in the process. It also tests failover and recovery capabilities, and can expose latent defects that are lurking under the covers. GameDays help teams to understand scaling drivers (page views, orders, and so forth) and gives them an opportunity to test their scaling practices. To learn more, read Resilience Engineering: Learning to Embrace Failure or watch the video: GameDay: Creating Resiliency Through Destruction.

Prime Day 2017 Metrics
So, how did we do this year?

The AWS teams checked their dashboards and log files, and were happy to share their metrics with me. Here are a few of the most interesting ones:

Block Storage – Use of Amazon Elastic Block Store (EBS) grew by 40% year-over-year, with aggregate data transfer jumping to 52 petabytes (a 50% increase) for the day and total I/O requests rising to 835 million (a 30% increase). The team told me that they loved the elasticity of EBS, and that they were able to ramp down on capacity after Prime Day concluded instead of being stuck with it.

NoSQL Database – Amazon DynamoDB requests from Alexa, the Amazon.com sites, and the Amazon fulfillment centers totaled 3.34 trillion, peaking at 12.9 million per second. According to the team, the extreme scale, consistent performance, and high availability of DynamoDB let them meet needs of Prime Day without breaking a sweat.

Stack Creation – Nearly 31,000 AWS CloudFormation stacks were created for Prime Day in order to bring additional AWS resources on line.

API Usage – AWS CloudTrail processed over 50 billion events and tracked more than 419 billion calls to various AWS APIs, all in support of Prime Day.

Configuration TrackingAWS Config generated over 14 million Configuration items for AWS resources.

You Can Do It
Running an event that is as large, complex, and mission-critical as Prime Day takes a lot of planning. If you have an event of this type in mind, please take a look at our new Infrastructure Event Readiness white paper. Inside, you will learn how to design and provision your applications to smoothly handle planned scaling events such as product launches or seasonal traffic spikes, with sections on automation, resiliency, cost optimization, event management, and more.

Jeff;

 

How to Increase the Redundancy and Performance of Your AWS Directory Service for Microsoft AD Directory by Adding Domain Controllers

Post Syndicated from Peter Pereira original https://aws.amazon.com/blogs/security/how-to-increase-the-redundancy-and-performance-of-your-aws-directory-service-for-microsoft-ad-directory-by-adding-domain-controllers/

You can now increase the redundancy and performance of your AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as AWS Microsoft AD, directory by deploying additional domain controllers. Adding domain controllers increases redundancy, resulting in even greater resilience and higher availability. This new capability enables you to have at least two domain controllers operating, even if an Availability Zone were to be temporarily unavailable. The additional domain controllers also improve the performance of your applications by enabling directory clients to load-balance their requests across a larger number of domain controllers. For example, AWS Microsoft AD enables you to use larger fleets of Amazon EC2 instances to run .NET applications that perform frequent user attribute lookups.

AWS Microsoft AD is a highly available, managed Active Directory built on actual Microsoft Windows Server 2012 R2 in the AWS Cloud. When you create your AWS Microsoft AD directory, AWS deploys two domain controllers that are exclusively yours in separate Availability Zones for high availability. Now, you can deploy additional domain controllers easily via the Directory Service console or API, by specifying the total number of domain controllers that you want.

AWS Microsoft AD distributes the additional domain controllers across the Availability Zones and subnets within the Amazon VPC where your directory is running. AWS deploys the domain controllers, configures them to replicate directory changes, monitors for and repairs any issues, performs daily snapshots, and updates the domain controllers with patches. This reduces the effort and complexity of creating and managing your own domain controllers in the AWS Cloud.

In this blog post, I create an AWS Microsoft AD directory with two domain controllers in each Availability Zone. This ensures that I always have at least two domain controllers operating, even if an entire Availability Zone were to be temporarily unavailable. To accomplish this, first I create an AWS Microsoft AD directory with one domain controller per Availability Zone, and then I deploy one additional domain controller per Availability Zone.

Solution architecture

The following diagram shows how AWS Microsoft AD deploys all the domain controllers in this solution after you complete Steps 1 and 2. In Step 1, AWS Microsoft AD deploys the two required domain controllers across multiple Availability Zones and subnets in an Amazon VPC. In Step 2, AWS Microsoft AD deploys one additional domain controller per Availability Zone and subnet.

Solution diagram

Step 1: Create an AWS Microsoft AD directory

First, I create an AWS Microsoft AD directory in an Amazon VPC. I can add domain controllers only after AWS Microsoft AD configures my first two required domain controllers. In my example, my domain name is example.com.

When I create my directory, I must choose the VPC in which to deploy my directory (as shown in the following screenshot). Optionally, I can choose the subnets in which to deploy my domain controllers, and AWS Microsoft AD ensures I select subnets from different Availability Zones. In this case, I have no subnet preference, so I choose No Preference from the Subnets drop-down list. In this configuration, AWS Microsoft AD selects subnets from two different Availability Zones to deploy the directory.

Screenshot of choosing the VPC in which to create the directory

I then choose Next Step to review my configuration, and then choose Create Microsoft AD. It takes approximately 40 minutes for my domain controllers to be created. I can check the status from the AWS Directory Service console, and when the status is Active, I can add my two additional domain controllers to the directory.

Step 2: Deploy two more domain controllers in the directory

Now that I have created an AWS Microsoft AD directory and it is active, I can deploy two additional domain controllers in the directory. AWS Microsoft AD enables me to add domain controllers through the Directory Service console or API. In this post, I use the console.

To deploy two more domain controllers in the directory:

  1. I open the AWS Management Console, choose Directory Service, and then choose the Microsoft AD Directory ID. In my example, my recently created directory is example.com, as shown in the following screenshot.Screenshot of choosing the Directory ID
  2. I choose the Domain controllers tab next. Here I can see the two domain controllers that AWS Microsoft AD created for me in Step 1. It also shows the Availability Zones and subnets in which AWS Microsoft AD deployed the domain controllers.Screenshot showing the domain controllers, Availability Zones, and subnets
  3. I then choose Modify on the Domain controllers tab. I specify the total number of domain controllers I want by choosing the subtract and add buttons. In my example, I want four domain controllers in total for my directory.Screenshot showing how to specify the total number of domain controllers
  4. I choose Apply. AWS Microsoft AD deploys the two additional domain controllers and distributes them evenly across the Availability Zones and subnets in my Amazon VPC. Within a few seconds, I can see the Availability Zones and subnets in which AWS Microsoft AD deployed my two additional domain controllers with a status of Creating (see the following screenshot). While AWS Microsoft AD deploys the additional domain controllers, my directory continues to operate by using the active domain controllers—with no disruption of service.
    Screenshot of two additional domain controllers with a status of "Creating"
  5. When AWS Microsoft AD completes the deployment steps, all domain controllers are in Active status and available for use by my applications. As a result, I have improved the redundancy and performance of my directory.

Note: After deploying additional domain controllers, I can reduce the number of domain controllers by repeating the modification steps with a lower number of total domain controllers. Unless a directory is deleted, AWS Microsoft AD does not allow fewer than two domain controllers per directory in order to deliver fault tolerance and high availability.

Summary

In this blog post, I demonstrated how to deploy additional domain controllers in your AWS Microsoft AD directory. By adding domain controllers, you increase the redundancy and performance of your directory, which makes it easier for you to migrate and run mission-critical Active Directory–integrated workloads in the AWS Cloud without having to deploy and maintain your own AD infrastructure.

To learn more about AWS Directory Service, see the AWS Directory Service home page. If you have questions, post them on the Directory Service forum.

– Peter

Healthcare Industry Cybersecurity Report

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/06/healthcare_indu.html

New US government report: “Report on Improving Cybersecurity in the Health Care Industry.” It’s pretty scathing, but nothing in it will surprise regular readers of this blog.

It’s worth reading the executive summary, and then skimming the recommendations. Recommendations are in six areas.

The Task Force identified six high-level imperatives by which to organize its recommendations and action items. The imperatives are:

  1. Define and streamline leadership, governance, and expectations for health care industry cybersecurity.
  2. Increase the security and resilience of medical devices and health IT.

  3. Develop the health care workforce capacity necessary to prioritize and ensure cybersecurity awareness and technical capabilities.

  4. Increase health care industry readiness through improved cybersecurity awareness and education.

  5. Identify mechanisms to protect research and development efforts and intellectual property from attacks or exposure.

  6. Improve information sharing of industry threats, weaknesses, and mitigations.

News article.

Slashdot thread.

Some notes on Trump’s cybersecurity Executive Order

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/05/some-notes-on-trumps-cybersecurity.html

President Trump has finally signed an executive order on “cybersecurity”. The first draft during his first weeks in power were hilariously ignorant. The current draft, though, is pretty reasonable as such things go. I’m just reading the plain language of the draft as a cybersecurity expert, picking out the bits that interest me. In reality, there’s probably all sorts of politics in the background that I’m missing, so I may be wildly off-base.

Holding managers accountable

This is a great idea in theory. But government heads are rarely accountable for anything, so it’s hard to see if they’ll have the nerve to implement this in practice. When the next breech happens, we’ll see if anybody gets fired.
“antiquated and difficult to defend Information Technology”

The government uses laughably old computers sometimes. Forces in government wants to upgrade them. This won’t work. Instead of replacing old computers, the budget will simply be used to add new computers. The old computers will still stick around.
“Legacy” is a problem that money can’t solve. Programmers know how to build small things, but not big things. Everything starts out small, then becomes big gradually over time through constant small additions. What you have now is big legacy systems. Attempts to replace a big system with a built-from-scratch big system will fail, because engineers don’t know how to build big systems. This will suck down any amount of budget you have with failed multi-million dollar projects.
It’s not the antiquated systems that are usually the problem, but more modern systems. Antiquated systems can usually be protected by simply sticking a firewall or proxy in front of them.

“address immediate unmet budgetary needs necessary to manage risk”

Nobody cares about cybersecurity. Instead, it’s a thing people exploit in order to increase their budget. Instead of doing the best security with the budget they have, they insist they can’t secure the network without more money.

An alternate way to address gaps in cybersecurity is instead to do less. Reduce exposure to the web, provide fewer services, reduce functionality of desktop computers, and so on. Insisting that more money is the only way to address unmet needs is the strategy of the incompetent.

Use the NIST framework
Probably the biggest thing in the EO is that it forces everyone to use the NIST cybersecurity framework.
The NIST Framework simply documents all the things that organizations commonly do to secure themselves, such run intrusion-detection systems or impose rules for good passwords.
There are two problems with the NIST Framework. The first is that no organization does all the things listed. The second is that many organizations don’t do the things well.
Password rules are a good example. Organizations typically had bad rules, such as frequent changes and complexity standards. So the NIST Framework documented them. But cybersecurity experts have long opposed those complex rules, so have been fighting NIST on them.

Another good example is intrusion-detection. These days, I scan the entire Internet, setting off everyone’s intrusion-detection systems. I can see first hand that they are doing intrusion-detection wrong. But the NIST Framework recommends they do it, because many organizations do it, but the NIST Framework doesn’t demand they do it well.
When this EO forces everyone to follow the NIST Framework, then, it’s likely just going to increase the amount of money spent on cybersecurity without increasing effectiveness. That’s not necessarily a bad thing: while probably ineffective or counterproductive in the short run, there might be long-term benefit aligning everyone to thinking about the problem the same way.
Note that “following” the NIST Framework doesn’t mean “doing” everything. Instead, it means documented how you do everything, a reason why you aren’t doing anything, or (most often) your plan to eventually do the thing.
preference for shared IT services for email, cloud, and cybersecurity
Different departments are hostile toward each other, with each doing things their own way. Obviously, the thinking goes, that if more departments shared resources, they could cut costs with economies of scale. Also obviously, it’ll stop the many home-grown wrong solutions that individual departments come up with.
In other words, there should be a single government GMail-type service that does e-mail both securely and reliably.
But it won’t turn out this way. Government does not have “economies of scale” but “incompetence at scale”. It means a single GMail-like service that is expensive, unreliable, and in the end, probably insecure. It means we can look forward to government breaches that instead of affecting one department affecting all departments.

Yes, you can point to individual organizations that do things poorly, but what you are ignoring is the organizations that do it well. When you make them all share a solution, it’s going to be the average of all these things — meaning those who do something well are going to move to a worse solution.

I suppose this was inserted in there so that big government cybersecurity companies can now walk into agencies, point to where they are deficient on the NIST Framework, and say “sign here to do this with our shared cybersecurity service”.
“identify authorities and capabilities that agencies could employ to support the cybersecurity efforts of critical infrastructure entities”
What this means is “how can we help secure the power grid?”.
What it means in practice is that fiasco in the Vermont power grid. The DHS produced a report containing IoCs (“indicators of compromise”) of Russian hackers in the DNC hack. Among the things it identified was that the hackers used Yahoo! email. They pushed these IoCs out as signatures in their “Einstein” intrusion-detection system located at many power grid locations. The next person that logged into their Yahoo! email was then flagged as a Russian hacker, causing all sorts of hilarity to ensue, such as still uncorrected stories by the Washington Post how the Russians hacked our power-grid.
The upshot is that federal government help is also going to include much government hindrance. They really are this stupid sometimes and there is no way to fix this stupid. (Seriously, the DHS still insists it did the right thing pushing out the Yahoo IoCs).
Resilience Against Botnets and Other Automated, Distributed Threats

The government wants to address botnets because it’s just the sort of problem they love, mass outages across the entire Internet caused by a million machines.

But frankly, botnets don’t even make the top 10 list of problems they should be addressing. Number #1 is clearly “phishing” — you know, the attack that’s been getting into the DNC and Podesta e-mails, influencing the election. You know, the attack that Gizmodo recently showed the Trump administration is partially vulnerable to. You know, the attack that most people blame as what probably led to that huge OPM hack. Replace the entire Executive Order with “stop phishing”, and you’d go further fixing federal government security.

But solving phishing is tough. To begin with, it requires a rethink how the government does email, and how how desktop systems should be managed. So the government avoids complex problems it can’t understand to focus on the simple things it can — botnets.

Dealing with “prolonged power outage associated with a significant cyber incident”

The government has had the hots for this since 2001, even though there’s really been no attack on the American grid. After the Russian attacks against the Ukraine power grid, the issue is heating up.

Nation-wide attacks aren’t really a threat, yet, in America. We have 10,000 different companies involved with different systems throughout the country. Trying to hack them all at once is unlikely. What’s funny is that it’s the government’s attempts to standardize everything that’s likely to be our downfall, such as sticking Einstein sensors everywhere.

What they should be doing is instead of trying to make the grid unhackable, they should be trying to lessen the reliance upon the grid. They should be encouraging things like Tesla PowerWalls, solar panels on roofs, backup generators, and so on. Indeed, rather than industrial system blackout, industry backup power generation should be considered as a source of grid backup. Factories and even ships were used to supplant the electric power grid in Japan after the 2011 tsunami, for example. The less we rely on the grid, the less a blackout will hurt us.

“cybersecurity risks facing the defense industrial base, including its supply chain”

So “supply chain” cybersecurity is increasingly becoming a thing. Almost anything electronic comes with millions of lines of code, silicon chips, and other things that affect the security of the system. In this context, they may be worried about intentional subversion of systems, such as that recent article worried about Kaspersky anti-virus in government systems. However, the bigger concern is the zillions of accidental vulnerabilities waiting to be discovered. It’s impractical for a vendor to secure a product, because it’s built from so many components the vendor doesn’t understand.

“strategic options for deterring adversaries and better protecting the American people from cyber threats”

Deterrence is a funny word.

Rumor has it that we forced China to backoff on hacking by impressing them with our own hacking ability, such as reaching into China and blowing stuff up. This works because the Chinese governments remains in power because things are going well in China. If there’s a hiccup in economic growth, there will be mass actions against the government.

But for our other cyber adversaries (Russian, Iran, North Korea), things already suck in their countries. It’s hard to see how we can make things worse by hacking them. They also have a strangle hold on the media, so hacking in and publicizing their leader’s weird sex fetishes and offshore accounts isn’t going to work either.

Also, deterrence relies upon “attribution”, which is hard. While news stories claim last year’s expulsion of Russian diplomats was due to election hacking, that wasn’t the stated reason. Instead, the claimed reason was Russia’s interference with diplomats in Europe, such as breaking into diplomat’s homes and pooping on their dining room table. We know it’s them when they are brazen (as was the case with Chinese hacking), but other hacks are harder to attribute.

Deterrence of nation states ignores the reality that much of the hacking against our government comes from non-state actors. It’s not clear how much of all this Russian hacking is actually directed by the government. Deterrence polices may be better directed at individuals, such as the recent arrest of a Russian hacker while they were traveling in Spain. We can’t get Russian or Chinese hackers in their own countries, so we have to wait until they leave.

Anyway, “deterrence” is one of those real-world concepts that hard to shoe-horn into a cyber (“cyber-deterrence”) equivalent. It encourages lots of bad thinking, such as export controls on “cyber-weapons” to deter foreign countries from using them.

“educate and train the American cybersecurity workforce of the future”

The problem isn’t that we lack CISSPs. Such blanket certifications devalue the technical expertise of the real experts. The solution is to empower the technical experts we already have.

In other words, mandate that whoever is the “cyberczar” is a technical expert, like how the Surgeon General must be a medical expert, or how an economic adviser must be an economic expert. For over 15 years, we’ve had a parade of non-technical people named “cyberczar” who haven’t been experts.

Once you tell people technical expertise is valued, then by nature more students will become technical experts.

BTW, the best technical experts are software engineers and sysadmins. The best cybersecurity for Windows is already built into Windows, whose sysadmins need to be empowered to use those solutions. Instead, they are often overridden by a clueless cybersecurity consultant who insists on making the organization buy a third-party product instead that does a poorer job. We need more technical expertise in our organizations, sure, but not necessarily more cybersecurity professionals.

Conclusion

This is really a government document, and government people will be able to explain it better than I. These are just how I see it as a technical-expert who is a government-outsider.

My guess is the most lasting consequential thing will be making everyone following the NIST Framework, and the rest will just be a lot of aspirational stuff that’ll be ignored.

New Whitepaper: Aligning to the NIST Cybersecurity Framework in the AWS Cloud

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/new-whitepaper-aligning-to-the-nist-cybersecurity-framework-in-the-aws-cloud/

NIST logo

Today, we released the Aligning to the NIST Cybersecurity Framework in the AWS Cloud whitepaper. Both public and commercial sector organizations can use this whitepaper to assess the AWS environment against the National Institute of Standards and Technology (NIST) Cybersecurity Framework (CSF) and improve the security measures they implement and operate (also known as security in the cloud). The whitepaper also provides a third-party auditor letter attesting to the AWS Cloud offering’s conformance to NIST CSF risk management practices (also known as security of the cloud), allowing organizations to properly protect their data across AWS.

In February 2014, NIST published the Framework for Improving Critical Infrastructure Cybersecurity in response to Presidential Executive Order 13636, “Improving Critical Infrastructure Cybersecurity,” which called for the development of a voluntary framework to help organizations improve the cybersecurity, risk management, and resilience of their systems. The Cybersecurity Enhancement Act of 2014 reinforced the legitimacy and authority of the NIST CSF by codifying it and its voluntary adoption into law, and federal agency Federal Information Security Modernization Act (FISMA) reporting metrics now align to the NIST CSF. Though it is intended for adoption by the critical infrastructure sector, the foundational set of security disciplines in the NIST CSF has been endorsed by government and industry as a recommended baseline for use by any organization, regardless of its sector or size.

We recognize the additional level of effort an organization has to expend for each new security assurance framework it implements. To reduce that burden, we provide a detailed breakout of AWS Cloud offerings and associated customer and AWS responsibilities to facilitate alignment with the NIST CSF. Organizations ranging from federal and state agencies to regulated entities to large enterprises can use this whitepaper as a guide for implementing AWS solutions to achieve the risk management outcomes in the NIST CSF.

Security, compliance, and customer data protection are our top priorities, and we will continue to provide the resources and services for you to meet your desired outcomes while integrating security best practices in the AWS environment. When you use AWS solutions, you can be confident that we protect your data with a level of assurance that meets, if not exceeds, your requirements and needs, and gives you the resources to secure your AWS environment. To request support for implementing the NIST CSF in your organization by using AWS services, contact your AWS account manager.

– Chris Gile, Senior Manager, Security Assurance

Buzzword Watch: Prosilience

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/buzzword_watch_.html

Summer Fowler at CMU has invented a new word: prosilience:

I propose that we build operationally PROSILIENT organizations. If operational resilience, as we like to say, is risk management “all grown up,” then prosilience is resilience with consciousness of environment, self-awareness, and the capacity to evolve. It is not about being able to operate through disruption, it is about anticipating disruption and adapting before it even occurs–a proactive version of resilience. Nascent prosilient capabilities include exercises (tabletop or technical) that simulate how organizations would respond to a scenario. The goal, however, is to automate, expand, and perform continuous exercises based on real-world indicators rather than on scenarios.

I have long been a big fan of resilience as a security concept, and the property we should be aiming for. I’m not sure prosilience buys me anything new, but this is my first encounter with this new buzzword. It would certainly make for a best-selling business-book title.

First Round of systemd.conf 2015 Sponsors

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/first-round-of-systemdconf-2015-sponsors.html

First Round of systemd.conf 2015 Sponsors

We are happy to announce the first round of systemd.conf
2015
sponsors!

Our first Gold sponsor is CoreOS!

CoreOS develops software for modern infrastructure that delivers a consistent operating environment for distributed applications. CoreOS’s commercial offering, Tectonic, is an enterprise-ready platform that combines Kubernetes and the CoreOS stack to run Linux containers. In addition CoreOS is the creator and maintainer of open source projects such as CoreOS Linux, etcd, fleet, flannel and rkt. The strategies and architectures that influence CoreOS allow companies like Google, Facebook and Twitter to run their services at scale with high resilience. Learn more about CoreOS here https://coreos.com/, Tectonic here, https://tectonic.com/ or follow CoreOS on Twitter @coreoslinux.

A Silver sponsor is Codethink:

Codethink is a software services consultancy, focusing on engineering reliable systems for long-term deployment with open source technologies.

A Bronze sponsor is Pantheon:

Pantheon is a platform for professional website development, testing, and deployment. Supporting Drupal and WordPress, Pantheon runs over 100,000 websites for the world’s top brands, universities, and media organizations on top of over a million containers.

A Bronze sponsor is Pengutronix:

Pengutronix provides consulting, training and development services for Embedded Linux to customers from the industry. The Kernel Team ports Linux to customer hardware and has more than 3100 patches in the official mainline kernel. In addition to lowlevel ports, the Pengutronix Application Team is responsible for board support packages based on PTXdist or Yocto and deals with system integration (this is where systemd plays an important role). The Graphics Team works on accelerated multimedia tasks, based on the Linux kernel, GStreamer, Qt and web technologies.

We’d like to thank our sponsors for their support! Without sponsors our conference would not be possible!

We’ll shortly announce our second round of sponsors, please stay tuned!

If you’d like to join the ranks of systemd.conf 2015 sponsors, please have a look at our Becoming a Sponsor page!

Reminder! The systemd.conf 2015 Call for Presentations ends on monday, August 31st! Please make sure to submit your proposals on the CfP page until then!

Also, don’t forget to register for the conference! Only a limited number of
registrations are available due to space constraints!
Register here!.

For further details about systemd.conf consult the conference website.