Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/874143/rss

Security updates have been issued by Debian (mosquitto and php7.0), Fedora (python-django-filter and qt), Mageia (fossil, opencryptoki, and qtbase5), openSUSE (apache2, busybox, dnsmasq, ffmpeg, pcre, and wireguard-tools), Red Hat (kpatch-patch), SUSE (apache2, busybox, dnsmasq, ffmpeg, java-11-openjdk, libvirt, open-lldp, pcre, python, qemu, util-linux, and wireguard-tools), and Ubuntu (apport and libslirp).

Accelerating serverless development with AWS SAM Accelerate

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/accelerating-serverless-development-with-aws-sam-accelerate/

Building a serverless application changes the way developers think about testing their code. Previously, developers would emulate the complete infrastructure locally and only commit code ready for testing. However, with serverless, local emulation can be more complex.

In this post, I show you how to bypass most local emulation by testing serverless applications in the cloud against production services using AWS SAM Accelerate. AWS SAM Accelerate aims to increase infrastructure accuracy for testing with sam sync, incremental builds, and aggregated feedback for developers. AWS SAM Accelerate brings the developer to the cloud and not the cloud to the developer.

AWS SAM Accelerate

The AWS SAM team has listened to developers wanting a better way to emulate the cloud on their local machine and we believe that testing against the cloud is the best path forward. With that in mind, I am happy to announce the beta release of AWS SAM Accelerate!

Previously, the latency of deploying after each change has caused developers to seek other options. AWS SAM Accelerate is a set of features to reduce that latency and enable developers to test their code quickly against production AWS services in the cloud.

To demonstrate the different options, this post uses an example application called “Blog”. To follow along, create your version of the application by downloading the demo project. Note, you need the latest version of AWS SAM and Python 3.9 installed. AWS SAM Accelerate works with other runtimes, but this example uses Python 3.9.

After installing the pre-requisites, set up the demo project with the following commands:

  1. Create a folder for the project called blog
    mkdir blog && cd blog
  2. Initialize a new AWS SAM project:
    sam init
  3. Chose option 2 for Custom Template Location.
  4. Enter https://github.com/aws-samples/aws-sam-accelerate-demo as the location.

AWS SAM downloads the sample project into the current folder. With the blog application in place, you can now try out AWS SAM Accelerate.

AWS SAM sync

The first feature of AWS SAM Accelerate is a new command called sam sync. This command synchronizes your project declared in an AWS SAM template to the AWS Cloud. However, sam sync differentiates between code and configuration.

AWS SAM defines code as the following:

Anything else is considered configuration. The following description of the sam sync options explains how sam sync differentiates between configuration synchronization and code synchronization. The resulting patterns are the fastest way to test code in the cloud with AWS SAM.

Using sam sync (no options)

The sam sync command with no options deploys or updates all infrastructure and code like the sam deploy command. However, unlike sam deploy, sam sync bypasses the AWS CloudFormation changeset process. To see this, run:

sam sync --stack-name blog
AWS SAM sync with no options

AWS SAM sync with no options

First, sam sync builds the code using the sam build command and then the application is synchronized to the cloud.

Successful sync

Successful sync

Using SAM sync code, resource, resource-id flags

The sam sync command can also synchronize code changes to the cloud without updating the infrastructure. This code synchronization uses the service APIs and bypasses CloudFormation, allowing AWS SAM to update the code in seconds instead of minutes.

To synchronize code, use the --code flag, which instructs AWS SAM to sync all the code resources in the stack:

sam sync --stack-name blog --code
AWS SAM sync --code

AWS SAM sync with the code flag

The sam sync command verifies each of the code types present and synchronizes the sources to the cloud. This example uses an API Gateway REST API and two Lambda functions. AWS SAM skips the REST API because there is no external OpenAPI file for this project. However, the Lambda functions and their dependencies are synchronized.

You can limit the synchronized resources by using the --resource flag with the --code flag:

sam sync --stack-name blog --code --resource AWS::Serverless::Function
SAM sync specific resource types

SAM sync specific resource types

This command limits the synchronization to Lambda functions. Other available resources are AWS::Serverless::Api, AWS::Serverless::HttpApi, and AWS::Serverless::StateMachine.

You can target one specific resource with the --resource-id flag to get more granular:

sam sync --stack-name blog --code --resource-id HelloWorldFunction
SAM sync specific resource

SAM sync specific resource

This time sam sync ignores the GreetingFunction and only updates the HelloWorldFunction declared with the command’s --resource-id flag.

Using the SAM sync watch flag

The sam sync --watch option tells AWS SAM to monitor for file changes and automatically synchronize when changes are detected. If the changes include configuration changes, AWS SAM performs a standard synchronization equivalent to the sam sync command. If the changes are code only, then AWS SAM synchronizes the code with the equivalent of the sam sync --code command.

The first time you run the sam sync command with the --watch flag, AWS SAM ensures that the latest code and infrastructure are in the cloud. It then monitors for file changes until you quit the command:

sam sync --stack-name blog --watch
Initial sync

Initial sync

To see a change, modify the code in the HelloWorldFunction (hello_world/app.py) by updating the response to the following:

return {
  "statusCode": 200,
  "body": json.dumps({
    "message": "hello world, how are you",
    # "location": ip.text.replace("\n", "")
  }),
}

Once you save the file, sam sync detects the change and syncs the code for the HelloWorldFunction to the cloud.

AWS SAM detects changes

AWS SAM detects changes

Auto dependency layer nested stack

During the initial sync, there is a logical resource name called AwsSamAutoDependencyLayerNestedStack. This feature helps to synchronize code more efficiently.

When working with Lambda functions, developers manage the code for the Lambda function and any dependencies required for the Lambda function. Before AWS SAM Accelerate, if a developer does not create a Lambda layer for dependencies, then the dependencies are re-uploaded with the function code on every update. However, with sam sync, the dependencies are automatically moved to a temporary layer to reduce latency.

Auto dependency layer in change set

Auto dependency layer in change set

During the first synchronization, sam sync creates a single nested stack that maintains a Lambda layer for each Lambda function in the stack.

Auto dependency layer in console

Auto dependency layer in console

These layers are only updated when the dependencies for one of the Lambda functions are updated. To demonstrate, change the requirements.txt (greeting/requirements.txt) file for the GreetingFunction to the following:

Requests
boto3

AWS SAM detects the change, and the GreetingFunction and its temporary layer are updated:

Auto layer synchronized

Auto dependency layer synchronized

The Lambda function changes because the Lambda layer version must be updated.

Incremental builds with sam build

The second feature of AWS SAM Accelerate is an update to the SAM build command. This change separates the cache for dependencies from the cache for the code. The build command now evaluates these separately and only builds artifacts that have changed.

To try this out, build the project with the cached flag:

sam build --cached
The first build establishes cache

The first build establishes cache

The first build recognizes that there is no cache and downloads the dependencies and builds the code. However, when you rerun the command:

The second build uses existing cached artifacts

The second build uses existing cached artifacts

The sam build command verifies that the dependencies have not changed. There is no need to download them again so it builds only the application code.

Finally, update the requirements file for the HelloWorldFunction (hello_w0rld/requirements.txt) to:

Requests
boto3

Now rerun the build command:

AWS SAM build detects dependency changes

AWS SAM build detects dependency changes

The sam build command detects a change in the dependency requirements and rebuilds the dependencies and the code.

Aggregated feedback for developers

The final part of AWS SAM Accelerate’s beta feature set is aggregating logs for developer feedback. This feature is an enhancement to the already existing sam logs command. In addition to pulling Amazon CloudWatch Logs or the Lambda function, it is now possible to retrieve logs for API Gateway and traces from AWS X-Ray.

To test this, start the sam logs:

sam logs --stack-name blog --include-traces --tail

Invoke the HelloWorldApi endpoint returned in the outputs on syncing:

curl https://112233445566.execute-api.us-west-2.amazonaws.com/Prod/hello

The sam logs command returns logs for the AWS Lambda function, Amazon API Gateway REST execution logs, and AWS X-Ray traces.

AWS Lambda logs from Amazon CloudWatch

AWS Lambda logs from Amazon CloudWatch

Amazon API Gateway execution logs from Amazon CloudWatch

Amazon API Gateway execution logs from Amazon CloudWatch

Traces from AWS X-Ray

Traces from AWS X-Ray

The full picture

Development diagram for AWS SAM Accelerate

Development diagram for AWS SAM Accelerate

With AWS SAM Accelerate, creating and testing an application is easier and faster. To get started:

  1. Start a new project:
    sam init
  2. Synchronize the initial project with a development environment:
    sam sync --stack-name <project name> --watch
  3. Start monitoring for logs:
    sam logs --stack-name <project name> --include-traces --tail
  4. Test using response data or logs.
  5. Iterate.
  6. Rinse and repeat!

Some caveats

AWS SAM Accelerate is in beta as of today. The team has worked hard to implement a solid minimum viable product (MVP) to get feedback from our community. However, there are a few caveats.

  1. Amazon State Language (ASL) code updates for Step Functions does not currently support DefinitionSubstitutions.
  2. API Gateway OpenAPI template must be defined in the DefiitionUri parameter and does not currently support pseudo parameters and intrinsic functions at this time
  3. The sam logs command only supports execution logs on REST APIs and access logs on HTTP APIs.
  4. Function code cannot be inline and must be defined as a separate file in the CodeUri parameter.

Conclusion

When testing serverless applications, developers must get to the cloud as soon as possible. AWS SAM Accelerate helps developers escape from emulating the cloud locally and move to the fidelity of testing in the cloud.

In this post, I walk through the philosophy of why the AWS SAM team built AWS SAM Accelerate. I provide an example application and demonstrate the different features designed to remove barriers from testing in the cloud.

We invite the serverless community to help improve AWS SAM for building serverless applications. As with AWS SAM and the AWS SAM CLI (which includes AWS SAM Accelerate), this project is open source and you can contribute to the repository.

For more serverless content, visit Serverless Land.

Use Amazon EC2 for cost-efficient cloud gaming with pay-as-you-go pricing

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/use-amazon-ec2-for-cost-efficient-cloud-gaming-with-pay-as-you-go-pricing/

This post is written by Markus Ziller, Solutions Architect

Since AWS launched in 2006, cloud computing disrupted traditional IT operations by providing a more cost-efficient, scalable, and secure alternative to owning hardware and data centers. Similarly, cloud gaming today enables gamers to play video games with pay-as-you go pricing. This removes the need of high upfront investments in gaming hardware. Cloud gaming platforms like Amazon Luna are an entryway, but customers are limited to the games available on the service. Furthermore, many customers also prefer to own their games, or they already have a sizable collection. For those use cases, vendor-neutral software like NICE DCV or Parsec are powerful solutions for streaming your games from anywhere.

This post shows a way to stream video games from the AWS Cloud to your local machine. I will demonstrate how you can provision a powerful gaming machine with pay-as-you-go pricing that allows you to play even the most demanding video games with zero upfront investment into gaming hardware.

The post comes with code examples on GitHub that let you follow along and replicate this post in your account.

In this example, I use the AWS Cloud Development Kit (AWS CDK), an open source software development framework, to model and provision cloud application resources. Using the CDK can reduce the complexity and amount of code needed to automate resource deployment.

Overview of the solution

The main services in this solution are Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Store (Amazon EBS). The architecture is as follows:

The architecture of the solution. It shows an EC2 instance of the G4 family deployed in a public subnet. The EC2 instances communicates with S3. Also shown is how a security group controls access from users to the EC2 instance

The key components of this solution are described in the following list. During this post, I will explain each component in detail. Deploy this architecture with the sample CDK project that comes with this blog post.

  1. An Amazon Virtual Private Cloud (Amazon VPC) that lets you launch AWS resources in a logically isolated virtual network. This includes a network configuration that lets you connect with instances in your VPC on ports 3389 and 8443.
  2. An Amazon EC2 instance of the G4 instance family. Amazon EC2 G4 instances are the most cost-effective and versatile GPU instances. Utilize G4 instances to run your games.
  3. Access to an Amazon Simple Storage Service (Amazon S3) bucket. S3 is an object storage that contains the graphics drivers required for GPU instances.
  4. A way to create a personal gaming Amazon Machine Images (AMI). AMIs mean that you only need to conduct the initial configuration of your gaming instance once. After that, create new gaming instances on demand from your AMI.

Walkthrough

The following sections walk through the steps required to set up your personal gaming AMI. You will only have to do this once.

For this walkthrough, you need:

  • An AWS account
  • Installed and authenticated AWS CLI
  • Installed Node.js, TypeScript
  • Installed git
  • Installed AWS CDK

You will also need an EC2 key pair for the initial instance setup. List your available key pairs with the following CLI command:

aws ec2 describe-key-pairs --query 'KeyPairs[*].KeyName' --output table

Alternatively, create a new key pair with the CLI. You will need the created .pem file later, so make sure to keep it.

KEY_NAME=Gaming
aws ec2 create-key-pair --key-name $KEY_NAME –query 'KeyMaterial' --output text > $KEY_NAME.pem

Checkout and deploy the sample stack

  1. After completing the prerequisites, clone the associated GitHub repository by running the following command in a local directory:git clone [email protected]:aws-samples/cloud-gaming-on-ec2-instances
  1. Open the repository in your preferred local editor, and then review the contents of the *.ts files in cdk/bin/ and cdk/lib/
  1. In gaming-on-g4-instances.ts you will find two CDK stacks: G4DNStack and G4ADStack. A CDK stack is a unit of deployment. All AWS resources defined within the scope of a stack, either directly or indirectly, are provisioned as a single unit.

The architecture for both stacks is similar, and it only differs in the instance type that will be deployed.

G4DNStack and G4ADStack share parameters that determine the configuration of the EC2 instance, the VPC network and the preinstalled software. The stacks come with defaults for some parameters – I recommend keeping the default values.

EC2

  • instanceSize: Sets the EC2 size. Defaults to g4dn.xlarge and g4ad.4xlarge, respectively. Check the service page for a list of valid configurations.
  • sshKeyName: The name of the EC2 key pair you will use to connect to the instance. Ensure you have access to the respective .pem file.
  • volumeSizeGiB: The root EBS volume size. Around 20 GB will be used for the Windows installation, the rest will be available for your software.

VPC

  • openPorts: Access from these ports will be allowed. Per default, this will allow access for Remote Desktop Protocol (RDP) (3389) and NICE DCV (8443).
  • associateElasticIp: Controls if an Elastic IP address will be created and added to the EC2 instance. An Elastic IP address is a static public IPv4 address. Contrary to dynamic IPv4 addresses managed by AWS, it will not change after an instance restart. This is an optional convenience that comes at a small cost for reserving the IP address for you.
  • allowInboundCidr: Access from this CIDR range will be allowed. This limits the IP space from where your instance is routable. It can be used to add an additional security layer. Per default, traffic from any IP address will be allowed to reach your instance. Notwithstanding allowInboundCidr, an RDP or NICE DCV connection to your instance requires valid credentials and will be rejected otherwise.

Software

For g4dn instances, one more parameter is required (see process of installing NVIDIA drivers):

  • gridSwCertUrl: The NVIDIA driver requires a certificate file, which can be downloaded from Amazon S3.

Choose an instance type based on your performance and cost requirements, and then adapt the config accordingly. I recommend starting with the G4DNStack, as it comes at the lowest hourly instance cost. If you need more graphics performance, then choose the G4ADStack for up to 40% improvement in graphics performance.

After choosing an instance, follow the instructions in the README.md in order to deploy the stack.

The CDK will create a new Amazon VPC with a public subnet. It will also configure security groups to allow inbound traffic from the CIDR range and ports specific in the config. By default, this will allow inbound traffic from any IP address. I recommend restricting access to your current IP address by going to checkip.amazonaws.com and replacing 0.0.0.0/0 with <YOUR_IP>/32 for the allowInboundCidr config parameter.

Besides the security groups, the CDK also manages all required IAM permissions.

It creates the following resources in your VPC:

  • An EC2 GPU instance running Windows Server 2019. Depending on the stack you chose, the default instance type will be g4dn.xlarge or g4ad.4xlarge. You can override this in the template. When the EC2 instance is launched, the CDK runs an instance-specific PowerShell script. This PowerShell script downloads all drivers and NICE DCV. Check the code to see the full script or add your own commands.
  • An EBS gp3 volume with your defined size (default: 150 GB).
  • An EC2 launch template.
  • (Optionally) An Elastic IP address as a static public IP address for your gaming instances.

After the stack has been deployed, you will see the following output:

The output of a CDK deployment of the CloudGamingOnG4DN stack. Shows values for Credentials, InstanceId, KeyName, LaunchTemplateId, PublicIp. Values for Credentials, InstanceId and PublicIp are redacted

Click the first link, and download the remote desktop file in order to connect to your instance. Use the .pem file from the previous step to receive the instance password.

Install drivers on EC2

You will use the RDP to initially connect to your instance. RDP provides a user with a graphical interface to connect to another computer over a network connection. RDP clients are available for a large number of platforms. Use the public IP address provided by the CDK output, the username Administrator, and the password displayed by the EC2 dialogue.

Ensure that you note the password for later steps.

Most configuration process steps are automated by the CDK. However, a few actions (e.g., driver installation) cannot be properly automated. For those, the following manual steps are required once.

Navigate to $home\Desktop\InstallationFiles. If the folder contains an empty file named “OK”, then everything was downloaded correctly and you can proceed with the installation. If you connect while the setup process is still in progress, then wait until the OK file gets created before proceeding. This typically takes 2-3 minutes.

The next step differs slightly for g4ad and g4dn instances.

g4ad instances with AMD Radeon Pro V520

Follow the instructions in the EC2 documentation to install the AMD driver from the InstallationFiles folder. The installation may take a few minutes, and it will display the following output when successfully finished.

The output of the PowerShell command that installs the graphics driver for AMD Radeon Pro V520

The contents of the InstallationFiles folder. It contains a folder 1_AMD_driver and files 2_NICEDCV-Server, 3_NICEDCV-DisplayDriver, OK

Next, install NICE DCV Server and Display driver by double-clicking the respective files. Finally, restart the instance by running the Restart-Computer PowerShell command.

 

g4dn instances with NVIDIA T4

Navigate to 1_NVIDIA_drivers, run the NVIDIA driver installer for Windows Server 2019 and follow the instructions.

The contents of the InstallationFiles folder. It contains a folder 1_NVIDIA_drivers and files 2_NICEDCV-Server, 3_NICEDCV-DisplayDriver, 4_update_registry, OK

Next, double-click the respective files in the InstallationFiles folder in order to install NICE DCV Server and Display driver.

Finally, right-click on 4_update_registry.ps1, and select “Run with PowerShell” to activate the driver. In order to complete the setup, restart the instance.

Install your software

After the instance restart, you can connect from your local machine to your EC2 instance with NICE DCV. The NICE DCV bandwidth-adaptive streaming protocol allows near real-time responsiveness for your applications without compromising the image accuracy. This is the recommended way to stream latency sensitive applications.

Download the NICE DCV viewer client and connect to your EC2 instance with the same credentials that you used for the RDP connection earlier. After testing the NICE DCV connection, I recommend disabling RDP by removing the corresponding rule in your security group.

You are now all set to install your games and tools on the EC2 instance. Make sure to install it on the C: drive, as you will create an AMI with the contents of C: later.

Start and stop your instance on demand

At this point, you have fully set up the EC2 instance for cloud gaming on AWS. You can now start and stop the instance when needed. The following CLI commands are all you need to remember:

aws ec2 start-instances --instance-ids <INSTANCE_ID>

aws ec2 stop-instances --instance-ids <INSTANCE_ID>

This will use the regular On-Demand instance capacity of EC2, and you will be billed hourly charges for the time that your instance is running. If your instance is stopped, you will only be charged for the EBS volume and the Elastic IP address if you chose to use one.

Launch instances from AMI

Make sure you have installed all of the applications you require on your EC2, and then create your personal gaming AMI by running the following AWS CLI command.

aws ec2 create-image --instance-id <YOUR_INSTANCE_ID> --name <THE_NAME_OF_YOUR_AMI>

Use the following command to get details about your AMI creation. When the status changes from pending to available, then your AMI is ready.

aws ec2 describe-images --owners self --query 'Images[*].[Name, ImageId, BlockDeviceMappings[0].Ebs.SnapshotId, State]' --output table

The CDK created an EC2 launch template when it deployed the stack. Run the following CLI command to spin up an EC2 instance from this template.

aws ec2 run-instances --image-id <YOUR_AMI_ID> --launch-template LaunchTemplateName=<LAUNCH_TEMPLATE_NAME> --query "Instances[*].[InstanceId, PublicIpAddress]" --output table

This command will start a new EC2 instance with the exact same configuration as your initial EC2, but with all your software already installed.

Conclusion

This post walked you through creating your personal cloud gaming stack. You are now all set to lean back and enjoy the benefits of per second billing while playing your favorite video games in the AWS cloud.

Visit the Amazon EC2 G4 instances service page to learn more about how AWS continues to push the boundaries of cost-effectiveness for graphic-intensive applications.

Cloudflare recognized as a ‘Leader’ in The Forrester New Wave for Edge Development Platforms

Post Syndicated from Rita Kozlov original https://blog.cloudflare.com/forrester-wave-edge-development-2021/

Cloudflare recognized as a 'Leader' in The Forrester New Wave for Edge Development Platforms

Cloudflare recognized as a 'Leader' in The Forrester New Wave for Edge Development Platforms

Forrester’s New Wave for Edge Development Platforms has just been announced. We’re thrilled that they have named Cloudflare a leader (you can download a complimentary copy of the report here).

Since the very beginning, Cloudflare has sought to help developers building on the web, and since the introduction of Workers in 2017, Cloudflare has enabled developers to deploy their applications to the edge itself.

According to the report by Forrester Vice President, Principal Analyst, Jeffrey Hammond, Cloudflare “offers strong compute, data services and web development capabilities. Alongside Workers, Workers KV adds edge data storage. Pages, Stream and Images provide higher level platform services for modern web workloads. Cloudflare has an intuitive developer experience, fast, global deployment of updated code, and minimal cold start times.”

Cloudflare recognized as a 'Leader' in The Forrester New Wave for Edge Development Platforms

Reimagining development for the modern web

Building on the web has come a long way. The idea that one might have to buy a physical machine in order to build a website seems incomprehensible now. The cloud has played a major role in making it easier for developers to get started. However, since the advent of the cloud, things have stalled — and innovation has become more incremental. That means that while developers don’t have to think about buying a server, they’re still tasked with thinking about where in the world it is, how to add concurrency to handle increasing traffic, and how to make them secure.

We wanted to abstract that all away. Our aim: to reimagine what things might look like if developers could truly just think about the application they wanted to build. Leaving the scaling, speed, and even compliance, to us.

Of course, reimagining things is always scary. There’s no guarantee that taking a new approach is going to work — it usually requires a leap of faith.

It’s been gratifying to see developers flock to our platform — and the applications they’ve been able to build, free of scalability and latency constraints, have been phenomenal.

It’s also gratifying to be named a Leader in Edge Development Platforms by Forrester — one of the preeminent analyst firms in the industry. We feel it really does provide industry recognition to the approach we bet on four years ago.

Cloudflare is the most differentiated among all the vendors evaluated

We received a differentiated rating in the following criteria:

  • Developer experience
  • Programming model
  • Platform execution model
  • “Day 2+” experience
  • Integrations
  • Roadmap
  • Vision
  • Market approach

While being able to build our platform atop Cloudflare’s network gave us an advantage in eliminating latency from the start, we knew that wasn’t enough to compel developers to think in a new way. Since the release of Workers, we have relentlessly focused on making the experience of building a new application as easy as possible at every step of the way: from onboarding, through day 2, and beyond.

This approach extends beyond tooling, and to how we think about additional services developers need in order to complete their applications. For example, in thinking about providing data solutions on the edge, we again wanted to make the distributed nature of the system just work, rather than making developers think about it, which is what led us to develop Durable Objects. With Durable Objects, Cloudflare can make intelligent decisions about where to store the data based on access patterns (or compliance — whichever is most important to the developer), rather than forcing the developer to think about regions.

As we expand our offering, it’s important to us that it continues to be intuitive and easy for developers to solve problems.

We’re just getting started

But, we’re not stopping here. As our cofounder Michelle likes to say, we’re just getting started. We recognize this is just the beginning of the journey to bring the full stack to the edge. We have some exciting announcements coming in the next couple of weeks — stay tuned!

Automation Enables Innovation in the Cloud

Post Syndicated from Shelby Matthews original https://blog.rapid7.com/2021/10/27/automation-enables-innovation-in-the-cloud/

Automation Enables Innovation in the Cloud

As public cloud adoption continues to grow year after year, we see more and more enterprises realizing the strategic advantage the cloud can provide to help deliver new and innovative products quicker, roll out new features with ease, and reach new customers. But along with those advantages comes a new level of complexity and risk that organizations need to account for.

Rapid7’s recently released 2021 Cloud Misconfigurations Report revealed that there were 121 publicly reported data exposure events last year that were the result of cloud misconfigurations.

One critical part of preventing these misconfigurations is the strategic, gradual adoption of automated notification and remediation workflows.

The benefits of automation in cloud security

Automation in the cloud is the implementation of tools that take away the responsibility of security from the user and make it automated. These tools can catch and fix misconfigurations before you even realize they were ever there.

Some of the benefits these tools can bring include:

  • Data breach protection: Despite increased regulations, data breaches continue to grow. Most of these breaches happen when organizations make inadequate or inappropriate investments in cloud security. Now more than ever, companies are under increasing pressure to make appropriate investments to protect customer data as they scale and expand their cloud footprint.
  • Threat protection: When using cloud services, it’s common to be overwhelmed with the large volume of threat signals you receive from a wide variety of sources. Without being able to decipher the signals from noise, it’s difficult to identify true risk and act on it in a timely fashion.

To deliver threat protection, InsightCloudSec integrates with native cloud service providers’ security platforms (e.g., Amazon GuardDuty) and other partners (e.g., Tenable) for best-in-class, intelligent threat detection that continuously monitors for malicious activity and unauthorized behavior. These services use machine learning, anomaly detection, and integrated threat intelligence to identify and prioritize potential threats. You’ll be able to detect cryptocurrency mining, credential compromise behavior, communication with known command-and-control servers, and API calls from known malicious IP addresses.

While automating every workflow possible isn’t the answer, one thing is clear: Enterprise-scale cloud environments have outstripped the human capacity to manage them manually.

Not only is automation essential for bringing security — it’s a way to cut down the time it would take to fix resources, as compared to a manual approach. Automation greatly reduces the risk of human error in the cloud and allows workflows to include automated security across the board.

How InsightCloudSec provides it

InsightCloudSec comes with an automated program that we call our bots, which allow you to execute actions on resources based on your conditions. Your bot consists of three things: scope, filters, and actions. A single bot can be configured to apply a unified approach to remediation across all clouds, creating a consistent, scalable, and sustainable approach to cloud security.

  • Scope: The scope chosen by the user determines which resources and places the bot will evaluate. You choose the bounds that the bot is constricted to. An example of a scope would be all of your AWS, GCP, and Azure accounts, looking for the Storage Container resource (e.g., S3 bucket, Blob storage, and Google Cloud storage).
  • Filters: InsightCloudSec comes with over 800 filters you can choose from. These filters are the condition on which the bot will act. An example of a filter would be Storage Container Public Access, which will evaluate if any of the resources within your scope have public access due to their permission(s) and/or bucket policy.
  • Actions: Finally, this is what the bot actually does. InsightCloudSec ships with over 100 different actions that you can customize. For example, if you set up a bot could to identify storage containers that are public, the action would be the bot notifying the team and cleaning up the exposed permissions.

Bots offer a unified approach to remediation across all your cloud environments. With InsightCloudSec, you can customize them just how you want it based on the full context of a misconfiguration. Automation with InsightCloudSec is the key to achieving security at the speed of scale.

What common cloud security mistakes are organizations making?

Find out in our 2021 Cloud Misconfigurations Report

Custom Headers for Cloudflare Pages

Post Syndicated from Nevi Shah original https://blog.cloudflare.com/custom-headers-for-pages/

Custom Headers for Cloudflare Pages

Custom Headers for Cloudflare Pages

Until today, Cloudflare Workers has been a great solution to setting headers, but we wanted to create an even smoother developer experience. Today, we’re excited to announce that Pages now natively supports custom headers on your projects! Simply create a _headers file in the build directory of your project and within it, define the rules you want to apply.

/developer-docs/*
  X-Hiring: Looking for a job? We're hiring engineers
(https://www.cloudflare.com/careers/jobs)

What can you set with custom headers?

Being able to set custom headers is useful for a variety of reasons — let’s explore some of your most popular use cases.

Search Engine Optimization (SEO)

When you create a Pages project, a pages.dev deployment is created for your project which enables you to get started immediately and easily preview changes as you iterate. However, we realize this poses an issue — publishing multiple copies of your website can harm your rankings in search engine results. One way to solve this is by disabling indexing on all pages.dev subdomains, but we see many using their pages.dev subdomain as their primary domain. With today’s announcement you can attach headers such as X-Robots-Tag to hint to Google and other search engines how you’d like your deployment to be indexed.

For example, to prevent your pages.dev deployment from being indexed, you can add the following to your _headers file:

https://:project.pages.dev/*
  X-Robots-Tag: noindex

Security

Customizing headers doesn’t just help with your site’s search result ranking — a number of browser security features can be configured with headers. A few headers that can enhance your site’s security are:

  • X-Frame-Options: You can prevent click-jacking by informing browsers not to embed your application inside another (e.g. with an <iframe>).
  • X-Content-Type-Option: nosniff: To prevent browsers from interpreting a response as any other content-type than what is defined with the Content-Type header.
  • Referrer-Policy: This allows you to customize how much information visitors give about where they’re coming from when they navigate away from your page.
  • Permissions-Policy: Browser features can be disabled to varying degrees with this header (recently renamed from Feature-Policy).
  • Content-Security-Policy: And if you need fine-grained control over the content in your application, this header allows you to configure a number of security settings, including similar controls to the X-Frame-Options header.

You can configure these headers to protect an /app/* path, with the following in your _headers file:

/app/*
  X-Frame-Options: DENY
  X-Content-Type-Options: nosniff
  Referrer-Policy: no-referrer
  Permissions-Policy: document-domain=()
  Content-Security-Policy: script-src 'self'; frame-ancestors 'none';

CORS

Modern browsers implement a security protection called CORS or Cross-Origin Resource Sharing. This prevents one domain from being able to force a user’s action on another. Without CORS, a malicious site owner might be able to do things like make requests to unsuspecting visitors’ banks and initiate a transfer on their behalf. However, with CORS, requests are prevented from one origin to another to stop the malicious activity.

There are, however, some cases where it is safe to allow these cross-origin requests. So-called, “simple requests” (such as linking to an image hosted on a different domain) are permitted by the browser. Fetching these resources dynamically is often where the difficulty arises, and the browser is sometimes overzealous in its protection. Simple static assets on Pages are safe to serve to any domain, since the request takes no action and there is no visitor session. Because of this, a domain owner can attach CORS headers to specify exactly which requests can be allowed in the _headers file for fine-grained and explicit control.

For example, the use of the asterisk will enable any origin to request any asset from your Pages deployment:

/*
  Access-Control-Allow-Origin: *

To be more restrictive and limit requests to only be allowed from a ‘staging’ subdomain, we can do the following:

https://:project.pages.dev/*
  Access-Control-Allow-Origin: https://staging.:project.pages.dev

How we built support for custom headers

To support all these use cases for custom headers, we had to build a new engine to determine which rules to apply for each incoming request. Backed, of course, by Workers, this engine supports splats and placeholders, and allows you to include those matched values in your headers.

Although we don’t support all of its features, we’ve modeled this matching engine after the URLPattern specification which was recently shipped with Chrome 95. We plan to be able to fully implement this specification for custom headers once URLPattern lands in the Workers runtime, and there should hopefully be no breaking changes to migrate.

Enhanced support for redirects

With this same engine, we’re bringing these features to your _redirects file as well. You can now configure your redirects with splats, placeholders and status codes as shown in the example below:

/blog/* https://blog.example.com/:splat 301
/products/:code/:name /products?name=:name&code=:code
/submit-form https://static-form.example.com/submit 307

Get started

Custom headers and redirects for Cloudflare Pages can be configured today. Check out our documentation to get started, and let us know how you’re using it in our Discord server. We’d love to hear about what this unlocks for your projects!

Coming up…

And finally, if a _headers file and enhanced support for _redirects just isn’t enough for you, we also have something big coming very soon which will give you the power to build even more powerful projects. Stay tuned!

Computer science education is a global challenge

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/brookings-report-global-computer-science-education-policy/

For the last two years, I’ve been one of the advisors to the Center for Universal Education at the Brookings Institution, a US-based think tank, on their project to survey formal computing education systems across the world. The resulting education policy report, Building skills for life: How to expand and improve computer science education around the world, pulls together the findings of their research. I’ll highlight key lessons policymakers and educators can benefit from, and what elements I think have been missed.

Woman teacher and female students at a computer

Why a global challenge?

Work on this new Brookings report was motivated by the belief that if our goal is to create an equitable, global society, then we need computer science (CS) in school to be accessible around the world; countries need to educate their citizens about computer science, both to strengthen their economic situation and to tackle inequality between countries. The report states that “global development gaps will only be expected to widen if low-income countries’ investments in these domains falter while high-income countries continue to move ahead” (p. 12).

Student using a Raspberry Pi computer

The report makes an important contribution to our understanding of computer science education policy, providing a global overview as well as in-depth case studies of education policies around the world. The case studies look at 11 countries and territories, including England, South Africa, British Columbia, Chile, Uruguay, and Thailand. The map below shows an overview of the Brookings researchers’ findings. It indicates whether computer science is a mandatory or elective subject, whether it is taught in primary or secondary schools, and whether it is taught as a discrete subject or across the curriculum.

A world map showing countries' situation in terms of computing education policy.
Computer science education across the world. Figure courtesy of Brookings Institution (click to enlarge).

It’s a patchy picture, demonstrating both countries’ level of capacity to deliver computer science education and the different approaches countries have taken. Analysis in the Brookings report shows a correlation between a country’s economic position and implementation of computer science in schools: no low-income countries have implemented it at all, while over 20% of high-income countries have mandatory computer science education at both primary and secondary level. 

Capacity building: IT infrastructure and beyond

Given these disparities, there is a significant focus in the report on what IT infrastructure countries need in order to deliver computer science education. This infrastructure needs to be preceded by investment (funds to afford it) and policy (a clear statement of intent and an implementation plan). Many countries that the Brookings report describes as having no computer science education may still be struggling to put these in place.

A young woman codes in a computing classroom.

The recently developed CAPE (capacity, access, participation, experience) framework offers another way of assessing disparities in education. To have capacity to make computer science part of formal education, a country needs to put in place the following elements:

My view is that countries that are at the beginning of this process need to focus on IT infrastructure, but also on the other elements of capacity. The Brookings report touches on these elements of capacity as well. Once these are in place in a country, the focus can shift to the next level: access for learners.

Comparing countries — what policies are in place?

In their report, the Brookings researchers identify seven complementary policy actions that a country can take to facilitate implementation of computer science education:

  1. Introduction of ICT (information and communications technology) education programmes
  2. Requirement for CS in primary education
  3. Requirement for CS in secondary education
  4. Introduction of in-service CS teacher education programmes
  5. Introduction of pre-service teacher CS education programmes
  6. Setup of a specialised centre or institution focused on CS education research and training
  7. Regular funding allocated to CS education by the legislative branch of government

The figure below compares the 11 case-study regions in terms of how many of the seven policy actions have been taken, what IT infrastructure is in place, and when the process of implementing CS education started.

A graph showing the trajectory of 11 regions of the world in terms of computing education policy.
Trajectories of regions in the 11 case studies. Figure courtesy of Brookings Institution (click to enlarge).

England is the only country that has taken all seven of the identified policy actions, having already had nation-wide IT infrastructure and broadband connectivity in place. Chile, Thailand, and Uruguay have made impressive progress, both on infrastructure development and on policy actions. However, it’s clear that making progress takes many years — Chile started in 1992, and Uruguay in 2007 —  and requires a considerable amount of investment and government policy direction.

Computing education policy in England

The first case study that Brookings produced for this report, back in 2019, related to England. Over the last 8 years in England, we have seen the development of computing education in the curriculum as a mandatory subject in primary and secondary schools. Initially, funding for teacher education was limited, but in 2018, the government provided £80 million of funding to us and a consortium of partners to establish the National Centre for Computing Education (NCCE). Thus, in-service teacher education in computing has been given more priority in England than probably anywhere else in the world.

Three young people learn coding at laptops supported by a volunteer at a CoderDojo session.

Alongside teacher education, the funding also covered our development of classroom resources to cover the whole CS curriculum, and of Isaac Computer Science, our online platform for 14- to 18-year-olds learning computer science. We’re also working on a £2m government-funded research project looking at approaches to improving the gender balance in computing in English schools, which is due to report results next year.

The future of education policy in the UK as it relates to AI technologies is the topic of an upcoming panel discussion I’m inviting you to attend.

school-aged girls and a teacher using a computer together.

The Brookings report highlights the way in which the English government worked with non-profit organisations, including us here at the Raspberry Pi Foundation, to deliver on the seven policy actions. Partnerships and engagement with stakeholders appear to be key to effectively implementing computer science education within a country. 

Lessons learned, lessons missed

What can we learn from the Brookings report’s helicopter view of 11 case studies? How can we ensure that computer science education is going to be accessible for all children? The Brookings researchers draw our six lessons learned in their report, which I have taken the liberty of rewording and shortening here:

  1. Create demand
  2. Make it mandatory
  3. Train teachers
  4. Start early
  5. Work in partnership
  6. Make it engaging

In the report, the sixth lesson is phrased as, “When taught in an interactive, hands-on way, CS education builds skills for life.” The Brookings researchers conclude that focusing on project-based learning and maker spaces is the way for schools to achieve this, which I don’t find convincing. The problem with project-based learning in maker spaces is one of scale: in my experience, this approach only works well in a non-formal, small-scale setting. The other reason is that maker spaces, while being very engaging, are also very expensive. Therefore, I don’t see them as a practicable aspect of a nationally rolled-out, mandatory, formal curriculum.

When we teach computer science, it is important that we encourage young people to ask questions about ethics, power, privilege, and social justice.

Sue Sentance

We have other ways to make computer science engaging to all learners, using a breadth of pedagogical approaches. In particular, we should focus on cultural relevance, an aspect of education the Brookings report does not centre. Culturally relevant pedagogy is a framework for teaching that emphasises the importance of incorporating and valuing all learners’ knowledge, heritage, and ways of learning, and promotes the development of learners’ critical consciousness of the world. When we teach computer science, it is important that we encourage young people to ask questions about ethics, power, privilege, and social justice.

Three teenage boys do coding at a shared computer during a computer science lesson.

The Brookings report states that we need to develop and use evidence on how to teach computer science, and I agree with this. But to properly support teachers and learners, we need to offer them a range of approaches to teaching computing, rather than just focusing on one, such as project-based learning, however valuable that approach may be in some settings. Through the NCCE, we have embedded twelve pedagogical principles in the Teach Computing Curriculum, which is being rolled out to six million learners in England’s schools. In time, through this initiative, we will gain firm evidence on what the most effective approaches are for teaching computer science to all students in primary and secondary schools.

Moving forward together

I believe the Brookings Institution’s report has a huge contribution to make as countries around the world seek to introduce computer science in their classrooms. As we can conclude from the patchiness of the CS education world map, there is still much work to be done. I feel fortunate to be living in a country that has been able and motivated to prioritise computer science education, and I think that partnerships and working across stakeholder groups, particularly with schools and teachers, have played a large part in the progress we have made.

To my mind, the challenge now is to find ways in which countries can work together towards more equity in computer science education around the world. The findings in this report will help us make that happen.


PS We invite you to join us on 16 November for our online panel discussion on what the future of the UK’s education policy needs to look like to enable young people to navigate and shape AI technologies. Our speakers include UK Minister Chris Philp, our CEO Philip Colligan, and two young people currently in education. Tabitha Goldstaub, Chair of the UK government’s AI Council, will be chairing the discussion.

Sign up for your free ticket today and submit your questions to our panel!

The post Computer science education is a global challenge appeared first on Raspberry Pi.

[$] Android wallpaper fingerprints

Post Syndicated from jake original https://lwn.net/Articles/873921/rss

Uniquely identifying users so that they can be tracked as they go about
their business on the internet is, sadly, a major goal for advertisers and
others today. Web browser cookies provide a fairly well-known avenue
for tracking users as they traverse various web sites, but mobile apps are
not browsers, so that mechanism is not available. As it turns out, though,
there are ways
to “fingerprint” Android devices—and likely those of other mobile
platforms—so that the device owners can be tracked as they hop
between their apps.

Simplifying Multi-account CI/CD Deployments using AWS Proton

Post Syndicated from Marvin Fernandes original https://aws.amazon.com/blogs/architecture/simplifying-multi-account-ci-cd-deployments-using-aws-proton/

Many large enterprises, startups, and public sector entities maintain different deployment environments within multiple Amazon Web Services (AWS) accounts to securely develop, test, and deploy their applications. Maintaining separate AWS accounts for different deployment stages is a standard practice for organizations. It helps developers limit the blast radius in case of failure when deploying updates to an application, and provides for more resilient and distributed systems.

Typically, the team that owns and maintains these environments (the platform team) is segregated from the development team. A platform team performs critical activities. These can include setting infrastructure and governance standards, keeping patch levels up to date, and maintaining security and monitoring standards. Development teams are responsible for writing the code, performing appropriate testing, and pushing code to repositories to initiate deployments. The development teams are focused more on delivering their application and less on the infrastructure and networking that ties them together. The segregation of duties and use of multi-account environments are effective from a regulatory and development standpoint. But monitoring, maintaining, and enabling the safe release to these environments can be cumbersome and error prone.

In this blog, you will see how to simplify multi-account deployments in an environment that is segregated between platform and development teams. We will show how you can use one consistent and standardized continuous delivery pipeline with AWS Proton.

Challenges with multi-account deployment

For platform teams, maintaining these large environments at different stages in the development lifecycle and within separate AWS accounts can be tedious. The platform teams must ensure that certain security and regulatory requirements (like networking or encryption standards) are implemented in each separate account and environment. When working in a multi-account structure, AWS Identity and Access Management (IAM) permissions and cross-account access management can be a challenge for many account administrators. Many organizations rely on specific monitoring metrics and tagging strategies to perform basic functions. The platform team is responsible for enforcing these processes and implementing these details repeatedly across multiple accounts. This is a pain point for many infrastructure administrators or platform teams.

Platform teams are also responsible for ensuring a safe and secure application deployment pipeline. To do this, they isolate deployment and production environments from one another limiting the blast radius in case of failure. Platform teams enforce the principle of least privilege on each account, and implement proper testing and monitoring standards across the deployment pipeline.

Instead of focusing on the application and code, many developers face challenges complying with these rigorous security and infrastructure standards. This results in limited access to resources for developers. Delays come with reliance on administrators to deploy application code into production. This can lead to lags in deployment of updated code.

Deployment using AWS Proton

The ownership for infrastructure lies with the platform teams. They set the standards for security, code deployment, monitoring, and even networking. AWS Proton is an infrastructure provisioning and deployment service for serverless and container-based applications. Using AWS Proton, the platform team can provide their developers with a highly customized and catered “platform as a service” experience. This allows developers to focus their energy on building the best application, rather than spending time on orchestration tools. Platform teams can similarly focus on building the best platform for that application.

With AWS Proton, developers use predefined templates. With only a few input parameters, infrastructure can be provisioned and code deployed in an effective pipeline. This way you can get your application running and updated more quickly, see Figure 1.

Figure 1. Platform and development team roles when using AWS Proton

Figure 1. Platform and development team roles when using AWS Proton

AWS Proton allows you to deploy any serverless or container-based application across multiple accounts. You can define infrastructure standards and effective continuous delivery pipelines for your organization. Proton breaks down the infrastructure into environment and service (“infrastructure as code” templates).

In Figure 2, platform teams provide a service template of a secure environment to host a microservices application on Amazon Elastic Container Service (Amazon ECS) and AWS Fargate. The environment template contains infrastructure that is shared across services. This includes the networking configuration: Amazon Virtual Private Cloud (VPC), subnets, route tables, Internet Gateway, security groups, and ECS cluster definition for the Fargate service.

The service template provides details of the service. It includes the container task definitions, monitoring and logging definitions, and an effective continuous delivery pipeline. Using the environment and service template definitions, development teams can define the microservices that are running on Amazon ECS. They can deploy their code following the continuous integration and continuous delivery (CI/CD) pipeline.

Figure 2. Platform teams provision environment and service infrastructure as code templates in AWS Proton management account

Figure 2. Platform teams provision environment and service infrastructure as code templates in AWS Proton management account

Multi-account CI/CD deployment

For Figures 3 and 4, we used publicly available templates and created three separate AWS accounts: the AWS Proton management account, development account, and production environment accounts. Additional accounts may be added based on your use case and security requirements. As shown in Figure 3, the AWS Proton service account contains the environment, service, and pipeline templates. It also provides the connection to other accounts within the organization. The development and production accounts follow the structure of a development pipeline for a typical organization.

AWS Proton alleviates complicated cross-account policies by using a secure “environment account connection” feature. With environment account connections, platform administrators can give AWS Proton permissions to provision infrastructure in other accounts. They create an IAM role and specify a set of permissions in the target account. This enables Proton to assume the role from the management account to build resources in the target accounts.

AWS Key Management Service (KMS) policies can also be hard to manage in multi-account deployments. Proton reduces managing cross-account KMS permissions. In an AWS Proton management account, you can build a pipeline using a single artifact repository. You can also extend the pipeline to additional accounts from a single source of truth. This feature can be helpful when accounts are located in different Regions, due to regulatory requirements for example.

Figure 3. AWS Proton uses cross-account policies and provisions infrastructure in development and production accounts with environment connection feature

Figure 3. AWS Proton uses cross-account policies and provisions infrastructure in development and production accounts with environment connection feature

Once the environment and service templates are defined in the AWS Proton management account, the developer selects the templates. Proton then provisions the infrastructure, and the continuous delivery pipeline that will deploy the services to each separate account.

Developers commit code to a repository, and the pipeline is responsible for deploying to the different deployment stages. You don’t have to worry about any of the environment connection workflows. Proton allows platform teams to provide a single pipeline definition to deploy the code into multiple different accounts without any additional account level information. This standardizes the deployment process and implements effective testing and staging policies across the organization.

Platform teams can also inject manual approvals into the pipeline so they can control when a release is deployed. Developers can define tests that initiate after a deployment to ensure the validity of releases before moving to a production environment. This simplifies application code deployment in an AWS multi-account environment and allows updates to be deployed more quickly into production. The resulting deployed infrastructure is shown in Figure 4.

Figure 4. AWS Proton deploys service into multi-account environment through standardized continuous delivery pipeline

Figure 4. AWS Proton deploys service into multi-account environment through standardized continuous delivery pipeline

Conclusion

In this blog, we have outlined how using AWS Proton can simplify handling multi-account deployments using one consistent and standardized continuous delivery pipeline. AWS Proton addresses multiple challenges in the segregation of duties between developers and platform teams. By having one uniform resource for all these accounts and environments, developers can develop and deploy applications faster, while still complying with infrastructure and security standards.

For further reading:

Getting started with Proton
Identity and Access Management for AWS Proton
Proton administrative guide

Copy large datasets from Google Cloud Storage to Amazon S3 using Amazon EMR

Post Syndicated from Andrew Lee original https://aws.amazon.com/blogs/big-data/copy-large-datasets-from-google-cloud-storage-to-amazon-s3-using-amazon-emr/

Many organizations have data sitting in various data sources in a variety of formats. Even though data is a critical component of decision-making, for many organizations this data is spread across multiple public clouds. Organizations are looking for tools that make it easy and cost-effective to copy large datasets across cloud vendors. With Amazon EMR and the Hadoop file copy tools Apache DistCp and S3DistCp, we can migrate large datasets from Google Cloud Storage (GCS) to Amazon Simple Storage Service (Amazon S3).

Apache DistCp is an open-source tool for Hadoop clusters that you can use to perform data transfers and inter-cluster or intra-cluster file transfers. AWS provides an extension of that tool called S3DistCp, which is optimized to work with Amazon S3. Both these tools use Hadoop MapReduce to parallelize the copy of files and directories in a distributed manner. Data migration between GCS and Amazon S3 is possible by utilizing Hadoop’s native support for S3 object storage and using a Google-provided Hadoop connector for GCS. This post demonstrates how to configure an EMR cluster for DistCp and S3DistCP, goes over the settings and parameters for both tools, performs a copy of a test 9.4 TB dataset, and compares the performance of the copy.

Prerequisites

The following are the prerequisites for configuring the EMR cluster:

  1. Install the AWS Command Line Interface (AWS CLI) on your computer or server. For instructions, see Installing, updating, and uninstalling the AWS CLI.
  2. Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair for SSH access to your EMR nodes. For instructions, see Create a key pair using Amazon EC2.
  3. Create an S3 bucket to store the configuration files, bootstrap shell script, and the GCS connector JAR file. Make sure that you create a bucket in the same Region as where you plan to launch your EMR cluster.
  4. Create a shell script (sh) to copy the GCS connector JAR file and the Google Cloud Platform (GCP) credentials to the EMR cluster’s local storage during the bootstrapping phase. Upload the shell script to your bucket location: s3://<S3 BUCKET>/copygcsjar.sh. The following is an example shell script:
#!/bin/bash
sudo aws s3 cp s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar /tmp/gcs-connector-hadoop3-latest.jar
sudo aws s3 cp s3://<S3 BUCKET>/gcs.json /tmp/gcs.json
  1. Download the GCS connector JAR file for Hadoop 3.x (if using a different version, you need to find the JAR file for your version) to allow reading of files from GCS.
  2. Upload the file to s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar.
  3. Create GCP credentials for a service account that has access to the source GCS bucket. The credentials should be named json and be in JSON format.
  4. Upload the key to s3://<S3 BUCKET>/gcs.json. The following is a sample key:
{
   "type":"service_account",
   "project_id":"project-id",
   "private_key_id":"key-id",
   "private_key":"-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n",
   "client_email":"service-account-email",
   "client_id":"client-id",
   "auth_uri":"https://accounts.google.com/o/oauth2/auth",
   "token_uri":"https://accounts.google.com/o/oauth2/token",
   "auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs",
   "client_x509_cert_url":"https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}
  1. Create a JSON file named gcsconfiguration.json to enable the GCS connector in Amazon EMR. Make sure the file is in the same directory as where you plan to run your AWS CLI commands. The following is an example configuration file:
[
   {
      "Classification":"core-site",
      "Properties":{
         "fs.AbstractFileSystem.gs.impl":"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
         "google.cloud.auth.service.account.enable":"true",
         "google.cloud.auth.service.account.json.keyfile":"/tmp/gcs.json",
         "fs.gs.status.parallel.enable":"true"
      }
   },
   {
      "Classification":"hadoop-env",
      "Configurations":[
         {
            "Classification":"export",
            "Properties":{
               "HADOOP_USER_CLASSPATH_FIRST":"true",
               "HADOOP_CLASSPATH":"$HADOOP_CLASSPATH:/tmp/gcs-connector-hadoop3-latest.jar"
            }
         }
      ]
   },
   {
      "Classification":"mapred-site",
      "Properties":{
         "mapreduce.application.classpath":"/tmp/gcs-connector-hadoop3-latest.jar"
      }
   }
]

Launch and configure Amazon EMR

For our test dataset, we start with a basic cluster consisting of one primary node and four core nodes for a total of five c5n.xlarge instances. You should iterate on your copy workload by adding more core nodes and check on your copy job timings in order to determine the proper cluster sizing for your dataset.

  1. We use the AWS CLI to launch and configure our EMR cluster (see the following basic create-cluster command):
aws emr create-cluster \
--name "My First EMR Cluster" \
--release-label emr-6.3.0 \
--applications Name=Hadoop \
--ec2-attributes KeyName=myEMRKeyPairName \
--instance-type c5n.xlarge \
--instance-count 5 \
--use-default-roles

  1. Create a custom bootstrap action to be performed at cluster creation to copy the GCS connector JAR file and GCP credentials to the EMR cluster’s local storage. You can add the following parameter to the create-cluster command to configure your custom bootstrap action:
--bootstrap-actions Path="s3://<S3 BUCKET>/copygcsjar.sh"

Refer to Create bootstrap actions to install additional software for more details about this step.

  1. To override the default configurations for your cluster, you need to supply a configuration object. You can add the following parameter to the create-cluster command to specify the configuration object:
--configurations file://gcsconfiguration.json

Refer to Configure applications when you create a cluster for more details on how to supply this object when creating the cluster.

Putting it all together, the following code is an example of a command to launch and configure an EMR cluster that can perform migrations from GCS to Amazon S3:

aws emr create-cluster \
--name "My First EMR Cluster" \
--release-label emr-6.3.0 \
--applications Name=Hadoop \
--ec2-attributes KeyName=myEMRKeyPairName \
--instance-type c5n.xlarge \
--instance-count 5 \
--use-default-roles \
--bootstrap-actions Path="s3:///copygcsjar.sh" \
--configurations file://gcsconfiguration.json

Submit S3DistCp or DistCp as a step to an EMR cluster

You can run the S3DistCp or DistCp tool in several ways.

When the cluster is up and running, you can SSH to the primary node and run the command in a terminal window, as mentioned in this post.

You can also start the job as part of the cluster launch. After the job finishes, the cluster can either continue running or be stopped. You can do this by submitting a step directly via the AWS Management Console when creating a cluster. Provide the following details:

  • Step type – Custom JAR
  • NameS3DistCp Step
  • JAR locationcommand-runner.jar
  • Argumentss3-dist-cp --src=gs://<GCS BUCKET>/ --dest=s3://<S3 BUCKET>/
  • Action of failure – Continue

We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:

aws emr add-steps \
--cluster-id j-ABC123456789Z \
--steps 'Name=LoadData,Jar=command-runner.jar,ActionOnFailure=CONTINUE,Type=CUSTOM_JAR,Args=s3-dist-cp,--src=gs://<GCS BUCKET>/, --dest=s3://<S3 BUCKET>/'

DistCp settings and parameters

In this section, we optimize the cluster copy throughput by adjusting the number of maps or reducers and other related settings.

Memory settings

We use the following memory settings:

-Dmapreduce.map.memory.mb=1536
-Dyarn.app.mapreduce.am.resource.mb=1536

Both parameters determine the size of the map containers that are used to parallelize the transfer. Setting this value in line with the cluster resources and the number of maps defined is key to ensuring efficient memory usage. You can calculate the number of launched containers by using the following formula:

Total number of launched containers = Total memory of cluster / Map container memory

Dynamic strategy settings

We use the following dynamic strategy settings:

-Ddistcp.dynamic.max.chunks.tolerable=4000
-Ddistcp.dynamic.split.ratio=3 -strategy dynamic

The dynamic strategy settings determine how DistCp splits up the copy task into dynamic chunk files. Each of these chunks is a subset of the source file listing. The map containers then draw from this pool of chunks. If a container finishes early, it can get another unit of work. This makes sure that containers finish the copy job faster and perform more work than slower containers. The two tunable settings are split ratio and max chunks tolerable. The split ratio determines how many chunks are created from the number of maps. The max chunks tolerable setting determines the maximum number of chunks to allow. The setting is determined by the ratio and the number of maps defined:

Number of chunks = Split ratio * Number of maps
Max chunks tolerable must be > Number of chunks

Map settings

We use the following map setting:

-m 640

This determines the number of map containers to launch.

List status settings

We use the following list status setting:

-numListstatusThreads 15

This determines the number of threads to perform the file listing of the source GCS bucket.

Sample command

The following is a sample command when running with 96 core or task nodes in the EMR cluster:

hadoop distcp
-Dmapreduce.map.memory.mb=1536 \
-Dyarn.app.mapreduce.am.resource.mb=1536 \
-Ddistcp.dynamic.max.chunks.tolerable=4000 \
-Ddistcp.dynamic.split.ratio=3 \
-strategy dynamic \
-update \
-m 640 \
-numListstatusThreads 15 \
gs://<GCS BUCKET>/ s3://<S3 BUCKET>/

S3DistCp settings and parameters

When running large copies from GCS using S3DistCP, make sure you have the parameter fs.gs.status.parallel.enable (also shown earlier in the sample Amazon EMR application configuration object) set in core-site.xml. This helps parallelize getFileStatus and listStatus methods to reduce latency associated with file listing. You can also adjust the number of reducers to maximize your cluster utilization. The following is a sample command when running with 24 core or task nodes in the EMR cluster:

s3-dist-cp -Dmapreduce.job.reduces=48 --src=gs://<GCS BUCKET>/--dest=s3://<S3 BUCKET>/

Testing and performance

To test the performance of DistCp with S3DistCp, we used a test dataset of 9.4 TB (157,000 files) stored in a multi-Region GCS bucket. Both the EMR cluster and S3 bucket were located in us-west-2. The number of core nodes that we used in our testing varied from 24–120.

The following are the results of the DistCp test:

  • Workload – 9.4 TB and 157,098 files
  • Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
Nodes Throughput Transfer Time Maps
24 1.5GB/s 100 mins 168
48 2.9GB/s 53 mins 336
96 4.4GB/s 35 mins 640
120 5.4GB/s 29 mins 840

The following are the results of the S3DistCp test:

  • Workload – 9.4 TB and 157,098 files
  • Instance types – 1x c5n.4xlarge (primary), c5n.xlarge (core)
Nodes Throughput Transfer Time Reducers
24 1.9GB/s 82 mins 48
48 3.4GB/s 45 mins 120
96 5.0GB/s 31 mins 240
120 5.8GB/s 27 mins 240

The results show that S3DistCP performed slightly better than DistCP for our test dataset. In terms of node count, we stopped at 120 nodes because we were satisfied with the performance of the copy. Increasing nodes might yield better performance if required for your dataset. You should iterate through your node counts to determine the proper number for your dataset.

Using Spot Instances for task nodes

Amazon EMR supports the capacity-optimized allocation strategy for EC2 Spot Instances for launching Spot Instances from the most available Spot Instance capacity pools by analyzing capacity metrics in real time. You can now specify up to 15 instance types in your EMR task instance fleet configuration. For more information, see Optimizing Amazon EMR for resilience and cost with capacity-optimized Spot Instances.

Clean up

Make sure to delete the cluster when the copy job is complete unless the copy job was a step at the cluster launch and the cluster was set up to stop automatically after the completion of the copy job.

Conclusion

In this post, we showed how you can copy large datasets from GCS to Amazon S3 using an EMR cluster and two Hadoop file copy tools: DistCp and S3DistCp.

We also compared the performance of DistCp with S3DistCp with a test dataset stored in a multi-Region GCS bucket. As a follow-up to this post, we will run the same test on Graviton instances to compare the performance/cost of the latest x86 based instances vs. Graviton 2 instances.

You should conduct your own tests to evaluate both tools and find the best one for your specific dataset. Try copying a dataset using this solution and let us know your experience by submitting a comment or starting a new thread on one of our forums.


About the Authors

Hammad Ausaf is a Sr Solutions Architect in the M&E space. He is a passionate builder and strives to provide the best solutions to AWS customers.

Andrew Lee is a Solutions Architect on the Snap Account, and is based in Los Angeles, CA.

New – EC2 Instances Powered by Gaudi Accelerators for Training Deep Learning Models

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-ec2-instances-powered-by-gaudi-accelerators-for-training-deep-learning-models/

There are more applications today for deep learning than ever before. Natural language processing, recommendation systems, image recognition, video recognition, and more can all benefit from high-quality, well-trained models.

The process of building such a model is iterative: construct an initial model, train it on the ground truth data, do some test inferences, refine the model and repeat. Deep learning models contain many layers (hence the name), each of which transforms outputs of the previous layer. The training process is math and processor intensive, and places demands on just about every part of the systems used for training including the GPU or other training accelerator, the network, and local or network storage. This sophistication and complexity increases training time and raises costs.

New DL1 Instances
Today I would like to tell you about our new DL1 instances. Powered by Gaudi accelerators from Habana Labs, the dl1.24xlarge instances have the following specs:

Gaudi Accelerators – Each instance is equipped with eight Gaudi accelerators, with a total of 256 GB of High Bandwidth (HBM2) accelerator memory and high-speed, RDMA-powered communication between accelerators.

System Memory – 768 GB of system memory, enough to hold very large sets of training data in memory, as often requested by our customers.

Local Storage – 4 TB of local NVMe storage, configured as four 1 TB volumes.

Processor – Intel Cascade Lake processor with 96 vCPUs.

Network – 400 Gbps of network throughput.

As you can see, we have maxed out the specs in just about every dimension, with the goal of giving you a highly capable machine learning training platform with a low cost of entry and up to 40% better price-performance than current GPU-based EC2 instances.

Gaudi Inside
The Gaudi accelerators are custom-designed for machine learning training, and have a ton of cool & interesting features & attributes:

Data Types – Support for floating point (BF16 and FP32), signed integer (INT8, INT16, and INT32), and unsigned integer (UINT8, UINT16, and UINT32) data.

Generalized Matrix Multiplier Engine (GEMM) – Specialized hardware to accelerate matrix multiplication.

Tensor Processing Cores (TPCs) – Specialized VLIW SIMD (Very Long Instruction Word / Single Instruction Multiple Data) processing units designed for ML training. The TPCs are C-programmable, although most users will use higher-level tools and frameworks.

Getting Started with DL1 Instances
The Gaudi SynapseAI Software Suite for Training will help you to build new models and to migrate existing models from popular frameworks such as PyTorch and TensorFlow:

Here are some resources to get you started:

TensorFlow User Guide – Learn how to run your TensorFlow models on Gaudi.

PyTorch User Guide – Learn how to run your PyTorch models on Gaudi.

Gaudi Model Migration Guide – Learn how to port your PyTorch or TensorFlow to Gaudi.

HabanaAI Repo – This large, active repo contains setup instructions, reference models, academic papers, and much more.

You can use the TPC Programming Tools to write, simulate, and debug code that runs directly on the TPCs, and you can use the Habana Communication Library (HCL) to build applications that harness the power of multiple accelerators. The Habana Collective Communications Library (HCCL) runs atop HCL and gives you access to collective primitives for Reduce, Broadcast, Gather, and Scatter operations.

Now Available
DL1 instances are available today in the US East (N. Virginia) and US West (Oregon) Regions in On-Demand and Spot form. You can purchase Reserved Instances and Savings plans as well.

Jeff;

AWS Local Zones Are Now Open in Las Vegas, New York City, and Portland

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-local-zones-are-now-open-in-las-vegas-new-york-city-and-portland/

Today, we are opening three new AWS Local Zones in Las Vegas, New York City (located in New Jersey), and Portland metro areas. We are now at a total of 14 Local Zones in 13 cities since Jeff Barr announced the first Local Zone in Los Angeles in December 2019. These three new Local Zones join the ones in full operation in Boston, Chicago, Dallas, Denver, Houston, Kansas City, Los Angeles, Miami, Minneapolis, and Philadelphia.

Local Zones are one of the ways we bring select AWS services much closer to large populations and geographic areas where major industries come together. By having this proximity, you can deploy latency-sensitive workloads such as real-time gaming platforms, financial transaction processing, media and entertainment content creation, or ad services. Using Local Zones for migrations or hybrid strategies are two additional use cases allowing you to migrate your applications to a nearby AWS Local Zone while still meeting the low-latency requirements of hybrid deployments.

Local Zones support the deployment of workloads using Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Block Store (EBS), Amazon FSx for Windows File Server and Amazon FSx for Lustre, Elastic Load Balancing, Amazon Relational Database Service (RDS), and Amazon Virtual Private Cloud (VPC). Local Zones provide a high-bandwidth, secure connection between local workloads and those running in the parent AWS Region, while offering the full range of services found in a Region through the same APIs, console and tool sets. This page lists the exact AWS services and features available in each Local Zone.

Local Zones are easy to use and can be enabled in only three clicks! This article will help you learn how to provision infrastructure in a Local Zone, which is very similar to creating infrastructure in an Availability Zone. Once enabled, Local Zones appear as additional Availability Zones in your AWS Management Console or AWS Command Line Interface (CLI).

Local Zones in Action
Examples of workloads that our customers run in Local Zones include:

Dish Wireless is building the US-telecom’s first cloud-native 5G network. They are unleashing 5G connectivity with better speed, better security, and better latency. DISH is leveraging AWS Regions, AWS Local Zones, and AWS Outposts to extend AWS infrastructure and services to wherever they – or their customers – need it.

Integral Ad Science (IAS) is a global leader in digital media quality. Every millisecond counts when it comes to delivering actionable insights for its advertiser and publisher customers. Leveraging AWS Regions and AWS Local Zones, IAS ensures rapid response times in milliseconds when analyzing data and delivering insights.

Esports Engine (a Vindex company) is a turnkey esports solutions company working with gaming publishers, rights holders, brands, and teams to provide production, broadcast, tournament, and program design. Their graphic-intensive streaming content is live-fed from the locations where the games are recorded and then broadcast from the studios to viewers. AWS Local Zones replace their previous on-premises data centers to reduce the need for support for the physical data center buildings.

Proof Trading is a financial services company looking forward to taking advantage of AWS Local Zones to bring trading workloads closer to the major trading venues located in Chicago and New Jersey. Our industry blog has a detailed article that provides more context on trading-related workloads.

Ubitus is a cloud gaming technology leader. They deploy latency-sensitive game servers all over the world to be closer to gamers. An important part of having a great gaming experience is to have consistent low-latency game plays. AWS Local Zones are a game changer for them. Now, they can easily deploy and test clusters of game servers in many cities across the US, ensuring that more customers get a consistent experience regardless of where they are located.

What’s Next?
In 2019 when we launched our first Local Zone at AWS re:Invent 2019, we said we were just getting started. In addition to today’s announcement, we are working on opening three additional Local Zones in Atlanta, Phoenix, and Seattle by the end of the year, and we keep expanding. If you would like to express your interest in a particular location, please let us know by filling out the AWS Local Zones Interest form.

We are also listening to your feedback on additional services that we should add to Local Zones, such as more EC2 instance types to give you even more flexibility.

Build and deploy your workload on a Local Zone today.

— seb

Automate building an integrated analytics solution with AWS Analytics Automation Toolkit

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/automate-building-an-integrated-analytics-solution-with-aws-analytics-automation-toolkit/

Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that powers the modern data architecture enabling fast and deep insights or machine learning (ML) predictions using SQL across your data warehouse, data lake, and operational databases. A key differentiating factor of Amazon Redshift is its native integration with other AWS services, which makes it easy to build complete, comprehensive, and enterprise-level analytics applications.

As analytics solutions have moved away from the one-size-fits-all model to choosing the right tool for the right function, architectures have become more optimized and performant while simultaneously becoming more complex. You can use Amazon Redshift for a variety of use cases, along with other AWS services for ingesting, transforming, and visualizing the data.

Manually deploying these services is time-consuming. It also runs the risk of making human errors and deviating from best practices.

In this post, we discuss how to automate the process of building an integrated analytics solution by using a simple script.

Solution overview

The framework described in this post uses Infrastructure as Code (IaC) to solve the challenges with manual deployments, by using AWS Cloud Development Kit (CDK)  to automate provisioning AWS analytics services. You can indicate the services and resources you want to incorporate in your infrastructure by editing a simple JSON configuration file.

The script then instantly auto-provisions all the required infrastructure components in a dynamic manner, while simultaneously integrating them according to AWS recommended best practices.

In this post, we go into further detail on the specific steps to build this solution.

Prerequisites

Prior to deploying the AWS CDK stack, complete the following prerequisite steps:

  1. Verify that you’re deploying this solution in a Region that supports AWS CloudShell. For more information, see AWS CloudShell endpoints and quotas.
  2. Have an AWS Identity and Access Management (IAM) user with the following permissions:
    1. AWSCloudShellFullAccess
    2. IAM Full Access
    3. AWSCloudFormationFullAccess
    4. AmazonSSMFullAccess
    5. AmazonRedshiftFullAccess
    6. AmazonS3ReadOnlyAccess
    7. SecretsManagerReadWrite
    8. AmazonEC2FullAccess
    9. Create a custom AWS Database Migration Service (AWS DMS) policy called AmazonDMSRoleCustom with the following permissions:
{
	"Version": "2012-10-17",
	"Statement": [
		{
		"Effect": "Allow",
		"Action": "dms:*",
		"Resource": "*"
		}
	]
}
  1. Optionally, create a key pair that you have access to. This is only required if deploying the AWS Schema Conversion Tool (AWS SCT).
  2. Optionally, if using resources outside your AWS account, open firewalls and security groups to allow traffic from AWS. This is only applicable for AWS DMS and AWS SCT deployments.

Prepare the config file

To launch the target infrastructures, download the user-config-template.json file from the GitHub repo.

To prep the config file, start by entering one of the following values for each key in the top section: CREATE, N/A, or an existing resource ID to indicate whether you want to have the component provisioned on your behalf, skipped, or integrated using an existing resource in your account.

For each of the services with the CREATE value, you then edit the appropriate section under it with the specific parameters to use for that service. When you’re done customizing the form, save it as user-config.json.

You can see an example of the completed config file under user-config-sample.json in the GitHub repo, which illustrates a config file for the following architecture by newly provisioning all the services, including Amazon Virtual Private Cloud (Amazon VPC), Amazon Redshift, an Amazon Elastic Compute Cloud (Amazon EC2) instance with AWS SCT, and AWS DMS instance connecting an external source SQL Server on Amazon EC2 to the Amazon Redshift cluster.

Launch the toolkit

This project uses CloudShell, a browser-based shell service, to programatically initiate the deployment through the AWS Management Console. Prior to opening CloudShell, you need to configure an IAM user, as described in the prerequisites.

  1. On the CloudShell console, clone the Git repository:
    git clone https://github.com/aws-samples/amazon-redshift-infrastructure-automation.git

  2. Run the deployment script:
    ~/amazon-redshift-infrastructure-automation/scripts/deploy.sh

  3. On the Actions menu, choose Upload file and upload user-config.json.
  4. Enter a name for the stack.
  5. Depending on the resources being deployed, you may have to provide additional information, such as the password for an existing database or Amazon Redshift cluster.
  6. Press Enter to initiate the deployment.

Monitor the deployment

After you run the script, you can monitor the deployment of resource stacks through the CloudShell terminal, or through the AWS CloudFormation console, as shown in the following screenshot.

Each stack corresponds to the creation of a resource from the config file. You can see the newly created VPC, Amazon Redshift cluster, EC2 instance running AWS SCT, and AWS DMS instance. To test the success of the deployment, you can test the newly created AWS DMS endpoint connectivity to the source system and the target Amazon Redshift cluster. Select your endpoint and on the Actions menu, choose Test connection.

If both statuses say Success, the AWS DMS workflow is fully integrated.

Troubleshooting

If the stack launch stalls at any point, visit our GitHub repository for troubleshooting instructions.

Conclusion

In this post, we discussed how you can use the AWS Analytics Infrastructure Automation utility to quickly get started with Amazon Redshift and other AWS services. It helps you provision your entire solution on AWS instantly without any spending any time on challenges around integrating the services or scaling your solution.


About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Samir Kakli is an Analytics Specialist Solutions Architect at AWS. He has worked with building and tuning databases and data warehouse solutions for over 20 years. His focus is  architecting end-to-end analytics solutions designed to meet the specific needs for each customer.

Julia Beck is a Specialist Solutions Architect at AWS. She supports customers building analytics proof of concept workloads. Outside of work, she enjoys traveling, cooking, and puzzles.

Accelerate large-scale data migration validation using PyDeequ

Post Syndicated from Mahendar Gajula original https://aws.amazon.com/blogs/big-data/accelerate-large-scale-data-migration-validation-using-pydeequ/

Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from on premises to the cloud. This data validation is a critical step and if not done correctly, may result in the failure of the entire project. However, developing custom solutions to determine migration accuracy by comparing the data between the source and target can often be time-consuming.

In this post, we walk through a step-by-step process to validate large datasets after migration using PyDeequ. PyDeequ is an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon). Deequ is written in Scala, whereas PyDeequ allows you to use its data quality and testing capabilities from Python and PySpark.

Prerequisites

Before getting started, make sure you have the following prerequisites:

Solution overview

This solution uses the following services:

  • Amazon RDS for My SQL as the database engine for the source database.
  • Amazon Simple Storage Service (Amazon S3) or Hadoop Distributed File System (HDFS) as the target.
  • Amazon EMR to run the PySpark script. We use PyDeequ to validate data between MySQL and the corresponding Parquet files present in the target.
  • AWS Glue to catalog the technical table, which stores the result of the PyDeequ job.
  • Amazon Athena to query the output table to verify the results.

We use profilers, which is one of the metrics computation components of PyDeequ. We use this to analyze each column in the given dataset to calculate statistics like completeness, approximate distinct values, and data types.

The following diagram illustrates the solution architecture.

In this example, you have four tables in your on-premises database that you want to migrate: tbl_books, tbl_sales, tbl_venue, and tbl_category.

Deploy the solution

To make it easy for you to get started, we created an AWS CloudFormation template that automatically configures and deploys the solution for you.

The CloudFormation stack performs the following actions:

  • Launches and configures Amazon RDS for MySQL as a source database
  • Launches Secrets Manager for storing the credentials for accessing the source database
  • Launches an EMR cluster, creates and loads the database and tables on the source database, and imports the open-source library for PyDeequ at the EMR primary node
  • Runs the Spark ingestion process from Amazon EMR, connecting to the source database, and extracts data to Amazon S3 in Parquet format

To deploy the solution, complete the following steps:

  1. Choose Launch Stack to launch the CloudFormation template.

The template is launched in the US East (N. Virginia) Region by default.

  1. On the Select Template page, keep the default URL for the CloudFormation template, then choose Next.
  2. On the Specify Details page, provide values for the parameters that require input (see the following screenshot).
  3. Choose Next.
  4. Choose Next again.
  5. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  6. Choose Create stack.

You can view the stack outputs on the AWS Management Console or by using the following AWS Command Line Interface (AWS CLI) command:

aws cloudformation describe-stacks --stack-name <stack-name> --region us-east-1 --query Stacks[0].Outputs

It takes approximately 20–30 minutes for the deployment to complete. When the stack is complete, you should see the resources in the following table launched and available in your account.

Resource Name Functionality
DQBlogBucket The S3 bucket that stores the migration accuracy results for the AWS Glue Data Catalog table
EMRCluster The EMR cluster to run the PyDeequ validation process
SecretRDSInstanceAttachment Secrets Manager for securely accessing the source database
SourceDBRds The source database (Amazon RDS)

When the EMR cluster is launched, it runs the following steps as part of the post-cluster launch:

  • DQsetupStep – Installs the Deequ JAR file and MySQL connector. This step also installs Boto3 and the PyDeequ library. It also downloads sample data files to use in the next step.
  • SparkDBLoad – Runs the initial data load to the MySQL database or table. This step creates the test environment that we use for data validation purposes. When this step is complete, we have four tables with data on MySQL and respective data files in Parquet format on HDFS in Amazon EMR.

When the Amazon EMR step SparkDBLoad is complete, we verify the data records in the source tables. You can connect to the source database using your preferred SQL editor. For more details, see Connecting to a DB instance running the MySQL database engine.

The following screenshot is a preview of sample data from the source table MyDatabase.tbl_books.

Validate data with PyDeequ

Now the test environment is ready and we can perform data validation using PyDeequ.

  1. Use Secure Shell (SSH) to connect to the primary node.
  2. Run the following Spark command, which performs data validation and persists the results to an AWS Glue Data Catalog table (db_deequ.db_migration_validation_result):
spark-submit --jars deequ-1.0.5.jar pydeequ_validation.py 

The script and JAR file are already available on your primary node if you used the CloudFormation template. The PySpark script computes PyDeequ metrics on the source MySQL table data and target Parquet files in Amazon S3. The metrics currently calculated as part of this example are as follows:

  • Completeness to measure fraction of not null values in a column
  • Approximate number of distinct values
  • Data type of column

If required, we can compute more metrics for each column. To see the complete list of supported metrics, see the PyDeequ package on GitHub.

The output metrics from the source and target are then compared using a PySpark DataFrame.

When that step is complete, the PySpark script creates the AWS Glue table db_deequ.db_migration_validation_result in your account, and you can query this table from Athena to verify the migration accuracy.

Verify data validation results with Athena

You can use Athena to check the overall data validation summary of all the tables. The following query shows you the aggregated data output. It lists all the tables you validated using PyDeequ and how many columns match between the source and target.

select table_name,
max(src_count) as "source rows",
max(tgt_count) as "target rows",
count(*) as "total columns",
sum(case when status='Match' then 1 else 0 end) as "matching columns",
sum(case when status<>'Match' then 1 else 0 end) as "non-matching columns"
from  "db_deequ"."db_migration_validation_result"
group by table_name;

The following screenshot shows our results.

Because all your columns match, you can have high confidence that the data has been exported correctly.

You can also check the data validation report for any table. The following query gives detailed information about any specific table metrics captured as part of PyDeequ validation:

select col_name,src_datatype,tgt_datatype,src_completeness,tgt_completeness,src_approx_distinct_values,tgt_approx_distinct_values,status 
from "db_deequ"."db_migration_validation_result"
where table_name='tbl_sales'

The following screenshot shows the query results. The last column status is the validation result for the columns in the table.

Clean up

To avoid incurring additional charges, complete the following steps to clean up your resources when you’re done with the solution:

  1. Delete the AWS Glue database and table db_deequ.db_migration_validation_result.
  2. Delete the prefixes and objects you created from the bucket dqblogbucket-${AWS::AccountId}.
  3. Delete the CloudFormation stack, which removes your additional resources.

Customize the solution

The solution consists of two parts:

  • Data extraction from the source database
  • Data validation using PyDeequ

In this section, we discuss ways to customize the solution based on your needs.

Data extraction from the source database

Depending on your data volume, there are multiple ways of extracting data from on-premises database sources to AWS. One recommended service is AWS Data Migration Service (AWS DMS). You can also use AWS Glue, Spark on Amazon EMR, and other services.

In this post, we use PySpark to connect to the source database using a JDBC connection and extract the data into HDFS using an EMR cluster.

The primary reason is that we’re already using Amazon EMR for PyDeequ, and we can use the same EMR cluster for data extraction.

In the CloudFormation template, the Amazon EMR step SparkDBLoad runs the PySpark script blogstep3.py. This PySpark script uses Secrets Manager and a Spark JDBC connection to extract data from the source to the target.

Data validation using PyDeequ

In this post, we use ColumnProfilerRunner from the pydeequ.profiles package for metrics computation. The source data is from the database using a JDBC connection, and the target data is from data files in HDFS and Amazon S3.

To create a DataFrame with metrics information for the source data, use the following code:

df_readsrc = spark.read.format('jdbc').option('url',sqlurl).option('dbtable',tbl_name).option('user',user).option('password',pwd).load()

result_rds = ColumnProfilerRunner(spark).onData(df_readsrc).run()
a=[]
for col, profile in result_rds.profiles.items():
    b=[]
    b.append(""+col +","+ str(profile.completeness)+","+ str(profile.approximateNumDistinctValues)+","+ str(profile.dataType)+"")
    a.append(b[0])

rdd1 = spark.sparkContext.parallelize(a)
row_rdd = rdd1.map(lambda x: Row(x))
df=spark.createDataFrame(row_rdd,['column'])

finalDataset = df.select(split(df.column,",")).rdd.flatMap(lambda x: x).toDF(schema=['column','completeness','approx_distinct_values','inferred_datatype'])

Similarly, the metrics is computed for the target (the data file).

You can create a temporary view from the DataFrame to use in the next step for metrics comparison.

After we have both the source (vw_Source) and target (vw_Target) available, we use the following query in Spark to generate the output result:

df_result = spark.sql("select t1.table_name as table_name,t1.column as col_name,t1.inferred_datatype as src_datatype,t2.inferred_datatype as tgt_datatype,t1.completeness as src_completeness,t2.completeness as tgt_completeness,t1.approx_distinct_values as src_approx_distinct_values,t2.approx_distinct_values as tgt_approx_distinct_values,t1.count as src_count,t2.count as tgt_count,case when t1.inferred_datatype=t2.inferred_datatype and t1.completeness=t2.completeness and t1.approx_distinct_values=t2.approx_distinct_values and t1.count=t2.count then 'Match' else 'No Match' end as status from vw_Source t1 left outer join vw_Target t2 on t1.column = t2.column")

The generated result is stored in the db_deequ.db_migration_validation_result table in the Data Catalog.

If you used the CloudFormation template, the entire PyDeequ code used in this post is available at the path /home/hadoop/pydeequ_validation.py in the EMR cluster.

You can modify the script to include or exclude tables as per your requirements.

Conclusion

This post showed you how you can use PyDeequ to accelerate the post-migration data validation process. PyDeequ helps you calculate metrics at the column level. You can also use more PyDeequ components like constraint verification to build a custom data validation framework.

For more use cases on Deequ, check out the following:


About the Authors

Mahendar Gajula is a Sr. Data Architect at AWS. He works with AWS customers in their journey to the cloud with a focus on Big data, Data Lakes, Data warehouse and AI/ML projects. In his spare time, he enjoys playing tennis and spending time with his family.

Nitin Srivastava is a Data & Analytics consultant at Amazon Web Services. He has more than a decade of datawarehouse experience along with designing and implementing large scale Big Data and Analytics solutions. He works with customers to deliver the next generation big data analytics platform using AWS technologies.

Amazon RDS Custom for Oracle – New Control Capabilities in Database Environment

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-rds-custom-for-oracle-new-control-capabilities-in-database-environment/

Managing databases in self-managed environments such as on premises or Amazon Elastic Compute Cloud (Amazon EC2) requires customers to spend time and resources doing database administration tasks such as provisioning, scaling, patching, backups, and configuring for high availability. So, hundreds of thousands of AWS customers use Amazon Relational Database Service (Amazon RDS) because it automates these undifferentiated administration tasks.

However, there are some legacy and packaged applications that require customers to make specialized customizations to the underlying database and the operating system (OS), such as Oracle industry specialized applications for healthcare and life sciences, telecom, retail, banking, and hospitality. Customers with these specific customization requirements cannot get the benefits of a fully managed database service like Amazon RDS, and they end up deploying their databases on premises or on EC2 instances.

Today, I am happy to announce the general availability of Amazon RDS Custom for Oracle, new capabilities that enable database administrators to access and customize the database environment and operating system. With RDS Custom for Oracle, you can now access and customize your database server host and operating system, for example by applying special patches and changing the database software settings to support third-party applications that require privileged access.

You can easily move your existing self-managed database for these applications to Amazon RDS and automate time-consuming database management tasks, such as software installation, patching, and backups. Here is a simple comparison of features and responsibilities between Amazon EC2, RDS Custom for Oracle, and RDS.

Features and Responsibilities Amazon EC2 RDS Custom for Oracle Amazon RDS
Application optimization Customer Customer Customer
Scaling/high availability Customer Shared AWS
DB backups Customer Shared AWS
DB software maintenance Customer Shared AWS
OS maintenance Customer Shared AWS
Server maintenance AWS AWS AWS

The shared responsibility model of RDS Custom for Oracle gives you more control than in RDS, but also more responsibility, similar to EC2. So, if you need deep control of your database environment where you take responsibility for changes that you make and want to offload common administration tasks to AWS, RDS Custom for Oracle is the recommended deployment option over self-managing databases on EC2.

Getting Started with Amazon RDS Custom for Oracle
To get started with RDS Custom for Oracle, you create a custom engine version (CEV), the database installation files of supported Oracle database versions and upload the CEV to Amazon Simple Storage Service (Amazon S3). This launch includes Oracle Enterprise Edition allowing Oracle customers to use their own licensed software with bring your own license (BYOL).

Then with just a few clicks in the AWS Management Console, you can deploy an Oracle database instance in minutes. Then, you can connect to it using SSH or AWS Systems Manager.

Before creating and connecting your DB instance, make sure that you meet some prerequisites such as configuring the AWS Identity and Access Management (IAM) role and Amazon Virtual Private Cloud (VPC) using the pre-created AWS CloudFormation template in the Amazon RDS User Guide.

A symmetric AWS Key Management Service (KMS) key is required for RDS Custom for Oracle. If you don’t have an existing symmetric KMS key in your account, create a KMS key by following the instructions in Creating keys in the AWS KMS Developer Guide.

The Oracle Database installation files and patches are hosted on Oracle Software Delivery Cloud. If you want to create a CEV, search and download your preferred version under the Linux x86/64 platform and upload it to Amazon S3.

$ aws s3 cp install-or-patch-file.zip \ s3://my-oracle-db-files

To create CEV for creating a DB instance, you need a CEV manifest, a JSON document that describes installation .zip files stored in Amazon S3. RDS Custom for Oracle will apply the patches in the order in which they are listed when creating the instance by using this CEV.

{
    "mediaImportTemplateVersion": "2020-08-14",
    "databaseInstallationFileNames": [
        "V982063-01.zip"
    ],
    "opatchFileNames": [
        "p6880880_190000_Linux-x86-64.zip"
    ],
    "psuRuPatchFileNames": [
        "p32126828_190000_Linux-x86-64.zip"
    ],
    "otherPatchFileNames": [
        "p29213893_1910000DBRU_Generic.zip",
        "p29782284_1910000DBRU_Generic.zip",
        "p28730253_190000_Linux-x86-64.zip",
        "p29374604_1910000DBRU_Linux-x86-64.zip",
        "p28852325_190000_Linux-x86-64.zip",
        "p29997937_190000_Linux-x86-64.zip",
        "p31335037_190000_Linux-x86-64.zip",
        "p31335142_190000_Generic.zip"
] }

To create a CEV in the AWS Management Console, choose Create custom engine version in the Custom engine version menu.

You can set Engine type to Oracle, choose your preferred database edition and version, and enter CEV manifest, the location of the S3 bucket that you specified. Then, choose Create custom engine version. Creation takes approximately two hours.

To create your DB instance with the prepared CEV, choose Create database in the Databases menu. When you choose a database creation method, select Standard create. You can set Engine options to Oracle and choose Amazon RDS Custom in the database management type.

In Settings, enter a unique name for the DB instance identifier and your master username and password. By default, the new instance uses an automatically generated password for the master user. To learn more in the remaining setting, see Settings for DB instances in the Amazon RDS User Guide. Choose Create database.

Alternatively, you can create a CEV by running create-custom-db-engine-version command in the AWS Command Line Interface (AWS CLI).

$ aws rds create-db-instances \
      --engine my-oracle-ee \
      --db-instance-identifier my-oracle-instance \ 
      --engine-version 19.my_cev1 \ 
      --allocated-storage 250 \ 
      --db-instance-class db.m5.xlarge \ 
      --db-subnet-group mydbsubnetgroup \ 
      --master-username masterawsuser \ 
      --master-user-password masteruserpassword \ 
      --backup-retention-period 3 \ 
      --no-multi-az \ 
              --port 8200 \
      --license-model bring-your-own-license \
      --kms-key-id my-kms-key

After you create your DB instance, you can connect to this instance using an SSH client. The procedure is the same as for connecting to an Amazon EC2 instance. To connect to the DB instance, you need the key pair associated with the instance. RDS Custom for Oracle creates the key pair on your behalf. The pair name uses the prefix do-not-delete-ssh-privatekey-db-. AWS Secrets Manager stores your private key as a secret.

For more information, see Connecting to your Linux instance using SSH in the Amazon EC2 User Guide.

You can also connect to it using AWS Systems Manager Session Manager, a capability that lets you manage EC2 instances through a browser-based shell. To learn more, see Connecting to your RDS Custom DB instance using SSH and AWS Systems Manager in the Amazon RDS User Guide.

Things to Know
Here are a couple of things to keep in mind about managing your DB instance:

High Availability (HA): To configure replication between DB instances in different Availability Zones to be resilient to Availability Zone failures, you can create read replicas for RDS Custom for Oracle DB instances. Read replica creation is similar to Amazon RDS, but with some differences. Not all options are supported when creating RDS Custom read replicas. To learn how to configure HA, see Working with RDS Custom for Oracle read replicas in the AWS Documentation.

Backup and Recovery: Like Amazon RDS, RDS Custom for Oracle creates and saves automated backups during the backup window of your DB instance. You can also back up your DB instance manually. The procedure is identical to taking a snapshot of an Amazon RDS DB instance. The first snapshot contains the data for the full DB instance just like in Amazon RDS. RDS Custom also includes a snapshot of the OS image, and the EBS volume that contains the database software. Subsequent snapshots are incremental. With backup retention enabled, RDS Custom also uploads transaction logs into an S3 bucket in your account to be used with the RDS point-in-time recovery feature. Restore DB snapshots, or restore DB instances to a specific point in time using either the AWS Management Console or the AWS CLI. To learn more, see Backing up and restoring an Amazon RDS Custom for Oracle DB instance in the Amazon RDS User Guide.

Monitoring and Logging: RDS Custom for Oracle provides a monitoring service called the support perimeter. This service ensures that your DB instance uses a supported AWS infrastructure, operating system, and database. Also, all changes and customizations to the underlying operating system are automatically logged for audit purposes using Systems Manager and AWS CloudTrail. To learn more, see Troubleshooting an Amazon RDS Custom for DB instance in the Amazon RDS User Guide.

Now Available
Amazon RDS Custom for Oracle is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Frankfurt), EU (Ireland), EU (Stockholm), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo) regions.

To learn more, take a look at the product page and documentations of Amazon RDS Custom for Oracle. Please send us feedback either in the AWS forum for Amazon RDS or through your usual AWS support contacts.

Channy

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close