Choosing between AWS Lambda data storage options in web apps

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/choosing-between-aws-lambda-data-storage-options-in-web-apps/

AWS Lambda is an on-demand compute service that powers many serverless applications. Lambda functions are ephemeral, with execution environments only existing for a brief time when the function is invoked. Many compute operations need access to external data for a variety of purposes. This includes importing third-party libraries, accessing machine learning models, or exporting the output of the compute operation.

Lambda provides a comprehensive range of storage options to meet the needs of web application developers. These include other AWS services such as Amazon S3 and Amazon EFS. There are also native storage options available, such as temporary storage or Lambda layers. In this blog post, I explain the differences between these options, and discuss common use-cases to help you choose for your own applications.

This post references the Happy Path web application series, and you can download the code for that application from the repository.

Amazon S3 – Object storage

Amazon S3 is an object storage service that scales elastically. It offers high availability and 11 9’s of durability. The service is ideal for storing unstructured data. This includes binary data, such as images or media, log files and sensor data.

Sample contents from an S3 bucket.

There are certain characteristics of S3 object storage that are important to remember. While S3 objects can be versioned, you cannot append data as you could in a file system. You have to store an entirely new version of an object. S3 also has a flat storage hierarchy that’s different to a file system. Instead of directories, you use folders to logically organize objects, by prefixing ‘foldername/’ in the key name.

S3 has important event integrations for serverless developers. It has a native integration with Lambda, which allows you to invoke a function in response to an S3 event. This can provide a scalable way to trigger application workflows when objects are created or deleted in S3. In the Happy Path application, the image-processing workflows are initiated by this event integration. To learn more about using S3 to trigger automated serverless workflows, visit the learning path.

S3 is often an important repository for an organization’s data lake. If your application writes data to S3 buckets, this can be a useful staging area for downstream processing. For analytics workloads, you can use AWS Glue to perform extract, transform, and loan (ETL) operations. To create ad hoc visualizations and business analysis reports, Amazon QuickSight can connect to your S3 buckets and produce interactive dashboards. To learn how to build business intelligence dashboards for your web application, visit the Innovator Island workshop.

S3 also provides object lifecycle management. This allows you to automatically change storage classes when certain conditions are met. For example, an application for uploading expenses could automatically archive PDFs after 1 year to Amazon S3 Glacier to reduce storage costs. In the Happy Path application, the original high-resolution uploads are stored in a separate bucket from the optimized distribution assets. To reduce storage costs, lifecycle management could be configured to automatically delete these original photo assets after 30 days.

Temporary storage with /tmp

The Lambda execution environment provides a file system for your code to use at /tmp. This space has a fixed size of 512 MB. The same Lambda execution environment may be reused by multiple Lambda invocations to optimize performance. The /tmp area is preserved for the lifetime of the execution environment and provides a transient cache for data between invocations. Each time a new execution environment is created, this area is deleted.

Consequently, this is intended as an ephemeral storage area. While functions may cache data here between invocations, it should be used only for data needed by code in a single invocation. It’s not a place to store data permanently, and is better-used to support operations required by your code.

Operationally, working with files in /tmp is the same as your local hard disk, and offers fast I/O throughput. For example, to unzip a file into this space in Python, use:

import os, zipfile
os.chdir('/tmp')
with zipfile.ZipFile(myzipfile, 'r') as zip:
    zip.extractall()

Lambda layers

Your Lambda functions may use additional libraries as part of the deployment package. You can bundle these in the deployment archive or optionally move to a layer instead. A Lambda function can have up to five layers, and is subject to the maximum deployment size of 50 MB (zipped). Packages in layers are available in the /opt directory during invocations. While layers are private to you by default, you can also share layers with other AWS accounts, or make layers public.

Lambda layers in the console

There are many benefits to using layers throughout the functions in your serverless application. It’s best practice to include the AWS SDK instead of depending on the version bundled with the Lambda service. This enables you to pin the version of the SDK. By using a layer, you don’t need to bundle the package with each function, which can increase your deployment package size and slow down deployments. You can create an AWS SDK layer and then include a reference to the layer in each function.

Layers can be an effective way to bundle large dependencies, or share compiled libraries with binaries that vary by operating system. For example, the Happy Path application uses the Sharp npm graphics library to process images. Similarly, the Innovator Island workshop uses the OpenCV library to perform image manipulation, and this is imported using a shared layer.

Layers are static once they are deployed. You can only change the contents of a layer by deploying a new version. Any Lambda function using the layer binds to a specific version and must be updated to change layer versions. To learn more, see using Lambda layers to simplify your development process.

Amazon EFS for Lambda

Amazon EFS is a fully managed, elastic, shared file system that integrates with other AWS services. It is durable storage option that offers high availability. You can now mount EFS volumes in Lambda functions, which makes it simpler to share data across invocations. The file system grows and shrinks as you add or delete data, so you do not need to manage storage limits.

EFS file system in the console.

The Lambda service mounts EFS file systems when the execution environment is prepared. This happens in parallel with other initialization operations so typically does not impact cold start latency. If the execution environment is warm from previous invocations, the mount is already prepared. To use EFS, your Lambda function must be in the same VPC as the file system.

EFS enables new capabilities for serverless applications. The file system is a dynamic binding for Lambda functions, unlike layers. This makes it useful for deploying code libraries where you want to always use the latest version. You configure the mount path when integrating the file system with your function, and then include packages from this location. Additionally, you can use this to include packages that exceed the limits of layers.

Due to its speed and support of standard file operations, EFS is also useful for ingesting or writing large numbers files durably. This can be helpful for zipping or unzipping large archives, for example. For appending to existing files, EFS is also a preferred option to using S3.

To learn more, see using Amazon EFS for AWS Lambda in your serverless applications.

Comparing the different data storage options

This table compares the characteristics of these four different data storage options for Lambda:

Amazon S3 /tmp Lambda Layers Amazon EFS
Maximum size Elastic 512 MB 50 MB Elastic
Persistence Durable Ephemeral Durable Durable
Content Dynamic Dynamic Static Dynamic
Storage type Object File system Archive File system
Lambda event source integration Native N/A N/A N/A
Operations supported Atomic with versioning Any file system operation Immutable Any file system operation
Object tagging Y N N N
Object metadata Y N N N
Pricing model Storage + requests + data transfer Included in Lambda Included in Lambda Storage + data transfer + throughput
Sharing/permissions model IAM Function-only IAM IAM + NFS
Source for AWS Glue Y N N N
Source for Amazon QuickSight Y N N N
Relative data access speed from Lambda Fast Fastest Fastest Very fast

Conclusion

Lambda is a flexible, on-demand compute service for serverless application. It supports a wide variety of workloads by providing a number of different data storage options.

In this post, I compare the capabilities and use-cases of S3, EFS, Lambda layers, and temporary storage for Lambda functions. There are benefits to each approach, as each type has different behaviors and characteristics. For web application developers, these storage types support different operations depending upon the needs of your serverless backend.

As the newest integration with Lambda, EFS now enables new workloads and capabilities. This includes sharing large code packages with Lambda, or durably operating on large numbers of files. It also opens up new possibilities for developers working on deep learning inference models.

To learn more about storage options available, visit the AWS Serverless homepage. For more serverless learning resources, visit https://serverlessland.com.

Agile website delivery with Hugo and AWS Amplify

Post Syndicated from Nigel Harris original https://aws.amazon.com/blogs/devops/agile-website-delivery-with-hugo-and-aws-amplify/

In this post, we show how you can rapidly configure and deploy a website using Hugo (an AWS Cloud9 integrated development environment (IDE) for content editing), AWS CodeCommit for source code control, and AWS Amplify to implement a source code-controlled, automated deployment process.

When hosting a website on AWS, you can choose from several options. One popular option is to use Amazon Simple Storage Service (Amazon S3) to host a static website. If you prefer full access to the infrastructure hosting your website, you can use the NGINX Quick Start to quickly deploy web server infrastructure using AWS CloudFormation.

Static website generators such as Hugo and MkDocs accelerate the website content generation process, and can be a valuable tool when trying to rapidly deliver technical documentation or similar content. Typically, the content creation process requires programming in HTML and CSS.

Hugo is written in Go and available under the Apache 2.0 license. It provides several themes (collections of layouts) that accelerate website creation by drastically reducing the need to focus on format. You can author content in Markdown and output in multiple languages and formats (including ebook formats). Excellent examples of public websites built using Hugo include Digital.gov and Kubernetes.io.

 

Solution overview

This solution illustrates how to provision a hosted, source code-controlled Hugo generated website using CodeCommit and Amplify Console. The provisioned website is configured with a custom subdomain and an SSL certificate. We use an AWS Cloud9 IDE to enable content creation in the cloud.

 

Setting up an AWS Cloud9 IDE

Start by provisioning an AWS Cloud9 IDE. AWS Cloud9 environments run using Amazon Elastic Compute Cloud (Amazon EC2). You need to provision your AWS Cloud9 environment into an existing public subnet in an Amazon Virtual Private Cloud (Amazon VPC) within your AWS account. You can complete this in the following steps:

1. Access your AWS account using with an identity with administrative privileges. If you don’t have an AWS account, you can create one.

2. Create a new AWS Cloud9 environment using the wizard on the AWS Cloud9 console.

3. Enter a name for your desktop and an optional description.

4. Choose Next step.

Naming your Cloud 9 environment

5. In the Environment settings section, for Environment type, select Create a new EC2 instance for environment (direct access).

6. For Instance type, select your preferred instance type (the default, t2.micro, works for this use case)

7. Under Network settings, for Network (VPC), choose a VPC that you wish to deploy your AWS Cloud9 instance into. You may wish to use your default VPC, which is suitable for the purpose of this tutorial.

8. Choose a public subnet from this VPC for deployment.

Cloud9 Settings

9. Leave all other settings unchanged and choose Next step.
10. Review your choices and choose Create environment.

Environment creation takes a few minutes to complete. When the environment is ready, you receive access to the AWS Cloud9 IDE in your browser. We return to it shortly to develop content for your Hugo website.

Your Cloud9 Desktop

Configuring a source code repository to track content changes

Static website generators enable rapid changes to website content and layout. Source control management (SCM) systems provide a revision history for your code, and allow you to revert to previous versions of a project when unintended changes are introduced. SCM systems become increasingly important as the velocity of change and the number of team members introducing change increases.

You now create a source code repository to track changes to your content. You use CodeCommit, a fully-managed source control service that hosts secure Git-based repositories.

1. In a new browser, sign in to the CodeCommit Console and create a new repository.

2. For Repository name, enter amplify-website.

3. For Description, enter an appropriate description.

4. Choose Create.

Create repository

Repository creation takes just a few moments.

5. In the Connection steps section, choose the appropriate method to connect to your repository based on how you accessed your AWS account.

For this post, I signed in to my AWS account using federated access, so I choose the HTTPS Git Remote CodeCommit (HTTPS-GRC) tab. This is the recommended connection method for this sign-in type. You can also configure a connection to your repository using SSH or Git credentials over HTTPS. SSH and Git credentials over HTTPS are appropriate methods if you have signed in to your AWS account as an AWS Identity and Access Management (IAM) user. The Amazon CodeCommit console provides additional information regarding each of these connection types, including links to supporting documentation.

Connect to Repo

 

Configuring and deploying an example website

You’re now ready to configure and deploy your website.

1. Return to the browser with your AWS Cloud9 IDE and place your cursor in the lower terminal pane of the IDE.

The terminal pane provides Bash shell access on the EC2 instance running AWS Cloud9.

You now create a Hugo website. The website design is based on Hugo-theme-learn. Themes are collections of Hugo layouts that take all the hassle out of building your website. Learn is a multilingual-ready theme authored by Mathieu Cornic, designed for building technical documentation websites.

Hugo provides a variety of themes on their website. Many of the themes include bundled example website content that you can easily adapt by following the accompanying theme documentation.

2. Enter the following code to download an existing example website stored as a .zip file, extract it, and commit the contents into CodeCommit from your AWS Cloud9 IDE:

cd ~/environment
aws s3 cp s3://ee-assets-prod-us-east-1/modules/3c5ba9cb6ff44465b96993d210f67147/v1/example-website.zip ~/environment/example-website.zip
unzip example-website.zip
rm example-website.zip

The following screenshot shows your output.

example website copy commands

 

Next, we run commands to create a directory to host your website and copy files into place from the example website to get started. We then create a new default branch called main (formerly referred to as the master branch), local to our AWS Cloud9 instance. We then copy files into place from the example website. After adding and committing them locally, we push all our changes to the remote Amazon Codecommit repository.

3. Enter the following code:

mkdir ~/environment/amplify-website/
cd ~/environment/amplify-website/
git init
git remote add origin codecommit::us-east-1://amplify-website
git remote -v
git checkout -b main
cp -rp ~/environment/example-website/* ~/environment/amplify-website/
git add *
git commit -am "first commit"
git push -u origin main

Deployment and hosting is achieved by using Amplify Console, a static web hosting service that accelerates your application release cycle by providing a simple CI/CD workflow for building and deploying static web applications.

4. On the Amplify console, under Deploy, choose Get Started.

Amplify banner

5. On the Get started with the Amplify Console page, select AWS CodeCommit as your source code repository.

6. Choose Continue.

Amplify get started page

7. On the Add repository branch page, for Recently updated repositories, choose your repository.

8. For Branch, choose main.

9. Choose Next.

add branch

On the Configure build settings page, Amplify automatically uses the amplify.yml file for build settings for your deployment. You committed this into your source code repository in the previous step. The amplify.yml file is detected from the root of your website directory structure.

10. Choose Next.

Amplify configure build settings

11. On the review page, choose Save and deploy.

Amplify builds and deploys your Amplify website within minutes, and shows you its progress. When deployment is complete, you can access the website to see the sample content.

amplify website

The following screenshot shows your example website.

sample website

 

Promoting changes to the website

We can now update the line of text in the home page and commit and publish this change.

1. Return to the browser with your AWS Cloud9 IDE and place your cursor in the lower terminal pane of the IDE.

2. On the navigation pane, choose the file ~/environment/amplify-website/workshop/content/_index.en.md.

The contents of the file open under a new tab in the upper pane.

3. Change the string First Line of Text to First Update to Website.

content change

4. From the File menu, choose Save to save the changes you have made to the _index.en.md file.

save content changes

5. Commit the changes and push to CodeCommit by running the following command in the lower terminal pane in AWS Cloud9:

git add *; git commit -am "homepage update"; git push origin main

The output in your AWS Cloud9 terminal should appear similar to the following screenshot.

commit output

6. Return to the Amplify Console and observe how the committed change in CodeCommit is automatically detected. Amplify runs deployment steps to push your changes to the website.

amplify deploy changes

7. Access the URL of your website after this update is complete to verify that the first line of text on your home page has changed.

updated website

You can repeat this process to make source-code controlled, automated changes to your website.

Adding a custom domain

Adding a custom domain to your Amplify configuration makes it easier for clients to access your content. You can register new domains using Amazon Route 53 or, if you have an existing domain registered outside of AWS, you can integrate it with Route 53 and Amplify. For our use case, the domain www.hugoonamplify.com is a registered a domain name using a third-party registrar (NameCheap). You can manage DNS configurations for domains registered outside of AWS using Route 53.

Start by configuring a public hosted zone in Route 53.

1. On the Route 53 console, choose Hosted zones.

2. Choose Create hosted zone.

hosted zones

3. For Domain name, enter hugoonamplify.com.

4. For Description, enter an appropriate description.

5. For Type, select Public hosted zone.

hosted zones configuration

6. Choose Create hosted zone.

7. Save the addresses of the name servers that respond to client DNS lookup requests for the custom domain.

create hosted zone

8. In a separate browser, access the console of your DNS registrar.

9. Configure a custom DNS name servers setting on the console of the third-party domain name registrar.

This configuration specifies the Route 53 assigned name servers as authoritative DNS for our custom domain. For this use case, propagation of this change may take up to 48 hours.

namecheap console

10. Use https://who.is to verify that the AWS name servers are listed correctly for your custom domain to internet clients.

whois lookup

You can now set up your custom domain in Amplify. Amplify helps you configure DNS and set up SSL for your desired custom domain.

domain management

11. On the Amplify Console, under App settings, choose Domain management.

12. Choose Add domain.

13. For Domain, enter your custom domain name (hugoonamplify.com).

14. Choose Configure domain.

15. For Subdomain, I only want to set up www and choose to exclude the root of my custom domain.

16. Choose Save.

Amplify begins the process of creating the SSL certificates. Amplify sends a notification that it’s issuing an SSL certificate to secure traffic to the custom domain.

ssl domain management

After a few moments, it proceeds to SSL configuration and indicates that ownership of domain is in progress.

ssl domain management configuration

Amplify verifies domain ownership by creating a sample CNAME record in your hosted zone file. When ownership is verified, the domain is propagated onto an Amazon CloudFront distribution managed by the Amplify service, and domain activation is complete.

ssl domain management configured

Clients can now access the website using the custom domain name www.hugoonaplify.com.

access website via custom domain

 

Establishing a subdomain for development

You can create a development website in Amplify that is aligned to a development code branch in CodeCommit that enables testing changes prior to production release.

1. Access the AWS Cloud9 IDE and use the terminal to enter the following commands to create a development branch and push changes to CodeCommit using the current content from the main branch with a single content change:

git checkout -b development
git branch
git remote -v
git add *; git commit -am "first development commit";
git push -u origin development

2. Open and edit the file ~/environment/amplify-website/workshop/content/_index.en.md and change the string Update to Website to something else.

Alternatively, run the following Unix sed command from the terminal in AWS Cloud9 to make that content change:

sed -i 's/Update to Website/Update to Development/g' ~/environment/amplify-website/workshop/content/_index.en.md

3. Commit and push your change with the following code:

git add *; git commit -am "second development commit"; git push -u origin development

You now configure a subdomain in Amplify to allow developers to review changes.

4. Return to the amplify-website app.

5. Choose Connect branch.

connect branch

6. For Branch, choose the development branch you created and committed code into.

7. Choose Next.

add development branch

Amplify builds a second website based on the contents of the development branch. You can see the instance of your website matched to the development code branch on Amplify Console.

amplify two branches

8. Access the domain management menu item in your Amplify application to add a friendly subdomain.

9. Edit the domain and add a subdomain item with a name of your choice (for example, dev).

10. Associate it to the development branch containing the committed code and content changes.

11. Choose Add.

add dev domain

You can access the subdomain to verify the changes.

verify domain

Controlling access to development

You may wish to restrict access to new content as it’s deployed into the development website.

1. On Amplify Console, choose your application.

2. Choose Access control.

3. Under Access control settings, choose your preferred settings.

You have the option to restrict access globally or on a branch-by-branch basis. For this use case, we create a simple password protection for a user named developer on the development branch and site.

access control settings

 

Cleaning up

Unless you plan to keep the website you have constructed, you can quickly clean up provisioned assets and avoid any unnecessary costs.

1. On Amplify Console, select the app you created.

2. From the Actions drop-down menu, choose Delete app.

3. In the pop-up window, confirm the deletion.

4. On the CodeCommit dashboard, select the repository you created.

5. Choose Delete.

6. In the pop-up window, confirm the deletion.

7. On the AWS Cloud9 dashboard, select the IDE you created.

8. Choose Delete.

9. In the pop-up window, confirm the deletion.

 

Conclusion

Hugo is a powerful tool that enables accelerated delivery of content in a variety of formats including image portfolios, online resume presentation, blogging, and technical documentation. Amplify Console provides a convenient, easy-to-use, static web hosting service that can greatly accelerate delivery of static content.

When combining Hugo with Amplify Console, you can rapidly deploy websites in minutes with features such as friendly URLS, environments matched to code branches, and encryption (SSL). Visit gohugo.io to find out more about Hugo. For more information about how Amplify Console can help you rapidly deploy Hugo and other modern web applications, see the AWS Amplify Console User Guide.

Nigel Harris

Nigel Harris

Nigel Harris is an Enterprise Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on AWS architectures.

Talk to your Raspberry Pi | HackSpace 36

Post Syndicated from Andrew Gregory original https://www.raspberrypi.org/blog/talk-to-your-raspberry-pi-hackspace-36/

In the latest issue of HackSpace Magazine, out now, @MrPJEvans shows you how to add voice commands to your projects with a Raspberry Pi 4 and a microphone.

You’ll need:

It’s amazing how we’ve come from everything being keyboard-based to so much voice control in our lives. Siri, Alexa, and Cortana are everywhere and happy to answer questions, play you music, or help automate your household.

For the keen maker, these offerings may not be ideal for augmenting their latest project as they are closed systems. The good news is, with a bit of help from Google, you can add voice recognition to your project and have complete control over what happens. You just need a Raspberry Pi 4, a speaker array, and a Google account to get started.

Set up your microphone

This clever speaker uses four microphones working together to increase accuracy. A ring of twelve RGB LEDs can be coded to react to events, just like an Amazon Echo

For a home assistant device, being able to hear you clearly is an essential. Many microphones are either too low-quality for the task, or are unidirectional: they only hear well in one direction. To the rescue comes Seeed’s ReSpeaker, an array of four microphones with some clever digital processing to provide the kind of listening capability normally found on an Amazon Echo device or Google Assistant. It’s also in a convenient HAT form factor, and comes with a ring of twelve RGB LEDs, so you can add visual effects too. Start with a Raspberry Pi OS Lite installation, and follow these instructions to get your ReSpeaker ready for use.

Install Snowboy

You’ll see later on that we can add the power of Google’s speech-to-text API by streaming audio over the internet. However, we don’t want to be doing that all the time. Snowboy is an offline ‘hotword’ detector. We can have Snowboy running all the time, and when your choice of word is ‘heard’, we switch to Google’s system for accurate processing. Snowboy can only handle a few words, so we only use it for the ‘trigger’ words. It’s not the friendliest of installations so, to get you up and running, we’ve provided step-by-step instructions.

There’s also a two-microphone ReSpeaker for the Raspberry Pi Zero

Create your own hotword

As we’ve just mentioned, we can have a hotword (or trigger word) to activate full speech recognition so we can stay offline. To do this, Snowboy must be trained to understand the word chosen. The code that describes the word (and specifically your pronunciation of it) is called the model. Luckily, this whole process is handled for you at snowboy.kitt.ai, where you can create a model file in a matter of minutes and download it. Just say your choice of words three times, and you’re done. Transfer the model to your Raspberry Pi 4 and place it in your home directory.

Let’s go Google

ReSpeaker can use its multiple mics to detect distance and direction

After the trigger word is heard, we want Google’s fleet of super-servers to help us transcribe what is being said. To use Google’s speech-to-text API, you will need to create a Google application and give it permissions to use the API. When you create the application, you will be given the opportunity to download ‘credentials’ (a small text file) which will allow your setup to use the Google API. Please note that you will need a billable account for this, although you get one hour of free speech-to-text per month. Full instructions on how to get set up can be found here.

Install the SDK and transcriber

To use Google’s API, we need to install the firm’s speech-to-text SDK for Python so we can stream audio and get the results. On the command line, run the following:pip3 install google-cloud-speech
(If you get an error, run sudo apt install python3-pip then try again).
Remember that credentials file? We need to tell the SDK where it is:
export GOOGLE_APPLICATION_CREDENTIALS="/home/pi/[FILE_NAME].json"
(Don’t forget to replace [FILE_NAME] with the actual name of the JSON file.)
Now download and run this test file. Try saying something and see what happens!

Putting it all together

Now we can talk to our Raspberry Pi, it’s time to link the hotword system to the Google transcription service to create our very own virtual assistant. We’ve provided sample code so that you can see these two systems running together. Run it, then say your chosen hotword. Now ask ‘what time is it?’ to get a response. (Don’t forget to connect a speaker to the audio output if you’re not using HDMI.) Now it’s over to you. Try adding code to respond to certain commands such as ‘turn the light on’, or ‘what time is it?’

Get HackSpace magazine 36 Out Now!

Each month, HackSpace magazine brings you the best projects, tips, tricks and tutorials from the makersphere. You can get it from the Raspberry Pi Press online store, The Raspberry Pi store in Cambridge, or your local newsagents.

Each issue is free to download from the HackSpace magazine website.

The post Talk to your Raspberry Pi | HackSpace 36 appeared first on Raspberry Pi.

[$] Two address-space-isolation patches get closer

Post Syndicated from original https://lwn.net/Articles/835342/rss

Address-space isolation is the technique of removing a range of memory from
one or more address spaces as a way of preventing accidental or malicious
access to that memory. Since the disclosure of the Meltdown and Spectre
vulnerabilities, the kernel has used one form
of address-space isolation
to make kernel memory completely
inaccessible to user-space processes, for example. There has been a steady
level of interest in using similar techniques to protect memory in other
contexts; two patches implementing new isolation mechanisms are getting
closer to being ready for merging into the mainline kernel.

How to configure Duo multi-factor authentication with Amazon Cognito

Post Syndicated from Mahmoud Matouk original https://aws.amazon.com/blogs/security/how-to-configure-duo-multi-factor-authentication-with-amazon-cognito/

Adding multi-factor authentication (MFA) reduces the risk of user account take-over, phishing attacks, and password theft. Adding MFA while providing a frictionless sign-in experience requires you to offer a variety of MFA options that support a wide range of users and devices. Let’s see how you can achieve that with Amazon Cognito and Duo Multi-Factor Authentication (MFA).

Amazon Cognito user pools are user directories that are used by Amazon Web Services (AWS) customers to manage the identities of their customers and to add sign-in, sign-up and user management features to their customer-facing web and mobile applications. Duo Security is an APN Partner that provides unified access security and multi-factor authentication solutions.

In this blog post, I show you how to use Amazon Cognito custom authentication flow to integrate Duo Multi-Factor Authentication (MFA) into your sign-in flow and offer a wide range of MFA options to your customers. Some second factors available through Duo MFA are mobile phone SMS passcodes, approval of login via phone call, push-notification-based approval on smartphones, biometrics on devices that support it, and security keys that can be attached via USB.

How it works

Amazon Cognito user pools enable you to build a custom authentication flow that authenticates users based on one or more challenge/response cycles. You can use this flow to integrate Duo MFA into your authentication as a custom challenge.

Duo Web offers a software development kit to make it easier for you to integrate your web applications with Duo MFA. You need an account with Duo and an application to protect (which can be created from the Duo admin dashboard). When you create your application in the Duo admin dashboard, note the integration key (ikey), secret key (skey), and API hostname. These details, together with a random string (akey) that you generate, are the primary factors used to integrate your Amazon Cognito user pool with Duo MFA.

Note: ikey, skey, and akey are referred to as Duo keys.

Duo MFA will be integrated into the sign-in flow as a custom challenge. To do that, you need to generate a signed challenge request using Duo APIs and use it to load Duo MFA in an iframe and request the user’s second factor. When the challenge is answered by the user, a signed response is returned to your application and sent to Amazon Cognito for verification. If the response is valid then the MFA challenge is successful.

Let’s take a closer look at the sequence of calls and components involved in this flow.

Implementation details

In this section, I walk you through the end-to-end flow of integrating Duo MFA with Amazon Cognito using a custom authentication flow. To help you with this integration, I built a demo project that provides deployment steps and sample code to create a working demo in your environment.

Create and configure a user pool

The first step is to create the AWS resources needed for the demo. You can do that by deploying the AWS CloudFormation stack as described in the demo project.

A few implementation details to be aware of:

  • The template creates an Amazon Cognito user pool, application client, and AWS Lambda triggers that are used for the custom authentication.
  • The template also accepts ikey, skey, and akey as inputs. For security, the parameters are masked in the AWS CloudFormation console. These parameters are stored in a secret in AWS Secrets Manager with a resource policy that allows relevant Lambda functions read access to that secret.
  • Duo keys are loaded from secrets manager at the initialization of create auth challenge and verify auth challenge Lambda triggers to be used to create sign-request and verify sign-response.

Authentication flow

Figure 1: User authentication process for the custom authentication flow

Figure 1: User authentication process for the custom authentication flow

The preceding sequence diagram (Figure 1) illustrates the sequence of calls to sign in a user, which are as follows:

  1. In your application, the user is presented with a sign-in UI that captures their user name and password and starts the sign-in flow. A script—running in the browser—starts the sign-in process using the Amazon Cognito authenticateUser API with CUSTOM_AUTH set as the authentication flow. This validates the user’s credentials using Secure Remote Password (SRP) protocol and moves on to the second challenge if the credentials are valid.

    Note: The authenticateUser API automatically starts the authentication process with SRP. The first challenge that’s sent to Amazon Cognito is SRP_A. This is followed by PASSWORD_VERIFIER to verify the user’s credentials.

  2. After the SRP challenge step, the define auth challenge Lambda trigger will return CUSTOM_CHALLENGE and this will move control to the create auth challenge trigger.
  3. The create auth challenge Lambda trigger creates a Duo signed request using the Duo keys plus the username and returns the signed request as a challenge to the client. Here is a sample code of what create auth challenge should look like:
    <JavaScript>
    
    exports.handler = async (event) => {
    
        //load duo keys from secrets manager and store them in global variables
    
        if(ikey == null || skey == null || akey == null){ 
          const promise = new Promise(function(resolve, reject) {
              secretsManagerClient.getSecretValue({SecretId: secretName}, function(err, data) {
                    if (err) {throw err; }
                    else {
                        if ('SecretString' in data) {
                            secret = JSON.parse(data.SecretString);
                            ikey = secret['duo-ikey'];
                            skey = secret['duo-skey'];
                            akey = secret['duo-akey'];
                        }
                    }
                    resolve();
                });
            })
            
            await promise; 
        }
    
        
        var username = event.userName;
        var sig_request = duo_web.sign_request(ikey, skey, akey, username);
        
        event.response.publicChallengeParameters = {
            sig_request: sig_request
        };
        
        return event;
    };
    

  4. The client initializes the Duo Web library with the signed request and displays Duo MFA in an iframe to request a second factor from the user. To initialize the Duo library, you need the api_hostname that is generated for your application in the Duo dashboard, the sign-request that was received as a challenge, and a callback function to invoke after the MFA step is completed by the user. This is done on the client side as follows:
    <JavaScript>
          //render Duo MFA iframe
          $("#duo-mfa").html('<iframe id="duo_iframe" title="Two-Factor Authentication" </iframe>');
            
          Duo.init({
            'host': api_hostname,
            'sig_request': challengeParameters.sig_request,
            'submit_callback': mfa_callback
          });
    

  5. Through the Duo iframe, the user can set up their MFA preferences and respond to an MFA challenge. After successful MFA setup, a signed response from the Duo Web library will be returned to the client and passed to the callback function that was provided in Duo.init call.
     
    Figure 2: The first time a user signs in, Duo MFA displays a Start setup screen

    Figure 2: The first time a user signs in, Duo MFA displays a Start setup screen

  6. The client sends the Duo signed response to the Amazon Cognito service as a challenge response.
  7. Amazon Cognito sends the response to the verify auth challenge Lambda trigger, which uses Duo keys and username to verify the response.
    <JavaScript>
    const duo_web = require('duo_web');
    exports.handler = async (event) => {
    
        //load duo keys from secrets manager and store them in global variables
        
        var username = event.userName;
        
        //-------get challenge response
        const sig_response = event.request.challengeAnswer;
        const verificationResult = duo_web.verify_response(ikey, skey, akey, sig_response);
        
        if (verificationResult === username) {
            event.response.answerCorrect = true;
        } else {
            event.response.answerCorrect = false;
        }
        return event;
    };
    

  8. Validation results and current state are passed once again to the define auth challenge Lambda trigger. If the user response is valid, then the Duo MFA challenge is successful. You can then decide to introduce additional challenges to the user or issue tokens and complete the authentication process.

Conclusion

As you build your mobile or web application, keep in mind that using multi-factor authentication is an effective and recommended approach to protect your customers from account take-over, phishing, and the risks of weak or compromised passwords. Making multi-factor authentication easy for your customers enables you to offer authentication experience that protects their accounts but doesn’t slow them down.

Visit the security pillar of AWS Well-Architected Framework to learn more about AWS security best practices and recommendations.

In this blog post, I showed you how to integrate Duo MFA with an Amazon Cognito user pool. Visit the demo application and review the code samples in it to learn how to integrate this with your application.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Mahmoud Matouk

Mahmoud is a Senior Solutions Architect with the Amazon Cognito team. He helps AWS customers build secure and innovative solutions for various identity and access management scenarios.

Big data processing in a data warehouse environment using AWS Glue 2.0 and PySpark

Post Syndicated from Kaushik Krishnamurthi original https://aws.amazon.com/blogs/big-data/big-data-processing-in-a-data-warehouse-environment-using-aws-glue-2-0-and-pyspark/

The AWS Marketing Data Science and Engineering team enables AWS Marketing to measure the effectiveness and impact of various marketing initiatives and campaigns. This is done through a data platform and infrastructure strategy that consists of maintaining data warehouse, data lake, and data transformation (ETL) pipelines, and designing software tools and services to run related operations. While providing various business intelligence (BI) and machine learning (ML) solutions for marketers, there is particular focus on the timely delivery of error-free, reliable, self-served, reusable, and scalable ways to measure and report business metrics. In this post, we discuss one such example of improving operational efficiency and how we optimized our ETL process using AWS Glue 2.0 and PySpark SQL to achieve huge parallelism and reduce the runtime significantly—under 45 minutes—to deliver data to business much sooner.

Solution overview

Our team maintained an ETL pipeline to process the entire history of a dataset. We did this by running a SQL query repeatedly in Amazon Redshift, incrementally processing 2 months at a time to account for several years of historical data, with several hundreds of billions of rows in total. The input to this query is detailed service billing metrics across various AWS products, and the output is aggregated and summarized usage data. We wanted to move this heavy ETL process outside of our data warehouse environment, so that business users and our other relatively smaller ETL processes can use the Amazon Redshift resources fully for complex analytical queries.

Over the years, raw data feeds were captured in Amazon Redshift into separate tables, with 2 months of data in each. We first UNLOAD these to Amazon Simple Storage Service (Amazon S3) as Parquet formatted files and create AWS Glue tables on top of them by running CREATE TABLE DDLs in Amazon Athena as a one-time exercise. The source data is now available to be used as a DataFrame or DynamicFrame in an AWS Glue script.

Our query is dependent on a few more dimension tables that we UNLOAD again but in an automated fashion daily because we need the most recent version of these tables.

Next, we convert Amazon Redshift SQL queries to equivalent PySpark SQL. The data generated from the query output is written back to Amazon Redshift using AWS Glue DynamicFrame and DataSink. For more information, see Moving Data to and from Amazon Redshift.

We perform development and testing using Amazon SageMaker notebooks attached to an AWS Glue development endpoint.

After completing the tests, the script is deployed as a Spark application on the serverless Spark platform of AWS Glue. We do this by creating a job in AWS Glue and attaching our ETL script. We use the recently announced version AWS Glue 2.0.

The job can now be triggered via the AWS Command Line Interface (AWS CLI) using any workflow management or job scheduling tool. We use an internal distributed job scheduling tool to run the AWS Glue job periodically.

Design choices

We made a few design choices based on a few different factors. Firstly, we used the same Amazon Redshift SQL queries with minimal changes by relying on Spark SQL, due to Spark SQL’s language syntax being very similar to traditional ANSI-SQL.

We also used several techniques to optimize our Spark script for better memory management and speed. For example, we used broadcast joins for smaller tables involved in joins. See the following code:

-- Join Hints for broadcast join
SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
-- https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-hints.html#join-hints

AWS Glue DynamicFrame allowed us to create an AWS Glue DataSink pointed to our Amazon Redshift destination and write the output of our Spark SQL directly to Amazon Redshift without having to export to Amazon S3 first, which requires an additional ETL to copy from Amazon S3 to Amazon Redshift. See the following code:

# Convert Spark DataFrame to Glue DynamicFrame:
myDyF = DynamicFrame.fromDF(myDF, glueContext, "dynamic_df")

# Connecting to destination Redshift database:
connection_options = {
    "dbtable": "example.redshift_destination",
    "database": "aws_marketing_redshift_db",
    "preactions": "delete from example.redshift_destination where date between '"+start_dt+"' AND '"+end_dt+"';",
    "postactions": "insert into example.job_status select 'example' as schema_name, 'redshift_destination' as table_name, to_date("+run_dt[:8]+",'YYYYMMDD') as run_date;",
}

# Glue DataSink:
datasink = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=myDyF,
    catalog_connection="aws_marketing_redshift_db_read_write",
    connection_options=connection_options,
    redshift_tmp_dir=args["TempDir"],
    transformation_ctx="datasink"
)

We also considered horizontal scaling vs. vertical scaling. Based on the results observed during our tests for performance tuning, we chose to go with 75 as the number of workers and G.2X as the worker type. This translates to 150 data processing units (DPU) in AWS Glue. With G.2X, each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB of disk) and provides one executor per worker. The performance was nearly twice as fast when compared to G.1X for our dataset’s partitioning scheme, SQL aggregate functions, filters, and more put together. Each G.2X worker maps to 2 DPUs and runs twice the number of concurrent tasks compared to G.1X. This worker type is recommended for memory-intensive jobs and jobs that run intensive transforms. For more information, see the section Understanding AWS Glue Worker types in Best practices to scale Apache Spark jobs and partition data with AWS Glue.

We tested various choices of worker types between Standard, G.1X, and G.2X while also tweaking the number of workers. The job run time reduced proportionally as we added more G.2X instances.

Before AWS Glue 2.0, earlier versions involved AWS Glue jobs spending several minutes for the cluster to become available. We observed an approximate average startup time of 8–10 minutes for our AWS Glue job with 75 or more workers. With AWS Glue 2.0, you can see much faster startup times. We noticed startup times of less than 1 minute on average in almost all our AWS Glue 2.0 jobs, and the ETL workload began within 1 minute from when the job run request was made. For more information, see Running Spark ETL Jobs with Reduced Startup Times.

Although cost is a factor to consider while running a large ETL, you’re billed only for the duration of the AWS Glue job. For Spark jobs with AWS Glue 2.0, you’re billed in 1-second increments (with a 1-minute minimum). For more information, see AWS Glue Pricing.

Additional design considerations

During implementation, we also considered additional optimizations and alternatives in case we ran into issues. For example, if you want to allocate more resources to the write operations into Amazon Redshift, you can modify the workload management (WLM) configuration in Amazon Redshift accordingly so sufficient compute power from Amazon Redshift is available for the AWS Glue jobs to write data into Amazon Redshift.

To complement our ETL process, we can also perform an elastic resize of the Amazon Redshift cluster to a larger size, making it more powerful in a matter of minutes and allowing more parallelism, which helps improve the speed of our ETL load operations.

To submit an elastic resize of an Amazon Redshift cluster using Bash, see the following code:

cmd=$(aws redshift --region 'us-east-1' resize-cluster --cluster-identifier ${REDSHIFT_CLUSTER_NAME} --number-of-nodes ${NUMBER_OF_NODES} --no-classic)

To monitor the elastic resize of an Amazon Redshift cluster using Bash, see the following code:

cluster_status=$(aws redshift --region 'us-east-1' describe-clusters --cluster-identifier ${REDSHIFT_CLUSTER_NAME} | jq -r ".Clusters[0].ClusterStatus")
cluster_availability_status=$(aws redshift --region 'us-east-1' describe-clusters --cluster-identifier ${REDSHIFT_CLUSTER_NAME} | jq -r ".Clusters[0].ClusterAvailabilityStatus")

while [ "$cluster_status" != "available" ] || [ "$cluster_availability_status" != "Available" ]
do
	echo "$cluster_status" | ts
	echo "$cluster_availability_status" | ts
	echo "Waiting for Redshift resize cluster to complete..."
	sleep 60
	cluster_status=$(aws redshift --region 'us-east-1' describe-clusters --cluster-identifier ${REDSHIFT_CLUSTER_NAME} | jq -r ".Clusters[0].ClusterStatus")
	cluster_availability_status=$(aws redshift --region 'us-east-1' describe-clusters --cluster-identifier ${REDSHIFT_CLUSTER_NAME} | jq -r ".Clusters[0].ClusterAvailabilityStatus")
done

echo "$cluster_status" | ts
echo "$cluster_availability_status" | ts
echo "Done"

ETL overview

To submit AWS Glue jobs using Python, we use the following code:

jobs = []
# 'glue' is an authenticated boto3 client object
jobs.append((glue_job_name, glue.start_job_run(
    JobName=glue_job_name,
    Arguments=arguments
)))

For our use case, we have multiple jobs. Each job can have multiple job runs, and each job run can have multiple retries. To monitor jobs, we use the following pseudo code:

while overall batch is still in progress:
      loop over all job runs of all submitted jobs:
            if job run is still in progress:
                  print job run status
                  wait
            else if job run has completed:
                  print success
            else job run has failed:
                  wait for retry to begin
                  loop over up to 10 retries of this job run:
                        if retry is still in progress:
                              print retry status     
                              wait
                              break
                        else if retry has completed:
                              print success
                              break
                        else retry has failed:
                              wait for next retry to begin
                        if this is the 10th i.e. final retry that failed:
                              print failure
                              loop over all job runs of all submitted jobs:
                                    loop over all retries of job run:
                                          build all job runs list
                                    kill all job runs list
                              wait for kill job runs to complete
                              send failure signal back to caller
      update overall batch status

The Python code is as follows:

job_run_status_overall = 'STARTING'
while job_run_status_overall in ['STARTING', 'RUNNING', 'STOPPING']:
    print("")
    job_run_status_temp = 'SUCCEEDED'
    for job, response in jobs:
        glue_job_name = job
        job_run_id = response['JobRunId']
        job_run_response = glue.get_job_run(JobName=glue_job_name, RunId=job_run_id)
        job_run_status = job_run_response['JobRun']['JobRunState']
        if job_run_status in ['STARTING', 'RUNNING', 'STOPPING']:
            job_run_status_temp = job_run_status
            logger.info("Glue job ({}) with run id {} has status : {}".format(glue_job_name, job_run_id, job_run_status))
            time.sleep(120)
        elif job_run_status == 'SUCCEEDED':
            logger.info("Glue job ({}) with run id {} has status : {}".format(glue_job_name, job_run_id, job_run_status))
        else:
            time.sleep(30)
            for i in range(1, 11):
                try:
                    job_run_id_temp = job_run_id+'_attempt_'+str(i)
                    # print("Checking for " + job_run_id_temp)
                    job_run_response = glue.get_job_run(JobName=glue_job_name, RunId=job_run_id_temp)
                    # print("Found " + job_run_id_temp)
                    job_run_id = job_run_id_temp
                    job_run_status = job_run_response['JobRun']['JobRunState']
                    if job_run_status in ['STARTING', 'RUNNING', 'STOPPING']:
                        job_run_status_temp = job_run_status
                        logger.info("Glue job ({}) with run id {} has status : {}".format(glue_job_name, job_run_id, job_run_status))
                        time.sleep(30)
                        break
                    elif job_run_status == 'SUCCEEDED':
                        logger.info("Glue job ({}) with run id {} has status : {}".format(glue_job_name, job_run_id, job_run_status))
                        break
                    else:
                        logger.info("Glue job ({}) with run id {} has status : {}".format(glue_job_name, job_run_id, job_run_status))
                        time.sleep(30)
                except Exception as e:
                    pass
                if i == 10:
                    logger.info("All attempts failed: Glue job ({}) with run id {} and status: {}".format(glue_job_name, job_run_id, job_run_status))
                    logger.info("Cleaning up: Stopping all jobs and job runs...")
                    for job_to_stop, response_to_stop in jobs:
                        glue_job_name_to_stop = job_to_stop
                        job_run_id_to_stop = response_to_stop['JobRunId']
                        job_run_id_to_stop_temp = []
                        for j in range(0, 11):
                            job_run_id_to_stop_temp.append(job_run_id_to_stop if j == 0 else job_run_id_to_stop+'_attempt_'+str(j))
                        job_to_stop_response = glue.batch_stop_job_run(JobName=glue_job_name_to_stop, JobRunIds=job_run_id_to_stop_temp)
                    time.sleep(30)
                    raise ValueError("Glue job ({}) with run id {} and status: {}".format(glue_job_name, job_run_id, job_run_status))
    job_run_status_overall = job_run_status_temp

Our set of source data feeds consists of multiple historical AWS Glue tables, with 2 months’ data in each, spanning across the past few years and a year into the future:

  • Tables for year 2016: table_20160101, table_20160301, table_20160501, …, table_20161101. (6 tables)
  • Tables for year 2017: table_20170101, table_20170301, table_20170501, …, table_20171101. (6 tables)
  • Tables for year 2018: table_20180101, table_20180301, table_20180501, …, table_20181101. (6 tables)
  • Tables for year 2019: table_20190101, table_20190301, table_20190501, …, table_20191101. (6 tables)
  • Tables for year 2020: table_20200101, table_20200301, table_20200501, …, table_20201101. (6 tables)
  • Tables for year 2021: table_20210101, table_20210301, table_20210501, …, table_20211101. (6 tables)

The tables add up to 36 tables (therefore, 36 SQL queries) with about 800 billion rows to process (excluding the later months of year 2020 and year 2021, the tables for which are placeholders and empty at the time of writing).

Due to the high volume of data, we want to trigger our AWS Glue job multiple times: one job run request for each table, all at once to achieve parallelism (as opposed to sequential, stacked, or staggered-over-time job runs), resulting in 36 total job runs needed to process 6 years of data. In AWS Glue, we created 12 identical jobs with three maximum concurrent runs of each, thereby allowing the 12 * 3 = 36 job run requests that we needed. However, we encountered a few bottlenecks and limitations, which we addressed by the workarounds we discuss in the following section.

Limitations and workarounds

We needed to increase the limit on how many IP addresses we can have within one VPC. To do this, we made sure the VPC’s CIDR was configured to allow as many IP addresses as needed to launch the over 2,000 workers expected when running all the AWS Glue jobs. The following table shows an example configuration.

IPv4 CIDR Available IPv4
10.0.0.0/20 4091

For better availability and parallelism, we spread our jobs across multiple AWS Glue connections by doing the following:

  • Splitting our VPC into multiple subnets, with each subnet in a different Availability Zone
  • Creating one AWS Glue connection for each subnet (one each of us-east-1a, 1c, and 1d) so our request for over 2,000 worker nodes wasn’t made within one Availability Zone

This VPC splitting approach makes sure the job requests are evenly distributed across the three Availability Zones we chose. The following table shows an example configuration.

Subnet VPC IPv4 CIDR Available IPv4 Availability Zone
my-subnet-glue-etl-us-east-1c my-vpc 10.0.0.0/20 4091 us-east-1c
my-subnet-glue-etl-us-east-1a my-vpc 10.0.16.0/20 4091 us-east-1a
my-subnet-glue-etl-us-east-1d my-vpc 10.0.32.0/20 4091 us-east-1d

The following diagram illustrates our architecture.

Summary

In this post, we shared our experience exploring the features and capabilities of AWS Glue 2.0 for our data processing needs. We consumed over 4,000 DPUs across all our AWS Glue jobs because we used over 2,000 workers of G.2X type. We spread our jobs across multiple connections mapped to different Availability Zones of our Region: us-east-1a, 1c, and 1d, for better availability and parallelism.

Using AWS Glue 2.0, we could run all our PySpark SQLs in parallel and independently without resource contention between each other. With earlier AWS Glue versions, launching each job took an extra 8–10 minutes for the cluster to boot up, but with the reduced startup time in AWS Glue 2.0, each job is ready to start processing data in less than 1 minute. And each AWS Glue job runs a Spark version of our original SQL query and directly writes output back to our Amazon Redshift destination configured via AWS Glue DynamicFrame and DataSink.

Each job takes approximately 30 minutes. And when jobs are submitted together, due to high parallelism, all our jobs still finish within 40 minutes. Although the job durations of AWS Glue 2.0 are similar to 1.0, saving an additional 8–10 minutes of startup time previously observed for a large sized cluster is a huge benefit. The duration of our long-running ETL process was reduced from several hours to under an hour, resulting in significant improvement in runtime.

Based on our experience, we plan to migrate to AWS Glue 2.0 for a large number of our current and future data platform ETL needs.


About the Author

Kaushik Krishnamurthi is a Senior Data Engineer at Amazon Web Services (AWS), where he focuses on building scalable platforms for business analytics and machine learning. Prior to AWS, he worked in several business intelligence and data engineering roles for many years.

 

 

 

 

AWS achieves FedRAMP P-ATO for 5 services in AWS US East/West and GovCloud (US) Regions

Post Syndicated from Amendaze Thomas original https://aws.amazon.com/blogs/security/aws-achieves-fedramp-p-ato-for-5-services-in-aws-us-east-west-and-govcloud-us-regions/

We’re pleased to announce that five additional AWS services have achieved provisional authorization (P-ATO) by the Federal Risk and Authorization Management Program (FedRAMP) Joint Authorization Board (JAB). These services provide the following capabilities for the federal government and customers with regulated workloads:

  • Enable your organization’s developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs with AWS Batch.
  • Aggregate, organize, and prioritize your security alerts or findings from multiple AWS services using AWS Security Hub.
  • Provision, manage, and deploy public and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) certificates using AWS Certificate Manager.
  • Enable customers to set up and govern a new, secure, multi-account AWS environment using AWS Control Tower.
  • Provide a fully managed Kubernetes service with Amazon Elastic Kubernetes Service.

The following services are now listed on the FedRAMP Marketplace and the AWS Services in Scope by Compliance Program page.

AWS US East/West Regions (FedRAMP Moderate Authorization)

AWS GovCloud (US) Regions (FedRAMP High Authorization)

AWS is continually expanding the scope of our compliance programs to help enable your organization to use our services for sensitive and regulated workloads. Today, AWS offers 90 AWS services authorized in the AWS US East/West Regions under FedRAMP Moderate Authorization, and 76 services authorized in the AWS GovCloud (US) Regions under FedRAMP High Authorization.

To learn what other public sector customers are doing on AWS, see our Government, Education, and Nonprofits Case Studies and Customer Success Stories. Stay tuned for future updates on our Services in Scope by Compliance Program page. If you have feedback about this blog post, let us know in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

author photo

Amendaze Thomas

Amendaze is the manager of the AWS Government Assessments and Authorization Program (GAAP). He has 15 years of experience providing advisory services to clients in the federal government, and over 13 years of experience supporting CISO teams with risk management framework (RMF) activities.

Mercado Libre: How to Block Malicious Traffic in a Dynamic Environment

Post Syndicated from Gaston Ansaldo original https://aws.amazon.com/blogs/architecture/mercado-libre-how-to-block-malicious-traffic-in-a-dynamic-environment/

Blog post contributors: Pablo Garbossa and Federico Alliani of Mercado Libre

Introduction

Mercado Libre (MELI) is the leading e-commerce and FinTech company in Latin America. We have a presence in 18 countries across Latin America, and our mission is to democratize commerce and payments to impact the development of the region.

We manage an ecosystem of more than 8,000 custom-built applications that process an average of 2.2 million requests per second. To support the demand, we run between 50,000 to 80,000 Amazon Elastic Cloud Compute (EC2) instances, and our infrastructure scales in and out according to the time of the day, thanks to the elasticity of the AWS cloud and its auto scaling features.

Mercado Libre

As a company, we expect our developers to devote their time and energy building the apps and features that our customers demand, without having to worry about the underlying infrastructure that the apps are built upon. To achieve this separation of concerns, we built Fury, our platform as a service (PaaS) that provides an abstraction layer between our developers and the infrastructure. Each time a developer deploys a brand new application or a new version of an existing one, Fury takes care of creating all the required components such as Amazon Virtual Private Cloud (VPC), Amazon Elastic Load Balancing (ELB), Amazon EC2 Auto Scaling group (ASG), and EC2) instances. Fury also manages a per-application Git repository, CI/CD pipeline with different deployment strategies, such like blue-green and rolling upgrades, and transparent application logs and metrics collection.

Fury- MELI PaaS

For those of us on the Cloud Security team, Fury represents an opportunity to enforce critical security controls across our stack in a way that’s transparent to our developers. For instance, we can dictate what Amazon Machine Images (AMIs) are vetted for use in production (such as those that align with the Center for Internet Security benchmarks). If needed, we can apply security patches across all of our fleet from a centralized location in a very scalable fashion.

But there are also other attack vectors that every organization that has a presence on the public internet is exposed to. The AWS recent Threat Landscape Report shows a 23% YoY increase in the total number of Denial of Service (DoS) events. It’s evident that organizations need to be prepared to quickly react under these circumstances.

The variety and the number of attacks are increasing, testing the resilience of all types of organizations. This is why we started working on a solution that allows us to contain application DoS attacks, and complements our perimeter security strategy, which is based on services such as AWS Shield and AWS Web Application Firewall (WAF). In this article, we will walk you through the solution we built to automatically detect and block these events.

The strategy we implemented for our solution, Network Behavior Anomaly Detection (NBAD), consists of four stages that we repeatedly execute:

  1. Analyze the execution context of our applications, like CPU and memory usage
  2. Learn their behavior
  3. Detect anomalies, gather relevant information and process it
  4. Respond automatically

Step 1: Establish a baseline for each application

End user traffic enters through different AWS CloudFront distributions that route to multiple Elastic Load Balancers (ELBs). Behind the ELBs, we operate a fleet of NGINX servers from where we connect back to the myriad of applications that our developers create via Fury.

MELI Architecture - nomaly detection project-step 1

Step 1: MELI Architecture – Anomaly detection project

We collect logs and metrics for each application that we ship to Amazon Simple Storage Service (S3) and Datadog. We then partition these logs using AWS Glue to make them available for consumption via Amazon Athena. On average, we send 3 terabytes (TB) of log files in parquet format to S3.

Based on this information, we developed processes that we complement with commercial solutions, such as Datadog’s Anomaly Detection, which allows us to learn the normal behavior or baseline of our applications and project expected adaptive growth thresholds for each one of them.

Anomaly detection

Step 2: Anomaly detection

When any of our apps receives a number of requests that fall outside the limits set by our anomaly detection algorithms, an Amazon Simple Notification Service (SNS) event is emitted, which triggers a workflow in the Anomaly Analyzer, a custom-built component of this solution.

Upon receiving such an event, the Anomaly Analyzer starts composing the so-called event context. In parallel, the Data Extractor retrieves vital insights via Athena from the log files stored in S3.

The output of this process is used as the input for the data enrichment process. This is responsible for consulting different threat intelligence sources that are used to further augment the analysis and determine if the event is an actual incident or not.

At this point, we build the context that will allow us not only to have greater certainty in calculating the score, but it will also help us validate and act quicker. This context includes:

  • Application’s owner
  • Affected business metrics
  • Error handling statistics of our applications
  • Reputation of IP addresses and associated users
  • Use of unexpected URL parameters
  • Distribution by origin of the traffic that generated the event (cloud providers, geolocation, etc.)
  • Known behavior patterns of vulnerability discovery or exploitation
Step 2: MELI Architecture - Anomaly detection project

Step 2: MELI Architecture – Anomaly detection project

Step 3: Incident response

Once we reconstruct the context of the event, we calculate a score for each “suspicious actor” involved.

Step 3: MELI Architecture - Anomaly detection project

Step 3: MELI Architecture – Anomaly detection project

Based on these analysis results we carry out a series of verifications in order to rule out false positives. Finally, we execute different actions based on the following criteria:

Manual review

If the outcome of the automatic analysis results in a medium risk scoring, we activate a manual review process:

  1. We send a report to the application’s owners with a summary of the context. Based on their understanding of the business, they can activate the Incident Response Team (IRT) on-call and/or provide feedback that allows us to improve our automatic rules.
  2. In parallel, our threat analysis team receives and processes the event. They are equipped with tools that allow them to add IP addresses, user-agents, referrers, or regular expressions into Amazon WAF to carry out temporary blocking of “bad actors” in situations where the attack is in progress.

Automatic response

If the analysis results in a high risk score, an automatic containment process is triggered. The event is sent to our block API, which is responsible for adding a temporary rule designed to mitigate the attack in progress. Behind the scenes, our block API leverages AWS WAF to create IPSets. We reference these IPsets from our custom rule groups in our web ACLs, in order to block IPs that source the malicious traffic. We found many benefits in the new release of AWS WAF, like support for Amazon Managed Rules, larger capacity units per web ACL as well as an easier to use API.

Conclusion

By leveraging the AWS platform and its powerful APIs, and together with the AWS WAF service team and solutions architects, we were able to build an automated incident response solution that is able to identify and block malicious actors with minimal operator intervention. Since launching the solution, we have reduced YoY application downtime over 92% even when the time under attack increased over 10x. This has had a positive impact on our users and therefore, on our business.

Not only was our downtime drastically reduced, but we also cut the number of manual interventions during this type of incident by 65%.

We plan to iterate over this solution to further reduce false positives in our detection mechanisms as well as the time to respond to external threats.

About the authors

Pablo Garbossa is an Information Security Manager at Mercado Libre. His main duties include ensuring security in the software development life cycle and managing security in MELI’s cloud environment. Pablo is also an active member of the Open Web Application Security Project® (OWASP) Buenos Aires chapter, a nonprofit foundation that works to improve the security of software.

Federico Alliani is a Security Engineer on the Mercado Libre Monitoring team. Federico and his team are in charge of protecting the site against different types of attacks. He loves to dive deep into big architectures to drive performance, scale operational efficiency, and increase the speed of detection and response to security events.

Set Your Content Free With Fastly and Backblaze B2

Post Syndicated from Elton Carneiro original https://www.backblaze.com/blog/set-your-content-free-with-fastly-and-backblaze-b2/

Whether you need to deliver fast-changing application updates to users around the world, manage an asset-heavy website, or deliver a full-blown video streaming service—there are two critical parts of your solution you need to solve for: your origin store and your CDN.

You need an origin store that is a reliable place to store the content your app will use. And you need a content delivery network (CDN) to cache and deliver that content closer to every location your users happen to be so that your application delivers an optimized user experience.

These table stakes are simple, but platforms that try to serve both functions together generally end up layering on excessive complexity and fees to keep your content locked on their platform. When you can’t choose the right components for your solution, your content service can’t scale as fast as it needs to today and the premium you pay for unnecessary features inhibits your growth in the future.

That’s why we’re excited to announce our collaboration with Fastly in our campaign to bring choice, affordability, and simplicity to businesses with diverse content delivery needs.

Fastly: The Newest Edge Cloud Platform Partner for Backblaze B2 Cloud Storage

Our new collaboration with Fastly, a global edge cloud platform and CDN, offers an integrated solution that will let you store and serve rich media files seamlessly, free from the lock-in fees and functionality of closed “goliath” cloud storage platforms, and all with free egress from Backblaze B2 Cloud Storage to Fastly.

Fastly’s edge cloud platform enables users to create great digital experiences quickly, securely, and reliably by processing, serving, and securing customers’ applications as close to end-users as possible. Fastly’s edge cloud platform takes advantage of the modern internet, and is designed both for programmability and to support agile software development.

Get Ready to Go Global

The Fastly edge cloud platform is for any business that wants to serve data and content efficiently with the best user experience. Getting started only takes minutes: Fastly’s documentation will help you spin up your account and then help you explore how to use their features like image optimization, video and streaming acceleration, real-time logs, analytic services, and more.

If you’d like to learn more, join us for a webinar with Simon Wistow, Co-Founder & VP of Strategic Initiatives for Fastly, on November 19th at 10 a.m. PST.

Backblaze Covers Migration Egress Fees

To pair this functionality with best in class storage and pricing, you simply need a Backblaze B2 Cloud Storage account to set as your origin store. If you’re already using Fastly but have a different origin store, you might be paying a lot of money for data egress. Maybe even enough that the concept of migrating to another store seems impossible.

Backblaze has the solution: Migrate 50TB (or more), store it with us for at least 12 months, and we’ll pay the data transfer fees.

Or, if you have data on-premise, we have a number of solutions for you. And if the content you want to move is less than 50TB, we still have a way to cut your egress charges from your old provider by over 50%. Contact our team for details.
 

 

Freedom to Build and Operate Your Ideal Solution

With Backblaze as your origin store and Fastly as your CDN and edge cloud platform, you can reduce your applications storage and network costs by up to 80% based on joint solution pricing vs. closed platform alternatives. Contact the Backblaze team if you have any questions.

The post Set Your Content Free With Fastly and Backblaze B2 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/835401/rss

Security updates have been issued by Debian (thunderbird), Fedora (createrepo_c, dnf-plugins-core, dnf-plugins-extras, librepo, livecd-tools, and pdns-recursor), openSUSE (firefox and mailman), Oracle (firefox), Red Hat (chromium-browser, java-1.8.0-openjdk, and Satellite 6.8), Scientific Linux (java-1.8.0-openjdk), SUSE (libvirt), and Ubuntu (blueman, firefox, mysql-5.7, mysql-8.0, php7.4, and ruby-kramdown).

Fedora 33 released

Post Syndicated from original https://lwn.net/Articles/835366/rss

The Fedora 33
release
is now available in a variety of editions, including the newly promoted IoT edition. “No matter
what variant of Fedora you use, you’re getting the latest the open source
world has to offer. Following our ‘First’ foundation, we’ve updated key
programming language and system library packages, including Python 3.9,
Ruby on Rails 6.0, and Perl 5.32. In Fedora KDE, we’ve followed the work in
Fedora 32 Workstation and enabled the EarlyOOM service by default to
improve the user experience in low-memory situations.

To make the default Fedora experience better, we’ve set nano as the default
editor.” A number of the more significant Fedora 33 changes
were covered here in June.

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/bulldozer-batch-data-moving-from-data-warehouse-to-online-key-value-stores-41bac13863f8

By Tianlong Chen and Ioannis Papapanagiotou

Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables.

At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enhance our user experience, one online application fetches subscribers’ preferences data to recommend movies and TV shows. The data warehouse is not designed to serve point requests from microservices with low latency. Therefore, we must efficiently move data from the data warehouse to a global, low-latency and highly-reliable key-value store. For how our machine learning recommendation systems leverage our key-value stores, please see more details on this presentation.

What is Bulldozer

Bulldozer is a self-serve data platform that moves data efficiently from data warehouse tables to key-value stores in batches. It leverages Netflix Scheduler for scheduling the Bulldozer Jobs. Netflix Scheduler is built on top of Meson which is a general purpose workflow orchestration and scheduling framework to execute and manage the lifecycle of the data workflow. Bulldozer makes data warehouse tables more accessible to different microservices and reduces each individual team’s burden to build their own solutions. Figure 1 shows how we use Bulldozer to move data at Netflix.

Figure 1. Moving data with Bulldozer at Netflix.

As the paved path for moving data to key-value stores, Bulldozer provides a scalable and efficient no-code solution. Users only need to specify the data source and the destination cluster information in a YAML file. Bulldozer provides the functionality to auto-generate the data schema which is defined in a protobuf file. The protobuf schema is used for serializing and deserializing the data by Bulldozer and data consumers. Bulldozer uses Spark to read the data from the data warehouse into DataFrames, converts each data entry to a key-value pair using the schema defined in the protobuf and then delivers key-value pairs into a key-value store in batches.

Instead of directly moving data into a specific key-value store like Cassandra or Memcached, Bulldozer moves data to a Netflix implemented Key-Value Data Abstraction Layer (KV DAL). The KV DAL allows applications to use a well-defined and storage engine agnostic HTTP/gRPC key-value data interface that in turn decouples applications from hard to maintain and backwards-incompatible datastore APIs. By leveraging multiple shards of the KV DAL, Bulldozer only needs to provide one single solution for writing data to the highly abstracted key-value data interface, instead of developing different plugins and connectors for different data stores. Then the KV DAL handles writing to the appropriate underlying storage engines depending on latency, availability, cost, and durability requirements.

Figure 2. How Bulldozer leverages Spark, Protobuf and KV DAL for moving the data.

Configuration-Based Bulldozer Job

For batch data movement in Netflix, we provide job templates in our Scheduler to make movement of data from all data sources into and out of the data warehouse. Templates are backed by notebooks. Our data platform provides the clients with a configuration-based interface to run a templated job with input validation.

We provide the job template MoveDataToKvDal for moving the data from the warehouse to one Key-Value DAL. Users only need to put the configurations together in a YAML file to define the movement job. The job is then scheduled and executed in Netflix Big Data Platform. This configuration defines what and where the data should be moved. Bulldozer abstracts the underlying infrastructure on how the data moves.

Let’s look at an example of a Bulldozer YAML configuration (Figure 3). Basically the configuration consists of three major domains: 1) data_movement includes the properties that specify what data to move. 2) key_value_dal defines the properties of where the data should be moved. 3) bulldozer_protobuf has the required information for protobuf file auto generation.

Figure 3. An Exemplar Bulldozer Job YAML.

In the data_movement domain, the source of the data can be a warehouse table or a SQL query. Users also need to define the key and value columns to tell Bulldozer which column is used as the key and which columns are included in the value message. We will discuss more details about the schema mapping in the next Data Model section. In the key_value_dal domain, it defines the destination of the data which is a namespace in the Key-Value DAL. One namespace in a Key-Value DAL contains as many key-value data as required, it is the equivalent to a table in a database.

Data Model

Bulldozer uses protobuf for 1) representing warehouse table schema into a key-value schema; 2) serializing and deserializing the key-value data when performing write and read operations to KV DAL. In this way, it allows us to provide a more traditional typed record store while keeping the key-value storage engine abstracted.

Figure 4 shows a simple example of how we represent a warehouse table schema into a key-value schema. The left part of the figure shows the schema of the warehouse table while the right part is the protobuf message that Bulldozer auto generates based on the configurations in the YAML file. The field names should exactly match for Bulldozer to convert the structured data entries into the key-value pairs. In this case, profile_id field is the key while email and age fields are included in the value schema. Users can use the protobuf schema KeyMessage and ValueMessage to deserialize data from Key-Value DAL as well.

Figure 4. An Example of Schema Mapping.

In this example, the schema of the warehouse table is flat, but sometimes the table can have nested structures. Bulldozer supports complicated schemas, like struct of struct type, array of struct, map of struct and map of map type.

Data Version Control

Bulldozer jobs can be configured to execute at a desired frequency of time, like once or many times per day. Each execution moves the latest view of the data warehouse into a Key-Value DAL namespace. Each view of the data warehouse is a new version of the entire dataset. For example, the data warehouse has two versions of full dataset as of January 1st and 2nd, Bulldozer job is scheduled to execute daily for moving each version of the data.

Figure 5. Dataset of January 1st 2020.
Figure 6. Dataset of January 2nd 2020.

When Bulldozer moves these versioned data, it usually has the following requirements:

  • Data Integrity. For one Bulldozer job moving one version of data, it should write the full dataset or none. Partially writing is not acceptable. For example above, if the consumer reads value for movie_id: 1 and movie_id: 2 after the Bulldozer jobs, the returned values shouldn’t come from two versions, like: (movie_id: 1, cost 40), (movie_id: 2, cost 101).
  • Seamless to Data Consumer. Once a Bulldozer job finishes moving a new version of data, the data consumer should be able to start reading the new data automatically and seamlessly.
  • Data Fallback. Normally, data consumers read only the latest version of the data, but if there’s some data corruption in that version, we should have a mechanism to fallback to the previous version.

Bulldozer leverages the KV DAL data namespace and namespace alias functionality to manage these versioned datasets. For each Bulldozer job execution, it creates a new namespace suffixed with the date and moves the data to that namespace. The data consumer reads data from an alias namespace which points to one of these version namespaces. Once the job moves the full data successfully, the Bulldozer job updates the alias namespace to point to the new namespace which contains the new version of data. The old namespaces are closed to reads and writes and deleted in the background once it’s safe to do so. As most key-value storage engines support efficiently deleting a namespace (e.g. truncate or drop a table) this allows us to cheaply recycle old versions of the data. There are also other systems in Netflix like Gutenberg which adopt a similar namespace alias approach for data versioning which is applied to terabyte datasets.

For example, in Figure 7 data consumers read the data through namespace: alias_namespace which points to one of the underlying namespaces. On January 1st 2020, Bulldozer job creates namespace_2020_01_01 and moves the dataset, alias_namespace points to namespace_2020_01_01. On January 2nd 2020, there’s a new version of data, bulldozer creates namespace_2020_01_02 , moves the new dataset and updates alias_namespace pointing to namespace_2020_01_02. Both namespace_2020_01_01 and namespace_2020_01_02 are transparent to the data consumers.

Figure 7. An Example of How the Namespace Aliasing Works.

The namespace aliasing mechanism ensures that the data consumer only reads data from one single version. If there’s a bad version of data, we can always swap the underlying namespaces to fallback to the old version.

Production Usage

We released Bulldozer in production in early 2020. Currently, Bulldozer transfers billions of records from the data warehouse to key-value stores in Netflix everyday. The use cases include our members’ predicted scores data to help improve personalized experience (one example shows in Figure 8), the metadata of Airtable and Google Sheets for data lifecycle management, the messaging modeling data for messaging personalization and more.

Figure 8. Personalized articles in Netflix Help Center powered by Bulldozer.

Stay Tuned

The ideas discussed here include only a small set of problems with many more challenges still left to be identified and addressed. Please share your thoughts and experience by posting your comments below and stay tuned for more on data movement work at Netflix.


Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Take part in the PA Raspberry Pi Competition for UK schools

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/pa-raspberry-pi-competition-uk-2021/

Every year, we support the PA Raspberry Pi Competition for UK schools, run by PA Consulting. In this free competition, teams of students from schools all over the UK imagine, design, and create Raspberry Pi–powered inventions.

Female engineer with Raspberry Pi device. Copyright © University of Southampton
Let’s inspire young people to take up a career in STEM!
© University of Southampton

The PA Raspberry Pi Competition aims to inspire young people aged 8 to 18 to learn STEM skills, teamwork, and creativity, and to move toward a career in STEM.

We invite all UK teachers to register if you have students at your school who would love to take part!

For the first 100 teams to complete registration and submit their entry form, PA Consulting provides a free Raspberry Pi Starter Kit to create their invention.

This year’s competition theme: Innovating for a better world

The theme is deliberately broad so that teams can show off their creativity and ingenuity.

  • All learners aged 8 to 18 can take part, and projects are judged in four age groups
  • The judging categories include team passion; simplicity and clarity of build instructions; world benefit; and commercial potential
  • The proposed budget for a team’s invention is around £100
  • The projects can be part of your students’ coursework
  • Entries must be submitted by Monday 22 March 2021
  • You’ll find more details and inspiration on the PA Raspberry Pi Competition webpage

Among all the entries, judges from the tech sector and the Raspberry Pi Foundation choose the finalists with the most outstanding inventions in their age group.

The Dynamix team, finalists in last round’s Y4–6 group, built a project called SmartRoad+

The final teams get to take part in an exciting awards event to present their creations so that the final winners can be selected. This round’s PA Raspberry Pi Awards Ceremony takes place on Wednesday 28 April 2021, and PA Consulting are currently considering whether this will be a physical or virtual event.

All teams that participate in the competition will be rewarded with certificates, and there’s of course the chance to win trophies and prizes too!

You can prepare with our free online courses

If you would like to boost your skills so you can better support your team, then sign up to one of our free online courses designed for educators:

Take inspiration from the winners of the previous round

All entries are welcome, no matter what your students’ experience is! Here are the outstanding projects from last year’s competition:

A look inside the air quality-monitoring project by Team Tempest, last round’s winners in the Y7–9 group

Find out more at the PA Raspberry Pi Competition webinar!

To support teachers in guiding their teams through the competition, PA Consulting will hold a webinar on 12 November 2020 at 4.30–5.30pm. Sign up to hear first-hand what’s involved in taking part in the PA Raspberry Pi Competition, and use the opportunity to ask questions!

The post Take part in the PA Raspberry Pi Competition for UK schools appeared first on Raspberry Pi.

Diving into /proc/[pid]/mem

Post Syndicated from Lennart Espe original https://blog.cloudflare.com/diving-into-proc-pid-mem/

Diving into /proc/[pid]/mem

Diving into /proc/[pid]/mem

A few months ago, after reading about Cloudflare doubling its intern class size, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring into Linux kernel code and adding a pretty cool feature to gVisor, a Linux container runtime.

My internship was under the Emerging Technologies and Incubation group on a project involving gVisor. A co-worker contacted my team about not being able to read the debug symbols of stack traces inside the sandbox. For example, when the isolated process crashed, this is what we saw in the logs:

*** Check failure stack trace: ***
    @     0x7ff5f69e50bd  (unknown)
    @     0x7ff5f69e9c9c  (unknown)
    @     0x7ff5f69e4dbd  (unknown)
    @     0x7ff5f69e55a9  (unknown)
    @     0x5564b27912da  (unknown)
    @     0x7ff5f650ecca  (unknown)
    @     0x5564b27910fa  (unknown)

Obviously, this wasn’t very useful. I eagerly volunteered to fix this stack unwinding code – how hard could it be?

After some debugging, we found that the logging library used in the project opened /proc/self/mem to look for ELF headers at the start of each memory-mapped region. This was necessary to calculate an offset to find the correct addresses for debug symbols.

It turns out this mechanism is rather common. The stack unwinding code is often run in weird contexts – like a SIGSEGV handler – so it would not be appropriate to dig over real memory addresses back and forth to read the ELF. This could trigger another SIGSEGV. And SIGSEGV inside a SIGSEGV handler means either termination via the default handler for a segfault or recursing into the same handler again and again (if one sets SA_NODEFER) leading to a stack overflow.

However, inside gVisor, each call of open() on /proc/self/mem resulted in ENOENT, because the entire /proc/self/mem file was missing. In order to provide a robust sandbox, gVisor has to carefully reimplement the Linux kernel interfaces. This particular /proc file was simply unimplemented in the virtual file system of Sentry, one of gVisor’s sandboxing components.
Marek asked the devs on the project chat and got confirmation – they would be happy to accept a patch implementing this file.
Diving into /proc/[pid]/mem

The easy way out would have been to make a small, local patch to the unwinder behavior, yet I found myself diving into the Linux kernel trying to figure how the mem file worked in an attempt to implement it in Sentry’s VFS.

What does /proc/[pid]/mem do?

The file itself is quite powerful, because it allows raw access to the virtual address space of a process. According to manpages, the documented file operations are open(), read() and lseek(). Typical use cases are debugging tasks or dumping process memory.

Opening the file

When a process wants to open the file, the kernel does the file permissions check, looks up the associated operations for mem and invokes a method called proc_mem_open. It retrieves the associated task and calls a method named mm_access.

/*
 * Grab a reference to a task's mm, if it is not already going away
 * and ptrace_may_access with the mode parameter passed to it
 * succeeds.
 */

Seems relatively straightforward, right? The special thing about mm_access is that it verifies the permissions the current task has regarding the task to which the memory belongs. If the current task and target task do not share the same memory manager, the kernel invokes a method named __ptrace_may_access.

/*
 * May we inspect the given task?
 * This check is used both for attaching with ptrace
 * and for allowing access to sensitive information in /proc.
 *
 * ptrace_attach denies several cases that /proc allows
 * because setting up the necessary parent/child relationship
 * or halting the specified task is impossible.
 *
 */

According to the manpages, a process which would like to read from an unrelated /proc/[pid]/mem file should have access mode PTRACE_MODE_ATTACH_FSCREDS. This check does not verify that a process is attached via PTRACE_ATTACH, but rather if it has the permission to attach with the specified credentials mode.

Access checks

After skimming through the function, you will see that a process is allowed access if the current task belongs to the same thread group as the target task, or denied access (depending on whether PTRACE_MODE_FSCREDS or PTRACE_MODE_REALCREDS is set, we will use either the file-system UID / GID, which is typically the same as the effective UID/GID, or the real UID / GID) if none of the following conditions are met:

  • the current task’s credentials (UID, GID) match up with the credentials (real, effective and saved set-UID/GID) of the target process
  • the current task has CAP_SYS_PTRACE inside the user namespace of the target process

In the next check, access is denied if the current task has neither CAP_SYS_PTRACE inside the user namespace of the target task, nor the target’s dumpable attribute is set to SUID_DUMP_USER. The dumpable attribute is typically required to allow producing core dumps.

After these three checks, we also go through the commoncap Linux Security Module (and other LSMs) to verify our access mode is fine. LSMs you may know are SELinux and AppArmor. The commoncap LSM performs the checks on the basis of effective or permitted process capabilities (depending on the mode being FSCREDS or REALCREDS), allowing access if

  • the capabilities of the current task are a superset of the capabilities of the target task, or
  • the current task has CAP_SYS_PTRACE in the target task’s user namespace

In conclusion, one has access (with only commoncap LSM checks active) if:

  • the current task is in the same task group as the target task, or
  • the current task has CAP_SYS_PTRACE in the target task’s user namespace, or
  • the credentials of the current and target task match up in the given credentials mode, the target task is dumpable, they run in the same user namespace and the target task’s capabilities are a subset of the current task’s capabilities

I highly recommend reading through the ptrace manpages to dig deeper into the different modes, options and checks.

Reading from the file

Since all the access checks occur when opening the file, reading from it is quite straightforward. When one invokes read() on a mem file, it calls up mem_rw (which actually can do both reading and writing).

To avoid using lots of memory, mem_rw performs the copy in a loop and buffers the data in an intermediate page. mem_rw has a hidden superpower, that is, it uses FOLL_FORCE to avoid permission checks on user-owned pages (handling pages marked as non-readable/non-writable readable and writable).

mem_rw has other specialties, such as its error handling. Some interesting cases are:

  • if the target task has exited after opening the file descriptor, performing read() will always succeed with reading 0 bytes
  • if the initial copy from the target task’s memory to the intermediate page fails, it does not always return an error but only if no data has been read

You can also perform lseek on the file excluding SEEK_END.

How it works in gVisor

Luckily, gVisor already implemented ptrace_may_access as kernel.task.CanTrace, so one can avoid reimplementing all the ptrace access logic. However, the implementation in gVisor is less complicated due to the lack of support for PTRACE_MODE_FSCREDS (which is still an open issue).

When a new file descriptor is open()ed, the GetFile method of the virtual Inode is invoked, therefore this is where the access check naturally happens. After a successful access check, the method returns a fs.File. The fs.File implements all the file operations you would expect such as Read() and Write(). gVisor also provides tons of primitives for quickly building a working file structure so that one does not have to reimplement a generic lseek() for example.

In case a task invokes a Read() call onto the fs.File, the Read method retrieves the memory manager of the file’s Task.
Accessing the task’s memory manager is a breeze with comfortable CopyIn and CopyOut methods, with interfaces similar to io.Writer and io.Reader.

After implementing all of this, we finally got a useful stack trace.

*** Check failure stack trace: ***
    @     0x7f190c9e70bd  google::LogMessage::Fail()
    @     0x7f190c9ebc9c  google::LogMessage::SendToLog()
    @     0x7f190c9e6dbd  google::LogMessage::Flush()
    @     0x7f190c9e75a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x55d6f718c2da  main
    @     0x7f190c510cca  __libc_start_main
    @     0x55d6f718c0fa  _start

Conclusion

A comprehensive victory! The /proc/<pid>/mem file is an important mechanism that gives insight into contents of process memory. It is essential to stack unwinders to do their work in case of complicated and unforeseeable failures. Because the process memory contains highly-sensitive information, data access to the file is determined by a complex set of poorly documented rules. With a bit of effort, you can emulate /proc/[PID]/mem inside gVisor’s sandbox, where the process only has access to the subset of procfs that has been implemented by the gVisor authors and, as a result, you can have access to an easily readable stack trace in case of a crash.

Now I can’t wait to get the PR merged into gVisor.

Reverse-Engineering the Redactions in the Ghislaine Maxwell Deposition

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/10/reverse-engineering-the-redactions-in-the-ghislaine-maxwell-deposition.html

Slate magazine was able to cleverly read the Ghislaine Maxwell deposition and reverse-engineer many of the redacted names.

We’ve long known that redacting is hard in the modern age, but most of the failures to date have been a result of not realizing that covering digital text with a black bar doesn’t always remove the text from the underlying digital file. As far as I know, this reverse-engineering technique is new.

EDITED TO ADD: A similar technique was used in 1991 to recover the Dead Sea Scrolls.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close