Tag Archives: Amazon Simple Storage Service (S3)

AWS Week in Review – November 7, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-7-2022/

With three weeks to go until AWS re:Invent opens in Las Vegas, the AWS News Blog Team is hard at work creating blog posts to share the latest launches and previews with you. As usual, we have a strong mix of new services, new features, and a surprise or two.

Last Week’s Launches
Here are some launches that caught my eye last week:

Amazon SNS Data Protection and Masking – After a quick public preview, this cool feature is now generally available. It uses pattern matching, machine learning models, and content policies to help protect data at scale. You can find many different kinds of personally identifiable information (PII) and protected health information (PHI) in message bodies and either block message delivery or mask (de-identify) the sensitive data, all in real-time and on a per-topic basis. To learn more, read the blog post or the message data protection documentation.

Amazon Textract Updates – This service extracts text, handwriting, and data from any document or image. This past week we updated the AnalyzeID function so that it can now extract the machine readable zone (MRZ) on passports issued by the United States, and we added the entire OCR output to the API response. We also updated the machine learning models that power the AnalyzeDocument function, with a focus on single-character boxed forms commonly found on tax and immigration documents. Finally, we updated the AnalyzeExpense function with support for new fields and higher accuracy for existing fields, bringing the total field count to more than 40.

Another Amazon Braket Processor – Our quantum computing service now supports Aquila, a new 256-qubit quantum computer from QuEra that is based on a programmable array of neutral Rubidium atoms. According to the What’s New, Aquila supports the Analog Hamiltonian Simulation (AHS) paradigm, allowing it to solve for the static and dynamic properties of quantum systems composed of many interacting particles.

Amazon S3 on Outposts – This service now lets you use additional S3 Lifecycle rules to optimize capacity management. You can expire objects as they age or are replaced with newer versions, with control at the bucket level, or for subsets defined by prefixes, object tags, or object sizes. There’s more info in the What’s New and in the S3 documentation.

AWS CloudFormation – There were two big updates last week: support for Amazon RDS Multi-AZ deployments with two readable standbys, and better access to detailed information on failed stack instances for operations on CloudFormation StackSets.

Amazon MemoryDB for Redis – You can now use data tiering as a lower cost way to to scale your clusters up to hundreds of terabytes of capacity. This new option uses a combination of instance memory and SSD storage in each cluster node, with all data stored durably in a multi-AZ transaction log. There’s more information in the What’s New and the blog post.

Amazon EC2 – You can now remove launch permissions for Amazon Machine Images (AMIs) that are directly shared with your AWS account.

X in Y – We launched existing AWS services and instance types in additional Regions:

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras highlights new open source projects, tools, and demos from the AWS Community. Read Installment 134 to see what’s going on!

New Case Study – A new AWS case study describes how Taggle (a company focused on smart water solutions in Australia) created an IoT platform that runs on AWS and uses Amazon Kinesis Data Streams to store & ingest data in real time. Using AWS allowed them to scale to accommodate 80,000 additional sensors that will roll out in 2022.

Upcoming AWS Events
re:Invent 2022AWS re:Invent is just three weeks away! Join us live from November 28th to December 2nd for keynotes, training and certification opportunities, and over 1,500 technical sessions. If you cannot make it to Las Vegas you can also join us online to watch the keynotes and leadership sessions live. Be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

PeerTalk – If you will be attending re:Invent in person and are interested in meeting with me or any of our featured experts, be sure to check out PeerTalk, our new onsite networking program.

That’s all for this week!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

Optimize your modern data architecture for sustainability: Part 1 – data ingestion and data lake

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/architecture/optimize-your-modern-data-architecture-for-sustainability-part-1-data-ingestion-and-data-lake/

The modern data architecture on AWS focuses on integrating a data lake and purpose-built data services to efficiently build analytics workloads, which provide speed and agility at scale. Using the right service for the right purpose not only provides performance gains, but facilitates the right utilization of resources. Review Modern Data Analytics Reference Architecture on AWS, see Figure 1.

In this series of two blog posts, we will cover guidance from the Sustainability Pillar of the AWS Well-Architected Framework on optimizing your modern data architecture for sustainability. Sustainability in the cloud is an ongoing effort focused primarily on energy reduction and efficiency across all components of a workload. This will achieve the maximum benefit from the resources provisioned and minimize the total resources required.

Modern data architecture includes five pillars or capabilities: 1) data ingestion, 2) data lake, 3) unified data governance, 4) data movement, and 5) purpose-built analytics. In the first part of this blog series, we will focus on the data ingestion and data lake pillars of modern data architecture. We’ll discuss tips and best practices that can help you minimize resources and improve utilization.

Modern Data Analytics Reference Architecture on AWS

Figure 1. Modern Data Analytics Reference Architecture on AWS

1. Data ingestion

The data ingestion process in modern data architecture can be broadly divided into two main categories: batch, and real-time ingestion modes.

To improve the data ingestion process, see the following best practices:

Avoid unnecessary data ingestion

Work backwards from your business needs and establish the right datasets you’ll need. Evaluate if you can avoid ingesting data from source systems by using existing publicly available datasets in AWS Data Exchange or Open Data on AWS. Using these cleaned and curated datasets will help you to avoid duplicating the compute and storage resources needed to ingest this data.

Reduce the size of data before ingestion

When you design your data ingestion pipelines, use strategies such as compression, filtering, and aggregation to reduce the size of ingested data. This will permit smaller data sizes to be transferred over network and stored in the data lake.

To extract and ingest data from data sources such as databases, use change data capture (CDC) or date range strategies instead of full-extract ingestion. Use AWS Database Migration Service (DMS) transformation rules to selectively include and exclude the tables (from schema) and columns (from wide tables, for example) for ingestion.

Consider event-driven serverless data ingestion

Adopt an event-driven serverless architecture for your data ingestion so it only provisions resources when work needs to be done. For example, when you use AWS Glue jobs and AWS Step Functions for data ingestion and pre-processing, you pass the responsibility and work of infrastructure optimization to AWS.

2. Data lake

Amazon Simple Storage Service (S3) is an object storage service which customers use to store any type of data for different use cases as a foundation for a data lake. To optimize data lakes on Amazon S3, follow these best practices:

Understand data characteristics

Understand the characteristics, requirements, and access patterns of your workload data in order to optimally choose the right storage tier. You can classify your data into categories shown in Figure 2, based on their key characteristics.

Data Characteristics

Figure 2. Data Characteristics

Adopt sustainable storage options

Based on your workload data characteristics, use the appropriate storage tier to reduce the environmental impact of your workload, as shown in Figure 3.

Storage tiering on Amazon S3

Figure 3. Storage tiering on Amazon S3

Implement data lifecycle policies aligned with your sustainability goals

Based on your data classification information, you can move data to more energy-efficient storage or safely delete it. Manage the lifecycle of all your data automatically using Amazon S3 Lifecycle policies.

Amazon S3 Storage Lens delivers visibility into storage usage, activity trends, and even makes recommendations for improvements. This information can be used to lower the environmental impact of storing information on S3.

Select efficient file formats and compression algorithms

Use efficient file formats such as Parquet, where a columnar format provides opportunities for flexible compression options and encoding schemes. Parquet also enables more efficient aggregation queries, as you can skip over the non-relevant data. Using an efficient way of storage and accessing data is translated into higher performance with fewer resources.

Compress your data to reduce the storage size. Remember, you will need to trade off compression level (storage saved on disk) against the compute effort required to compress and decompress. Choosing the right compression algorithm can be beneficial as well. For instance, ZStandard (zstd) provides a better compression ratio compared with LZ4 or GZip.

Use data partitioning and bucketing

Partitioning and bucketing divides your data and keeps related data together. This can help reduce the amount of data scanned per query, which means less compute resources needed to service the workload.

Track and assess the improvement for environmental sustainability

The best way for customers to evaluate success in optimizing their workloads for sustainability is to use proxy measures and unit of work KPIs. For storage, this is GB per transaction, and for compute, it would be vCPU minutes per transaction. To use proxy measures to optimize workloads for energy efficiency, read Sustainability Well-Architected Lab on Turning the Cost and Usage Report into Efficiency Reports.

In Table 1, we have listed certain metrics to use as a proxy metric to measure specific improvements. These fall under each pillar of modern data architecture covered in this post. This is not an exhaustive list, you could use numerous other metrics to spot inefficiencies. Remember, just tracking one metric may not explain the impact on sustainability. Use an analytical exercise of combining the metric with data, type of attributes, type of workload, and other characteristics.

Pillar Metrics
Data ingestion
Data lake

Table 1. Metrics for the Modern data architecture pillars

Conclusion

In this post, we have provided guidance and best practices to help reduce the environmental impact of the data ingestion and data lake pillars of modern data architecture.

In the next post, we will cover best practices for sustainability for the unified governance, data movement, and purpose-built analytics and insights pillars.

Further reading:

How USAA built an Amazon S3 malware scanning solution

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/architecture/how-usaa-built-an-amazon-s3-malware-scanning-solution/

United Services Automobile Association (USAA) is a San Antonio-based insurance, financial services, banking, and FinTech company supporting millions of military members and their families. USAA has partnered with Amazon Web Services (AWS) to digitally transform and build multiple USAA solutions that help keep members safe and save members money and time.

Why build a S3 malware scanning solution?

As complex companies’ businesses continue to grow, there may be an increased need for collaboration and interactions with outside vendors. Prior to developing an Amazon Simple Storage Solution (Amazon S3) scanning solution, a security review and approval process for application teams to ingest data into an AWS Organization from external vendors’ AWS accounts may be warranted, to ensure additional threats are not being introduced. This could result in a lengthy review and exception process, and subsequently, could hinder the velocity of application teams’ collaboration with external vendors.

USAA security standards, like those of most companies, require all data from external vendors to be treated as untrusted, and therefore must be scanned by an antivirus or antimalware solution prior to being ingested by downstream processes within the AWS environment. Companies looking to automate the scanning process may want to consider a solution where all incoming external data flow through a demilitarized drop zone to be scanned, and subsequently released to downstream processes if malware and viruses are not detected.

S3 malware scanning solution overview

Dedicated AWS accounts should be provisioned for specific data classifications and used as a demilitarized zone (DMZ) for an untrusted staging area. The solution discussed in this blog uses a dedicated staging AWS account that controls the release of Amazon S3 objects to other AWS accounts within an AWS Organization. AWS accounts within an AWS Organization should follow security best practices in terms of infrastructure, networking, logging, and security. External vendors should explicitly be given limited permissions to appropriate resources in their respective staging S3 bucket.

A staging S3 bucket should have specific resource policies restricting which applications and identity and access management (IAM) principals can interact with S3 objects using object attributes, such as object tags, to determine whether an object has been scanned, and what the results of that scan are. Additional guardrails are implemented using Service Control Policies (SCP) to restrict authorized IAM principals to create or modify S3 object attributes (Figure 1).

Amazon S3 antivirus and antimalware scanning architecture workflow

Figure 1. Amazon S3 antivirus and antimalware scanning architecture workflow

  1. The external vendor copies an object to the staging S3 bucket.
  2. The staging S3 bucket has event notifications configured and generates an event.
  3. The S3 PutObject event is sent to an Object Created Amazon Simple Queue Service (Amazon SQS) queue topic.
  4. An Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling group is configured to scale based on messages in the Object Created SQS queue.
  5. An antivirus and antimalware scanning service application on the Amazon EC2 instances takes the following actions on objects within the Object Created Amazon SQS queue:
    a. Tag the S3 object with an “In Progress” status.
    b. Get the object from the Staging S3 bucket and stores it in a local ephemeral file system.
    c. Scan the copied object using antivirus or antimalware tool.
    d. Based on the antivirus or antimalware scan results, tag the S3 object with the scan results (for example, No_Malware_Detected vs. Malware_Detected).
    e. Create and publish a payload to the Object Scanned Amazon Simple Notification Service (Amazon SNS) topic, allowing application team filtering.
    f. Delete the message from the Object Created SQS queue.
  6. Application teams are subscribed to the Object Scanned SNS topic with a filter for their application.
  7. For any objects where a virus or malware is detected, a company can use its cyber threat response team to conduct a thorough analysis and take appropriate actions.

USAA built a custom anti-virus and anti-malware scanning application using EC2 instances, using a private, hardened Amazon Machine Image (AMI). For cost-efficacy purposes, the EC2 automatic scaling event can be configured based on Object Created SQS queue depth and Service Level Objective (SLO). A serverless version of an anti-virus and anti-malware solution can be used instead of an EC2 application, depending on your specific use-case and other factors. Some important factors include antivirus and antimalware tool serverless support, resource tuning and configuration requirements, and additional AWS services to manage that could possibly result in a bottleneck. If your enterprise is going with a serverless approach, you can use open-source tools such as ClamAV using Lambda functions.

In the event of an infected object, proper guardrails and response mechanisms need to be in place. USAA teams have developed playbooks to monitor the health and performance of S3 scanning solution, as well as responding to detected virus or malware.

This cloud native, event-driven solution has benefited multiple USAA application teams who have previously requested the ability to ingest data into AWS workloads from teams outside of USAA’s AWS Organization, and allowed additional capabilities and functionality to better serve their members. To enhance this solution even further, USAA’s security team plans to incorporate additional mechanisms to find specific objects that either failed or required additional processing, without having to scan all objects in the buckets. This can be accomplished by including an additional AWS Lambda function and Amazon DynamoDB table to track object metadata as objects get added to the Object Created SQS queue for processing. The metadata could possibly include information such as S3 bucket origin, S3 object key, version ID, scan status, and the original S3 event payload to replay the event into the Object Created SQS queue. The Lambda function primarily ensures the DynamoDB table is kept up to date as objects are processed, as well as handling issues for objects that may need to be reprocessed. The DynamoDB table also has time-to-live (TTL) configured to clear records as they expire from the Staging S3 bucket.

Conclusion

In this post, we reviewed how USAA’s Public Cloud Security team facilitated collaboration and interactions with external vendors and AWS workloads securely by creating a scalable solution to scan S3 objects for virus and malware prior to releasing objects downstream. The solution uses native AWS services and can be utilized for any use-cases requiring antivirus or antimalware capabilities. Because the S3 object scanning solution uses EC2 instances, you can use your existing antivirus or antimalware enterprise tool.

How a blockchain startup built a prototype solution to solve the need of analytics for decentralized applications with AWS Data Lab

Post Syndicated from Dr. Quan Hoang Nguyen original https://aws.amazon.com/blogs/big-data/how-a-blockchain-startup-built-a-prototype-solution-to-solve-the-need-of-analytics-for-decentralized-applications-with-aws-data-lab/

This post is co-written with Dr. Quan Hoang Nguyen, CTO at Fantom Foundation.

Here at Fantom Foundation (Fantom), we have developed a high performance, highly scalable, and secure smart contract platform. It’s designed to overcome limitations of the previous generation of blockchain platforms. The Fantom platform is permissionless, decentralized, and open source. The majority of decentralized applications (dApps) hosted on the Fantom platform lack an analytics page that provides information to the users. Therefore, we would like to build a data platform that supports a web interface that will be made public. This will allow users to search for a smart contract address. The application then displays key metrics for that smart contract. Such an analytics platform can give insights and trends for applications deployed on the platform to the users, while the developers can continue to focus on improving their dApps.

AWS Data Lab offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. Data Lab has three offerings: the Build Lab, the Design Lab, and a Resident Architect. The Build Lab is a 2–5 day intensive build with a technical customer team. The Design Lab is a half-day to 2-day engagement for customers who need a real-world architecture recommendation based on AWS expertise, but aren’t yet ready to build. Both engagements are hosted either online or at an in-person AWS Data Lab hub. The Resident Architect provides AWS customers with technical and strategic guidance in refining, implementing, and accelerating their data strategy and solutions over a 6-month engagement.

In this post, we share the experience of our engagement with AWS Data Lab to accelerate the initiative of developing a data pipeline from an idea to a solution. Over 4 weeks, we conducted technical design sessions, reviewed architecture options, and built the proof of concept data pipeline.

Use case review

The process started with us engaging with our AWS Account team to submit a nomination for the data lab. This followed by a call with the AWS Data Lab team to assess the suitability of requirements against the program. After the Build Lab was scheduled, an AWS Data Lab Architect engaged with us to conduct a series of pre-lab calls to finalize the scope, architecture, goals, and success criteria for the lab. The scope was to design a data pipeline that would ingest and store historical and real-time on-chain transactions data, and build a data pipeline to generate key metrics. Once ingested, data should be transformed, stored, and exposed via REST-based APIs and consumed by a web UI to display key metrics. For this Build Lab, we choose to ingest data for Spooky, which is a decentralized exchange (DEX) deployed on the Fantom platform and had the largest Total Value Locked (TVL) at that time. Key metrics such number of wallets that have interacted with the dApp over time, number of tokens and their value exchanged for the dApp over time, and number of transactions for the dApp over time were selected to visualize through a web-based UI.

We explored several architecture options and picked one for the lab that aligned closely with our end goal. The total historical data for the selected smart contract was approximately 1 GB since deployment of dApp on the Fantom platform. We used FTMScan, which allows us to explore and search on the Fantom platform for transactions, to estimate the rate of transfer transactions to be approximately three to four per minute. This allowed us to design an architecture for the lab that can handle this data ingestion rate. We agreed to use an existing application known as the data producer that was developed internally by the Fantom team to ingest on-chain transactions in real time. On checking transactions’ payload size, it was found to not exceed 100 kb for each transaction, which gave us the measure of number of files that will be created once ingested through the data producer application. A decision was made to ingest the past 45 days of historic transactions to populate the platform with enough data to visualize key metrics. Because the feature of backdating exists within the data producer application, we agreed to use that. The Data Lab Architect also advised us to consider using AWS Database Migration Service (AWS DMS) to ingest historic transactions data post lab. As a last step, we decided to build a React-based webpage with Material-UI that allows users to enter a smart contract address and choose the time interval, and the app fetches the necessary data to show the metrics value.

Solution overview

We collectively agreed to incorporate the following design principles for the data lab architecture:

  • Simplified data pipelines
  • Decentralized data architecture
  • Minimize latency as much as possible

The following diagram illustrates the architecture that we built in the lab.

We collectively defined the following success criteria for the Build Lab:

  • End-to-end data streaming pipeline to ingest on-chain transactions
  • Historical data ingestion of the selected smart contract
  • Data storage and processing of on-chain transactions
  • REST-based APIs to provide time-based metrics for the three defined use cases
  • A sample web UI to display aggregated metrics for the smart contract

Prior to the Build Lab

As a prerequisite for the lab, we configured the data producer application to use the AWS Software Development Kit (AWS SDK) and PUTRecords API operation to send transactions data to an Amazon Simple Storage Service (Amazon S3) bucket. For the Build Lab, we built additional logic within the application to ingest historic transactions data together with real-time transactions data. As a last step, we verified that transactions data was captured and ingested into a test S3 bucket.

AWS services used in the lab

We used the following AWS services as part of the lab:

  • AWS Identity and Access Management (IAM) – We created multiple IAM roles with appropriate trust relationships and necessary permissions that can be used by multiple services to read and write on-chain transactions data and generated logs.
  • Amazon S3 – We created an S3 bucket to store the incoming transactions data as JSON-based files. We created a separate S3 bucket to store incoming transaction data that failed to be transformed and will be reprocessed later.
  • Amazon Kinesis Data Streams – We created a new Kinesis data stream in on-demand mode, which automatically scales based on data ingestion patterns and provides hands-free capacity management. This stream was used by the data producer application to ingest historical and real-time on-chain transactions. We discussed having the ability to manage and predict cost, and therefore were advised to use the provisioned mode when reliable estimates were available for throughput requirements. We were also advised to continue to use on-demand mode until the data traffic patterns were unpredictable.
  • Amazon Kinesis Data Firehose – We created a Firehose delivery stream to transform the incoming data and writes it to the S3 bucket. To minimize latency, we set the delivery stream buffer size to 1 MiB and buffer interval to 60 seconds. This would ensure a file is written to the S3 bucket when either of the two conditions are satisfied regardless of the order. Transactions data written to the S3 bucket was in JSON Lines format.
  • Amazon Simple Queue Service (Amazon SQS) – We set up an SQS queue of the type Standard and an access policy for that SQS queue to allow incoming messages generated from S3 bucket event notifications.
  • Amazon DynamoDB – In order to pick a data store for on-chain transactions, we needed a service that can store transactions payload of unstructured data with varying schemas, provides the ability to cache query results, and is a managed service. We picked DynamoDB for those reasons. We created a single DynamoDB table that holds the incoming transactions data. After analyzing the access query patterns, we decided to use the address field of the smart contract as the partition key and the timestamp field as the sort key. The table was created with auto scaling of read and write capacity modes because the actual usage requirements would be hard to predict at that time.
  • AWS Lambda – We created the following functions:
    • A Python-based Lambda function to perform transformations on the incoming data from the data producer application to flatten the JSON structure, convert the Unix-based epoch timestamp to a date/time value, and convert hex-based string values to a decimal value representing the number of tokens.
    • A second Lambda function to parse incoming SQS queue messages. This message contained values for bucket_name and object_key, which holds the reference to a newly created object within the S3 bucket. The Lambda function logic included parsing of this value to obtain the reference to the S3 object, get the contents of the object, read it into a data frame object using the AWS SDK for pandas (awswrangler) library, convert it into a Pandas data frame object, and use the put_df API call to write a Pandas data frame object as an item into a DynamoDB table. We choose to use Pandas due to familiarity with the library and functions required to perform data transform operations.
    • Three separate Lambda functions that contains the logic to query the DynamoDB table and retrieve items to aggregate and calculate metrics values. This calculated metrics value within the Lambda function was formatted as an HTTP response to expose as REST-based APIs.
  • Amazon API Gateway – We created a REST based API endpoint that uses Lambda proxy integration to pass a smart contract address and time-based interval in minutes as a query string parameter to the backend Lambda function. The response from the Lambda function was a metrics value. We also enabled cross-origin resource sharing (CORS) support within API Gateway to successfully query from the web UI that resides in a different domain.
  • Amazon CloudWatch – We used a Lambda function in-built mechanism to send function metrics to CloudWatch. Lambda functions come with a CloudWatch Logs log group and a log stream for each instance of your function. The Lambda runtime environment sends details of each invocation to the log stream, and relays logs and other output from your function’s code.

Iterative development approach

Across 4 days of the Build Lab, we undertook iterative development. We started by developing the foundational layer and iteratively added extra features through testing and data validation. This allowed us to develop confidence of the solution being built as we tested the output of the metrics through a web-based UI and verified with the actual data. As errors got discovered, we deleted the entire dataset and reran all the jobs to verify results and resolve those errors.

Lab outcomes

In 4 days, we built an end-to-end streaming pipeline ingesting 45 days of historical data and real-time on-chain transactions data for the selected Spooky smart contract. We also developed three REST-based APIs for the selected metrics and a sample web UI that allows users to insert a smart contract address, choose a time frequency, and visualize the metrics values. In a follow-up call, our AWS Data Lab Architect shared post-lab guidance around the next steps required to productionize the solution:

  • Scaling of the proof of concept to handle larger data volumes
  • Security best practices to protect the data while at rest and in transit
  • Best practices for data modeling and storage
  • Building an automated resilience technique to handle failed processing of the transactions data
  • Incorporating high availability and disaster recovery solutions to handle incoming data requests, including adding of the caching layer

Conclusion

Through a short engagement and small team, we accelerated this project from an idea to a solution. This experience gave us the opportunity to explore AWS services and their analytical capabilities in-depth. As a next step, we will continue to take advantage of AWS teams to enhance the solution built during this lab to make it ready for the production deployment.

Learn more about how the AWS Data Lab can help your data and analytics on the cloud journey.


About the Authors

Dr. Quan Hoang Nguyen is currently a CTO at Fantom Foundation. His interests include DLT, blockchain technologies, visual analytics, compiler optimization, and transactional memory. He has experience in R&D at the University of Sydney, IBM, Capital Markets CRC, Smarts – NASDAQ, and National ICT Australia (NICTA).

Ankit Patira is a Data Lab Architect at AWS based in Melbourne, Australia.

Build incremental crawls of data lakes with existing Glue catalog tables

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/build-incremental-crawls-of-data-lakes-with-existing-glue-catalog-tables/

AWS Glue includes crawlers, a capability that make discovering datasets simpler by scanning data in Amazon Simple Storage Service (Amazon S3) and relational databases, extracting their schema, and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. This reduces the time to insight by making newly ingested data quickly available for analysis with your preferred analytics and machine learning (ML) tools.

Previously, you could reduce crawler cost by using Amazon S3 Event Notifications to incrementally crawl changes on Data Catalog tables created by crawler. Today, we’re extending this support to crawling and updating Data Catalog tables that are created by non-crawler methods, such as using data pipelines. This crawler feature can be useful for several use cases, such as following:

  • You currently have a data pipeline to create AWS Glue Data Catalog tables and want to offload detection of partition information from the data pipeline to a scheduled crawler
  • You have an S3 bucket with event notifications enabled and want to continuously catalog new changes and prevent creation of new tables in case of ill-formatted files that break the partition detection
  • You have manually created Data Catalog tables and want to run incremental crawls on new file additions instead of running full crawls due to long crawl times

To accomplish incremental crawling, you can configure Amazon S3 Event Notifications to be sent to an Amazon Simple Queue Service (Amazon SQS) queue. You can then use the SQS queue as a source to identify changes and can schedule or run an AWS Glue crawler with Data Catalog tables as a target. With each run of the crawler, the SQS queue is inspected for new events. If no new events are found, the crawler stops. If events are found in the queue, the crawler inspects their respective folders, processes through built-in classifiers (for CSV, JSON, AVRO, XML, and so on), and determines the changes. The crawler then updates the Data Catalog with new information, such as newly added or deleted partitions or columns. This feature reduces the cost and time to crawl large and frequently changing Amazon S3 data.

This post shows how to create an AWS Glue crawler that supports Amazon S3 event notification on existing Data Catalog tables using the new crawler UI and an AWS CloudFormation template.

Overview of solution

To demonstrate how the new AWS Glue crawler performs incremental updates, we use the Toronto parking tickets dataset—specifically data about parking tickets issued in the city of Toronto between 2019–2020. The goal is to create a manual dataset as well as its associated metadata tables in AWS Glue, followed by an event-based crawler that detects and implements changes to the manually created datasets and catalogs.

As mentioned before, instead of crawling all the subfolders on Amazon S3, we use an Amazon S3 event-based approach. This helps improve the crawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder that triggered the event instead of listing the full Amazon S3 target. To accomplish this, we create an S3 bucket, an event-based crawler, an Amazon Simple Storage Service (Amazon SNS) topic, and an SQS queue. The following diagram illustrates our solution architecture.

Prerequisites

For this walkthrough, you should have the following prerequisites:

If the AWS account you use to follow this post uses Lake Formation to manage permissions on the AWS Glue Data Catalog, make sure that you log in as a user with access to create databases and tables. For more information, refer to Implicit Lake Formation permissions.

Launch your CloudFormation stack

To create your resources for this use case, complete the following steps:

  1. Launch your CloudFormation stack in us-east-1:
  2. For Stack name, enter a name for your stack .
  3. For paramBucketName, enter a name for your S3 bucket (with your account number).
  4. Choose Next.
  5. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  6. Choose Create stack.

Wait for the stack formation to finish provisioning the requisite resources. When you see the CREATE_COMPLETE status, you can proceed to the next steps.

Additionally, note down the ARN of the SQS queue to use at a later point.

Query your Data Catalog

Next, we use Amazon Athena to confirm that the manual tables have been created in the Data Catalog, as part of the CloudFormation template.

  1. On the Athena console, choose Launch query editor.
  2. For Data source, choose AwsDataCatalog.
  3. For Database, choose torontoparking.

    The tickets table should appear in the Tables section.

    Now you can query the table to see its contents.
  4. You can write your own query, or choose Preview Table on the options menu.

    This writes a simple SQL query to show us the first 10 rows.
  5. Choose Run to run the query.

As we can see in the query results, the database and table for 2019 parking ticket data have been created and partitioned.

Create the Amazon S3 event crawler

The next step is to create the crawler that detects and crawls only on incrementally updated tables.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. For Name, enter a name.
  4. Choose Next.

    Now we need to select the data source for the crawler.
  5. Select Yes to indicate that our data is already mapped to our AWS Glue Data Catalog.
  6. Choose Add tables.
  7. For Database, choose torontoparking and for Tables, choose tickets.
  8. Select Crawl based on events.
  9. For Include SQS ARN, enter the ARN you saved from the CloudFormation stack outputs.
  10. Choose Confirm.

    You should now see the table populated under Glue tables, with the parameter set as Recrawl by event.
  11. Choose Next.
  12. For Existing IAM role, choose the IAM role created by the CloudFormation template (GlueCrawlerTableRole).
  13. Choose Next.
  14. For Frequency, choose On demand.

    You also have the option of choosing a schedule on which the crawler will run regularly.
  15. Choose Next.
  16. Review the configurations and choose Create crawler.

    Now that the crawler has been created, we add the 2020 ticketing data to our S3 bucket so that we can test our new crawler. For this step, we use the AWS Command Line Interface (AWS CLI)
  17. To add this data, use the following command:
    aws s3 cp s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csv

After successful completion of this command, your S3 bucket should contain the 2020 ticketing data and your crawler is ready to run. The terminal should return the following:

copy: s3://aws-bigdata-blog/artifacts/gluenewcrawlerui2/source/year=2020/Parking_Tags_Data_2020.000.csv to s3://glue-table-crawler-blog-<YOURACCOUNTNUMBER>/year=2020/Parking_Tags_Data_2020.000.csvRun the crawler and verify the updates

Run the crawler and verify the updates

Now that the new folder has been created, we run the crawler to detect the changes in the table and partitions.

  1. Navigate to your crawler on the AWS Glue console and choose Run crawler.

    After running the crawler, you should see that it added the 2020 data to the tickets table.
  2. On the Athena console, we can ensure that the Data Catalog has been updated by adding a where year = 2020 filter to the query.

AWS CLI option

You can also create the crawler using the AWS CLI. For more information, refer to create-crawler.

Clean up

To avoid incurring future charges, and to clean up unused roles and policies, delete the resources you created: the CloudFormation stack, S3 bucket, AWS Glue crawler, AWS Glue database, and AWS Glue table.

Conclusion

You can use AWS Glue crawlers to discover datasets, extract schema information, and populate the AWS Glue Data Catalog. In this post, we provided a CloudFormation template to set up AWS Glue crawlers to use Amazon S3 event notifications on existing Data Catalog tables, which reduces the time and cost needed to incrementally process table data updates in the Data Catalog.

With this feature, incremental crawling can now be offloaded from data pipelines to the scheduled AWS Glue crawler, reducing cost. This alleviates the need for full crawls, thereby reducing crawl times and Data Processing Units (DPUs) required to run the crawler. This is especially useful for customers that have S3 buckets with event notifications enabled and want to continuously catalog new changes.

To learn more about this feature, refer to Accelerating crawls using Amazon S3 event notifications.

Special thanks to everyone who contributed to this crawler feature launch: Theo Xu, Jessica Cheng, Arvin Mohanty, and Joseph Barlan.


About the authors

Leonardo Gómez is a Senior Analytics Specialist Solutions Architect at AWS. Based in Toronto, Canada, he has over a decade of experience in data management, helping customers around the globe address their business and technical needs.

Aayzed Tanweer is a Solutions Architect working with startup customers in the FinTech space, with a special focus on analytics services. Originally hailing from Toronto, he recently moved to New York City, where he enjoys eating his way through the city and exploring its many peculiar nooks and crannies.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

AWS Week in Review – October 10, 2022

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-10-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had an amazing start to the week last week as I was speaking at the AWS Community Day NL. This event had 500 attendees and over 70 speakers, and Dr. Werner Vogels, Amazon CTO, delivered the keynote. AWS Community Days are community-led conferences organized by local communities, with a variety of workshops and sessions. I recommend checking your region for any of these events.

Community Day NL

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon S3 Object Lambda now supports using your own code to change the results of HEAD and LIST requests, besides GET (which we launched last year). This feature now enables more capabilities for what you can do with S3 Object Lambda. Danilo made a Twitter thread with lots of use cases for this new launch.

Amazon SageMaker Clarify now can provide near real-time explanations for ML predictions. SageMaker Clarify is a service that provides explainability by ML models individual predictions. These explanations are important for developers to get visibility into their training data and models to identify potential bias.

AWS Storage Gateway now supports 15 TiB tapes. It increased the maximum supported virtual tape size on Tape Gateway from 5 TiB to 15 TiB, so you can store more data on a single virtual tape, and you can reduce the number of tapes you need to manage.

Amazon Aurora Serverless v2 now supports AWS CloudFormation. Early this year, we announced the general availability of Aurora Serverless v2, and now you can use AWS CloudFormation Templates to deploy and change the database along with the rest of your infrastructure.

AWS Config now supports 15 new resource types, including AWS DataSync, Amazon GuardDuty, Amazon Simple Email Service (Amazon SES), AWS AppSync, AWS Cloud Map, Amazon EC2, and AWS AppConfig. With this launch, you can use AWS Config to monitor configuration data for the supported resource types in your AWS account, and you can see how the configuration changes.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

This week an article about how AWS is leading a pilot project to turn the Greek island of Naxos into a smart island caught my attention. The project introduces smart solutions for mobility, primary healthcare, and the transport of goods. The solution has been built based on four pillars that were important for the island: sustainability, telehealth, leisure, and digital skills. Check out the whole article to learn what they are doing.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish, and every other week there is a new episode. The podcast is meant for builders, and it shares stories about how customers implemented and learned AWS services, how to architect applications, and how to use new services. You can listen to all the episodes directly from your favorite podcast app or at AWS Podcasts en español.

AWS open-source news and updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent reserved seating opens on October 11. If you are planning to attend, book a spot in advance for your favorite sessions. AWS re:Invent is our biggest conference of the year, it happens in Las Vegas from November 28 to December 2, and registrations are open. Many writers of this blog have sessions at re:Invent, and you can search the event agenda using our names.

I started the post talking about AWS Community Days, and there is one in Warsaw, Poland, on October 14. If you are around Warsaw during this week, you can first check out the AWS Pop-up Hub in Warsaw that runs October 10-14 and then join for the Community Day.

On October 20, there is a virtual event for modernizing .NET workloads with Windows containers on AWS, You can register for free.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

Lifting and shifting a web application to AWS Serverless: Part 2

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/lifting-and-shifting-a-web-application-to-aws-serverless-part-2/

In part 1, you learn if it is possible to migrate a non-serverless web application to a serverless environment without changing much code. You learn different tools that can help you in this process, like Lambda Web Adaptor and AWS Amplify. By the end, you have migrated an application into a serverless environment.

However, if you test the migrated app, you find two issues. The first one is that the user session is not sticky. Every time you log in, you are logged out unexpectedly from the application. The second one is that when you create a new product, you cannot upload new images of that product.

This final post analyzes each of the problems in detail and shows solutions. In addition, it analyzes the cost and performance of the solution.

Authentication and authorization migration

The original application handled the authentication and authorization by itself. There is a user directory in the database, with the passwords and emails for each of the users. There are APIs and middleware that take care of validating that the user is logged in before showing the application. All the logic for this is developed inside the Node.js/Express application.

However, with the current migrated application every time you log in, you are logged out unexpectedly from the application. This is because the server code is responsible for handling the authentication and the authorization of the users, and now our server is running in an AWS Lambda function and functions are stateless. This means that there will be one function running per request—a request can load all the products in the landing page, get the details for a product, or log in to the site—and if you do something in one of these functions, the state is not shared across.

To solve this, you must remove the authentication and authorization mechanisms from the function and use a service that can preserve the state across multiple invocations of the functions.

There are many ways to solve this challenge. You can add a layer of authentication and session management with a database like Redis, or build a new microservice that is in charge of authentication and authorization that can handle the state, or use an existing managed service for this.

Because of the migration requirements, we want to keep the cost as low as possible, with the fewest changes to the application. The better solution is to use an existing managed service to handle authentication and authorization.

This demo uses Amazon Cognito, which provides user authentication and authorization to AWS resources in a managed, pay as you go way. One rapid approach is to replace all the server code with calls to Amazon Cognito using the AWS SDK. But this adds complexity that can be replaced completely by just invoking Amazon Cognito APIs from the React application.

Using Cognito

For example, when a new user is registered, the application creates the user in the Amazon Cognito user pool directory, as well as in the application database. But when a user logs in to the web app, the application calls Amazon Cognito API directly from the AWS Amplify application. This way minimizes the amount of code needed.

In the original application, all authenticated server APIs are secured with a middleware that validates that the user is authenticated by providing an access token. With the new setup that doesn’t change, but the token is generated by Amazon Cognito and then it can be validated in the backend.

let auth = (req, res, next) => {
    const token = req.headers.authorization;
    const jwtToken = token.replace('Bearer ', '');

    verifyToken(jwtToken)
        .then((valid) => {
            if (valid) {
                getCognitoUser(jwtToken).then((email) => {
                    User.findByEmail(email, (err, user) => {
                        if (err) throw err;
                        if (!user)
                            return res.json({
                                isAuth: false,
                                error: true,
                            });

                        req.user = user;
                        next();
                    });
                });
            } else {
                throw Error('Not valid Token');
            }
        })
        .catch((error) => {
            return res.json({
                isAuth: false,
                error: true,
            });
        });
};

You can see how this is implemented step by step in this video.

Storage migration

In the original application, when a new product is created, a new image is uploaded to the Node.js/Express server. However, now the application resides in a Lambda function. The code (and files) that are part of that function cannot change, unless the function is redeployed. Consequently, you must separate the user storage from the server code.

For doing this, there are a couple of solutions: using Amazon Elastic File System (EFS) or Amazon S3. EFS is a file storage, and you can use that to have a dynamic storage where you upload the new images. Using EFS won’t change much of the code, as the original implementation is using a directory inside the server as EFS provides. However, using EFS adds more complexity to the application, as functions that use EFS must be inside an Amazon Virtual Private Cloud (Amazon VPC).

Using S3 to upload your images to the application is simpler, as it only requires that an S3 bucket exists. For doing this, you must refactor the application, from uploading the image to the application API to use the AWS Amplify library that uploads and gets images from S3.

export function uploadImage(file) {
    const fileName = `uploads/${file.name}`;

    const request = Storage.put(fileName, file).then((result) => {
        return {
            image: fileName,
            success: true,
        };
    });

    return {
        type: IMAGE_UPLOAD,
        payload: request,
    };
}

An important benefit of using S3 is that you can also use Amazon CloudFront to accelerate the retrieval of the images from the cloud. In this way, you can speed up the loading time of your page. You can see how this is implemented step by step in this video.

How much does this application cost?

If you deploy this application in an empty AWS account, most of the usage of this application is covered by the AWS Free Tier. Serverless services, like Lambda and Amazon Cognito, have a forever free tier that gives you the benefits in pricing for the lifetime of hosting the application.

  • AWS Lambda—With 100 requests per hour, an average 10ms invocation and 1GB of memory configured, it costs 0 USD per month.
  • Amazon S3—Using S3 standard, hosting 1 GB per month and 10k PUT and GET requests per month costs 0.07 USD per month. This can be optimized using S3 Intelligent-Tiering.
  • Amazon Cognito—Provides 50,000 monthly active users for free.
  • AWS Amplify—If you build your client application once a week, serve 3 GB and store 1 GB per month, this costs 0.87 USD.
  • AWS Secrets Manager—There are two secrets stored using Secrets Manager and this costs 1.16 USD per month. This can be optimized by using AWS System Manager Parameter Store and AWS Key Management Service (AWS KMS).
  • MongoDB Atlas Forever free shared cluster.

The total monthly cost of this application is approximately 2.11 USD.

Performance analysis

After you migrate the application, you can run a page speed insight tool, to measure this application’s performance. This tool provides results mostly about the front end and the experience that the user perceives. The results are displayed in the following image. The performance of this website is good, according to the insight tool performance score – it responds quickly and the user experience is good.

Page speed insight tool results

After the application is migrated to a serverless environment, you can do some refactoring to improve further the overall performance. One alternative is whenever a new image is uploaded, it gets resized and formatted into the correct next-gen format automatically using the event driven capabilities that S3 provides. Another alternative is to use Lambda on Edge to serve the right image size for the device, as it is possible to format the images on the fly when serving them from a distribution.

You can run load tests for understanding how your backend and database will perform. For this, you can use Artillery, an open-source library that allows you to run load tests. You can run tests with the expected maximum load your site will get and ensure that your site can handle it.

For example, you can configure a test that sends 30 requests per seconds to see how your application reacts:

config:
  target: 'https://xxx.lambda-url.eu-west-1.on.aws'
  phases:
    - duration: 240
      arrivalRate: 30
      name: Testing
scenarios:
  - name: 'Test main page'
    flow:
      - post:
          url: '/api/product/getProducts/'

This test is performed on the backend APIs, not only testing your backend but also your integration with the MongoDB. After running it, you can see how the Lambda function performs on the Amazon CloudWatch dashboard.

Running this load test helps you understand the limitations of your system. For example, if you run a test with too many concurrent users, you might see that the number of throttles in your function increases. This means that you need to lift the limit of invocations of the functions you can have at the same time.

Or when increasing the requests per second, you may find that the MongoDB cluster starts throttling your requests. This is because you are using the free tier and that has a set number of connections. You might need a larger cluster or to migrate your database to another service that provides a large free tier, like Amazon DynamoDB.

Cloudwatch dashboard

Conclusion

In this two-part article, you learn if it is possible to migrate a non-serverless web application to a serverless environment without changing much code. You learn different tools that can help you in this process, like AWS Lambda Web Adaptor and AWS Amplify, and how to solve some of the typical challenges that we have, like storage and authentication.

After the application is hosted in a fully serverless environment, it can scale up and down to meet your needs. This web application is also performant once the backend is hosted in a Lambda function.

If you need, from here you can start using the strangler pattern to refactor the application to take advantage of the benefits of event-driven architecture.

To see all the steps of the migration, there is a playlist that contains all the tutorials for you to follow.

For more serverless learning resources, visit Serverless Land.

Lifting and shifting a web application to AWS Serverless: Part 1

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/lifting-and-shifting-a-web-application-to-aws-serverless-part-1/

Customers migrating to the cloud often want to get the benefits of serverless architecture. But what is the best approach and is it possible? There are many strategies to do a migration, but lift and shift is often the fastest way to get to production with the migrated workload.

You might also wonder if it’s possible to lift and shift an existing application that runs in a traditional environment to serverless. This blog post shows how to do this for a Mongo, Express, React, and Node.js (MERN) stack web app. However, the discussions presented in this post apply to other stacks too.

Why do a lift and shift migration?

Lift and shift, or sometimes referred to as rehosting the application, is moving the application with as few changes as possible. Lift and shift migrations often allow you to get the new workload in production as fast as possible. When migrating to serverless, lift and shift can bring a workload that is not yet in the cloud or in a serverless environment to use managed and serverless services quickly.

Migrating a non-serverless workload to serverless with lift and shift might not bring all the serverless benefits right away, but it enables the development team to refactor, using the strangler pattern, the parts of the application that might benefit from what serverless technologies offer.

Why migrate a web app to serverless?

Web apps hosted in a serverless environment benefit most from the capability of serverless applications to scale automatically and for paying for what you use.

Imagine that you have a personal web app with little traffic. If you are hosting in a serverless environment, you don’t pay a fixed price to have the servers up and running. Your web app has only a few requests and the rest of the time is idle.

This benefit applies to the opposite case. For an owner of a small ecommerce site running on a server, imagine if a social media influencer with millions of followers recommends one of their products. Suddenly, thousands of requests arrive and make the site unavailable. If the site is hosted on a serverless platform, the application will scale to the traffic that it receives.

Requirements for migration

Before starting a migration, it is important to define the nonfunctional requirements that you need the new application to have. These requirements help when you must make architectural decisions during the migration process.

These are the nonfunctional requirements of this migration:

  • Environment that scales to zero and scales up automatically.
  • Pay as little as possible for idle time.
  • Configure as little infrastructure as possible.
  • Automatic high availability of the application.
  • Minimal changes to the original code.

Application overview

This blog post guides you on how to migrate a MERN application. The original application is hosted in two different servers: One contains the Mongo database and another contains the Node/js/Express and ReactJS applications.

Application overview

This demo application simulates a swag ecommerce site. The database layer stores the products, users, and the purchases history. The server layer takes care of the ecommerce business logic, hosting the product images, and user authentication and authorization. The web layer takes care of all the user interaction and communicates with the server layer using REST APIs.

How the application looks like

These are the changes that you must make to migrate to a serverless environment:

  • Database migration: Migrate the database from on-premises to MongoDB Atlas.
  • Backend migration: Migrate the NodeJS/Express application from on-premises to an AWS Lambda function.
  • Web app migration: Migrate the React web app from on-premises to AWS Amplify.
  • Authentication migration: Migrate the custom-built authentication to use Amazon Cognito.
  • Storage migration: Migrate the local storage of images to use Amazon S3 and Amazon CloudFront.

The following image shows the proposed solution for the migrated application:

Proposed architecture

Database migration

The database is already in a MongoDB vanilla container that has all the data for this application. As MongoDB is the database engine for our stack, their recommended solution to migrate to serverless is to use MongoDB Atlas. Atlas provides a database cluster in the cloud that scales automatically and you pay for what you use.

To get started, create a new Atlas cluster, then migrate the data from the existing database to the serverless one. To migrate the data, you can first dump all the content of the database to a dump folder and then restore it to the cloud:

mongodump --uri="mongodb://<localuser>:<localpassword>@localhost:27017"

mongorestore --uri="mongodb+srv://<user>:<password>@<clustername>.debkm.mongodb.net" .

After doing that, your data is now in the cloud. The next step is to change the configuration string in the server to point to the new database. To see this in action, check this video that shows a walkthrough of the migration.

Backend migration

Migrating the Node.js/Express backend is the most challenging of the layers to migrate to a serverless environment, as the server layer is a Node.js application that runs in a server.

One option for this migration is to use AWS Fargate. Fargate is a serverless container service that allows you to scale automatically and you pay as you go. Another option is to use AWS AppRunner, a container service that auto scales and you also pay as you go. However, neither of these options align with our migration requirements, as they don’t scale to zero.

Another option for the lift and shift migration of this Node.js application is to use Lambda with the AWS Lambda Web Adapter. The AWS Lambda Web Adapter is an open-source project that allows you to build web applications with familiar frameworks, like Express.js, Flask, SpringBoot, and run it on Lambda. You can learn more about this project in its GitHub repository.

Lambda Web Adapter

Using this project, you can create a new Lambda function that has the Express/NodeJS application as the function code. You can lift and shift all the code into the function. If you want a step-by-step tutorial on how to do this, check out this video.

const lambdaAdapterFunction = new Function(this,`${props.stage}-LambdaAdapterFunction`,
            {
                runtime: Runtime.NODEJS_16_X,
                code: Code.fromAsset('backend-app'),
                handler: 'run.sh',
                environment: {
                    AWS_LAMBDA_EXEC_WRAPPER: '/opt/bootstrap',
                    REGION: this.region,
                    ASYNC_INIT: 'true',
                },
                memorySize: 1024,
                layers: [layerLambdaAdapter],
                timeout: Duration.seconds(2),
                tracing: Tracing.ACTIVE,
            }
        );

The next step is to create an HTTP endpoint for the server application. There are three options for doing this: API Gateway, Application Load Balancer (ALB) , or to use Lambda Function URLs. All the options are compatible with Lambda Web Adapter and can solve the challenge for you.

For this demo, choose function URLs, as they are simple to configure and one function URL forwards all routes to the Express server. API Gateway and ALB require more configuration and have separate costs, while the cost of function URLs is included in the Lambda function.

Web app migration

The final layer to migrate is the React application. The best way to migrate the web layer and to adhere to the migration requirements is to use AWS Amplify to host it. AWS Amplify is a fully managed service that provides many features like hosting web applications and managing the CICD process for the web app. It provides client libraries to connect to different AWS resources, and many other features.

Migrating the React application is as simple as creating a new Amplify application in your AWS account and uploading the React application to a code repository like GitHub. This AWS Amplify application is connected to a GitHub branch, and when there is a new commit in this branch, AWS Amplify redeploys the code.

The Amplify application receives configuration parameters like the function URL endpoint (the server URL) using environmental variables.

const amplifyApp = new App(this, `${props.stage}-AmplifyReactShopApp`, {
            sourceCodeProvider: new GitHubSourceCodeProvider({
                owner: config.frontend.owner,
                repository: config.frontend.repository_name,
                oauthToken: SecretValue.secretsManager('github-token'),
            }),
            environmentVariables: {
                REGION: this.region,
                SERVER_URL: props.serverURL,
            },
        });

If you want to see a step-by-step guide on how to make your web layer serverless, you can check this video.

Next steps

However, if you test this migrated app, you will find two issues. The first one is that the user session is not sticky. Every time you log in, you are logged out unexpectedly from the application. The second one is that when you create a new product, you cannot upload new images of that product.

In part two, I analyze each of the problems in detail and find solutions. These issues arise because of the stateless and immutable characteristics of this solution. The next part of this article explains how to solve these issues, also it analyzes costs and performance of the solution.

Conclusion

In this article, you learn if it is possible to migrate a non-serverless web application to a serverless environment without changing much code. You learn different tools that can help you in this process, like the AWS Lambda Web Adaptor and AWS Amplify.

If you want to see the migration in action and learn all the steps for this, there is a playlist that contains all the tutorials for you to follow and learn how this is possible.

For more serverless learning resources, visit Serverless Land.

Convert Oracle XML BLOB data to JSON using Amazon EMR and load to Amazon Redshift

Post Syndicated from Abhilash Nagilla original https://aws.amazon.com/blogs/big-data/convert-oracle-xml-blob-data-to-json-using-amazon-emr-and-load-to-amazon-redshift/

In legacy relational database management systems, data is stored in several complex data types, such XML, JSON, BLOB, or CLOB. This data might contain valuable information that is often difficult to transform into insights, so you might be looking for ways to load and use this data in a modern cloud data warehouse such as Amazon Redshift. One such example is migrating data from a legacy Oracle database with XML BLOB fields to Amazon Redshift, by performing preprocessing and conversion of XML to JSON using Amazon EMR. In this post, we describe a solution architecture for this use case, and show you how to implement the code to handle the XML conversion.

Solution overview

The first step in any data migration project is to capture and ingest the data from the source database. For this task, we use AWS Database Migration Service (AWS DMS), a service that helps you migrate databases to AWS quickly and securely. In this example, we use AWS DMS to extract data from an Oracle database with XML BLOB fields and stage the same data in Amazon Simple Storage Service (Amazon S3) in Apache Parquet format. Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance, and is the storage of choice for setting up data lakes on AWS.

After the data is ingested into an S3 staging bucket, we used Amazon EMR to run a Spark job to perform the conversion of XML fields to JSON fields, and the results are loaded in a curated S3 bucket. Amazon EMR runtime for Apache Spark can be over three times faster than clusters without EMR runtime, and has 100% API compatibility with standard Apache Spark. This improved performance means your workloads run faster and it saves you compute costs, without making any changes to your application.

Finally, transformed and curated data is loaded into Amazon Redshift tables using the COPY command. The Amazon Redshift table structure should match the number of columns and the column data types in the source file. Because we stored the data as a Parquet file, we specify the SERIALIZETOJSON option in the COPY command. This allows us to load complex types, such as structure and array, in a column defined as SUPER data type in the table.

The following architecture diagram shows the end-to-end workflow.

In detail, AWS DMS migrates data from the source database tables into Amazon S3, in Parquet format. Apache Spark on Amazon EMR reads the raw data, transforms the XML data type into JSON, and saves the data to the curated S3 bucket. In our code, we used an open-source library, called spark-xml, to parse and query the XML data.

In the rest of this post, we assume that the AWS DMS tasks have already run and created the source Parquet files in the S3 staging bucket. If you want to set up AWS DMS to read from an Oracle database with LOB fields, refer to Effectively migrating LOB data to Amazon S3 from Amazon RDS for Oracle with AWS DMS or watch the video Migrate Oracle to S3 Data lake via AWS DMS.

Prerequisites

If you want to follow along with the examples in this post using your AWS account, we provide an AWS CloudFormation template you can launch by choosing Launch Stack:

BDB-2063-launch-cloudformation-stack

Provide a stack name and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

The template creates the following resources:

  • A virtual private cloud (VPC) with two private subnets that have routes to an Amazon S3 VPC endpoint
  • The S3 bucket {stackname}-s3bucket-{xxx}, which contains the following folders:
    • libs – Contains the JAR file to add to the notebook
    • notebooks – Contains the notebook to interactively test the code
    • data – Contains the sample data
  • An Amazon Redshift cluster, in one of the two private subnets, with a database named rs_xml_db and a schema named rs_xml
  • A secret (rs_xml_db) in AWS Secrets Manager
  • An EMR cluster

The CloudFormation template shared in this post is purely for demonstration purposes only. Please conduct your own security review and incorporate best practices prior to any production deployment using artifacts from the post.

Finally, some basic knowledge of Python and Spark DataFrames can help you review the transformation code, but isn’t mandatory to complete the example.

Understanding the sample data

In this post, we use college students’ course and subjects sample data that we created. In the source system, data consists of flat structure fields, like course_id and course_name, and an XML field that includes all the course material and subjects involved in the respective course. The following screenshot is an example of the source data, which is staged in an S3 bucket as a prerequisite step.

We can observe that the column study_material_info is an XML type field and contains nested XML tags in it. Let’s see how to convert this nested XML field to JSON in the subsequent steps.

Run a Spark job in Amazon EMR to transform the XML fields in the raw data to JSON

In this step, we use an Amazon EMR notebook, which is a managed environment to create and open Jupyter Notebook and JupyterLab interfaces. It enables you to interactively analyze and visualize data, collaborate with peers, and build applications using Apache Spark on EMR clusters. To open the notebook, follow these steps:

  1. On the Amazon S3 console, navigate to the bucket you created as a prerequisite step.
  2. Download the file in the notebooks folder.
  3. On the Amazon EMR console, choose Notebooks in the navigation pane.
  4. Choose Create notebook.
  5. For Notebook name, enter a name.
  6. For Cluster, select Choose an existing cluster.
  7. Select the cluster you created as a prerequisite.
  8. For Security Groups, choose BDB1909-EMR-LIVY-SG and BDB1909-EMR-Notebook-SG
  9. For AWS Service Role, choose the role bdb1909-emrNotebookRole-{xxx}.
  10. For Notebook location, specify the S3 path in the notebooks folder (s3://{stackname}-s3bucket-xxx}/notebooks/).
  11. Choose Create notebook.
  12. When the notebook is created, choose Open in JupyterLab.
  13. Upload the file you downloaded earlier.
  14. Open the new notebook.

    The notebook should look as shown in the following screenshot, and it contains a script written in Scala.
  15. Run the first two cells to configure Apache Spark with the open-source spark-xml library and import the needed modules.The spark-xml package allows reading XML files in local or distributed file systems as Spark DataFrames. Although primarily used to convert (portions of) large XML documents into a DataFrame, spark-xml can also parse XML in a string-valued column in an existing DataFrame with the from_xml function, in order to add it as a new column with parsed results as a struct.
  16. To do so, in the third cell, we load the data from the Parquet file generated by AWS DMS into a DataFrame, then we extract the attribute that contains the XML code (STUDY_MATERIAL_INFO) and map it to a string variable name payloadSchema.
  17. We can now use the payloadSchema in the from_xml function to convert the field STUDY_MATERIAL_INFO into a struct data type and added it as a column named course_material in a new DataFrame parsed.
  18. Finally, we can drop the original field and write the parsed DataFrame to our curated zone in Amazon S3.

Due to the structure differences between DataFrame and XML, there are some conversion rules from XML data to DataFrame and from DataFrame to XML data. More details and documentation are available XML Data Source for Apache Spark.

When we convert from XML to DataFrame, attributes are converted as fields with the heading prefix attributePrefix (underscore (_) is the default). For example, see the following code:

  <book category="undergraduate">
    <title lang="en">Introduction to Biology</title>
    <author>Demo Author 1</author>
    <year>2005</year>
    <price>30.00</price>
  </book>

It produces the following schema:

root
 |-- category: string (nullable = true)
 |-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)
 |-- author: string (nullable = true)
 |-- year: string (nullable = true)
 |-- price: string (nullable = true)

Next, we have a value in an element that has no child elements but attributes. The value is put in a separate field, valueTag. See the following code:

<title lang="en">Introduction to Biology</title>

It produces the following schema, and the tag lang is converted into the _lang field inside the DataFrame:

|-- title: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _lang: string (nullable = true)

Copy curated data into Amazon Redshift and query tables seamlessly

Because our semi-structured nested dataset is already written in the S3 bucket as Apache Parquet formatted files, we can use the COPY command with the SERIALIZETOJSON option to ingest data into Amazon Redshift. The Amazon Redshift table structure should match the metadata of the Parquet files. Amazon Redshift can replace any Parquet columns, including structure and array types, with SUPER data columns.

The following code demonstrates CREATE TABLE example to create a staging table.

create table rs_xml_db.public.stg_edw_course_catalog 
(
course_id bigint,
course_name character varying(5000),
course_material super
);

The following code uses the COPY example to load from Parquet format:

COPY rs_xml_db.public.stg_edw_course_catalog FROM 's3://<<your Amazon S3 Bucket for curated data>>/data/target/<<your output parquet file>>' 
IAM_ROLE '<<your IAM role>>' 
FORMAT PARQUET SERIALIZETOJSON; 

By using semistructured data support in Amazon Redshift, you can ingest and store semistructured data in your Amazon Redshift data warehouses. With the SUPER data type and PartiQL language, Amazon Redshift expands the data warehouse capability to integrate with both SQL and NoSQL data sources. The SUPER data type only supports up to 1 MB of data for an individual SUPER field or object. Note, the JSON object may be stored in a SUPER data type, but reading this data using JSON functions currently has a VARCHAR (65535 byte) limit. See Limitations for more details.

The following example shows how nested JSON can be easily accessed using SELECT statements:

SELECT DISTINCT bk._category
	,bk.author
	,bk.price
	,bk.year
	,bk.title._lang
FROM rs_xml_db.public.stg_edw_course_catalog main
INNER JOIN main.course_material.book bk ON true;

The following screenshot shows our results.

Clean up

To avoid incurring future charges, first delete the notebook and the related files on Amazon S3 bucket as explained in this EMR documentation page then the CloudFormation stack.

Conclusion

This post demonstrated how to use AWS services like AWS DMS, Amazon S3, Amazon EMR, and Amazon Redshift to seamlessly work with complex data types like XML and perform historical migrations when building a cloud data lake house on AWS. We encourage you to try this solution and take advantage of all the benefits of these purpose-built services.

If you have questions or suggestions, please leave a comment.


About the authors

Abhilash Nagilla is a Sr. Specialist Solutions Architect at AWS, helping public sector customers on their cloud journey with a focus on AWS analytics services. Outside of work, Abhilash enjoys learning new technologies, watching movies, and visiting new places.

Avinash Makey is a Specialist Solutions Architect at AWS. He helps customers with data and analytics solutions in AWS. Outside of work he plays cricket, tennis and volleyball in free time.

Fabrizio Napolitano is a Senior Specialist SA for DB and Analytics. He has worked in the analytics space for the last 20 years, and has recently and quite by surprise become a Hockey Dad after moving to Canada.

A multi-dimensional approach helps you proactively prepare for failures, Part 2: Infrastructure layer

Post Syndicated from Piyali Kamra original https://aws.amazon.com/blogs/architecture/a-multi-dimensional-approach-helps-you-proactively-prepare-for-failures-part-2-infrastructure-layer/

Distributed applications resiliency is a cumulative resiliency of applications, infrastructure, and operational processes. Part 1 of this series explored application layer resiliency. In Part 2, we discuss how using Amazon Web Services (AWS) managed services, redundancy, high availability, and infrastructure failover patterns based on recovery time and point objectives (RTO and RPO, respectively) can help in building more resilient infrastructures.

Pattern 1: Recognize high impact/likelihood infrastructure failures

To ensure cloud infrastructure resilience, we need to understand the likelihood and impact of various infrastructure failures, so we can mitigate them. Figure 1 illustrates that most of the failures with high likelihood happen because of operator error or poor deployments.

Automated testing, automated deployments, and solid design patterns can mitigate these failures. There could be datacenter failures—like whole rack failures—but deploying applications using auto scaling and multi-availability zone (multi-AZ) deployment, plus resilient AWS cloud native services, can mitigate the impact.

Likelihood and impact of failure events

Figure 1. Likelihood and impact of failure events

As demonstrated in the Figure 1, infrastructure resiliency is a combination of high availability (HA) and disaster recovery (DR). HA involves increasing the availability of the system by implementing redundancy among the application components and removing single points of failure.

Application layer decisions, like creating stateless applications, make it simpler to implement HA at the infrastructure layer by allowing it to scale using Auto Scaling groups and distributing the redundant applications across multiple AZs.

Pattern 2: Understanding and controlling infrastructure failures

Building a resilient infrastructure requires understanding which infrastructure failures are under control and which ones are not, as demonstrated in Figure 2.

These insights allow us to automate the detection of failures, control them, and employ pro-active patterns, such as static stability, to mitigate the need to scale up the infrastructure by over-provisioning it in advance.

Proactively designing systems in the event of failure

Figure 2. Proactively designing systems in the event of failure

The infrastructure decisions under our control that can increase the infrastructure resiliency of our system, include:

  • AWS services have control and data planes designed for minimum blast radius. Data planes typically have higher availability design goals than control planes and are usually less complex. When implementing recovery or mitigation responses to events that can affect resiliency, using control plane operations can lower the overall resiliency of your architectures. For example, Amazon Route 53 (Route 53) has a data plane designed for a 100% availability SLA. A good fail-over mechanism should rely on the data plane and not the control plane, as explained in Creating Disaster Recovery Mechanisms Using Amazon Route 53.
  • Understanding networking design and routes implemented in a virtual private cloud (VPC) are critical when testing the flow of traffic in our application. Understanding the flow of traffic helps us design better applications and see how one component failure can affect overall ingress/egress traffic. To achieve better network resiliency, it’s important to implement a good subnet strategy and manage our IP addresses to avoid fail-over issues and asymmetric routing in hybrid architectures. Use IP address management tools for established subnet strategies and routing decisions.
  • When designing VPCs and AZs, understanding the service limits, deploying independent routing tables and components in each zone increases availability. For example, highly available NAT gateways are preferred over NAT instances, as noted in the comparison provided in the Amazon VPC documentation.

Pattern 3: Considering different ways of increasing HA at the infrastructure layer

As already detailed, infrastructure resiliency = HA + DR.

Different ways by which system availability can be increased include:

  • Building for redundancy: Redundancy is the duplication of application components to increase the overall availability of the distributed system. After following application layer best practices, we can build auto healing mechanisms at the infrastructure layer.

We can take advantage of auto scaling features and use Amazon CloudWatch metrics and alarms to set up auto scaling triggers and deploy redundant copies of our applications across multiple AZs. This protects workloads from AZ failures, as shown in Figure 3.

Redundancy increases availability

Figure 3. Redundancy increases availability

  • Auto scale your infrastructure: When there are AZ failures, infrastructure auto scaling maintains the desired number of redundant components, which helps maintain the base level application throughput. This way, HA system and manage costs are maintained. Auto scaling uses metrics to scale in and out, appropriately, as shown in Figure 4.
How auto scaling improves availability

Figure 4. How auto scaling improves availability

  • Implement resilient network connectivity patterns: While building highly resilient distributed systems, network access to AWS infrastructure also needs to be highly resilient. While deploying hybrid applications, the capacity needed for hybrid applications to communicate with their cloud native application counterparts is an important consideration in designing the network access using AWS Direct Connect or VPNs.

Testing failover and fallback scenarios helps validate that network paths operate as expected and routes fail over as expected to meet RTO objectives. As the number of connection points between the data center and AWS VPCs increases, a hub and spoke configuration provided by the Direct Connect gateway and transit gateways simplify network topology, testing, and fail over. For more information, visit the AWS Direct Connect Resiliency Recommendations.

  • Whenever possible, use the AWS networking backbone to increase security, resiliency, and lower cost. AWS PrivateLink provides secure access to AWS services and exposes the application’s functionalities and APIs to other business units or partner accounts hosted on AWS.
  • Security appliances need to be set up in HA configuration, so that even if one AZ is unavailable, security inspection can be taken over by the redundant appliances in the other AZs.
  • Think ahead about DNS resolution: DNS is a critical infrastructure component; hybrid DNS resolution should be designed carefully with Route 53 HA inbound and outbound resolver endpoints instead of using self-managed proxies.

Implement a good strategy to share DNS resolver rules across AWS accounts and VPC’s with Resource Access Manager. Network failover tests are an important part of Disaster Recovery and Business Continuity Plans. To learn more, visit Set up integrated DNS resolution for hybrid networks in Amazon Route 53.

Additionally, ELB uses health checks to make sure that requests will route to another component if the underlying traffic application component fails. This improves the distributed system’s availability, as it is the cumulative availability of all different layers in our system. Figure 5 details advantages of some AWS managed services.

AWS managed services help in building resilient infrastructures (click the image to enlarge)

Figure 5. AWS managed services help in building resilient infrastructures (click the image to enlarge)

Pattern 4: Use RTO and RPO requirements to determine the correct failover strategy for your application

Capture RTO and RPO requirements early on to determine solid failover strategies (Figure 6). Disaster recovery strategies within AWS range from low cost and complexity (like backup and restore), to more complex strategies when lower values of RTO and RPO are required.

In pilot light and warm standby, only the primary region receives traffic. Pilot light only critical infrastructure components run in the backup region. Automation is used to check failures in the primary region using health checks and other metrics.

When health checks fail, use a combination of auto scaling groups, automation, and Infrastructure as Code (IaC) for quick deployment of other infrastructure components.

Note: This strategy depends on control plane availability in the secondary region for deploying the resources; keep this point in mind if you don’t have compute pre-provisioned in the secondary region. Carefully consider the business requirements and a distributed system’s application-level characteristics before deciding on a failover strategy. To understand all the factors and complexities involved in each of these disaster recovery strategies refer to disaster recovery options in the cloud.

Relationship between RTO, RPO, cost, data loss, and length of service interruption

Figure 6. Relationship between RTO, RPO, cost, data loss, and length of service interruption

Conclusion

In Part 2 of this series, we discovered that infrastructure resiliency is a combination of HA and DR. It is important to consider likelihood and impact of different failure events on availability requirements. Building in application layer resiliency patterns (Part 1 of this series), along with early discovery of the RTO/RPO requirements, as well as operational and process resiliency of an organization helps in choosing the right managed services and putting in place the appropriate failover strategies for distributed systems.

It’s important to differentiate between normal and abnormal load threshold for applications in order to put automation, alerts, and alarms in place. This allows us to auto scale our infrastructure for normal expected load, plus implement corrective action and automation to root out issues in case of abnormal load. Use IaC for quick failover and test failover processes.

Stay tuned for Part 3, in which we discuss operational resiliency!

Happy 10th Anniversary, Amazon S3 Glacier – A Decade of Cold Storage in the Cloud

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-10th-anniversary-amazon-s3-glacier-a-decade-of-cold-storage-in-the-cloud/

Ten years ago, on August 20, 2012, AWS announced the general availability of Amazon Glacier, secure, reliable, and extremely low-cost storage designed for data archiving and backup. At the time, I was working as an AWS customer and it felt like an April Fools’ joke, offering long-term, secure, and durable cloud storage that allowed me to archive large amounts of data at a very low cost.

In Jeff’s original blog post for this launch, he noted that:

Glacier provides, at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte per month, extremely low-cost archive storage. You can store a little bit, or you can store a lot (terabytes, petabytes, and beyond). There’s no upfront fee, and you pay only for the storage that you use. You don’t have to worry about capacity planning, and you will never run out of storage space.

Ten years later, Amazon S3 Glacier has evolved to be the best place in the world for you to store your archive data. The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud.

You can now choose from three archive storage classes optimized for different access patterns and storage duration – Amazon S3 Glacier Instant Retrieval, Amazon S3 Glacier Flexible Retrieval (formerly Amazon S3 Glacier), and Amazon S3 Glacier Deep Archive. We’ll dive into each of these storage classes in a bit.

A Decade of Innovation in Amazon S3 Glacier
To understand how we got here, we’ll walk through through the last decade and revisit some of the most significant Amazon S3 Glacier launches that fundamentally changed archive storage forever:

August 2012 – Amazon Glacier: Archival Storage for One Penny per GB per Month
We launched Amazon Glacier to store any amount of data with high durability at a cost that allows you to get rid of your tape libraries and all the operational complexity and overhead that have been part of data archiving for decades. Amazon Glacier was modeled on S3’s durability and dependability but designed and built from the ground up to offer an archival storage to you at an extremely low cost. At that time, Glacier introduced the concept of a “vault” for storing archival data. You could then easily retrieve your archival data by initiating a request and then the data was made available to you for download in 3–5 hours.

November 2012 – Archiving Amazon S3 Data to Glacier
While Glacier was purpose-built from the ground up for archival data, many customers had object data that originated in S3 warmer storage that they would eventually want to move to colder storage. To make that easy for customers, Amazon S3’s Lifecycle Management (aka Lifecycle Rule) integrated S3 and Glacier and made the details visible via the storage class of each object. Lifecycle Management allows you to define time-based rules that can start Transition (changing S3 storage class to Glacier) and Expiration (deletion of objects). In 2014, we combined the flexibility of S3 versioned objects with Glacier, helping you to further reduce your overall storage costs.

November 2016 – Glacier Price Reductions and Additional Retrieval Options for Glacier
As part of AWS’s long-term focus on reducing costs and passing along those savings to customers, we reduced the price of Glacier storage to $0.004 (less than half a cent) in the case of 1 GB for 1 month in the US East (N. Virginia) Region, from $0.007 in 2015 and $0.010 in 2012. With storing data at a very low cost but having flexibility in how quickly they can retrieve the data, we introduced two more options for data retrieval that were based on the amount of data that you stored in Glacier and the rate at which you retrieved it. You could select expedited retrieval (typically taking 1–5 minutes), bulk retrieval (5–12 hours), or the existing standard retrieval method (3–5 hours).

November 2018 – Amazon S3 Glacier Storage Class to Integrate S3 Experiences
Glacier customers appreciated the way they could easily move data from S3 to Glacier via S3 lifecycle management, and wanted us to expand on that capability to use the most common S3 APIs to operate directly on S3 Glacier objects. So, we added S3 PUT API to S3 Glacier, which enables you to use the standard S3 PUT API and select any storage class, including S3 Glacier, to store the data. Data can be stored directly in S3 Glacier, eliminating the need to upload to S3 Standard and immediately transition to S3 Glacier with a zero-day lifecycle policy. So, you could PUT to S3 Glacier like any other S3 storage class.

March 2019 – Amazon S3 Glacier Deep Archive – the Lowest Cost Storage in the Cloud
While the original Glacier service offered an extremely low price for archival storage, we challenged ourselves to see if we could find a way to invent an even lower priced storage offering for very cold data. The Amazon S3 Glacier Deep Archive storage class delivers the lowest cost storage, up to 75 percent lower cost (than S3 Glacier Flexible Retrieval), for long-lived archive data that is accessed less than once per year and is retrieved asynchronously. At just $0.00099 per GB-month (or $1 per TB-month), S3 Glacier Deep Archive offers the lowest cost storage in the cloud at prices significantly lower than storing and maintaining data in on-premises tape or archiving data off-site.

November 2020 – Amazon S3 Intelligent-Tiering adds Archive Access and Deep Archive Access tiers
In November 2018, we launched Amazon S3 Intelligent-Tiering, the only cloud storage class that delivers automatic storage cost savings, up to 95 percent when data access patterns change, without performance impact or operational overhead. In order to offer customers the simplicity and flexibility of S3 Intelligent-Tiering and the low storage cost of archival data, we added the Archive Access tier providing the same performance and pricing as the S3 Glacier storage class as well as the Deep Archive Access tier which offers the same performance and pricing as the S3 Glacier Deep Archive storage class.

November 2021 – Amazon S3 Glacier Flexible Retrieval and S3 Glacier Instant Retrieval
The Amazon S3 Glacier storage class was renamed to Amazon S3 Glacier Flexible Retrieval and now includes free bulk retrievals along with an additional 10 percent price reduction across all Regions, making it optimized for use cases such as backup and disaster recovery.

Additionally, customers asked us for a storage solution that had the low costs of Glacier but allowed for fast access when data was needed very quickly. So, we introduced Amazon S3 Glacier Instant Retrieval, a new archive storage class that delivers the lowest cost storage for long-lived data that is rarely accessed and requires milliseconds retrieval. You can save up to 68 percent on storage costs compared to using the S3 Standard-Infrequent Access (S3 Standard-IA) storage class when your data is accessed once per quarter.

The Amazon S3 Intelligent-Tiering storage class also recently added a new Archive Instant Access tier, providing the same performance and pricing as the S3 Glacier Instant Retrieval storage class which delivers automatic 68% cost savings for customers using S3 Intelligent-Tiering with long-lived data.

Then and Now
Customers across all industries and verticals use the S3 Glacier storage classes for every imaginable archival workload. Accessing and using the S3 Glacier storage classes through the S3 APIs and S3 console provides enhanced functionality for data management and cost optimization.

As we discussed above, you can now choose from three archive storage classes optimized for different access patterns and storage duration:

  • S3 Glacier Instant Retrieval – For archive data that needs immediate access, such as medical images, news media assets, or genomics data, choose the S3 Glacier Instant Retrieval storage class, an archive storage class that delivers the lowest cost storage with milliseconds retrieval.
  • S3 Glacier Flexible Retrieval – For archive data that does not require immediate access but needs to have the flexibility to retrieve large sets of data at no cost, such as backup or disaster recovery use cases, choose the S3 Glacier Flexible Retrieval storage class, with retrieval in minutes or free bulk retrievals in 12 hours.
  • S3 Glacier Deep Archive – For retaining data for 7–10 years or longer to meet customer needs and regulatory compliance requirements, such as financial services, healthcare, media and entertainment, and public sector, choose the S3 Glacier Deep Archive storage class, the lowest cost storage in the cloud with data retrieval within 12–48 hours.

Watch a brief introduction video for an overview of the S3 Glacier storage classes.

All S3 Glacier storage classes are designed for 99.999999999% (11 9s) of durability for objects. Data is redundantly stored across three or more Availability Zones that are physically separated within an AWS Region. Here are some comparisons across the S3 Glacier storage classes at a glance:

Performances S3 Glacier
Instant Retrieval
S3 Glacier
Flexible Retrieval
S3 Glacier
Deep Archive
Availability 99.9% 99.99% 99.99%
Availability SLA 99% 99.9% 99.9%
Minimum capacity charge per object 128 KB 40 KB 40 KB
Minimum storage duration charge 90 days 90 days 180 days
Retrieval charge per GB per GB per GB
Retrieval time milliseconds Expedited (1–5 minutes),
Standard (3–5 hours),
Bulk (5–12 hours) free
Standard (within 12 hours),
Bulk (within 48 hours)

For data with changing access patterns that you want to automatically archive based on the last access of that data, choose the S3 Intelligent-Tiering storage class. Doing so will optimize storage costs by automatically moving data to the most cost-effective access tier when access patterns change. Its Archive Instant Access, Archive Access, and Deep Archive Access tiers have the same performance as S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive respectively. To learn more, see the blog post Automatically archive and restore data with Amazon S3 Intelligent-Tiering.

To get started with S3 Glacier, see the blog post Best practices for archiving large datasets with AWS for key considerations and actions when planning your cold data storage patterns. You can also use a hands-on lab tutorial that will help you get started with the S3 Glacier storage classes in just 20 minutes, and start archiving your data in the S3 Glacier storage classes in the S3 console.

Happy Birthday, Amazon S3 Glacier!
During the last AWS Storage Day 2022, Kevin Miller, VP & GM of Amazon S3, mentioned the 10th anniversary of S3 Glacier and its pace of innovation for many customer use cases throughout his interview with theCUBE.

In this expanding world of data growth, you have to have an archiving strategy. Everyone has archival data — every company, every vertical, and every industry. There is an archiving need not only for companies that have been around for a while but also for digital native businesses.

Lots of AWS customers such as Nasdaq, Electronic Arts, and NASCAR have used S3 Glacier storage classes for their backup and archiving workloads. The following are some additional recent customer-authored blogs focusing on AWS archiving best practices from customers in the financial, media, gaming, and software industries.

A big thank you to all of our S3 Glacier customers from around the world! Over 90 percent of S3’s roadmap has come directly from feedback from customers like you. We will never stop listening to you, as your feedback and ideas are essential to how we improve the service. Thank you for trusting us and for constantly raising the bar and pushing us to improve to lower costs, simplify your storage, increase your agility, and allow you to innovate faster.

In accordance with Customer Obsession, one of the Amazon Leadership Principles, your feedback is always welcome! If you want to see new S3 Glacier features and capabilities, please send any feedback to AWS re:Post for S3 Glacier or through your usual AWS Support contacts.

– Channy

Welcome to AWS Storage Day 2022

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/welcome-to-aws-storage-day-2022/

We are on the fourth year of our annual AWS Storage Day! Do you remember our first Storage Day 2019 and the subsequent Storage Day 2020? I watched Storage Day 2021, which was streamed live from downtown Seattle. We continue to hear from our customers about how powerful the Storage Day announcements and educational sessions were. With this year’s lineup, we aim to share our insights on how to protect your data and put it to work. The free Storage Day 2022 virtual event is happening now on the AWS Twitch channel. Tune in to hear from experts about new announcements, leadership insights, and educational content related to the broad portfolio of AWS Storage services.

Our customers are looking to reduce and optimize storage costs, while building the cloud storage skills they need for themselves and for their organizations. Furthermore, our customers want to protect their data for resiliency and put their data to work. In this blog post, you will find our insights and announcements that address all these needs and more.

Let’s get into it…

Protect Your Data
Data protection has become an operational model to deliver the resiliency of applications and the data they rely on. Organizations use the National Institute of Standards and Technology (NIST) cybersecurity framework and its Identify->Protect->Detect->Respond->Recover process to approach data protection overall. It’s necessary to consider data resiliency and recovery upfront in the Identify and Protect functions, so there is a plan in place for the later Respond and Recover functions.

AWS is making data resiliency, including malware-type recovery, table stakes for our customers. Many of our customers use Amazon Elastic Block Store (Amazon EBS) for mission-critical applications. If you already use Amazon EBS and you regularly back up EBS volumes using EBS multi-volume snapshots, I have an announcement that you will find very exciting.

Amazon EBS
Amazon EBS scales fast for the most demanding, high-performance workloads, and this is why our customers trust Amazon EBS for critical applications such as SAP, Oracle, and Microsoft. Currently, Amazon EBS enables you to back up volumes at any time using EBS Snapshots. Snapshots retain the data from all completed I/O operations, allowing you to restore the volume to its exact state at the moment before backup.

Many of our customers use snapshots in their backup and disaster recovery plans. A common use case for snapshots is to create a backup of a critical workload such as a large database or file system. You can choose to create snapshots of each EBS volume individually or choose to create multi-volume snapshots of the EBS volumes attached to a single Amazon Elastic Compute Cloud (EC2) instance. Our customers love the simplicity and peace of mind that comes with regularly backing up EBS volumes attached to a single EC2 instance using EBS multi-volume snapshots, and today we’re announcing a new feature—crash consistent snapshots for a subset of EBS volumes.

Previously, when you wanted to create multi-volume snapshots of EBS volumes attached to a single Amazon EC2 instance, if you only wanted to include some—but not all—attached EBS volumes, you had to make multiple API calls to keep only the snapshots you wanted. Now, you can choose specific volumes you want to exclude in the create-snapshots process using a single API call or by using the Amazon EC2 console, resulting in significant cost savings. Crash consistent snapshots for a subset of EBS volumes is also supported by Amazon Data Lifecycle Manager policies to automate the lifecycle of your multi-volume snapshots.

This feature is now available to you at no additional cost. To learn more, please visit the EBS Snapshots user guide.

Put Your Data to Work
We give you controls and tools to get the greatest value from your data—at an organizational level down to the individual data worker and scientist. Decisions you make today will have a long-lasting impact on your ability to put your data to work. Consider your own pace of innovation and make sure you have a cloud provider that will be there for you no matter what the future brings. AWS Storage provides the best cloud for your traditional and modern applications. We support data lakes in AWS Storage, analytics, machine learning (ML), and streaming on top of that data, and we also make cloud benefits available at the edge.

Amazon File Cache (Coming Soon)
Today we are also announcing Amazon File Cache, an upcoming new service on AWS that accelerates and simplifies hybrid cloud workloads. Amazon File Cache provides a high-speed cache on AWS that makes it easier for you to process file data, regardless of where the data is stored. Amazon File Cache serves as a temporary, high-performance storage location for your data stored in on-premises file servers or in file systems or object stores in AWS.

This new service enables you to make dispersed data sets available to file-based applications on AWS with a unified view and at high speeds with sub-millisecond latencies and up to hundreds of GB/s of throughput. Amazon File Cache is designed to enable a wide variety of cloud bursting workloads and hybrid workflows, ranging from media rendering and transcoding, to electronic design automation (EDA), to big data analytics.

Amazon File Cache will be generally available later this year. If you are interested in learning more about this service, please sign up for more information.

AWS Transfer Family
During Storage Day 2020, we announced that customers could deploy AWS Transfer Family server endpoints in Amazon Virtual Private Clouds (Amazon VPCs). AWS Transfer Family helps our customers easily manage and share data with simple, secure, and scalable file transfers. With Transfer Family, you can seamlessly migrate, automate, and monitor your file transfer workflows into and out of Amazon S3 and Amazon Elastic File System (Amazon EFS) using the SFTP, FTPS, and FTP protocols. Exchanged data is natively accessible in AWS for processing, analysis, and machine learning, as well as for integrations with business applications running on AWS.

On July 26th of this year, Transfer Family launched support for the Applicability Statement 2 (AS2) protocol. Customers across verticals such as healthcare and life sciences, retail, financial services, and insurance that rely on AS2 for exchanging business-critical data can now use AWS Transfer Family’s highly available, scalable, and globally available AS2 endpoints to more cost-effectively and securely exchange transactional data with their trading partners.

With a focus on helping you work with partners of your choice, we are excited to announce the AWS Transfer Family Delivery Program as part of the AWS Partner Network (APN) Service Delivery Program (SDP). Partners that deliver cloud-native Managed File Transfer (MFT) and business-to-business (B2B) file exchange solutions using AWS Transfer Family are welcome to join the program. Partners in this program meet a high bar, with deep technical knowledge, experience, and proven success in delivering Transfer Family solutions to our customers.

Five New AWS Storage Learning Badges
Earlier I talked about how our customers are looking to add the cloud storage skills they need for themselves and for their organizations. Currently, storage administrators and practitioners don’t have an easy way of externally demonstrating their AWS storage knowledge and skills. Organizations seeking skilled talent also lack an easy way of validating these skills for prospective employees.

In February 2022, we announced digital badges aligned to Learning Plans for Block Storage and Object Storage on AWS Skill Builder. Today, we’re announcing five additional storage learning badges. Three of these digital badges align to the Skill Builder Learning Plans in English for File, Data Protection & Disaster Recovery (DPDR), and Data Migration. Two of these badges—Core and Technologist—are tiered badges that are awarded to individuals who earn a series of Learning Plan-related badges in the following progression:

Image showing badge progression. To get the Storage Core badge users must first get Block, File, and Object badges. To get the Storage Technologist Badge users must first get the Core, Data Protection & Disaster Recovery, and Data Migration badges.

To learn more, please visit the AWS Learning Badges page.

Well, That’s It!
As I’m sure you’ve picked up on the pattern already, today’s announcements focused on continuous innovation and AWS’s ongoing commitment to providing the cloud storage training that your teams are looking for. Best of all, this AWS training is free. These announcements also focused on simplifying your data migration to the cloud, protecting your data, putting your data to work, and cost-optimization.

Now Join Us Online
Register for free and join us for the AWS Storage Day 2022 virtual event on the AWS channel on Twitch. The event will be live from 9:00 AM Pacific Time (12:00 PM Eastern Time) on August 10. All sessions will be available on demand approximately 2 days after Storage Day.

We look forward to seeing you on Twitch!

– Veliswa x

Coordinating large messages across accounts and Regions with Amazon SNS and SQS

Post Syndicated from Mrudhula Balasubramanyan original https://aws.amazon.com/blogs/architecture/coordinating-large-messages-across-accounts-and-regions-with-amazon-sns-and-sqs/

Many organizations have applications distributed across various business units. Teams in these business units may develop their applications independent of each other to serve their individual business needs. Applications can reside in a single Amazon Web Services (AWS) account or be distributed across multiple accounts. Applications may be deployed to a single AWS Region or span multiple Regions.

Irrespective of how the applications are owned and operated, these applications need to communicate with each other. Within an organization, applications tend to be part of a larger system, therefore, communication and coordination among these individual applications is critical to overall operation.

There are a number of ways to enable coordination among component applications. It can be done either synchronously or asynchronously:

  • Synchronous communication uses a traditional request-response model, in which the applications exchange information in a tightly coupled fashion, introducing multiple points of potential failure.
  • Asynchronous communication uses an event-driven model, in which the applications exchange messages as events or state changes and are loosely coupled. Loose coupling allows applications to evolve independently of each other, increasing scalability and fault-tolerance in the overall system.

Event-driven architectures use a publisher-subscriber model, in which events are emitted by the publisher and consumed by one or more subscribers.

A key consideration when implementing an event-driven architecture is the size of the messages or events that are exchanged. How can you implement an event-driven architecture for large messages, beyond the default maximum of the services? How can you architect messaging and automation of applications across AWS accounts and Regions?

This blog presents architectures for enhancing event-driven models to exchange large messages. These architectures depict how to coordinate applications across AWS accounts and Regions.

Challenge with application coordination

A challenge with application coordination is exchanging large messages. For the purposes of this post, a large message is defined as an event payload between 256 KB and 2 GB. This stems from the fact that Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS) currently have a maximum event payload size of 256 KB. To exchange messages larger than 256 KB, an intermediate data store must be used.

To exchange messages across AWS accounts and Regions, set up the publisher access policy to allow subscriber applications in other accounts and Regions. In the case of large messages, also set up a central data repository and provide access to subscribers.

Figure 1 depicts a basic schematic of applications distributed across accounts communicating asynchronously as part of a larger enterprise application.

Asynchronous communication across applications

Figure 1. Asynchronous communication across applications

Architecture overview

The overview covers two scenarios:

  1. Coordination of applications distributed across AWS accounts and deployed in the same Region
  2. Coordination of applications distributed across AWS accounts and deployed to different Regions

Coordination across accounts and single AWS Region

Figure 2 represents an event-driven architecture, in which applications are distributed across AWS Accounts A, B, and C. The applications are all deployed to the same AWS Region, us-east-1. A single Region simplifies the architecture, so you can focus on application coordination across AWS accounts.

Application coordination across accounts and single AWS Region

Figure 2. Application coordination across accounts and single AWS Region

The application in Account A (Application A) is implemented as an AWS Lambda function. This application communicates with the applications in Accounts B and C. The application in Account B is launched with AWS Step Functions (Application B), and the application in Account C runs on Amazon Elastic Container Service (Application C).

In this scenario, Applications B and C need information from upstream Application A. Application A publishes this information as an event, and Applications B and C subscribe to an SNS topic to receive the events. However, since they are in other accounts, you must define an access policy to control who can access the SNS topic. You can use sample Amazon SNS access policies to craft your own.

If the event payload is in the 256 KB to 2 GB range, you can use Amazon Simple Storage Service (Amazon S3) as the intermediate data store for your payload. Application A uses the Amazon SNS Extended Client Library for Java to upload the payload to an S3 bucket and publish a message to an SNS topic, with a reference to the stored S3 object. The message containing the metadata must be within the SNS maximum message limit of 256 KB. Amazon EventBridge is used for routing events and handling authentication.

The subscriber Applications B and C need to de-reference and retrieve the payloads from Amazon S3. The SQS queue in Account B and Lambda function in Account C subscribe to the SNS topic in Account A. In Account B, a Lambda function is used to poll the SQS queue and read the message with the metadata. The Lambda function uses the Amazon SQS Extended Client Library for Java to retrieve the S3 object referenced in the message.

The Lambda function in Account C uses the Payload Offloading Java Common Library for AWS to get the referenced S3 object.

Once the S3 object is retrieved, the Lambda functions in Accounts B and C process the data and pass on the information to downstream applications.

This architecture uses Amazon SQS and Lambda as subscribers because they provide libraries that support offloading large payloads to Amazon S3. However, you can use any Java-enabled endpoint, such as an HTTPS endpoint that uses Payload Offloading Java Common Library for AWS to de-reference the message content.

Coordination across accounts and multiple AWS Regions

Sometimes applications are spread across AWS Regions, leading to increased latency in coordination. For existing applications, it could take substantive effort to consolidate to a single Region. Hence, asynchronous coordination would be a good fit for this scenario. Figure 3 expands on the architecture presented earlier to include multiple AWS Regions.

Application coordination across accounts and multiple AWS Regions

Figure 3. Application coordination across accounts and multiple AWS Regions

The Lambda function in Account C is in the same Region as the upstream application in Account A, but the Lambda function in Account B is in a different Region. These functions must retrieve the payload from the S3 bucket in Account A.

To provide access, configure the AWS Lambda execution role with the appropriate permissions. Make sure that the S3 bucket policy allows access to the Lambda functions from Accounts B and C.

Considerations

For variable message sizes, you can specify if payloads are always stored in Amazon S3 regardless of their size, which can help simplify the design.

If the application that publishes/subscribes large messages is implemented using the AWS Java SDK, it must be Java 8 or higher. Service-specific client libraries are also available in Python, C#, and Node.js.

An Amazon S3 Multi-Region Access Point can be an alternative to a centralized bucket for the payloads. It has not been explored in this post due to the asynchronous nature of cross-region replication.

In general, retrieval of data across Regions is slower than in the same Region. For faster retrieval, workloads should be run in the same AWS Region.

Conclusion

This post demonstrates how to use event-driven architectures for coordinating applications that need to exchange large messages across AWS accounts and Regions. The messaging and automation are enabled by the Payload Offloading Java Common Library for AWS and use Amazon S3 as the intermediate data store. These components can simplify the solution implementation and improve scalability, fault-tolerance, and performance of your applications.

Ready to get started? Explore SQS Large Message Handling.

Augmentation patterns to modernize a mainframe on AWS

Post Syndicated from Lewis Tang original https://aws.amazon.com/blogs/architecture/augmentation-patterns-to-modernize-a-mainframe-on-aws/

Customers with mainframes want to use Amazon Web Services (AWS) to increase agility, maximize the value of their investments, and innovate faster. On June 8, 2022, AWS announced the general availability of AWS Mainframe Modernization, a new service that makes it faster and simpler for customers to modernize mainframe-based workloads.

In this post, we discuss the common use cases and the augmentation architecture patterns that help liberate data from mainframe for modern data analytics, get rid of expensive and unsupported tape storage solutions for mainframe, build new capabilities that integrate with core mainframe workloads, and enable agile development and testing by adopting CI/CD for mainframe.

Pattern 1: Augment mainframe data retention with backup and archival on AWS

Mainframes process and generate the most business-critical data. It’s imperative to provide data protection via solutions, such as data backup, archiving, and disaster recovery. Mainframes usually use automated tape libraries—virtual tape libraries for backup and archive. These tapes need to be stored, organized, and transported to vaults and disaster recovery sites. All this can be very expensive and rigid.

There is a more cost-effective approach that helps simplify the operations of tape libraries:  leverage AWS partner tools, such as Model9, to transparently migrate the data on tape storage to AWS.

As depicted in Figure 1, mainframe data can be transferred via the secured network connection using AWS Transfer Family services or AWS DataSync to AWS cloud storage services, such as Amazon Elastic File System, Amazon Elastic Block Store, and Amazon Simple Storage Service (S3). After data is stored in AWS cloud, you can configure and move data among these services to meet with the business data processing need. Depending on data storage requirements, data storage costs can be further optimized by configuring S3 Lifecyle policies to move data among Amazon S3 storage classes. For long-term data archiving purpose, you can choose S3 Glacier storage class to achieve durability, resilience, and the optimal cost effectiveness.

Mainframe data backup and archival augmentation

Figure 1. Mainframe data backup and archival augmentation

Pattern 2: Augment mainframe with agile development and test environments including CI/CD pipeline on AWS

For any business-critical business application, a typical mainframe workload requires development and test environments to support production workloads. It’s common to see the lengthy application development lifecycle, a lack of automated testing, and an absent CI/CD pipeline with most of mainframes. Furthermore, the existing mainframe development processes and tools are outdated, as they are unable to keep up with the business pace, resulting in a growing backlog. Organizations with mainframes look for application development solutions to solve these challenges.

As demonstrated in Figure 2, AWS developer tools orchestrate code compilation, testing, and deployment among mainframe test environments. Mainframe test environments are either provided by the mainframe vendors as emulators or by AWS partners, such as Micro Focus. You can load the preferred developer tools and run an integrated development environment (IDE) from Amazon WorkSpaces or Amazon AppStream 2.0. Developers create or modify code in the IDE, and then commit and push their code to AWS CodeCommit. As soon as the code is pushed, an event is generated and triggers the pipeline in AWS CodePipeline to build the new code in a compilation environment via AWS CodeBuild. The pipeline pushes the new code to the test environment.

To optimize cost, you can scale the test environment capacity to meet needs. The tests are executed, and the test environment can be shut down when not in use. When the tests are successful, the pipeline pushes the code back to the mainframe via AWS CodeDeploy and an intermediary server. On the mainframe side, the code can go through a recompilation and final testing before being pushed to production.

You can further optimize operations and licensing cost of mainframe emulator by leveraging the managed integrated development and test environment provided by AWS Mainframe Modernization service.

Mainframe CI/CD augmentation

Figure 2. Mainframe CI/CD augmentation

Pattern 3: Augment mainframe with agile data analytics on AWS

Core business applications running on mainframes generate a lot of data throughout the years. Decades of historical business transactions and massive amounts of user data present an opportunity to develop deep business insight. By creating a data lake using the AWS big data services, you can gain faster analytics capabilities and better insight into core business data originated from mainframe applications.

Figure 3 depicts data being pulled from relational, hierarchical, or mainframe file-based data stores on mainframes. These data are presented in various formats and stored as DB2 for z/OS, VSAM, IMS DB, IDMS, DMS, or other formats. You can use AWS partners data replication and change data capture tools from AWS Marketplace or AWS cloud services, such as Amazon Managed Streaming for Apache Kafka for near real-time data streaming, Transfer Family services, and DataSync for moving data in batch from mainframes to AWS.

Once data are replicated to AWS, you can further process data using the services like AWS Lambda, or Amazon Elastic Container Service and store the processed data on various AWS storage services, such as Amazon DynamoDB, Amazon Relational Database Service, and Amazon S3.

By using AWS big data and data analytics services, such as Amazon EMR, Amazon Redshift, Amazon Athena, AWS Glue, and Amazon QuickSight, you can develop deep business insight and present flexible visuals to your customers. Read more about mainframe data integration.

Mainframe data analytics augmentation

Figure 3. Mainframe data analytics augmentation

Pattern 4: Augment mainframe with new functions and channels on AWS

Organizations with a mainframe use AWS to innovate and iterate quickly, as they often lack agility. For example, a common scenario for a bank could be providing a mobile application for customer engagements, such as supporting a marketing campaign for a new credit card.

As depicted in Figure 4, with the data replicated from mainframes to AWS cloud and analyzed by AWS big data and analytics services, new business functions can be developed on cloud-native applications by using Amazon API Gateway, AWS Lambda, and AWS Fargate. These new business applications can interact with mainframe data, and the combination can give deep business insight.

To add new innovation capabilities, with time-series data generated by the new business function applications, using Amazon Forecast can predict domain-specific metrics, such as inventory, workforce, web traffic, and finances. Amazon Lex can build virtual agents, automate informational response to customer enquiries, and improve business productivity. Adding Amazon SageMaker, you can prepare data gathered from mainframe and new business applications at scale to build, train, and deploy machine learning models for any business cases.

You can further improve customer engagement by incorporating Amazon Connect and Amazon Pinpoint to build multichannel communications.

Mainframe new functions and channels augmentation

Figure 4. Mainframe new functions and channels augmentation

Conclusion

To increase agility, maximize the value of investments, and innovate faster, organizations can adopt the patterns discussed in this post to augment mainframes by using AWS services to build resilient data protection solution, provision agile CI/CD integrated development and test environment, liberate mainframe data and developing innovation solutions for new digital customer experience. With AWS Mainframe Modernization service, you can accelerate this journey and innovate faster.

Introducing Amazon CodeWhisperer in the AWS Lambda console (In preview)

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/introducing-amazon-codewhisperer-in-the-aws-lambda-console-in-preview/

This blog post is written by Mark Richman, Senior Solutions Architect.

Today, AWS is launching a new capability to integrate the Amazon CodeWhisperer experience with the AWS Lambda console code editor.

Amazon CodeWhisperer is a machine learning (ML)–powered service that helps improve developer productivity. It generates code recommendations based on their code comments written in natural language and code.

CodeWhisperer is available as part of the AWS toolkit extensions for major IDEs, including JetBrains, Visual Studio Code, and AWS Cloud9, currently supporting Python, Java, and JavaScript. In the Lambda console, CodeWhisperer is available as a native code suggestion feature, which is the focus of this blog post.

CodeWhisperer is currently available in preview with a waitlist. This blog post explains how to request access to and activate CodeWhisperer for the Lambda console. Once activated, CodeWhisperer can make code recommendations on-demand in the Lambda code editor as you develop your function. During the preview period, developers can use CodeWhisperer at no cost.

Amazon CodeWhisperer

Amazon CodeWhisperer

Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications and only pay for what you use.

With Lambda, you can build your functions directly in the AWS Management Console and take advantage of CodeWhisperer integration. CodeWhisperer in the Lambda console currently supports functions using the Python and Node.js runtimes.

When writing AWS Lambda functions in the console, CodeWhisperer analyzes the code and comments, determines which cloud services and public libraries are best suited for the specified task, and recommends a code snippet directly in the source code editor. The code recommendations provided by CodeWhisperer are based on ML models trained on a variety of data sources, including Amazon and open source code. Developers can accept the recommendation or simply continue to write their own code.

Requesting CodeWhisperer access

CodeWhisperer integration with Lambda is currently available as a preview only in the N. Virginia (us-east-1) Region. To use CodeWhisperer in the Lambda console, you must first sign up to access the service in preview here or request access directly from within the Lambda console.

In the AWS Lambda console, under the Code tab, in the Code source editor, select the Tools menu, and Request Amazon CodeWhisperer Access.

Request CodeWhisperer access in Lambda console

Request CodeWhisperer access in Lambda console

You may also request access from the Preferences pane.

Request CodeWhisperer access in Lambda console preference pane

Request CodeWhisperer access in Lambda console preference pane

Selecting either of these options opens the sign-up form.

CodeWhisperer sign up form

CodeWhisperer sign up form

Enter your contact information, including your AWS account ID. This is required to enable the AWS Lambda console integration. You will receive a welcome email from the CodeWhisperer team upon once they approve your request.

Activating Amazon CodeWhisperer in the Lambda console

Once AWS enables your preview access, you must turn on the CodeWhisperer integration in the Lambda console, and configure the required permissions.

From the Tools menu, enable Amazon CodeWhisperer Code Suggestions

Enable CodeWhisperer code suggestions

Enable CodeWhisperer code suggestions

You can also enable code suggestions from the Preferences pane:

Enable CodeWhisperer code suggestions from Preferences pane

Enable CodeWhisperer code suggestions from Preferences pane

The first time you activate CodeWhisperer, you see a pop-up containing terms and conditions for using the service.

CodeWhisperer Preview Terms

CodeWhisperer Preview Terms

Read the terms and conditions and choose Accept to continue.

AWS Identity and Access Management (IAM) permissions

For CodeWhisperer to provide recommendations in the Lambda console, you must enable the proper AWS Identity and Access Management (IAM) permissions for either your IAM user or role. In addition to Lambda console editor permissions, you must add the codewhisperer:GenerateRecommendations permission.

Here is a sample IAM policy that grants a user permission to the Lambda console as well as CodeWhisperer:

{
  "Version": "2012-10-17",
  "Statement": [{
      "Sid": "LambdaConsolePermissions",
      "Effect": "Allow",
      "Action": [
        "lambda:AddPermission",
        "lambda:CreateEventSourceMapping",
        "lambda:CreateFunction",
        "lambda:DeleteEventSourceMapping",
        "lambda:GetAccountSettings",
        "lambda:GetEventSourceMapping",
        "lambda:GetFunction",
        "lambda:GetFunctionCodeSigningConfig",
        "lambda:GetFunctionConcurrency",
        "lambda:GetFunctionConfiguration",
        "lambda:InvokeFunction",
        "lambda:ListEventSourceMappings",
        "lambda:ListFunctions",
        "lambda:ListTags",
        "lambda:PutFunctionConcurrency",
        "lambda:UpdateEventSourceMapping",
        "iam:AttachRolePolicy",
        "iam:CreatePolicy",
        "iam:CreateRole",
        "iam:GetRole",
        "iam:GetRolePolicy",
        "iam:ListAttachedRolePolicies",
        "iam:ListRolePolicies",
        "iam:ListRoles",
        "iam:PassRole",
        "iam:SimulatePrincipalPolicy"
      ],
      "Resource": "*"
    },
    {
      "Sid": "CodeWhispererPermissions",
      "Effect": "Allow",
      "Action": ["codewhisperer:GenerateRecommendations"],
      "Resource": "*"
    }
  ]
}

This example is for illustration only. It is best practice to use IAM policies to grant restrictive permissions to IAM principals to meet least privilege standards.

Demo

To activate and work with code suggestions, use the following keyboard shortcuts:

  • Manually fetch a code suggestion: Option+C (macOS), Alt+C (Windows)
  • Accept a suggestion: Tab
  • Reject a suggestion: ESC, Backspace, scroll in any direction, or keep typing and the recommendation automatically disappears.

Currently, the IDE extensions provide automatic suggestions and can show multiple suggestions. The Lambda console integration requires a manual fetch and shows a single suggestion.

Here are some common ways to use CodeWhisperer while authoring Lambda functions.

Single-line code completion

When typing single lines of code, CodeWhisperer suggests how to complete the line.

CodeWhisperer single-line completion

CodeWhisperer single-line completion

Full function generation

CodeWhisperer can generate an entire function based on your function signature or code comments. In the following example, a developer has written a function signature for reading a file from Amazon S3. CodeWhisperer then suggests a full implementation of the read_from_s3 method.

CodeWhisperer full function generation

CodeWhisperer full function generation

CodeWhisperer may include import statements as part of its suggestions, as in the previous example. As a best practice to improve performance, manually move these import statements to outside the function handler.

Generate code from comments

CodeWhisperer can also generate code from comments. The following example shows how CodeWhisperer generates code to use AWS APIs to upload files to Amazon S3. Write a comment describing the intended functionality and, on the following line, activate the CodeWhisperer suggestions. Given the context from the comment, CodeWhisperer first suggests the function signature code in its recommendation.

CodeWhisperer generate function signature code from comments

CodeWhisperer generate function signature code from comments

After you accept the function signature, CodeWhisperer suggests the rest of the function code.

CodeWhisperer generate function code from comments

CodeWhisperer generate function code from comments

When you accept the suggestion, CodeWhisperer completes the entire code block.

CodeWhisperer generates code to write to S3.

CodeWhisperer generates code to write to S3.

CodeWhisperer can help write code that accesses many other AWS services. In the following example, a code comment indicates that a function is sending a notification using Amazon Simple Notification Service (SNS). Based on this comment, CodeWhisperer suggests a function signature.

CodeWhisperer function signature for SNS

CodeWhisperer function signature for SNS

If you accept the suggested function signature. CodeWhisperer suggest a complete implementation of the send_notification function.

CodeWhisperer function send notification for SNS

CodeWhisperer function send notification for SNS

The same procedure works with Amazon DynamoDB. When writing a code comment indicating that the function is to get an item from a DynamoDB table, CodeWhisperer suggests a function signature.

CodeWhisperer DynamoDB function signature

CodeWhisperer DynamoDB function signature

When accepting the suggestion, CodeWhisperer then suggests a full code snippet to complete the implementation.

CodeWhisperer DynamoDB code snippet

CodeWhisperer DynamoDB code snippet

Once reviewing the suggestion, a common refactoring step in this example would be manually moving the references to the DynamoDB resource and table outside the get_item function.

CodeWhisperer can also recommend complex algorithm implementations, such as Insertion sort.

CodeWhisperer insertion sort.

CodeWhisperer insertion sort.

As a best practice, always test the code recommendation for completeness and correctness.

CodeWhisperer not only provides suggested code snippets when integrating with AWS APIs, but can help you implement common programming idioms, including proper error handling.

Conclusion

CodeWhisperer is a general purpose, machine learning-powered code generator that provides you with code recommendations in real time. When activated in the Lambda console, CodeWhisperer generates suggestions based on your existing code and comments, helping to accelerate your application development on AWS.

To get started, visit https://aws.amazon.com/codewhisperer/. Share your feedback with us at [email protected].

For more serverless learning resources, visit Serverless Land.

AWS Week In Review – July 11, 2022

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-july-11/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

In France, we know summer has started when you see the Tour de France bike race on TV or in a city nearby. This year, the tour stopped in the city where I live, and I was blocked on my way back home from a customer conference to let the race pass through.

It’s Monday today, so let’s make another tour—a tour of the AWS news, announcements, or blog posts that captured my attention last week. I selected these as being of interest to IT professionals and developers: the doers, the builders that spend their time on the AWS Management Console or in code.

Last Week’s Launches
Here are some launches that got my attention during the previous week:

Amazon EC2 Mac M1 instances are generally available – this new EC2 instance type allows you to deploy Mac mini computers with M1 Apple Silicon running macOS using the same console, API, SDK, or CLI you are used to for interacting with EC2 instances. You can start, stop them, assign a security group or an IAM role, snapshot their EBS volume, and recreate an AMI from it, just like with Linux-based or Windows-based instances. It lets iOS developers create full CI/CD pipelines in the cloud without requiring someone in your team to reinstall various combinations of macOS and Xcode versions on on-prem machines. Some of you had the chance the enter the preview program for EC2 Mac M1 instances when we announced it last December. EC2 Mac M1 instances are now generally available.

AWS IAM Roles Anywhere – this is one of those incremental changes that has the potential to unlock new use cases on the edge or on-prem. AWS IAM Roles Anywhere enables you to use IAM roles for your applications outside of AWS to access AWS APIs securely, the same way that you use IAM roles for workloads on AWS. With IAM Roles Anywhere, you can deliver short-term credentials to your on-premises servers, containers, or other compute platforms. It requires an on-prem Certificate Authority registered as a trusted source in IAM. IAM Roles Anywhere exchanges certificates issued by this CA for a set of short-term AWS credentials limited in scope by the IAM role associated to the session. To make it easy to use, we do provide a CLI-based signing helper tool that can be integrated in your CLI configuration.

A streamlined deployment experience for .NET applications – the new deployment experience focuses on the type of application you want to deploy instead of individual AWS services by providing intelligent compute recommendations. You can find it in the AWS Toolkit for Visual Studio using the new “Publish to AWS” wizard. It is also available via the .NET CLI by installing AWS Deploy Tool for .NET. Together, they help easily transition from a prototyping phase in Visual Studio to automated deployments. The new deployment experience supports ASP.NET Core, Blazor WebAssembly, console applications (such as long-lived message processing services), and tasks that need to run on a schedule.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
This week, I also learned from these blog posts:

TLS 1.2 to become the minimum TLS protocol level for all AWS API endpointsthis article was published at the end of June, and it deserves more exposure. Starting in June 2022, we will progressively transition all our API endpoints to TLS 1.2 only. The good news is that 95 percent of the API calls we observe are already using TLS 1.2, and only five percent of the applications are impacted. If you have applications developed before 2014 (using a Java JDK before version 8 or .NET before version 4.6.2), it is worth checking your app and updating them to use TLS 1.2. When we detect your application is still using TLS 1.0 or TLS 1.1, we inform you by email and in the AWS Health Dashboard. The blog article goes into detail about how to analyze AWS CloudTrail logs to detect any API call that would not use TLS 1.2.

How to implement automated appointment reminders using Amazon Connect and Amazon Pinpoint this blog post guides you through the steps to implement a system to automatically call your customers to remind them of their appointments. This automated outbound campaign for appointment reminders checked the campaign list against a “do not call” list before making an outbound call. Your customers are able to confirm automatically or reschedule by speaking to an agent. You monitor the results of the calls on a dashboard in near real time using Amazon QuickSight. It provides you with AWS CloudFormation templates for the parts that can be automated and detailed instructions for the manual steps.

Using Amazon CloudWatch metrics math to monitor and scale resources AWS Auto Scaling is one of those capabilities that may look like magic at first glance. It uses metrics to take scale-out or scale-in decisions. Most customers I talk with struggle a bit at first to define the correct combination of metrics that allow them to scale at the right moment. Scaling out too late impacts your customer experience while scaling out too early impacts your budget. This article explains how to use metric math, a way to query multiple Amazon CloudWatch metrics, and use math expressions to create new time series based on these metrics. These math metrics may, in turn, be used to trigger scaling decisions. The typical use case would be to mathematically combine CPU, memory, and network utilization metrics to decide when to scale in or to scale out.

How to use Amazon RDS and Amazon Aurora with a static IP address – in the cloud, it is better to access network resources by referencing their DNS name instead of IP addresses. IP addresses come and go as resources are stopped, restarted, scaled out, or scaled in. However, when integrating with older, more rigid environments, it might happen, for a limited period of time, to authorize access through a static IP address. You have probably heard that scary phrase: “I have to authorize your IP address in my firewall configuration.” This new blog post explains how to do so for Amazon Relational Database Service (Amazon RDS) database. It uses a Network Load Balancer and traffic forwarding at the Linux-kernel level to proxy your actual database server.

Amazon S3 Intelligent-Tiering significantly reduces storage costs – we estimate our customers saved up to $250 millions in storage costs since we launched S3 Intelligent-Tiering in 2018. A recent blog post describes how Amazon Photo, a service that provides unlimited photo storage and 5 GB of video storage to Amazon Prime members in eight marketplaces world-wide, uses S3 Intelligent-Tiering to significantly save on storage costs while storing hundreds of petabytes of content and billions of images and videos on S3.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Inforce is the premier cloud security conference, July 26-27. This year it is hosted at the Boston Convention and Exhibition Center, Massachusetts, USA. The conference agenda is available and there is still time to register.

AWS Summit Chicago, August 25, at McCormick Place, Chicago, Illinois, USA. You may register now.

AWS Summit Canberra, August 31, at the National Convention Center, Canberra, Australia. Registrations are already open.

That’s all for this week. Check back next Monday for another tour of AWS news and launches!

— seb

Using AWS Backup and Oracle RMAN for backup/restore of Oracle databases on Amazon EC2: Part 2

Post Syndicated from Jeevan Shetty original https://aws.amazon.com/blogs/architecture/using-aws-backup-and-oracle-rman-for-backup-restore-of-oracle-databases-on-amazon-ec2-part-2/

Customers running Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2) often take database and schema backups using Oracle native tools like Data Pump and Recovery Manager (RMAN) to satisfy data protection, disaster recovery (DR), and compliance requirements. A priority is to reduce backup time as the data grows exponentially and recover sooner in case of failure/disaster.

In Part 1 of this two-part series, we explain how we can use AWS Backup and Amazon Simple Storage Service (Amazon S3) bucket to perform the backup and restore of an Oracle Database on AWS EC2.

In Part 2, we provide a mechanism to use AWS Backup to create a full backup of the EC2 instance, including the OS image, Oracle binaries, logs, and data files. The mechanism also uses Oracle RMAN to perform archived redo log backup to Amazon Elastic File System (Amazon EFS). Then, we demonstrate the steps to restore a database to a specific point-in-time using AWS Backup and Oracle RMAN.

Solution overview

Figure 1 demonstrates the workflow:

  1. Oracle database on Amazon EC2 configured with Oracle Secure Backup.
  2. AWS Backup service to backup EC2 instance at regular intervals.
  3. Amazon EFS for storing Oracle RMAN archive log backups.
Oracle Database in Amazon EC2 using AWS Backup and EFS for backup and restore

Figure 1. Oracle Database in Amazon EC2 using AWS Backup and EFS for backup and restore

Prerequisites

  1. An AWS account
  2. Oracle database and AWS CLI in an EC2 instance
  3. Access to configure AWS Backup
  4. Access to configure EFS to store the Oracle RMAN archive log backups

1. Configure AWS Backup

Configure AWS Backup as detailed in Step 1 of Part 1.

Oracle RMAN archive log backup

While AWS Backup is now creating a daily backup of the EC2 instance, we also want to make sure we backup the archived log files to a protected location. This will let us do point-in-time restores and restore to more recent times than just the last daily EC2 backup. Below we provide the steps to backup archive log using RMAN to Amazon EFS.

Backup/restore archive logs to/from Amazon EFS

Backing up the Oracle Archive logs is an important part of the process. In this section, we will describe how you can backup their Oracle archive logs to Amazon EFS. One advantage of this option (as compared with using Oracle Secure Backup [detailed in Part 1 of this series]) is that it does not require any additional Oracle licensing.

2. Configure Amazon EFS

a. Create an Amazon EFS file system that will be used to store Oracle RMAN Archive log backups. The image below details the steps involved in creation of an Amazon EFS. Consider that a sample file system ID: fs-0123abcdef012345 is created and will be used to store RMAN archive log backup.

Figure 2. Configure Amazon EFS which is used to store Oracle RMAN archive log backups

b. Install the Amazon EFS Client and follow instructions to install EFS client on RHEL EC2 instance. Note: next steps were tested on RHEL 7.9.

sudo yum -y install git
sudo yum -y install rpm-build
git clone https://github.com/aws/efs-utils
cd efs-utils/
sudo yum -y install make
sudo make rpm
sudo yum -y install ./build/amazon-efs-utils*rpm

c. Mount the EFS file system on your EC2 instance. In this example, we show the steps to mount EFS filesystem on EC2 Instance (if the command requests to upgrade stunnel, refer to Upgrading stunnel. Ensure that the EC2 instance profile attached has necessary policies to access EFS. /rman for mount point and file system ID: fs-0123abcdef012345 are examples for EFS file system.

sudo mkdir /rman
sudo mount -t efs -o tls,iam fs-0123abcdef012345 /rman

d. To mount EFS file system automatically on EC2 instance reboot, add an entry in /etc/fstab. This example is for RHEL EC2 instance:

fs-0123abcdef012345:/      /rman        efs     _netdev,tls,iam        0 0

3. Configure RMAN backup to Amazon EFS

With Amazon EFS mounted on EC2 instance, we can configure Oracle RMAN archive log backups to EFS. In below commands oratst is used as an example of your ORACLE_SID.

a. Configure RMAN repository to take control file backup to Amazon EFS automatically.

CONFIGURE CONTROLFILE AUTOBACKUP FORMAT FOR DEVICE TYPE DISK TO '/rman/ctrl-D_%d_%F';
CONFIGURE CONTROLFILE AUTOBACKUP ON;

b. Create a script (for example, rman_archive.sh) with below commands and schedule using crontab (example entry: */5 * * * * rman_archive.sh) to run every 5 minutes. This will ensure that Oracle Archive logs are backed up to Amazon EFS (/rman) frequently, ensuring an recovery point objective (RPO) of 5 minutes.

dt=`date +%Y%m%d_%H%M%S`

rman target / log=/rman/rman_arch_bkup_oratst_${dt}.log <<EOF

RUN
{
    allocate channel c1_efs device type disk format '/rman/arch-D-%d_%T_s%s_p%p' MAXPIECESIZE 10G;
    BACKUP ARCHIVELOG ALL delete all input;
    release channel c1_efs;
}

EOF

4. Perform database point-in-time recovery

In event of a database crash/corruption, we can use AWS Backup service and Oracle RMAN archive log backup to recover database to a specific point-in-time.

a. Typically, you would pick the most recent Recovery Point completed before the time to which you wish to recover. Using AWS Backup, identify the Recovery point ID to restore by following the steps from Restoring an Amazon EC2 instance. Note: when following the steps, be sure to set the “User data” settings as described in the next bulleted item.

After the EBS volumes are created from the snapshot, there is no need to wait for all of the data to transfer from Amazon S3 to your Amazon EBS volume before your attached instance can start accessing the volume. Amazon EBS Snapshots implement lazy loading, so that you can begin using them right away.

b. Ensure that database does not start automatically after restoring the EC2 instance, by renaming /etc/oratab. Use below command in “User data” section while restoring EC2 instance. After database recovery, we can rename it back to /etc/oratab.

#!/usr/bin/sh
sudo su - 
mv /etc/oratab /etc/oratab_bk

c. Login to the EC2 instance once it is up and execute the RMAN recovery commands mentioned below. Identify the DBID from RMAN logs saved in the EFS. Below commands use database oratst as an example.

rman target /

RMAN> startup nomount

RMAN> set dbid DBID


# Below command is to restore the controlfile from autobackup

RMAN> RUN
{
    set controlfile autobackup format for device type disk to '/rman/ctrl-D_%d_%F';
    RESTORE CONTROLFILE FROM AUTOBACKUP;
    alter database mount;
}


#Identify the recovery point (sequence_number) by listing the backups available in catalog.

RMAN> list backup;

In Figure 3, the most recent archive log backed up is 460, so you can use this sequence number in the next set of RMAN commands.

RMAN> RUN
{
    allocate channel c1_efs device type disk format '/rman/arch-D-%d_%T_s%s_p%p';    
    recover database until sequence sequence_number;
    ALTER DATABASE OPEN RESETLOGS;
    release channel c1_efs;
}

Sample output of Oracle RMAN “list backup” command

Figure 3. Sample output of Oracle RMAN “list backup” command

d. To avoid performance issues due to lazy loading, after the database is open, you can run below command to force a faster restoration of the blocks from S3 bucket to EBS volumes (below example allocates two channels and validates the entire database).

RMAN> RUN
{
  ALLOCATE CHANNEL c1 DEVICE TYPE DISK;
  ALLOCATE CHANNEL c2 DEVICE TYPE DISK;
  VALIDATE database section size 1200M;
}

e. This completes the recovery of database, and we can let the database to auto start by renaming file back to /etc/oratab.

mv /etc/oratab_bk /etc/oratab

5. Backup retention

Ensure that the AWS Backup Lifecycle policy match Oracle archive log backup retention. Also, follow documentation to configure Oracle backup retention and deleting expired backup. Below is a sample command for Oracle backup retention.

CONFIGURE BACKUP OPTIMIZATION ON;
CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 31 DAYS; 

RMAN> RUN
{
    allocate channel c1_efs device type disk format '/rman/arch-D-%d_%T_s%s_p%p';

    crosscheck backup;
    delete noprompt obsolete;
    delete noprompt expired backup;
    
    release channel c1_efs;
}

Cleanup

Follow below instructions to remove or cleanup the setup:

  1. Delete the backup plan created in Step 1.
  2. Remove the cron entry from the EC2 instance configured in Step 3b.
  3. Delete the EFS that was created in Step 2 to store Oracle RMAN archive log backups.

Conclusion

In this post, we demonstrated the use for AWS Backup for EC2 snapshot and EFS as storage for Oracle RMAN archive log backups. With this strategy for backup, Oracle Database running on EC2 can be restored and recovered to a point-in-time faster than oracle native backup and recovery strategies. Also, by using EFS for Oracle RMAN archive log backups, we can avoid the additional licensing required to use Oracle Secure Backup, explained in Part 1. You can leverage this solution to facilitate restoring copies of your production database for development or testing purposes and to Recover from a user error that removes data or corrupts existing data.

To learn more about AWS Backup, refer to the AWS Backup Documentation.

Using AWS Backup and Oracle RMAN for backup/restore of Oracle databases on Amazon EC2: Part 1

Post Syndicated from Jeevan Shetty original https://aws.amazon.com/blogs/architecture/using-aws-backup-and-oracle-rman-for-backup-restore-of-oracle-databases-on-amazon-ec2-part-1/

Customers running Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2) often take database and schema backups using Oracle native tools, like Data Pump and Recovery Manager (RMAN), to satisfy data protection, disaster recovery (DR), and compliance requirements. A priority is to reduce backup time as the data grows exponentially and recover sooner in case of failure/disaster.

In situations where RMAN backup is used as a DR solution, using AWS Backup to backup the file system and using RMAN to backup the archive logs are an efficient method to perform Oracle database point-in-time recovery in the event of a disaster.

Sample use cases:

  1. Quickly build a copy of production database to test bug fixes or for a tuning exercise.
  2. Recover from a user error that removes data or corrupts existing data.
  3. A complete database recovery after a media failure.

There are two options to backup the archive logs using RMAN:

  1. Using Oracle Secure Backup (OSB) and an Amazon Simple Storage Service (Amazon S3) bucket as the storage for archive logs
  2. Using Amazon Elastic File System (Amazon EFS) as the storage for archive logs

This is Part 1 of this two-part series, we provide a mechanism to use AWS Backup to create a full backup of the EC2 instance, including the OS image, Oracle binaries, logs, and data files. In this post, we will use Oracle RMAN to perform archived redo log backup to an Amazon S3 bucket. Then, we demonstrate the steps to restore a database to a specific point-in-time using AWS Backup and Oracle RMAN.

Solution overview

Figure 1 demonstrates the workflow:

  1. Oracle database on Amazon EC2 configured with Oracle Secure Backup (OSB)
  2. AWS Backup service to backup EC2 instance at regular intervals.
  3. AWS Identity and Access Management (IAM) role for EC2 instance that grants permission to write archive log backups to Amazon S3
  4. S3 bucket for storing Oracle RMAN archive log backups
Figure 1. Oracle Database in Amazon EC2 using AWS Backup and S3 for backup and restore

Figure 1. Oracle Database in Amazon EC2 using AWS Backup and S3 for backup and restore

Prerequisites

For this solution, the following prerequisites are required:

  1. An AWS account
  2. Oracle database and AWS CLI in an EC2 instance
  3. Access to configure AWS Backup
  4. Acces to S3 bucket to store the RMAN archive log backup

1. Configure AWS Backup

You can choose AWS Backup to schedule daily backups of the EC2 instance. AWS Backup efficiently stores your periodic backups using backup plans. Only the first EBS snapshot performs a full copy from Amazon Elastic Block Storage (Amazon EBS) to Amazon S3. All subsequent snapshots are incremental snapshots, copying just the changed blocks from Amazon EBS to Amazon S3, thus, reducing backup duration and storage costs. Oracle supports Storage Snapshot Optimization, which takes third-party snapshots of the database without placing the database in backup mode. By default, AWS Backup now creates crash-consistent backups of Amazon EBS volumes that are attached to an EC2 instance. Customers no longer have to stop their instance or coordinate between multiple Amazon EBS volumes attached to the same EC2 instance to ensure crash-consistency of their application state.

You can create daily scheduled backup of EC2 instances. Figures 2, 3, and 4 are sample screenshots of the backup plan, associating an EC2 instance with the backup plan.

Configure backup rule using AWS Backup

Figure 2. Configure backup rule using AWS Backup

Select EC2 instance containing Oracle Database for backup

Figure 3. Select EC2 instance containing Oracle Database for backup

Summary screen showing the backup rule and resources managed by AWS Backup

Figure 4. Summary screen showing the backup rule and resources managed by AWS Backup

Oracle RMAN archive log backup

While AWS Backup is now creating a daily backup of the EC2 instance, we also want to make sure we backup the archived log files to a protected location. This will let us do point-in-time restores and restore to other recent times than just the last daily EC2 backup. Here, we provide the steps to backup archive log using RMAN to S3 bucket.

Backup/restore archive logs to/from Amazon S3 using OSB

Backing-up the Oracle archive logs is an important part of the process. In this section, we will describe how you can backup their Oracle Archive logs to Amazon S3 using OSB. Note: OSB is a separately licensed product from Oracle Corporation, so you will need to be properly licensed for OSB if you use this approach.

2. Setup S3 bucket and IAM role

Oracle Archive log backups can be scheduled using cron script to run at regular interval (for example, every 15 minutes). These backups are stored in an S3 bucket.

a. Create an S3 bucket with lifecycle policy to transition the objects to S3 Standard-Infrequent Access.
b. Attach the following policy to the IAM Role of EC2 containing Oracle database or create an IAM role (ec2access) with the following policy and attach it to the EC2 instance. Update bucket-name with the bucket created in previous step.


        {
            "Sid": "S3BucketAccess",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObjectAcl",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::bucket-name",
                "arn:aws:s3:::bucket-name/*"
            ]
        }

3. Setup OSB

After we have configured the backup of EC2 instance using AWS Backup, we setup OSB in the EC2 instance. In these steps, we show the mechanism to configure OSB.

a. Verify hardware and software prerequisites for OSB Cloud Module.
b. Login to the EC2 instance with User ID owning the Oracle Binaries.
c. Download Amazon S3 backup installer file (osbws_install.zip)
d. Create Oracle wallet directory.

mkdir $ORACLE_HOME/dbs/osbws_wallet

e. Create a file (osbws.sh) in the EC2 instance with the following commands. Update IAM role with the one created/updated in Step 2b.

java -jar osbws_install.jar —IAMRole ec2access walletDir $ORACLE_HOME/dbs/osbws_wallet -libDir $ORACLE_HOME/lib/

f. Change permission and run the file.

chmod 700 osbws.sh
./osbws.sh

Sample output: AWS credentials are valid.
Oracle Secure Backup Web Service wallet created in directory /u01/app/oracle/product/19.3.0.0/db_1/dbs/osbws_wallet.
Oracle Secure Backup Web Service initialization file /u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora created.
Downloading Oracle Secure Backup Web Service Software Library from file osbws_linux64.zip.
Download complete.

g. Set ORACLE_SID by executing below command:

. oraenv

h. Running the script – osbws.sh installs OSB libraries and creates a file called osbws<ORACLE_SID>.ora.
i. Add/modify below with S3 bucket(bucket-name) and region(ex:us-west-2) created in Step 2a.

OSB_WS_HOST=http://s3.us-west-2.amazonaws.com
OSB_WS_BUCKET=bucket-name
OSB_WS_LOCATION=us-west-2

4. Configure RMAN backup to S3 bucket

With OSB installed in the EC2 instance, you can backup Oracle archive logs to S3 bucket. These backups can be used to perform database point-in-time recovery in case of database crash/corruption . oratst is used as an example in below commands.

a. Configure RMAN repository. Example below uses Oracle 19c and Oracle Sid – oratst.

RMAN> configure channel device type sbt parms='SBT_LIBRARY=/u01/app/oracle/product/19.3.0.0/db_1/lib/libosbws.so,SBT_PARMS=(OSB_WS_PFILE=/u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora)';

b. Create a script (for example, rman_archive.sh) with below commands, and schedule using crontab (example entry: */5 * * * * rman_archive.sh) to run every 5 minutes. This will makes sure Oracle Archive logs are backed up to Amazon S3 frequently, thus ensuring an recovery point objective (RPO) of 5 minutes.

dt=`date +%Y%m%d_%H%M%S`

rman target / log=rman_arch_bkup_oratst_${dt}.log <<EOF

RUN
{
	allocate channel c1_s3 device type sbt
	parms='SBT_LIBRARY=/u01/app/oracle/product/19.3.0.0/db_1/lib/libosbws.so,SBT_PARMS=(OSB_WS_PFILE=/u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora)' MAXPIECESIZE 10G;

	BACKUP ARCHIVELOG ALL delete all input;
	Backup CURRENT CONTROLFILE;

release channel c1_s3;
	
}

EOF

c. Copy RMAN logs to S3 bucket. These logs contain the database identifier (DBID) that is required when we have to restore the database using Oracle RMAN.

aws s3 cp rman_arch_bkup_oratst_${dt}.log s3://bucket-name

5. Perform database point-in-time recovery

In the event of a database crash/corruption, we can use AWS Backup service and Oracle RMAN Archive log backup to recover database to a specific point-in-time.

a. Typically, you would pick the most recent recovery point completed before the time you wish to recover. Using AWS Backup, identify the recovery point ID to restore by following the steps on restoring an Amazon EC2 instance. Note: when following the steps, be sure to set the “User data” settings as described in the next bullet item.

After the EBS volumes are created from the snapshot, there is no need to wait for all of the data to transfer from Amazon S3 to your EBS volume before your attached instance can start accessing the volume. Amazon EBS snapshots implement lazy loading, so that you can begin using them right away.

b. Be sure the database does not start automatically after restoring the EC2 instance, by renaming /etc/oratab. Use the following command in “User data” section while restoring EC2 instance. After database recovery, we can rename it back to /etc/oratab.

#!/usr/bin/sh
sudo su - 
mv /etc/oratab /etc/oratab_bk

c. Login to the EC2 instance once it is up, and execute the RMAN recovery commands mentioned. Identify the DBID from RMAN logs saved in the S3 bucket. These commands use database oratst as an example:

rman target /

RMAN> startup nomount

RMAN> set dbid DBID

# Below command is to restore the controlfile from autobackup

RMAN> RUN
{
    allocate channel c1_s3 device type sbt
	parms='SBT_LIBRARY=/u01/app/oracle/product/19.3.0.0/db_1/lib/libosbws.so,SBT_PARMS=(OSB_WS_PFILE=/u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora)';

    RESTORE CONTROLFILE FROM AUTOBACKUP;
    alter database mount;

    release channel c1_s3;
}


#Identify the recovery point (sequence_number) by listing the backups available in catalog.

RMAN> list backup;

In Figure 5, the most recent archive log backed up is 380, so you can use this sequence number in the next set of RMAN commands.

Sample output of Oracle RMAN “list backup” command

Figure 5. Sample output of Oracle RMAN “list backup” command

RMAN> RUN
{
    allocate channel c1_s3 device type sbt
	parms='SBT_LIBRARY=/u01/app/oracle/product/19.3.0.0/db_1/lib/libosbws.so,SBT_PARMS=(OSB_WS_PFILE=/u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora)';

    recover database until sequence sequence_number;
    ALTER DATABASE OPEN RESETLOGS;
    release channel c1_s3;
}

d. To avoid performance issues due to lazy loading, after the database is open, run the following command to force a faster restoration of the blocks from S3 bucket to EBS volumes (this example allocates two channels and validates the entire database).

RMAN> RUN
{
  ALLOCATE CHANNEL c1 DEVICE TYPE DISK;
  ALLOCATE CHANNEL c2 DEVICE TYPE DISK;
  VALIDATE database section size 1200M;
}

e. This completes the recovery of database, and we can let the database automatically start by renaming file back to /etc/oratab.

mv /etc/oratab_bk /etc/oratab

6. Backup retention

Ensure that the AWS Backup lifecycle policy matches the Oracle Archive log backup retention. Also, follow documentation to configure Oracle backup retention and delete expired backups. This is a sample command for Oracle backup retention:

CONFIGURE BACKUP OPTIMIZATION ON;
CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 31 DAYS; 

RMAN> RUN
{
    allocate channel c1_s3 device type sbt
	parms='SBT_LIBRARY=/u01/app/oracle/product/19.3.0.0/db_1/lib/libosbws.so,SBT_PARMS=(OSB_WS_PFILE=/u01/app/oracle/product/19.3.0.0/db_1/dbs/osbwsoratst.ora)';

            crosscheck backup;
            delete noprompt obsolete;
            delete noprompt expired backup;

    release channel c1_s3;
}

Cleanup

Follow below instructions to remove or cleanup the setup:

  1. Delete the backup plan created in Step 1.
  2. Uninstall Oracle Secure Backup from the EC2 instance.
  3. Delete/Update IAM role (ec2access) to remove access from the S3 bucket used to store archive logs.
  4. Remove the cron entry from the EC2 instance configured in Step 4b.
  5. Delete the S3 bucket that was created in Step 2a to store Oracle RMAN archive log backups.

Conclusion

In this post, we demonstrate how to use AWS Backup and Oracle RMAN Archive log backup of Oracle databases running on Amazon EC2 can restore and recover efficiently to a point-in-time, without requiring an extra-step of restoring data files. Data files are restored as part of the AWS Backup EC2 instance restoration. You can leverage this solution to facilitate restoring copies of your production database for development or testing purposes, plus recover from a user error that removes data or corrupts existing data.

To learn more about AWS Backup, refer to the AWS Backup AWS Backup Documentation.

Data warehouse and business intelligence technology consolidation using AWS

Post Syndicated from Bappaditya Datta original https://aws.amazon.com/blogs/architecture/data-warehouse-and-business-intelligence-technology-consolidation-using-aws/

Organizations have been using data warehouse and business intelligence (DWBI) workloads to support business decision making for many years. These workloads are brought to the Amazon Web Services (AWS) platform to utilize the benefit of AWS cloud. However, these workloads are built using multiple vendor tools and technologies, and the customer faces the burden of administrative overhead.

This post provides architectural guidance to consolidate multiple DWBI technologies to AWS Managed Services to help reduce the administrative overhead, bring operational ease, and business efficiency. Two scenarios are explored:

  1. Upstream transactional databases are already on AWS
  2. Upstream transactional databases are present at on-premise datacenter

Challenges faced by an organization

Organizations are engaged in managing multiple DWBI technologies due to acquisitions, mergers, and the lift-and-shift of workloads. These workloads use extract, transform, and load (ETL) tools to read relational data from upstream transactional databases, process it, and store it in a data warehouse. Thereafter, these workloads use business intelligence tools to generate valuable insight and present it to users in form of reports and dashboards.

These DWBI technologies are generally installed and maintained on their own server. Figure 1 demonstrates the increased the administrative overhead for the organization but also creates challenges in maintaining the team’s overall knowledge.

DWBI workload with multiple tools

Figure 1. DWBI workload with multiple tools

Therefore, organizations are looking to consolidate technology usage and continue supporting important business functions.

Scenario 1

As we know, three major functions of DWBI workstream are:

  • ETL data using a tool
  • Store/manage the data in a data warehouse
  • Generate information from the data using business intelligence

Each of these functions can be performed efficiently using an AWS service. For example, AWS Glue can be used for ETL, Amazon Redshift for data warehouse, and Amazon QuickSight for business intelligence.

With the use of mentioned AWS services, organizations will be able to consolidate their DWBI technology usage. Organizations also will be able to quickly adapt to these services, as their engineering team can more easily use their DWBI knowledge with these services. For example, using SQL knowledge in AWS Glue jobs with SprakSQL, in Amazon Redshift queries, and in Amazon QuickSight dashboards.

Figure 2 demonstrates the redesigned the architecture of Figure 1 using AWS services. In this architecture, ETL functions are consolidated in AWS Glue. An AWS Glue crawler is used to auto-catalogue the source and target table metadata; then, AWS Glue ETL jobs use these catalogues to read data from source and write to target (data warehouse). AWS Glue jobs also apply necessary transformations (such as join, filter, and aggregate) to the data before writing. Additionally, an AWS Glue trigger is used to schedule the job executions. Alternatively, AWS Managed Workflows for Apache Airflow can be used to schedule jobs.

Consolidated workload with source on AWS

Figure 2. Consolidated workload with source on AWS

Similarly, data warehousing function is consolidated with Amazon Redshift. Amazon Redshift is used to store and organize enriched data and also enforce appropriate data access control for both workloads and users.

Lastly, business intelligence functions are consolidated using Amazon QuickSight. It used to create necessary dashboards that source data from Amazon Redshift and apply complex business logic to produce necessary charts and graphs needed for business insights. It is also used to implement necessary access restrictions to dashboards and data.

Scenario 2

In situation where source databases are in on-premises datacenter, the overall solution will be similar to Scenario 1, with an additional step to move the data continually from on-premise database to an Amazon Simple Storage Service (Amazon S3) bucket. The data movement can be efficiently handled by AWS Database Migration Service (AWS DMS).

To make the source database accessible to AWS DMS, a connection needs to established between the AWS cloud and on-premise network. Based on performance and throughput needs, the organization can choose either AWS Direct Connect service or AWS Site-to-Site VPN service to securely move the data. For the purpose of this discussion, we are considering AWS Direct Connect.

In Figure 3, AWS DMS task is used to perform a full-load followed by change data capture to continuously move the data to an S3 bucket. In this scenario, AWS Glue is used to catalogue and read the data from S3 bucket. The remaining portion of the dataflow is the same as the one mentioned in Scenario 1.

Consolidated workload with source at datacenter

Figure 3. Consolidated workload with source at datacenter

Scaling

Both of the updated architectures provide necessary scaling:

  • Auto scaling feature can be used to scale-up or -down AWS Glue ETL job resources
  • Concurrency scaling feature can be used to support virtually unlimited concurrent users and queries in Amazon Redshift
  • Amazon QuickSight resources (web server, Amazon QuickSight engine, and SPICE) are auto scaled by design

Security, monitoring, and auditing

Also, the updated architectures provide necessary security by using access control, data encryption at-rest and in transit, monitoring, and auditing.

Additionally, both Amazon Redshift and Amazon QuickSight provides their own authentication and access controls. Therefore, a user can be a local user or a federated one. With the help of these authentications, an organization will be able to control access to data in Amazon Redshift and also access to the dashboard in Amazon QuickSight.

Conclusion

In this blog post, we discussed how AWS Glue, Amazon Redshift, and Amazon QuickSight can be used to consolidate DWBI technologies. We also have discussed how an architecture can help an organization build a scalable, secure workload with auto scaling, access control, log monitoring and activity auditing.

Ready to get started?