Tag Archives: AWS DataSync

Welcome to AWS Storage Day 2023

2023-08-09 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/welcome-to-aws-storage-day-2023/

Welcome to the fifth annual AWS Storage Day! This virtual event is happening today starting at 9:00 AM Pacific Time (12:00 PM Eastern Time) and is available for you to watch on the AWS On Air Twitch channel. The first AWS Storage Day was hosted in 2019, and this event has grown into an innovation day that we look forward to delivering to you every year. In last year’s Storage Day post, I wrote about the constant innovations in AWS Storage aimed at helping you put your data to work while keeping it secure and protected. This year, Storage Day is focused on storage for AI/ML, data protection and resiliency, and the benefits of moving to the cloud.

AWS Storage Day Key Themes
When it comes to storage for AI/ML, data volumes are increasing at an unprecedented rate, exploding from terabytes to petabytes and even to exabytes. With a modern data architecture on AWS, you can rapidly build scalable data lakes, use a broad and deep collection of purpose-built data services, scale your systems at a low cost without compromising performance, share data across organizational boundaries, and manage compliance, security, and governance, allowing you to make decisions with speed and agility at scale.
To train machine learning models and build Generative AI applications, you must have the right data strategy in place. So, I’m happy to see that, among the list of sessions to look forward to at the live event, the Optimize generative AI and ML with AWS Infrastructure session will discuss how you can transform your data into meaningful insights.

Whether you’re just getting started with the cloud, planning to migrate applications to AWS, or already building applications on AWS, we have resources to help you protect your data and meet your business continuity objectives. Our data protection and resiliency features and solutions can help you meet your business continuity goals and deliver disaster recovery during data loss events, across recovery point and time objectives (RPO and RTO). With the unprecedented data growth happening in the world today, determining where your data is stored, how it’s secured, and who has access to it is a higher priority than ever. Be sure to join the Protect data in AWS amid a rapidly evolving cyber landscape session to learn more.

When moving data to the cloud, you need to understand where you’re moving it for different use cases, the types of data you’re moving, and the network resources available, among other considerations. There are many reasons to move to the cloud, recently, Enterprise Strategy Group (ESG) validated that organizations reduced compute, networking, and storage costs by up to 66 percent by migrating on-premises workloads to AWS Cloud infrastructure. ESG confirmed that migrating on-premises workloads to AWS provides organizations with reduced costs, increased performance, improved operational efficiency, faster time to value, and improved business agility.
We have a number of sessions that discuss how to move to the cloud, based on your use case. I’m most looking forward to the Hybrid cloud storage and edge compute: AWS, where you need it session, which will discuss considerations for workloads that can’t fully move to the cloud.

Tune in to learn from experts about new announcements, leadership insights, and educational content related to the broad portfolio of AWS Storage services and features that address all these themes and more. Today, we have announcements related to Amazon Simple Storage Service (Amazon S3), Amazon FSx for Windows File Server, Amazon Elastic File System (Amazon EFS), Amazon FSx for OpenZFS, and more.

Let’s get into it.

15 Years of Amazon EBS
Not long ago, I was reading Jeff Barr’s post titled 15 Years of AWS Blogging! In this post, Jeff mentioned a few posts he wrote for the earliest AWS services and features. Amazon Elastic Block Store (Amazon EBS) is on this list as a service that simplifies the use of Amazon EC2.

Well, it’s been 15 years since the launch of Amazon EBS was announced, and today we celebrate 15 years of this service. If you were one of the original users who put Amazon EBS to good use and provided us with the very helpful feedback that helped us invent and simplify, iterate and improve, I’m sure you can’t believe how time flies. Today, Amazon EBS handles more than 100 trillion I/O operations daily, and over 390 million EBS volumes are created every day.

If you’re new to Amazon EBS, join us for a fireside chat with Matt Garman, Senior Vice President, Sales, Marketing, and Global Services at AWS, and learn the strategy and customer challenges behind the launch of the service in 2008. You’ll also hear from long-term EBS customer, Stripe, about its growth with EBS since Stripe was launched 12 years ago.

Amazon EBS has continuously improved its scalability and performance to support more customer workloads as the direct storage attachment for Amazon EC2 instances. With the launch of Amazon EC2 M7i instances, powered by custom 4th Generation Intel Xeon Scalable processors, on August 2, you can attach up to 128 Amazon EBS volumes, an increase from 28 on a previous generation M6i instance. The higher number of volume attachments means you can increase storage density per instance and improve resource utilization, reducing total compute cost.

You can host up to 127 containers per instance for larger database applications and scale them more cost effectively before needing to provision more instances and only pay for resources you need. With a higher number of volume attachments, you can fully utilize the memory and vCPU available on these powerful M7i instances as your database storage footprint grows. EBS is also increasing the number of multi-volume snapshots you can create, for up to 128 EBS volumes attached to an instance, enabling you to create crash-consistent backups of all volumes attached to an instance.

Join the 15 years of innovations with Amazon EBS session for a discussion about how the original vision for Amazon EBS has evolved to meet your growing demands for cloud infrastructure.

Mountpoint for Amazon S3
Now generally available, Mountpoint for Amazon S3 is a new open source file client that delivers high throughput access, lowering compute costs for data lakes on Amazon S3. Mountpoint for Amazon S3 is a file client that translates local file system API calls to S3 object API calls. Using Mountpoint for Amazon S3, you can mount an Amazon S3 bucket as a local file system on your compute instance, to access your objects through a file interface with the elastic storage and throughput of Amazon S3. Mountpoint for Amazon S3 supports sequential and random read operations on existing files, and sequential write operations for creating new files.

The Deep dive and demo of Mountpoint for Amazon S3 session demonstrates how to use the file client to access objects in Amazon S3 using file APIs, making it easier to store data at scale and maximize the value of your data with analytics and machine learning workloads. Read this blog post to learn more about Mountpoint for Amazon S3 and how to get started, including a demo.

Put Cold Storage to Work Faster with Amazon S3 Glacier Flexible Retrieval
Amazon S3 Glacier Flexible Retrieval improves data restore time by up to 85 percent, at no additional cost. Faster data restores automatically apply to the Standard retrieval tier when using Amazon S3 Batch Operations. These restores begin to return objects within minutes, so you can process restored data faster. Processing restored data in parallel with ongoing restores helps you accelerate data workflows and quickly respond to business needs. Now, whether you’re transcoding media, restoring operational backups, training machine learning models, or analyzing historical data, you can speed up your data restores from archive.

Coupled with the S3 Glacier improvements to restore throughput by up to 10 times for millions of objects announced in 2022, S3 Glacier data restores of all sizes now benefit from both faster starts and shorter completion times.

Join the Maximize the value of cold data with Amazon S3 Glacier session to learn how Amazon S3 Glacier is helping organizations of all sizes and from all industries transform their data archiving to unlock business value, increase agility, and save on storage costs. Read this blog post to learn more about the Amazon S3 Glacier Flexible Retrieval performance improvements and follow step-by-step guidance on how to get started with faster standard retrievals from S3 Glacier Flexible Retrieval.

Supporting a Broad Spectrum of File Workloads
To serve a broad spectrum of use cases that rely on file systems, we offer a portfolio of file system services, each targeting a different set of needs. Amazon EFS is a serverless file system built to deliver an elastic experience for sharing data across compute resources. Amazon FSx makes it easier and cost-effective for you to launch, run, and scale feature-rich, high-performance file systems in the cloud, enabling you to move to the cloud with no changes to your code, processes, or how you manage your data.

Power ML research and big data analytics with Amazon EFS
Amazon EFS offers serverless and fully scalable file storage, designed for high scalability in both storage capacity and throughput performance. Just last week, we announced enhanced support for faster read and write IOPS, making it easier to power more demanding workloads. We’ve improved the performance capabilities of Amazon EFS by adding support for up to 55,000 read IOPS and up to 25,000 write IOPS per file system. These performance enhancements help you to run more demanding workflows, such as machine learning (ML) research with KubeFlow, financial simulations with IBM Symphony, and big data processing with Domino Data Lab, Hadoop, and Spark.

Join the Build and run analytics and SaaS applications at scale session to hear how recent Amazon EFS performance improvements can help power more workloads.

Multi-AZ file systems on Amazon FSx for OpenZFS
You can now use a multi-AZ deployment option when creating file systems on Amazon FSx for OpenZFS, making it easier to deploy file storage that spans multiple AWS Availability Zones to provide multi-AZ resilience for business-critical workloads. With this launch, you can take advantage of the power, agility, and simplicity of Amazon FSx for OpenZFS for a broader set of workloads, including business-critical workloads like database, line-of-business, and web-serving applications that require highly available shared storage that spans multiple AZs.

The new multi-AZ file systems are designed to deliver high levels of performance to serve a broad variety of workloads, including performance-intensive workloads such as financial services analytics, media and entertainment workflows, semiconductor chip design, and game development and streaming, up to 21 GB per second of throughput and over 1 million IOPS for frequently accessed cached data, and up to 10 GB per second and 350,000 IOPS for data accessed from persistent disk storage.

Join the Migrate NAS to AWS to reduce TCO and gain agility session to learn more about multi-AZs with Amazon FSx for OpenZFS.

New, Higher Throughput Capacity Levels on Amazon FSx for Windows File Server
Performance improvements for Amazon FSx for Windows File Server help you accelerate time-to-results for performance-intensive workloads such as SQL Server databases, media processing, cloud video editing, and virtual desktop infrastructure (VDI).

We’re adding four new, higher throughput capacity levels to increase the maximum I/O available up to 12 GB per second from the previous I/O of 2 GB per second. These throughput improvements come with correspondingly higher levels of disk IOPS, designed to deliver an increase up to 350,000 IOPS.

In addition, by using FSx for Windows File Server, you can provision IOPS higher than the default 3 IOPS per GiB for your SSD file system. This allows you to scale SSD IOPS independently from storage capacity, allowing you to optimize costs for performance-sensitive workloads.

Join the Migrate NAS to AWS to reduce TCO and gain agility session to learn more about the performance improvements for Amazon FSx for Windows File Server.

Logically Air-Gapped Vault for AWS Backup
AWS Backup is a fully managed, policy-based data protection solution that enables customers to centralize and automate backup restores across 19 AWS services (spanning compute, storage, and databases) and third-party applications such as VMware Cloud on AWS and on-premises, as well as SAP HANA on Amazon EC2.

Today, we’re announcing the preview of logically air-gapped vault as a new type of AWS Backup Vault that acts as an additional layer of protection to mitigate against malware events. With logically air-gapped vault, customers can recover their application data through a different trusted account.

Join the Deep dive on data recovery for ransomware events session to learn more about logically air-gapped vault for AWS Backup.

Copy Data to and from Other Clouds with AWS DataSync
AWS DataSync is an online data movement and discovery service that simplifies data migration and helps you quickly, easily, and securely transfer your file or object data to, from, and between AWS storage services. In addition to support of data migration to and from AWS storage services, DataSync supports copying to and from other clouds such as Google Cloud Storage, Azure Files, and Azure Blob Storage. Using DataSync, you can move your object data at scale between Amazon S3 compatible storage on other clouds and AWS storage services such as Amazon S3. We’re now expanding the support of DataSync for copying data to and from other clouds to include DigitalOcean Spaces, Wasabi Cloud Storage, Backblaze B2 Cloud Storage, Cloudflare R2 Storage, and Oracle Cloud Storage.

Join the Identify and accelerate data migrations at scale session to learn more about this expanded support for DataSync.

Join Us Online
Join us today for the AWS Storage Day virtual event on the AWS On Air channel on Twitch. The event will be live starting at 9:00 AM Pacific Time (12:00 PM Eastern Time) on August 9. All sessions will be available on demand approximately two days after Storage Day.

We look forward to seeing you on Twitch!

– Veliswa

Designing a hybrid AI/ML data access strategy with Amazon SageMaker

2023-07-10 Franklin Aguinaldo

Post Syndicated from Franklin Aguinaldo original https://aws.amazon.com/blogs/architecture/designing-a-hybrid-ai-ml-data-access-strategy-with-amazon-sagemaker/

Over time, many enterprises have built an on-premises cluster of servers, accumulating data, and then procuring more servers and storage. They often begin their ML journey by experimenting locally on their laptops. Investment in artificial intelligence (AI) is at a different stage in every business organization. Some remain completely on-premises, others are hybrid (both on-premises and cloud), and the remaining have moved completely into the cloud for their AI and machine learning (ML) workloads.

These enterprises are also researching or have started using the cloud to augment their on-premises systems for several reasons. As technology improves, both the size and quantity of data increases over time. The amount of data captured and the number of datapoints continues to expand, which presents a challenge to manage on-premises. Many enterprises are distributed, with offices in different geographic regions, continents, and time zones. While it is possible to increase the on-premises footprint and network pipes, there are still hidden costs to consider for maintenance and upkeep. These organizations are looking to the cloud to shift some of that effort and enable them to burst and use the rich AI and ML features on the cloud.

Defining a hybrid data access strategy

Moving ML workloads into the cloud calls for a robust hybrid data strategy describing how and when you will connect your on-premises data stores to the cloud. For most, it makes sense to make the cloud the source of truth, while still permitting your teams to use and curate datasets on-premises. Defining the cloud as source of truth for your datasets means the primary copy will be in the cloud and any dataset generated will be stored in the same location in the cloud. This ensures that requests for data is served from the primary copy and any derived copies.

A hybrid data access strategy should address the following:

Understand your current and future storage footprint for ML on-premises. Create a map of your ML workloads, along with performance and access requirements for testing and training.
Define connectivity across on-premises locations and the cloud. This includes east-west and north-south traffic to support interconnectivity between sites, required bandwidth, and throughput for the data movement workload requirements.
Define your single source of truth (SSOT)[1] and where the ML datasets will primarily live. Consider how dated, new, hot, and cold data will be stored.
Define your storage performance requirements, mapping them to the appropriate cloud storage services. This will give you the ability to take advantage of cloud-native ML with Amazon SageMaker.

Hybrid data access strategy architecture

To help address these challenges, we worked on outlining an end-to-end system architecture in Figure 1 that defines: 1) connectivity between on-premises data centers and AWS Regions; 2) mappings for on-premises data to the cloud; and 3) Aligning Amazon SageMaker to appropriate storage, based on ML requirements.

Figure 1. AI/ML hybrid data access strategy reference architecture

Let’s explore this architecture step by step.

On-premises connectivity to the AWS Cloud runs through AWS Direct Connect for high transfer speeds.
AWS DataSync is used for migrating large datasets into Amazon Simple Storage Service (Amazon S3). AWS DataSync agent is installed on-premises.
On-premises network file system (NFS) or server message block (SMB) data is bridged to the cloud through Amazon S3 File Gateway, using either a virtual machine (VM) or hardware appliance.
AWS Storage Gateway uploads data into Amazon S3 and caches it on-premises.
Amazon S3 is the source of truth for ML assets stored on the cloud.
Download S3 data for experimentation to Amazon SageMaker Studio.
Amazon SageMaker notebooks instances can access data through S3, Amazon FSx for Lustre, and Amazon Elastic File System. Use Amazon File Cache for high-speed caching for access to on-premises data, and Amazon FSx for NetApp ONTAP for cloud bursting.
SageMaker training jobs can use data in Amazon S3, EFS, and FSx for Lustre. S3 data is accessed via File, Fast File, or Pipe mode, and pre-loaded or lazy-loaded when using FSx for Lustre as training job input. Any existing data on EFS can also be made available to training jobs as well.
Leverage Amazon S3 Glacier for archiving data and reducing storage costs.

ML workloads using Amazon SageMaker

Let’s go deeper into how SageMaker can help you with your ML workloads.

To start mapping ML workloads to the cloud, consider which AWS storage services work with Amazon SageMaker. Amazon S3 typically serves as the central storage location for both structured and unstructured data that is used for ML. This includes raw data coming from upstream applications, and also curated datasets that are organized and stored as part of a Feature Store.

In the initial phases of development, a SageMaker Studio user will leverage S3 APIs to download data from S3 to their private home directory. This home directory is backed by a SageMaker-managed EFS file system. Studio users then point their notebook code (also stored in the home directory) to the local dataset and begin their development tasks.

To scale up and automate model training, SageMaker users can launch training jobs that run outside of the SageMaker Studio notebook environment. There are several options for making data available to a SageMaker training job.

Amazon S3. Users can specify the S3 location of the training dataset. When using S3 as a data source, there are three input modes to choose from:
- File mode. This is the default input mode, where SageMaker copies the data from S3 to the training instance storage. This storage is either a SageMaker-provisioned Amazon Elastic Block Store (Amazon EBS) volume or an NVMe SSD that is included with specific instance types. Training only starts after the dataset has been downloaded to the storage, and there must be enough storage space to fit the entire dataset.
- Fast file mode. Fast file mode exposes S3 objects as a POSIX file system on the training instance. Dataset files are streamed from S3 on demand, as the training script reads them. This means that training can start sooner and require less disk space. Fast file mode also does not require changes to the training code.
- Pipe mode. Pipe input also streams data in S3 as the training script reads it, but requires code changes. Pipe input mode is largely replaced by the newer and easier-to-use Fast File mode.
FSx for Lustre. Users can specify a FSx for Lustre file system, which SageMaker will mount to the training instance and run the training code. When the FSx for Lustre file system is linked to a S3 bucket, the data can be lazily loaded from S3 during the first training job. Subsequent training jobs on the same dataset can then access it with low latency. Users can also choose to pre-load the file system with S3 data using hsm_restore commands.
Amazon EFS. Users can specify an EFS file system that already contains their training data. SageMaker will mount the file system on the training instance and run the training code.
Find out how to Choose the best data source for your SageMaker training job.

Conclusion

With this reference architecture, you can develop and deliver ML workloads that run either on-premises or in the cloud. Your enterprise can continue using its on-premises storage and compute for particular ML workloads, while also taking advantage of the cloud, using Amazon SageMaker. The scale available on the cloud allows your enterprise to conduct experiments without worrying about capacity. Start defining your hybrid data strategy on AWS today!

Additional resources:

[1] The practice of aggregating data from many sources to a single source or location.

Reduce archive cost with serverless data archiving

2023-07-07 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/architecture/reduce-archive-cost-with-serverless-data-archiving/

For regulatory reasons, decommissioning core business systems in financial services and insurance (FSI) markets requires data to remain accessible years after the application is retired. Traditionally, FSI companies either outsourced data archiving to third-party service providers, which maintained application replicas, or purchased vendor software to query and visualize archival data.

In this blog post, we present a more cost-efficient option with serverless data archiving on Amazon Web Services (AWS). In our experience, you can build your own cloud-native solution on Amazon Simple Storage Service (Amazon S3) at one-fifth of the price of third-party alternatives. If you are retiring legacy core business systems, consider serverless data archiving for cost-savings while keeping regulatory compliance.

Serverless data archiving and retrieval

Modern archiving solutions follow the principles of modern applications:

Serverless-first development, to reduce management overhead.
Cloud-native, to leverage native capabilities of AWS services, such as backup or disaster recovery, to avoid custom build.
Consumption-based pricing, since data archival is consumed irregularly.
Speed of delivery, as both implementation and archive operations need to be performed quickly to fulfill regulatory compliance.
Flexible data retention policies can be enforced in an automated manner.

AWS Storage and Analytics services offer the necessary building blocks for a modern serverless archiving and retrieval solution.

Data archiving can be implemented on top of Amazon S3) and AWS Glue.

Amazon S3 storage tiers enable different data retention policies and retrieval service level agreements (SLAs). You can migrate data to Amazon S3 using AWS Database Migration Service; otherwise, consider another data transfer service, such as AWS DataSync or AWS Snowball.
AWS Glue crawlers automatically infer both database and table schemas from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.
Amazon CloudWatch monitors the execution of AWS Glue crawlers and notifies of failures.

Figure 1 provides an overview of the solution.

Figure 1. Serverless data archiving and retrieval

Once the archival data is catalogued, Amazon Athena can be used for serverless data query operations using standard SQL.

Amazon API Gateway receives the data retrieval requests and eases integration with other systems via REST, HTTPS, or WebSocket.
AWS Lambda reads parametrization data/templates from Amazon S3 in order to construct the SQL queries. Alternatively, query templates can be stored as key-value entries in a NoSQL store, such as Amazon DynamoDB.
Lambda functions trigger Athena with the constructed SQL query.
Athena uses the AWS Glue Data Catalog to retrieve table metadata for the Amazon S3 (archival) data and to return the SQL query results.

How we built serverless data archiving

An early build-or-buy assessment compared vendor products with a custom-built solution using Amazon S3, AWS Glue, and a user frontend for data retrieval and visualization.

The total cost of ownership over a 10-year period for one insurance core system (Policy Admin System) was $0.25M to build and run the custom solution on AWS compared with >$1.1M for third-party alternatives. The implementation cost advantage of the custom-built solution was due to development efficiencies using AWS services. The lower run cost resulted from a decreased frequency of archival usage and paying only for what you use.

The data archiving solution was implemented with AWS services (Figure 2):

Amazon S3 is used to persist archival data in Parquet format (optimized for analytics and compressed to reduce storage space) that is loaded from the legacy insurance core system. The archival data source was AS400/DB2 and moved with Informatica Cloud to Amazon S3.
AWS Glue crawlers infer the database schema from objects in Amazon S3 and create tables in AWS Glue for the decommissioned application data.
Lambda functions (Python) remove data records based on retention policies configured for each domain, such as customers, policies, claims, and receipts. A daily job (Control-M) initiates the retention process.

Figure 2. Exemplary implementation of serverless data archiving and retrieval for insurance core system

Retrieval operations are formulated and executed via Python functions in Lambda. The following AWS resources implement the retrieval logic:

Athena is used to run SQL queries over the AWS Glue tables for the decommissioned application.
Lambda functions (Python) build and execute queries for data retrieval. The functions render HMTL snippets using Jinja templating engine and Athena query results, returning the selected template filled with the requested archive data. Using Jinja as templating engine improved the speed of delivery and reduced the heavy lifting of frontend and backend changes when modeling retrieval operations by ~30% due to the decoupling between application layers. As a result, engineers only need to build an Athena query with the linked Jinja template.
Amazon S3 stores templating configuration and queries (JSON files) used for query parametrization.
Amazon API Gateway serves as single point of entry for API calls.

The user frontend for data retrieval and visualization is implemented as web application using React JavaScript library (with static content on Amazon S3) and Amazon CloudFront used for web content delivery.

The archiving solution enabled 80 use cases with 60 queries and reduced storage from three terabytes on source to only 35 gigabytes on Amazon S3. The success of the implementation depended on the following key factors:

Appropriate sponsorship from business across all areas (claims, actuarial, compliance, etc.)
Definition of SLAs for responding to courts, regulators, etc.
Minimum viable and mandatory approach
Prototype visualizations early on (fail fast)

Conclusion

Traditionally, FSI companies relied on vendor products for data archiving. In this post, we explored how to build a scalable solution on Amazon S3 and discussed key implementation considerations. We have demonstrated that AWS services enable FSI companies to build a serverless archiving solution while reaching and keeping regulatory compliance at a lower cost.

Learn more about some of the AWS services covered in this blog:

Week in Review – AWS Verified Access, Java 17, Amplify Flutter, Conferences, and More – May 1, 2023

2023-05-02 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/week-in-review-aws-verified-access-java-17-amplify-flutter-conferences-and-more-may-1-2023/

Conference season has started and I was happy to meet and talk with iOS and Swift developers at the New York Swifty conference last week. I will travel again to Turino (Italy), Amsterdam (Netherlands), Frankfurt (Germany), and London (UK) in the coming weeks. Feel free to stop by and say hi if you are around. But, while I was queuing for passport control at JFK airport, AWS teams continued to listen to your feedback and innovate on your behalf.

What happened on AWS last week ? I counted 26 new capabilities since last Monday (not counting last Friday, since I am writing these lines before the start of the day in the US). Here are the eight that caught my attention.

Last Week on AWS

Amplify Flutter now supports web and desktop apps. You can now write Flutter applications that target six platforms, including iOS, Android, Web, Linux, MacOS, and Windows with a single codebase. This update encompasses not only the Amplify libraries but also the Flutter Authenticator UI library, which has been entirely rewritten in Dart. As a result, you can now deliver a consistent experience across all targeted platforms.

AWS Lambda adds support for Java 17. AWS Lambda now supports Java 17 as both a managed runtime and a container base image. Developers creating serverless applications in Lambda with Java 17 can take advantage of new language features including Java records, sealed classes, and multi-line strings. The Lambda Java 17 runtime also has numerous performance improvements, including optimizations when running Lambda functions on Graviton 2 processors. It supports AWS Lambda Snap Start (in supported Regions) for fast cold starts, and the latest versions of the popular Spring Boot 3 and Micronaut 4 application frameworks

AWS Verified Access is now generally available. I first wrote about Verified Access when we announced the preview at the re:Invent conference last year. AWS Verified Access is now available. This new service helps you provide secure access to your corporate applications without using a VPN. Built based on AWS Zero Trust principles, you can use Verified Access to implement a work-from-anywhere model with added security and scalability.

AWS Support is now available in Korean. As the number of customers speaking Korean grows, AWS Support is invested in providing the best support experience possible. You can now communicate with AWS Support engineers and agents in Korean when you create a support case at the AWS Support Center.

AWS DataSync Discovery is now generally available. DataSync Discovery enables you to understand your on-premises storage performance and capacity through automated data collection and analysis. It helps you quickly identify data to be migrated and evaluate suggested AWS Storage services that align with your performance and capacity needs. Capabilities added since preview include support for NetApp ONTAP 9.7, recommendations at cluster and storage virtual machine (SVM) levels, and discovery job events in Amazon EventBridge.

Amazon Location Service adds support for long-distance matrix routing. This makes it easier for you to quickly calculate driving time and driving distance between multiple origins and destinations, no matter how far apart they are. Developers can now make a single API request to calculate up to 122,500 routes (350 origins and 350 destinations) within a 180 km region or up to 100 routes without any distance limitation.

AWS Firewall Manager adds support for multiple administrators. You can now create up to 10 AWS Firewall Manager administrator accounts from AWS Organizations to manage your firewall policies. You can delegate responsibility for firewall administration at a granular scope by restricting access based on OU, account, policy type, and Region, thereby enabling policy management tasks to be implemented faster and more effectively.

AWS AppSync supports TypeScript and source maps in JavaScript resolvers. With this update, you can take advantage of TypeScript features when you write JavaScript resolvers. With the updated libraries, you get improved support for types and generics in AppSync’s utility functions. The updated AppSync documentation provides guidance on how to get started and how to bundle your code when you want to use TypeScript.

Amazon Athena Provisioned Capacity. Athena is a query service that makes it simple to analyze data in S3 data lakes and 30 different data sources, including on-premises data sources or other cloud systems, using standard SQL queries. Athena is serverless, so there is no infrastructure to manage, and–until today–you pay only for the queries that you run. Starting last week, you can now get dedicated capacity for your queries and use new workload management features to prioritize, control, and scale your most important queries, paying only for the capacity you provision.

X in Y – We made existing services available in additional Regions and locations:

Amazon EC2 High Memory instances are now available in the Europe (Zurich) Region.
Amazon EC2 T4g instances are now available in Africa (Cape Town) Region.
AWS Global Accelerator launches two new edge locations in Lima, Peru, and Nashville, Tennessee (United States).
AWS Systems Manager Fleet Manager console based access to Windows instances now available in AWS GovCloud (US) Regions.
AWS Network Firewall ingress TLS inspection is now available in 8 additional Regions. This capability is now available in 10 AWS Regions: US East (N. Virginia), Asia Pacific (Jakarta, Mumbai, Singapore, Sydney, Tokyo), Europe (Ireland, Frankfurt, Stockholm), and South America (São-Paulo).
Amazon CloudWatch Logs data protection is now available in all AWS Commercial Regions.

Upcoming AWS Events
And to finish this post, I recommend you check your calendars and sign up for these AWS events:

AWS Serverless Innovation Day – Join us on May 17, 2023, for a virtual event hosted on the Twitch AWS channel. We will showcase AWS serverless technology choices such as AWS Lambda, Amazon ECS with AWS Fargate, Amazon EventBridge, and AWS Step Functions. In addition, we will share serverless modernization success stories, use cases, and best practices.

AWS re:Inforce 2023 – Now register for AWS re:Inforce, in Anaheim, California, June 13–14. AWS Chief Information Security Officer CJ Moses will share the latest innovations in cloud security and what AWS Security is focused on. The breakout sessions will provide real-world examples of how security is embedded into the way businesses operate. To learn more and get the limited discount code to register, see CJ’s blog post Gain insights and knowledge at AWS re:Inforce 2023 in the AWS Security Blog.

AWS Global Summits – Check your calendars and sign up for the AWS Summit close to where you live or work: Seoul (May 3–4), Berlin and Singapore (May 4), Stockholm (May 11), Hong Kong (May 23), Amsterdam (June 1), London (June 7), Madrid (June 15), and Milano (June 22).

AWS Community Day – Join community-led conferences driven by AWS user group leaders close to your city: Chicago (June 15), Manila (June 29–30), and Munich (September 14). Recently, we have been bringing together AWS user groups from around the world into Meetup Pro accounts. Find your group and its meetups in your city!

AWS User Group Peru Conference – There is more than a new edge location opening in Lima. The local AWS User Group announced a one-day cloud event in Spanish and English in Lima on September 23. Three of us from the AWS News blog team will attend. I will be joined by my colleagues Marcia and Jeff. Save the date and register today!

You can browse all upcoming AWS-led in-person and virtual events and developer-focused events such as AWS DevDay.

Stay Informed
That was my selection for this week! To better keep up with all of this news, don’t forget to check out the following resources:

What’s New with AWS – All AWS announcements. You might want to add the RSS feed to your news reader.
The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in your local languages. Check out the ones in French, German, Italian, and Spanish.
AWS News Blog – This blog.
And subscribe to the open-source newsletter brought to you by my most excellent colleague Ricardo.

That’s all for this week. Check back next Monday for another Week in Review!

— seb

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week in Review: Public Preview of Amazon DataZone and AWS DataSync Updates – April 3, 2023

2023-04-03 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-public-preview-of-amazon-datazone-and-aws-datasync-updates-april-3-2023/

Last weekend, I enjoyed the spring vibes at Seoul Forest, a large park in the middle of Seoul city, where cherry blossoms are in full bloom.

Compared to last year, there were crowds of people, so I realized that it was really back to normal after the pandemic. I hope you all enjoy the season of spring or fall with your family.

Last Week’s Launches
Like an April Fool’s Day joke, there were 65 launches last week, far more than usual. AWS product teams are working hard with a customer obsession.

So, I had a lot of trouble choosing the important ones. Other than the ones I’ve picked out, there may be important feature releases that fit your needs. Be sure to take a look at the full launches list in the last week.

First, here is a list of the general availability of AWS services and features treated by AWS News Blog:

Let’s take a look at some launches from the last week that I want to remind you of:

The Preview of Amazon DataZone – At AWS re:Invent 2022, we preannounced Amazon DataZone, a new data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in the organization. You can now try out the public preview of Amazon DataZone.

Data producers populate the business data catalog from AWS Glue Data Catalog and Amazon Redshift tables. Data consumers search for and subscribe to data assets in the data catalog and analyze with tools such as Amazon Athena query editors in the Amazon DataZone portal. To get started with Amazon DataZone, see our Quick Start Guide to include sample datasets to implement a complete use case.

AWS DataSync Supports Azure Blob Storage in Preview – AWS DataSync supports copying your object data at scale from Azure Blob Storage to AWS storage services such as Amazon S3. AWS DataSync supports all blob types within Azure Blob Storage and can also be used with Azure Data Lake Storage (ADLS) Gen 2.

In addition to Azure Blob Storage, DataSync supports Google Cloud Storage and Azure Files storage locations as well as various general storage systems and AWS storage services. To learn more, see Migrating Azure Blob Storage to Amazon S3 using AWS DataSync in the AWS Storage Blog.

On-call schedules with AWS Systems Manager Incident Manager – You can now configure or change on-call rotation schedules with a group of contacts and have 24/7 coverage and responsiveness for critical issues in the Incident Manager console.

AWS Incident Manager helps you bring the right people and information together when a critical issue is detected, activating preconfigured response plans to engage responders using SMS, phone calls, and chat channels, as well as to run AWS Systems Manager Automation runbooks. To learn how to get started with an-call schedules in Incident Manager, see our Working with on-call schedules in Incident Manager in the AWS documentation.

AWS CloudShell Colsone Toolbar – You can now use AWS Cloudshell Console Toolbar with AWS Management Console in a single view. The Console Toolbar maintains its state (e.g., open, closed) and commands will continue to run in CloudShell as you navigate between services in the Console. For example, it allows you to run a command in CloudShell and view a CloudWatch alarm in the Console at the same time.

After signing into the Console, you can access CloudShell in the lower left of the Console by selecting the CloudShell icon in the Console Toolbar.

New Features of AWS Well-Architected Tool – The Consolidated Report and Enhanced Search enable customers to quickly identify risk themes across their workloads and scale improvements across their organization. This macro-level view helps executive stakeholders understand where common issues lie and prioritize team resources to drive widespread improvement. To learn more, see AWS Well-Architected Tool Dashboard in the AWS documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting from the last week:

Welcome to the .NET on AWS Blog – We launched a new blog channel for millions of .NET developers across the world. Blog posts will also cover built-for-the-cloud development, modernizing .NET Framework applications, and how to deploy .NET workloads on different AWS services. We will use this channel to share news on the work we’ve done with the .NET open-source community, post follow-ups from important events, and post announcements about upcoming presentations from our .NET developer advocates. To learn more, visit our .NET on AWS website and follow us on Twitter at @dotnetonAWS.

AWS Knowledge Center in AWS re:Post – You can now access trusted, authoritative articles and videos of AWS Knowledge Center on AWS re:Post to get answers to technical questions. Knowledge Center content is produced by an AWS team and covers the most frequent questions and requests from AWS customers. These articles are available in 10 localized languages: English, French, German, Italian, Japanese, Korean, Portuguese, Simplified Chinese, Spanish, and Traditional Chinese.

TF1’s FIFA Worldcup Digital Broadcasting Story – Sébastien shared an awesome story about how the French broadcaster TF1 use AWS Cloud technology and expertise to bring the FIFA World Cup to millions of people. He shared the history of redesigning its digital broadcasting architecture on AWS, testing the new platform on large-scale sporting events. For the preparation of the FIFA Worldcup event, TF1 enhanced monitoring to detect anomalies during the event and established the backup plan in a “war room” for the worst scenario. Even if you’re not a fan of football, I recommend reading the behind-the-scenes of the FIFA Worldcup Finals. It’s long but really fun!

Upcoming AWS Events
Check your calendars and sign up for these AWS-led events:

AWS re:Inforce 2023 – Now register AWS re:Inforce, in Anaheim, California, June 13–14. AWS Chief Information Security Officer CJ Moses will share the latest innovations in cloud security and what AWS Security is focused on. The breakout sessions will provide real-world examples of how security is embedded into the way businesses operate. To learn more and get the limited discount code to register, see CJ’s blog post of Gain insights and knowledge at AWS re:Inforce 2023 in the AWS Security Blog.

AWS Global Summits – Check your calendars and sign up for the AWS Summit closest to your city: Paris and Sydney (April 4), Seoul (May 3-4), Berlin and Singapore (May 4), Stockholm (May 11), Hong Kong (May 23), Amsterdam (June 1), London (June 7), Madrid (June 15), and Milano (June 22).

AWS Community Day – Join community-led conferences driven by AWS user group leaders closest to your city: Peru (April 15), Helsinki (April 20), Chicago (June 15), Philippines (June 29–30), and Munich (September 14). Recently, we are bringing together AWS user groups from around the world into Meetup Pro accounts. Find your group and its meetups in your city!

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week In Review – May 30, 2022

2022-05-30 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-may-30-2022/

Today, the US observes Memorial Day. South Korea also has a national Memorial Day, celebrated next week on June 6. In both countries, the day is set aside to remember those who sacrificed in service to their country. This time provides an opportunity to recognize and show our appreciation for the armed services and the important role they play in protecting and preserving national security.

AWS also has supported our veterans, active-duty military personnel, and military spouses with our training and hiring programs in the US. We’ve developed a number of programs focused on engaging the military community, helping them develop valuable AWS technical skills, and aiding in transitioning them to begin their journey to the cloud. To learn more, see AWS’s military commitment.

Last Week’s Launches
The launches that caught my attention last week are the following:

Three New AWS Wavelength Zones in the US and South Korea – We announced the availability of three new AWS Wavelength Zones on Verizon’s 5G Ultra Wideband network in Nashville, Tennessee, and Tampa, Florida in the US, and Seoul in South Korea on SK Telecom’s 5G network.

AWS Wavelength Zones embed AWS compute and storage services at the edge of communications service providers’ 5G networks while providing seamless access to cloud services running in an AWS Region. We have a total of 28 Wavelength Zones in Canada, Germany, Japan, South Korea, the UK, and the US globally. Learn more about AWS Wavelength and get started today.

New Amazon EC2 C7g, M6id, C6id, and P4de Instance Types – Last week, we announced four new EC2 instance types. C7g instances are the first instances powered by the latest AWS Graviton3 processors and deliver up to 25 percent better performance over Graviton2-based C6g instances for a broad spectrum of applications, even high-performance computing (HPC) and CPU-based machine learning (ML) inference.

M6id and C6id instances are powered by the Intel Xeon Scalable processors (Ice Lake) with an all-core turbo frequency of 3.5 GHz, equipped with up to 7.6 TB of local NVMe-based SSD block-level storage, and deliver up to 15 percent better price performance compared to the previous generation instances.

P4de instances are a preview of our latest GPU-based instances that provide the highest performance for ML training and HPC applications. It is powered by 8 NVIDIA A100 GPUs with 80 GB high-performance HBM2e GPU memory, 2X higher than the GPUs in our current P4d instances. The new P4de instances provide a total of 640GB of GPU memory, providing up to 60 percent better ML training performance along with 20 percent lower cost to train when compared to P4d instances.

Amazon EC2 Stop Protection Feature to Protect Instances From Unintentional Stop Actions – Now you don’t have to worry about stopping or terminating your instances from accidental actions. With Stop Protection, you can safeguard data in instance store volume(s) from unintentional stop actions. Previously, you could protect your instances from unintentional termination actions by enabling Termination Protection too.

When enabled, the Stop or Termination Protection feature blocks attempts to stop or terminate the instance via the EC2 console, API call, or CLI command. This feature provides an extra measure of protection for stateful workloads since instances can be stopped or terminated only by deactivating the Stop Protection feature.

AWS DataSync Supports Google Cloud Storage and Azure Files Storage Locations – We announced the general availability of two additional storage locations for AWS DataSync, an online data movement service that makes it easy to sync your data both into and out of the AWS Cloud. With this release, DataSync now supports Google Cloud Storage and Azure Files storage locations in addition to Network File System (NFS) shares, Server Message Block (SMB) shares, Hadoop Distributed File Systems (HDFS), self-managed object storage, AWS Snowcone, Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), Amazon FSx for Windows File Server, Amazon FSx for Lustre, and Amazon FSx for OpenZFS.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Last week, there were lots of announcements of public sectors at AWS Summit Washington, DC.

New Department of Defense (DoD) Cloud Infrastructure as Code (IaC), a collection of templates to enable defense mission owners to quickly build out secure, scalable cloud environments.
AWS officially announced the selection of 10 startups for the 2022 AWS Space Accelerator, 12 startups for the AWS Sustainable Cities Accelerator, the first AWS Defence Accelerator for startups, and the next program of AWS Healthcare Accelerator for startups.
New Public Sector Partner Programs, including Solution Spark for Public Sector Partners, AWS Smart City Competency, and AWS Containers Rapid Adoption Assistance (CRAA) initiative.

To learn more, watch the keynote of Max Peterson, Vice President of AWS Worldwide Public Sector.

Upcoming AWS Events
If you have a developer background or similar and are looking to develop ML skills you can use to solve real-world problems, Let’s Ship It – with AWS! ML Edition is the perfect place to start. Over eight episodes of Twitch training scheduled from June 2 to July 21, you can learn hands-on how to build ML models, such as predicting demand and personalizing your offerings, and more.

The AWS Summit season is mostly over in Asia Pacific and Europe, but there are some upcoming virtual and in-person Summits that might be close to you in June:

June 21–22, AWS Summit Milano, Italy (in-person)
June 22–23, AWS Summit Toronto, Canada (in-person)
June 29, AWS Summit Online, EMEA (virtual)
July 12, AWS Summit New York (in-person)

More to come in August and September.

Please join Amazon re:MARS 2022 (June 21 – 24) to hear from recognized thought leaders and technical experts who are building the future of machine learning, automation, robotics, and space. You can preview Robotics at Amazon to discuss the recent real-world challenges of building robotic systems, published by Amazon Science.

You can now register for AWS re:Inforce 2022 (July 26 – 27). Join us in Boston to learn how AWS is innovating in the world of cloud security, and hone your technical skills in expert-led interactive sessions.

You can now register for AWS re:Invent 2022 (November 28 – December 2). Join us in Las Vegas to experience our most vibrant event that brings together the global cloud community. You can virtually attend live keynotes and leadership sessions and access our on-demand breakout sessions even after re:Invent closes.

That’s all for this week. Check back next Monday for another Week in Review!

– Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

New for AWS DataSync – Move Data Between AWS and Google Cloud Storage or AWS and Microsoft Azure Files

2022-05-24 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-datasync-move-data-between-aws-and-google-cloud-storage-or-aws-and-microsoft-azure-files/

Moving data to and from AWS Storage services can be automated and accelerated with AWS DataSync. For example, you can use DataSync to migrate data to AWS, replicate data for business continuity, and move data for analysis and processing in the cloud. You can use DataSync to transfer data to and from AWS Storage services, including Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), and Amazon FSx. DataSync also integrates with Amazon CloudWatch and AWS CloudTrail for logging, monitoring, and alerting.

Today, we added to DataSync the capability to migrate data between AWS Storage services and either Google Cloud Storage or Microsoft Azure Files. In this way, you can simplify your data processing or storage consolidation tasks. This also helps if you need to import, share, and exchange data with customers, vendors, or partners who use Google Cloud Storage or Microsoft Azure Files. DataSync provides end-to-end security, including encryption and integrity validation, to ensure your data arrives securely, intact, and ready to use.

Let’s see how this works in practice.

Preparing the DataSync Agent
First, I need a DataSync agent to read from, or write to, storage located in Google Cloud Storage or Azure Files. I deploy the agent on an Amazon Elastic Compute Cloud (Amazon EC2) instance. The latest DataSync Amazon Machine Image (AMI) ID is stored in the Parameter Store, a capability of AWS Systems Manager. I use the AWS Command Line Interface (CLI) to get the value of the /aws/service/datasync/ami parameter:

aws ssm get-parameter --name /aws/service/datasync/ami --region us-east-1

{
    "Parameter": {
        "Name": "/aws/service/datasync/ami",
        "Type": "String",
        "Value": "ami-0e244fe801cf5a510",
        "Version": 54,
        "LastModifiedDate": "2022-05-11T14:08:09.319000+01:00",
        "ARN": "arn:aws:ssm:us-east-1::parameter/aws/service/datasync/ami",
        "DataType": "text"
    }
}

Using the EC2 console, I start an EC2 instance using the AMI ID specified in the Value property of the parameter. For networking, I use a public subnet and the option to auto-assign a public IP address. The EC2 instance needs network access to both the source and the destination of a data moving task. Another requirement for the instance is to be able to receive HTTP traffic from DataSync to activate the agent.

When using AWS DataSync in a virtual private cloud (VPC) based on the Amazon VPC service, it is a best practice to use VPC endpoints to connect the agent with the DataSync service. In the VPC console, I choose Endpoints on the navigation pane and then Create endpoint. I enter a name for the endpoint and select the AWS services category.

In the Services section, I look for DataSync.

Then, I select the same VPC where I started the EC2 instance.

To reduce cross-AZ traffic, I choose the same subnet used for the EC2 instance.

The DataSync agent running on the EC2 instance needs network access to the VPC endpoint. For simplicity, I use the default security group of the VPC for both. I create the VPC endpoint and, after a few minutes, it’s ready to be used.

In the AWS DataSync console, I choose Agents from the navigation pane and then Create agent. I select Amazon EC2 for the Hypervisor.

I choose VPC endpoints using AWS PrivateLink for the Endpoint type. I select the VPC endpoint I created before and the same Subnet and Security group I used for the VPC endpoint.

I choose the option to Automatically get the activation key and type the public IP of the EC2 instance. Then, I choose Get key.

After the DataSync agent has been activated, I don’t need HTTP access anymore, and I remove that from the security groups of the EC2 instance. Now that the DataSync agent is active, I can configure tasks and locations to move my data.

Moving Data from Google Cloud Storage to Amazon S3
I have a few images in a Google Cloud Storage bucket, and I want to synchronize those files with an S3 bucket. In the Google Cloud console, I open the settings of the bucket. There, I create a service account with Storage Object Viewer permissions and write down the credentials (access key and secret) to access the bucket programmatically.

Back in the AWS DataSync console, I choose Tasks and then Create task.

To configure the source of the task, I create a location. I select Object storage for the Location type and choose the agent I just created. For the Server, I use storage.googleapis.com. Then, I enter the name of the Google Cloud bucket and the folder where my images are stored.

For authentication, I enter the access key and the secret I retrieved when I created the service account. I choose Next.

To configure the destination of the task, I create another location. This time, I select Amazon S3 for the Location Type. I choose the destination S3 bucket and enter a folder that will be used as a prefix for the files transferred to the bucket. I use the Autogenerate button to create the IAM role that will give DataSync permissions to access the S3 bucket.

In the next step, I configure the task settings. I enter a name for the task. Optionally, I can fine-tune how DataSync verifies the integrity of the transferred data or allocate a bandwidth for the task.

I can also choose what data to scan and what to transfer. By default, all source data is scanned, and only data that has changed is transferred. In the Additional settings, I disable Copy object tags because tags are currently not supported with Google Cloud Storage.

I can select the schedule used to run this task. For now, I leave it Not scheduled, and I will start it manually.

For logging, I use the Autogenerate button to create a log group for DataSync. I choose Next.

I review the configurations and create the task. Now, I start the data moving task from the console. After a few minutes, the files are synced with my S3 bucket and I can access them from the S3 console.

Moving Data from Azure Files to Amazon FSx for Windows File Server
I take a lot of pictures, and I also have a few images in an Azure file share. I want to synchronize those files with an Amazon FSx for Windows file system. In the Azure console, I select the file share and choose the Connect button to generate a PowerShell script that checks if this storage account is accessible over the network.

$connectTestResult = Test-NetConnection -ComputerName <SMB_SERVER> -Port 445
if ($connectTestResult.TcpTestSucceeded) {
    # Save the password so the drive will persist on reboot
    cmd.exe /C "cmdkey /add:`"danilopsync.file.core.windows.net`" /user:`"localhost\<USER>`" /pass:`"<PASSWORD>`""
    # Mount the drive
    New-PSDrive -Name Z -PSProvider FileSystem -Root "\\danilopsync.file.core.windows.net\<SHARE_NAME>" -Persist
} else {
    Write-Error -Message "Unable to reach the Azure storage account via port 445. Check to make sure your organization or ISP is not blocking port 445, or use Azure P2S VPN, Azure S2S VPN, or Express Route to tunnel SMB traffic over a different port."
}

From this script, I grab the information I need to configure the DataSync location:

SMB Server
Share Name
User
Password

Back in the AWS DataSync console, I choose Tasks and then Create task.

To configure the source of the task, I create a location. I select Server Message Block (SMB) for the Location Type and the agent I created before. Then, I use the information I found in the script to enter the SMB Server address, the Share name, and the User/Password to use for authentication.

To configure the destination of the task, I again create a location. This time, I choose Amazon FSx for the Location type. I select an FSx for Windows file system that I created before and use the default share name. I use the default security group to connect to the file system. Because I am using AWS Directory Service for Microsoft Active Directory with FSx for Windows File Server, I use the credentials of a user member of the AWS Delegated FSx Administrators and Domain Admins groups. For more information, see Creating a location for FSx for Windows File Server in the documentation.

In the next step, I enter a name for the task and leave all other options to their default values in the same way I did for the previous task.

I review the configurations and create the task. Now, I start the data moving task from the console. After a few minutes, the files are synched with my FSx for Windows file system share. I mount the file system share with a Windows EC2 instance and see that my images are there.

When creating a task, I can reuse existing locations. For example, if I want to synchronize files from Azure Files to my S3 bucket, I can quickly select the two corresponding locations I created for this post.

Availability and Pricing
You can move your data using the AWS DataSync console, AWS Command Line Interface (CLI), or AWS SDKs to create tasks that move data between AWS storage and Google Cloud Storage buckets or Azure Files file systems. As your tasks run, you can monitor progress from the DataSync console or by using CloudWatch.

There are no changes to DataSync pricing with these new capabilities. Moving data to and from Google Cloud or Microsoft Azure is charged at the same rate as all other data sources supported by DataSync today.

You may be subject to data transfer out fees by Google Cloud or Microsoft Azure. Because DataSync compresses data in flight when copying between the agent and AWS, you may be able to reduce egress fees by deploying the DataSync agent in a Google Cloud or Microsoft Azure environment.

When using DataSync to move data from AWS to Google Cloud or Microsoft Azure, you are charged for data transfer out from EC2 to the internet. See Amazon EC2 pricing for more information.

Automate and accelerate the way you move data with AWS DataSync.

— Danilo

Migrating petabytes of data from on-premises file systems to Amazon FSx for Lustre

2022-03-18 Vimala Pydi

Post Syndicated from Vimala Pydi original https://aws.amazon.com/blogs/architecture/migrating-petabytes-of-data-from-on-premises-file-systems-to-amazon-fsx-for-lustre/

Many organizations use the Lustre filesystem for Linux-based applications that require petabytes of data and high-performance storage. Lustre file systems are used in machine learning (ML), high performance computing (HPC), big data, and financial analytics. Many such high-performance workloads are being migrated to Amazon Web Services (AWS) to take advantage of the scalability, elasticity, and agility that AWS offers. Amazon FSx for Lustre is a fully managed service that provides cost-effective, high-performance, and scalable storage for Lustre file systems on AWS.

AWS DataSync is an AWS managed service for copying data to and from Amazon FSx for Lustre. It provides high-speed transfer through its use of compression and parallel transfer mechanism and integrates with Amazon CloudWatch for observability.

This blog will show you how to migrate petabytes of data files from on-premises to Amazon FSx for Lustre using AWS DataSync. It will provide an overview of Amazon CloudWatch metrics and logs to help you monitor your data transfer using AWS DataSync and metrics from Amazon FSx for Lustre.

Solution overview for file storage data migration

The high-level architecture diagram in Figure 1 depicts file storage data migration from on-premises data center to Amazon FSx for Lustre using AWS DataSync.

Following are the steps for the migration:

Create an Amazon FSx file system.
Install AWS DataSync agent on premises to connect to AWS DataSync service over secured TLS connection.
Configure source and target locations to create an AWS DataSync task.
Configure and start the AWS DataSync task to migrate the data from on-premises to Amazon FSx for Lustre.

Figure 1. Architecture diagram for transferring files on-premises to Amazon FSx for Lustre using AWS DataSync

Prerequisites

On-premises hypervisor or virtual machine
The necessary network communications between the AWS DataSync agent and AWS as detailed in AWS DataSync network requirements
AWS Management Console access to AWS DataSync, Amazon FSx for Lustre, and Amazon CloudWatch

Steps for migration

1. Create an Amazon FSx file system

To start the migration, create a Lustre file system in Amazon FSx service and follow the step-by-step guidance provided in Getting started with Amazon FSx for Lustre.

For this blog, a target of ‘Persistent 2’ deployment type FSx for Lustre is selected with a storage capacity of 1.2 TB (Figure 2.)

Figure 2. FSx for Lustre target file system

2. Install AWS DataSync agent on-premises

Follow steps in the article: Getting started with AWS DataSync to get started with the AWS DataSync service. Configure the source system to migrate the file system data using the following steps:

Deploy an AWS DataSync agent on-premises on a supported virtual machine or hypervisor (Figure 3.)
Configure the AWS DataSync agent from AWS Management Console.
Activate the AWS DataSync agent configured from the preceding step.

Figure 3. Create AWS DataSync agent

3. Configure source and destination locations

A DataSync task consists of a pair of locations between which data is transferred. The source location defines the storage system that you want to read from. The destination location defines the storage service that you want to write data to. Here the source location is an on-premises Lustre system and the destination location is the Amazon FSx for Lustre service (Figure 4.)

Figure 4. Configure source and destination location for AWS DataSync task

4. Configure and start task

A task is a set of two locations (source and destination) and a set of options that you use to control the behavior of the task. Create a task with the source and destination locations and choose Start from the Actions menu (Figure 5.)

Figure 5. Start task

Wait for the task status to change to Running (Figure 6.)

Figure 6. Task status

To check the details of the task completion, select the task and click on the History tab (Figure 7.) The status should show Success once the task successfully completes the migration.

Figure 7. Task history

Monitoring the file transfer

Amazon CloudWatch is the AWS native observability service. It collects and processes raw data from AWS services such as Amazon FSx for Lustre and AWS DataSync into readable, near real-time metrics. It provides metrics that you can use to get more visibility into the data transfer. For a full list of CloudWatch metrics for AWS DataSync and Amazon FSx for Lustre, read Monitoring AWS DataSync and Monitoring Amazon FSx for Lustre.

Amazon FSx for Lustre can also provide various metrics, for example, the number of read or write operations using DataReadOperations and DataWriteOperations. To find the total storage available you can check the metric FreeDataStorageCapacity (Figure 8.)

Figure 8. CloudWatch metrics for Amazon FSx for Lustre

AWS DataSync metrics such as FilesTransferred, gives the actual number of files or metadata that transferred over the network. BytesTransferred provides the total number of bytes that transferred over the network when the agent reads from the source location to the destination location.

A robust monitoring system can be built by setting up an automated notification process for any errors or issues in the data transfer task. Integrate Amazon CloudWatch in combination with the Amazon Simple Notification Service (SNS). Figure 9 depicts the AWS DataSync logs in Amazon CloudWatch.

Figure 9. AWS DataSync logs in Amazon CloudWatch

You can also gather insights from the logs of the data transfer metrics using CloudWatch Logs Insights. CloudWatch Log Insights enables you to quickly search and query your log data (Figure 10.) You can set a filter metric for error codes and alert the appropriate team.

Figure 10. Amazon CloudWatch Logs Insights for querying logs

Cleanup

If you are no longer using the resources discussed in this blog, remove the unneeded AWS resources to avoid incurring charges. After finishing the file transfer, clean up resources by deleting the Amazon FSx file system and AWS DataSync objects (DataSync agent, task, source location, and destination location.)

Conclusion

In this post, we demonstrated how we can accelerate migration of Lustre files from on-premises into Amazon FSx for Lustre using AWS DataSync. As a fully managed service, AWS DataSync securely and seamlessly connects to your Amazon FSx for Lustre file system. This makes it possible for you to move millions of files and petabytes of data without the need for deploying or managing infrastructure in the cloud. We walked through different observability metrics with Amazon CloudWatch integration to provide performance metrics, logging, and events. This can further help to speed up critical hybrid cloud storage workflows in industries that must move active files into AWS quickly. This capability is available in Regions where AWS DataSync and Amazon FSx for Lustre are available. For further details on using this cost-effective service, see Amazon FSx for Lustre pricing and AWS DataSync pricing.

For further reading:

Creating a Multi-Region Application with AWS Services – Part 2, Data and Replication

2022-01-12 Joe Chapman

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-2-data-and-replication/

In Part 1 of this blog series, we looked at how to use AWS compute, networking, and security services to create a foundation for a multi-Region application.

Data is at the center of many applications. In this post, Part 2, we will look at AWS data services that offer native features to help get your data where it needs to be.

In Part 3, we’ll look at AWS application management and monitoring services to help you build, monitor, and maintain a multi-Region application.

Considerations with replicating data

Data replication across the AWS network can happen quickly, but we are still limited by the speed of light. For this reason, data consistency must be considered when building a multi-Region application. Generally speaking, the longer a physical distance is, the longer it will take the data to get there.

When building a distributed system, consider the consistency, availability, partition tolerance (CAP) theorem. This theorem states that an application can only pick 2 out of the 3, and tradeoffs should be considered.

Consistency – all clients always have the same view of data
Availability – all clients can always read and write data
Partition Tolerance – the system will continue to work despite physical partitions

Achieving consistency and availability is common for single-Region applications. For example, when an application connects to a single in-Region database. However, this becomes more difficult with multi-Region applications due to the latency added by transferring data over long distances. For this reason, highly distributed systems will typically follow an eventual consistency approach, favoring availability and partition tolerance.

Replicating objects and files

To ensure objects are in multiple Regions, Amazon Simple Storage Service (Amazon S3) can be set up to replicate objects across AWS Regions automatically with one-way or two-way replication. A subset of objects in an S3 bucket can be replicated with S3 replication rules. If low replication lag is critical, S3 Replication Time Control can help meet requirements by replicating 99.99% of objects within 15 minutes, and most within seconds. To monitor the replication status of objects, Amazon S3 events and metrics will track replication and can send an alert if there’s an issue.

Traditionally, each S3 bucket has its own single, Regional endpoint. To simplify connecting to and managing multiple endpoints, S3 Multi-Region Access Points create a single global endpoint spanning multiple S3 buckets in different Regions. When applications connect to this endpoint, it will route over the AWS network using AWS Global Accelerator to the bucket with the lowest latency. Failover routing is also automatically handled if the connectivity or availability to a bucket changes.

For files stored outside of Amazon S3, AWS DataSync simplifies, automates, and accelerates moving file data across Regions and accounts. It supports homogeneous and heterogeneous file migrations across Elastic File System (Amazon EFS), Amazon FSx, AWS Snowcone, and Amazon S3. It can even be used to sync on-premises files stored on NFS, SMB, HDFS, and self-managed object storage to AWS for hybrid architectures.

File and object replication should be expected to be eventually consistent. The rate at which a given dataset can transfer is a function of the amount of data, I/O bandwidth, network bandwidth, and network conditions.

Copying backups

Scheduled backups can be set up with AWS Backup, which automates backups of your data to meet business requirements. Backup plans can automate copying backups to one or more AWS Regions or accounts. A growing number of services are supported, and this can be especially useful for services that don’t offer real-time replication to another Region such as Amazon Elastic Block Store (Amazon EBS) and Amazon Neptune.

Figure 1 shows how these data transfer services can be combined for each resource.

Figure 1. Storage replication services

Spanning non-relational databases across Regions

Amazon DynamoDB global tables provide multi-Region and multi-writer features to help you build global applications at scale. A DynamoDB global table is the only AWS managed offering that allows for multiple active writers in a multi-Region topology (active-active and multi-Region). This allows for applications to read and write in the Region closest to them, with changes automatically replicated to other Regions.

Global reads and fast recovery for Amazon DocumentDB (with MongoDB compatibility) can be achieved with global clusters. These clusters have a primary Region that handles write operations. Dedicated storage-based replication infrastructure enables low-latency global reads with a lag of typically less than one second.

Keeping in-memory caches warm with the same data across Regions can be critical to maintain application performance. Amazon ElastiCache for Redis offers global datastore to create a fully managed, fast, reliable, and secure cross-Region replica for Redis caches and databases. With global datastore, writes occurring in one Region can be read from up to two other cross-Region replica clusters – eliminating the need to write to multiple caches to keep them warm.

Spanning relational databases across Regions

For applications that require a relational data model, Amazon Aurora global database provides for scaling of database reads across Regions in Aurora PostgreSQL-compatible and MySQL-compatible editions. Dedicated replication infrastructure utilizes physical replication to achieve consistently low replication lag that outperforms the built-in logical replication database engines offer, as shown in Figure 2.

Figure 2. SysBench OLTP (write-only) stepped every 600 seconds on R4.16xlarge

With Aurora global database, one primary Region is designated as the writer, and secondary Regions are dedicated to reads. Aurora MySQL supports write forwarding, which forwards write requests from a secondary Region to the primary Region to simplify logic in application code. Failover testing can happen by utilizing managed planned failover, which will change the active write cluster to another Region while keeping the replication topology intact. All databases discussed in this post employ eventual consistency when used across Regions, but Aurora PostgreSQL has an option to set the maximum a replica lag allowed with managed recovery point objective (managed RPO).

Logical replication, which utilizes a database engine’s built-in replication technology, can be set up for Amazon Relational Database Service (Amazon RDS) for MariaDB, MySQL, Oracle, PostgreSQL, and Aurora databases. A cross-Region read replica will receive these changes from the writer in the primary Region. For applications built on RDS for Microsoft SQL Server, cross-Region replication can be achieved by utilizing the AWS Database Migration Service. Cross-Region replicas allow for quicker local reads and can reduce data loss and recovery times in the case of a disaster by being promoted to a standalone instance.

For situations where a longer RPO and recovery time objective (RTO) are acceptable, backups can be copied across Regions. This is true for all of the relational and non-relational databases mentioned in this post, except for ElastiCache for Redis. Amazon Redshift can also automatically do this for your data warehouse. Backup copy times will vary depending on size and change rates.

A purpose-built database strategy offers many benefits, Figure 3 forms a purpose-built global database architecture.

Figure 3. Purpose-built global database architecture

Summary

Data is at the center of almost every application. In this post, we reviewed AWS services that offer cross-Region data replication to get your data where it needs to be quickly. Whether you need faster local reads, an active-active database, or simply need your data durably stored in a second Region, we have a solution for you. In the 3rd and final post of this series, we’ll cover application management and monitoring features.

Ready to get started? We’ve chosen some AWS Solutions, AWS Blogs, and Well-Architected labs to help you!

Migrating to an Amazon Redshift Cloud Data Warehouse from Microsoft APS

2021-11-09 Sudarshan Roy

Post Syndicated from Sudarshan Roy original https://aws.amazon.com/blogs/architecture/migrating-to-an-amazon-redshift-cloud-data-warehouse-from-microsoft-aps/

Before cloud data warehouses (CDWs), many organizations used hyper-converged infrastructure (HCI) for data analytics. HCIs pack storage, compute, networking, and management capabilities into a single “box” that you can plug into your data centers. However, because of its legacy architecture, an HCI is limited in how much it can scale storage and compute and continue to perform well and be cost-effective. Using an HCI can impact your business’s agility because you need to plan in advance, follow traditional purchase models, and maintain unused capacity and its associated costs. Additionally, HCIs are often proprietary and do not offer the same portability, customization, and integration options as with open-standards-based systems. Because of their proprietary nature, migrating HCIs to a CDW can present technical hurdles, which can impact your ability to realize the full potential of your data.

One of these hurdles includes using AWS Schema Conversion Tool (AWS SCT). AWS SCT is used to migrate data warehouses, and it supports several conversions. However, when you migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) platform using only AWS SCT, it results in connection errors due to the lack of server-side cursor support in Microsoft APS. In this blog post, we show you three approaches that use AWS SCT combined with other AWS services to migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) HCI platform to Amazon Redshift. These solutions will help you overcome elasticity, scalability, and agility constraints associated with proprietary HCI analytics platforms and future proof your analytics investment.

AWS Schema Conversion Tool

Though using AWS SCT only will result in server-side cursor errors, you can pair it with other AWS services to migrate your data warehouses to AWS. AWS SCT converts source database schema and code objects, including views, stored procedures, and functions, to be compatible with a target database. It highlights objects that require manual intervention. You can also scan your application source code for embedded SQL statements as part of database-schema conversion project. During this process, AWS SCT optimizes cloud-native code by converting legacy Oracle and SQL Server functions to their equivalent AWS service. This helps you modernize applications simultaneously. Once conversion is complete, AWS SCT can also migrate data.

Figure 1 shows a standard AWS SCT implementation architecture.

Figure 1. AWS SCT migration approach

The next section shows you how to pair AWS SCT with other AWS services to migrate a Microsoft APS PDW to Amazon Redshift CDW. We prove you a base approach and two extensions to use for data warehouses with larger datasets and longer release outage windows.

Migration approach using SQL Server on Amazon EC2

The base approach uses Amazon Elastic Compute Cloud (Amazon EC2) to host a SQL Server in a symmetric multi-processing (SMP) architecture that is supported by AWS SCT, as opposed to Microsoft’s APS PDW’s massively parallel processing (MPP) architecture. By changing the warehouse’s architecture from MPP to SMP and using AWS SCT, you’ll avoid server-side cursor support errors.

Here’s how you’ll set up the base approach (Figure 2):

Set up the SMP SQL Server on Amazon EC2 and AWS SCT in your AWS account.
Set up Microsoft tools, including SQL Server Data Tools (SSDT), remote table copy, and SQL Server Integration Services (SSIS).
Use the Application Diagnostic Utility (ADU) and SSDT to connect and extract table lists, indexes, table definitions, view definitions, and stored procedures.
Generate data description languages (DDLs) using step 3 outputs.
Apply these DDLs to the SMP SQL Server on Amazon EC2.
Run AWS SCT against the SMP SQL database to begin migrating schema and data to Amazon Redshift.
Extract data using remote table copy from source, which copies data into the SMP SQL Server.
Load this data into Amazon Redshift using AWS SCT or AWS Database Migration Service (AWS DMS).
Use SSIS to load delta data from source to the SMP SQL Server on Amazon EC2.

Figure 2. Base approach using SMP SQL Server on Amazon EC2

Extending the base approach

The base approach overcomes server-side issues you would have during a direct migration. However, many organizations host terabytes (TB) of data. To migrate such a large dataset, you’ll need to adjust your approach.

The following sections extend the base approach. They still use the base approach to convert the schema and procedures, but the dataset is handled via separate processes.

Extension 1: AWS Snowball Edge

Note: AWS Snowball Edge is a Region-specific service. Verify that the service is available in your Region before planning your migration. See Regional Table to verify availability.

Snowball Edge lets you transfer large datasets to the cloud at faster-than-network speeds. Each Snowball Edge device can hold up to 100 TB and uses 256-bit encryption and an industry-standard Trusted Platform Module to ensure security and full chain-of-custody for your data. Furthermore, higher volumes can be transferred by clustering 5–10 devices for increased durability and storage.

Extension 1 enhances the base approach to allow you to transfer large datasets (Figure 3) while simultaneously setting up an SMP SQL Server on Amazon EC2 for delta transfers. Here’s how you’ll set it up:

Once Snowball Edge is enabled in the on-premises environment, it allows data transfer via network file system (NFS) endpoints. The device can then be used with standard Microsoft tools like SSIS, remote table copy, ADU, and SSDT.
While the device is being shipped back to an AWS facility, you’ll set up an SMP SQL Server database on Amazon EC2 to replicate the base approach.
After your data is converted, you’ll apply a converted schema to Amazon Redshift.
Once the Snowball Edge arrives at the AWS facility, data is transferred to the SMP SQL Server database.
You’ll subsequently run schema conversions and initial and delta loads per the base approach.

Figure 3. Solution extension that uses Snowball Edge for large datasets

Note: Where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Extension 1 transfers initial load and later applies delta load. This adds time to the project because of longer cutover release schedules. Additionally, you’ll need to plan for multiple separate outages, Snowball lead times, and release management timelines.

Note that not all analytics systems are classified as business-critical systems, so they can withstand a longer outage, typically 1-2 days. This gives you an opportunity to use AWS DataSync as an additional extension to complete initial and delta load in a single release window.

Extension 2: AWS DataSync

DataSync speeds up data transfer between on-premises environments and AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.

Figure 4 shows the solution extension, which works as follows:

Create SMP MS SQL Server on EC2 and the DDL, as shown in the base approach.
Deploy DataSync agent(s) in your on-premises environment.
Provision and mount an NFS volume on the source analytics platform and DataSync agent(s).
Define a DataSync transfer task after the agents are registered.
Extract initial load from source onto the NFS mount that will be uploaded to Amazon Simple Storage Service (Amazon S3).
Load data extracts into the SMP SQL Server on Amazon EC2 instance (created using base approach).
Run delta loads per base approach, or continue using solution extension for delta loads.

Figure 4. Solution extension that uses DataSync for large datasets

Note: where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Transfer rates for DataSync depend on the amount of data, I/O, and network bandwidth available. A single DataSync agent can fully utilize a 10 gigabit per second (Gbps) AWS Direct Connect link to copy data from on-premises to AWS. As such, depending on initial load size, transfer window calculations must be done prior to finalizing transfer windows.

Conclusion

The approach and its extensions mentioned in this blog post provide mechanisms to migrate your Microsoft APS workloads to an Amazon Redshift CDW. They enable elasticity, scalability, and agility for your workload to future proof your analytics investment.

Related information

Speed Up Translation Jobs with a Fully Automated Translation System Assistant

2021-09-15 Narcisse Zekpa

Post Syndicated from Narcisse Zekpa original https://aws.amazon.com/blogs/architecture/speed-up-translation-jobs-with-a-fully-automated-translation-system-assistant/

Like other industries, translation and localization companies face the challenge of providing fast delivery at a low cost. To address this challenge, organizations use Machine Translation (MT) to complement their translator teams. MT is the use of automated software that translates text without the need of human involvement.

One of the most recent advancements is Active Custom Translation (ACT). ACT helps tailor translated text to a specific language style or terminology, per customer specifications. In the past, organizations built custom models to include ACT in their translation system. Amazon Translate has an Active Custom Translation feature, which helps customers integrate configurable MT capabilities into their translation systems, without needing to build it themselves.

This blog describes an end-to-end automated translation flow, including guidelines to manage the data involved in the ACT process. The solution combines Amazon Translate with other Amazon Web Services (AWS) such as AWS DataSync and AWS Lambda. Before exploring this architecture, let’s explain a few basic concepts specific to the translation and localization industry.

Standard translation concepts

Translation Memory. It is common to reuse previously generated outputs as components for machine translation systems. This data is commonly called Translation Memory, and is stored and exchanged according to standardized formats (TMX, TSV, or CSV).

Source Text. Translation input data is commonly exchanged as XML Localization Interchange File Format (XLIFF) documents. Amazon Translate recently added the support of XLIFF documents for batch processing.

Figure 1 illustrates a standard translation flow involving machine translation and translation memory. Once the output has been reviewed and finalized, it is part of the company’s intellectual property (IP). It can then be reincorporated into the flywheel as an input to future translation jobs.

Figure 1: Translation workflow using machine translation

Translation assistant solution walkthrough

When using Amazon Translate in batch mode, you must:

Gather together and make translation input data available to the Translation job
Monitor the processing and retrieval of the output
Implement improvised processes to integrate your Translation Management System (TMS) with AWS, as needed

As you can see, this can involve many manual steps. You must download huge files, upload them into Amazon Simple Storage Service (S3), and configure jobs. The solution shown in Figure 2 illustrates these automation activities.

Figure 2: Automated batch ACT translation solution architecture

Translation automation activities:

Upload the translation job input data (source files, custom terminology, translation memory files).
Initiate the preprocessing step. Scan input files and identify language pairs.
Create an Amazon Simple Queue Service (SQS) message per language pairs and translation project.
Create S3 buckets and prefixes for each translation job.
Create an Amazon Translate job.
Initiate a post-processing workflow, see Figure 3 (AWS Step Functions).
Copy the Translation output into the output bucket.
Publish an Amazon SNS notification to inform on job completion status.
Download translated files back into customer environment.

In this scenario, translators are operating from their company’s internal infrastructure, although their TMS can also be hosted on the cloud. They first collect the translation input data from their TMS and drop the files onto a shared file server. These files can be XLIFF, TMX, or CSV. We use AWS DataSync to orchestrate and initiate the data transfer from on-premises into an Amazon S3 staging bucket. AWS DataSync provides a few advantages:

A low code solution that manages the upload/download of translation data from/to AWS
The ability to schedule the synchronization for both upstream and downstream and control the frequency. This allows for batching translation jobs and optimizes usage and cost for Amazon Translate
A single point of access to translation data, which reduces the need to manage user accounts and grants access to the data

Once the files are uploaded into the input bucket, DataSync generates an event through Amazon EventBridge. This notification invokes an AWS Lambda function that pushes a message into an Amazon SQS queue. The message contains the list of files to be translated in the current batch. SQS decouples the data upload from the actual processing. Using this workflow provides scalability, service quota limit control, and better error handling.

The queue initiates another Lambda function that creates a file hierarchy in S3 for each translation job. File-naming conventions can be used as a key to separate jobs from each other. The function also prepares translation memory and custom terminology when required. Lastly, it creates and submits the translation job.

The post-processing AWS Step Functions workflow

Amazon Translate is able to generate events into EventBridge upon job completion or failure. We use this capability to invoke a post-processing AWS Step Functions workflow. For instance, some customers must flag machine translated segments within an XLIFF file, so their translators can quickly identify them for manual review.

The flow implemented in the state machine does the following (shown in Figure 3):

Verifies output of Amazon Translate. Checks for completeness, confirms all segments successfully translated
Enriches the translation data. Flags machine translated segments by comparing input and output
Copies output to staging bucket. Prepares for final upload
Sends SNS notifications to alert operators. Notifies that the batch is complete

Figure 3: Post-processing Step Functions workflow

This solution is entirely serverless, which frees you from maintaining the infrastructure or software platform. You can focus on the core business logic, and what really differentiates you from your competitors.

As the number of translation projects grow overtime, you can also take advantage of Amazon S3 storage classes to optimize document archiving. A translation service provider can define specific rules per customer or per project. These rules can be configured automatically as the data is copied into S3. The result is that files can be transferred to cheaper storage tiers with predefined retention periods.

Conclusion

In this blog, we’ve described a solution that helps you automate the collection and transfer of translation data. It also assists in the scheduling and orchestration of translation jobs. This leads to greater productivity, reduction in cost, and faster time-to-market. Using AWS, you can decrease maintenance, and create a highly scalable and cost-effective solution. Because of the AWS pay-as-you-go model, you can assess the price per project. This information can be used in your pricing model, and be passed along as service options to your own customers.

To get started with Amazon Translate or read more, check out these blogs:

Manage your Digital Microscopy Data using OMERO on AWS

2021-07-29 Travis Berkley

Post Syndicated from Travis Berkley original https://aws.amazon.com/blogs/architecture/manage-your-digital-microscopy-data-using-omero-on-aws/

The Open Microscopy Environment (OME) consortium develops open-source software and format standards for microscopy data. OME Remote Objects (OMERO) is an open source, image data management platform designed to support digital pathology and cellular biology studies. You can access, share, and work with various biological data. This can include histopathology, high content screening, electron microscopy, and even non-image genotype data. Deploying this open source tool on Amazon Web Services (AWS) allows you to access your image data in a secure central repository. You can take advantage of elastic storage by growing the archive as needed without provisioning excess storage beforehand. OMERO has a web interface, which facilitates data access and visualization. It also supports connection through the OMERO client or other third-party image analysis tools, like CellProfilerTM, QuPath, Fiji, ImageJ, and others.

The challenge of microscopy data

Saint Louis University (SLU) School of Medicine Research Microscopy and Histology Core required a centralized system for both distribution and hosting. The solution must provide research imaging distribution to both internal and external clients. It also needed the capability of hosting an educational platform for microscope images. SLU decided that the open source software OMERO was an ideal fit for them.

In order to provide speed, ease of access, and security for the University’s computer networks, SLU decided the solution must be hosted in the cloud. By partnering with AWS, SLU established a robust system for their clients. The privately hosted images on OMERO represent research material databases used by University researchers. OMERO also hosts teaching datasets for resident and fellow education. Other publicly hosted repositories provide access to source images for future publishing standards and regulations. SLU reported that the implementation was extraordinarily smooth for a non-programmer. In addition, the system design allowed for advanced data management to control costs and security.

Reviewing the OMERO architecture

OMERO is a typical three-tier web application, consisting of the following components:

OMERO.web provides access to OMERO’s data hierarchies and also enables annotation, organization, and visualization of data. This web browser-based client of OMERO.server exposes the annotation-based data-sharing mechanism.
OMERO.server is a middleware server application that provides access to image data and metadata stored in a series of databases. It contains a multi-threaded, image-rendering engine and supports a wide range (>140) of image pyramid formats through the Bio-Formats Java library. This Java application facilitates remote access and interoperability for modern scientific studies. It also exposes an API to allow any OMERO client to access the original data and any derived measurements.
OMERO relational database (PostgreSQL) provides the underlying storage facilities. This storage backend contains the processed metadata associated with the binary images, measurement specification, user information, structured annotations, and more.

Figure 1. Architectural diagram for a highly available (HA) deployment of OMERO on AWS including data ingestion options

To achieve the highly available (HA) deployment in the diagram, follow the guidance from this GitHub repository. Since OMERO only supports one writer per mounted network file share, there is one OMERO read+write server and one read-only server in the HA deployment. Otherwise, multiple instances will compete to get first access to Amazon Elastic File System (EFS). If HA is not a requirement, you can lower costs by deploying only the read+write OMERO.server.

OMERO is deployed on AWS using AWS CloudFormation (CFN) templates, which will deploy two nested CFN stacks, one for storage, and one for compute. The storage template creates an EFS volume and an Amazon Relational Database Service (RDS) instance of PostgreSQL. EFS provides the option to move files to an infrequent accessed storage class after a certain number of days to save storage cost. RDS has Multi-AZ option to improve business continuity. The compute template creates Amazon Elastic Container Service (Amazon ECS) containers for the OMERO web and server functions. You have the option to deploy the OMERO containers on AWS Fargate or Amazon EC2 launch type. It also creates an Amazon Application Load Balancer (ALB) with duration-based stickiness enabled and an AWS Certificate Manager (ACM) certificate for Transport Layer Security (TLS) termination at ALB. Only the ALB is publicly accessible, as the web portal is protected behind it in private subnets. VPC and subnets are required, which can be obtained via this CFN template. It also requires the hosted zone ID and fully qualified domain name in Amazon Route 53, which will be used to validate the TLS certificate. If higher security is not a requirement, there is an option to deploy without the registered domain and the hosted zone in Route 53. You will then be able to access the OMERO web through Application Load Balancer DNS name without TLS encryption.

Additionally, the containers of OMERO.web and OMERO.server can be extended with plugins. The landing page for login can be customized with logos, brands, or disclaimers. Build a new Docker container image with specific configuration changes to enrich the functionality of this open source platform.

You can use Amazon ECS Exec to access the OMERO command line interface (CLI) to import images within the OMERO.server container, running on either AWS Fargate or EC2 launch type. You can also run Amazon ECS Exec via AWS CloudShell. The OMERO CFN templates enable Amazon ECS Exec commands by default. You will only need to install AWS CLI and SSM plugin on your clients or AWS CloudShell to initiate the commands. When you import images within the OMERO.server container instances, you can use the OMERO in-place import to avoid redundant copies of the image files on Amazon EFS. Alternatively, you can access the Windows desktop OMERO client OMERO.insight, via the application virtualization service Amazon AppStream 2.0. This connects to the OMERO.server in the same VPC. Amazon AppStream 2.0 allows Amazon S3 being used as home folder storage, so you can import images directly from Amazon S3 to OMERO.server.

AWS offers multiple options to move your microscopic image data from on premises facilities to the cloud storage, as illustrated in Figure 1:

Use AWS Transfer Family to copy data directly from on premises devices to EFS
Alternatively, transfer data directly from your on-premises Network File System (NFS) to EFS using AWS DataSync. AWS DataSync can also be used to transfer files from S3 to EFS.
Set up AWS Storage Gateway, in particular File Gateway, to move your image files from on premises to Amazon Simple Storage Service (S3) first. A storage lifecycle policy can archive images. You can track the storage activity metrics using Amazon S3 Storage Lens and gain insights on storage cost using cost allocation tags. Once the files are in Amazon S3, you can either set up AWS DataSync to transfer files from S3 to EFS, or directly import files into OMERO.server.

To find the latest development to this solution, check out digital pathology on AWS repository on GitHub.

Conclusion

Researchers and scientists at St. Louis University were able to grow their image repository on AWS without the concern of fixed storage limits. They can scale their compute environment up or down as their research requirements dictate. The managed services, like Amazon ECS and RDS, are able to significantly reduce the operational workloads from researchers. SLU reports that this platform is of great use to their researchers. Other universities, academic medical centers, and pharmaceutical and biotechnology companies can also use this cloud-based image data management platform to collect, visualize, and share access to their image data assets.

Defining a hybrid data access strategy

Hybrid data access strategy architecture

ML workloads using Amazon SageMaker

Conclusion

Serverless data archiving and retrieval

How we built serverless data archiving

Conclusion

Solution overview for file storage data migration

Prerequisites

Steps for migration

1. Create an Amazon FSx file system

2. Install AWS DataSync agent on-premises

3. Configure source and destination locations

4. Configure and start task

Monitoring the file transfer

Cleanup

Conclusion

More posts for Women’s History Month!

Other ways to participate

Considerations with replicating data

Replicating objects and files

Copying backups

Spanning non-relational databases across Regions

Spanning relational databases across Regions

Summary

Related posts

AWS Schema Conversion Tool

Migration approach using SQL Server on Amazon EC2

Extending the base approach

Extension 1: AWS Snowball Edge

Extension 2: AWS DataSync

Conclusion

Related information

Standard translation concepts

Translation automation activities:

The post-processing AWS Step Functions workflow

Conclusion

The challenge of microscopy data

Reviewing the OMERO architecture

Conclusion

The collective thoughts of the interwebz