Using custom consumer group ID support for the AWS Lambda event sources for MSK and self-managed Kafka

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-custom-consumer-group-id-support-for-the-aws-lambda-event-sources-for-msk-and-self-managed-kafka/

This post is written by Adam Wagner, Principal Serverless Specialist SA.

AWS Lambda already supports Amazon Managed Streaming for Apache Kafka (MSK) and self-managed Apache Kafka clusters as event sources. Today, AWS adds support for specifying a custom consumer group ID for the Lambda event source mappings (ESMs) for MSK and self-managed Kafka event sources.

With this feature, you can create a Lambda ESM that uses a consumer group that has already been created. This enables you to use Lambda as a Kafka consumer for topics that are replicated with MirrorMaker v2 or with consumer groups you create to start consuming at a particular offset or timestamp.

Overview

This blog post shows how to use this feature to enable Lambda to consume a Kafka topic starting at a specific timestamp. This can be useful if you must reprocess some data but don’t want to reprocess all of the data in the topic.

In this example application, a client application writes to a topic on the MSK cluster. It creates a consumer group that points to a specific timestamp within that topic as the starting point for consuming messages. A Lambda ESM is created using that existing consumer group that triggers a Lambda function. This processes and writes the messages to an Amazon DynamoDB table.

Reference architecture

  1. A Kafka client writes messages to a topic in the MSK cluster.
  2. A Kafka consumer group is created with a starting point of a specific timestamp
  3. The Lambda ESM polls the MSK topic using the existing consumer group and triggers the Lambda function with batches of messages.
  4. The Lambda function writes the messages to DynamoDB

Step-by-step instructions

To get started, create an MSK cluster and a client Amazon EC2 instance from which to create topics and publish messages. If you don’t already have an MSK cluster, follow this blog on setting up an MSK cluster and using it as an event source for Lambda.

  1. On the client instance, set an environment variable to the MSK cluster bootstrap servers to make it easier to reference them in future commands:
    export MSKBOOTSTRAP='b-1.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094,b-2.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094,b-3.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094'
  2. Create the topic. This example has a three-node MSK cluster so the replication factor is also set to three. The partition count is set to three in this example. In your applications, set this according to throughput and parallelization needs.
    ./bin/kafka-topics.sh --create --bootstrap-server $MSKBOOT --replication-factor 3 --partitions 3 --topic demoTopic01
  3. Write messages to the topic using this Python script:
    #!/usr/bin/env python3
    import json
    import time
    from random import randint
    from uuid import uuid4
    from kafka import KafkaProducer
    
    BROKERS = ['b-1.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094', 
            'b-2.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094',
            'b-3.mskcluster.oy1hqd.c23.kafka.us-east-1.amazonaws.com:9094']
    TOPIC = 'demoTopic01'
    
    producer = KafkaProducer(bootstrap_servers=BROKERS, security_protocol='SSL',
            value_serializer=lambda x: json.dumps(x).encode('utf-8'))
    
    def create_record(sequence_num):
        number = randint(1000000,10000000)
        record = {"id": sequence_num, "record_timestamp": int(time.time()), "random_number": number, "producer_id": str(uuid4()) }
        print(record)
        return record
    
    def publish_rec(seq):
        data = create_record(seq)
        producer.send(TOPIC, value=data).add_callback(on_send_success).add_errback(on_send_error)
        producer.flush()
    
    def on_send_success(record_metadata):
        print(record_metadata.topic, record_metadata.partition, record_metadata.offset)
    
    def on_send_error(excp):
        print('error writing to kafka', exc_info=excp)
    
    for num in range(1,10000000):
        publish_rec(num)
        time.sleep(0.5) 
    
  4. Copy the script into a file on the client instance named producer.py. The script uses the kafka-python library, so first create a virtual environment and install the library.
    python3 -m venv venv
    source venv/bin/activate
    pip3 install kafka-python
    
  5. Start the script. Leave it running for a few minutes to accumulate some messages in the topic.
    Output
  6. Previously, a Lambda function would choose between consuming messages starting at the beginning of the topic or starting with the latest messages. In this example, it starts consuming messages from a few hours ago at 14:30 UTC. To do this, first create a new consumer group on the client instance:
    ./bin/kafka-consumer-groups.sh --command-config client.properties --bootstrap-server $MSKBOOTSTRAP --topic demoTopic01 --group specificTimeCG --to-datetime 2022-08-10T16:00:00.000 --reset-offsets --execute
  7. In this case, specificTimeCG is the consumer group ID used when creating the Lambda ESM. Listing the consumer groups on the cluster shows the new group:
    ./bin/kafka-consumer-groups.sh --list --command-config client.properties --bootstrap-server $MSKBOOTSTRAP

    Output

  8. With the consumer group created, create the Lambda function along with the Event Source Mapping that uses this new consumer group. In this case, the Lambda function and DynamoDB table are already created. Create the ESM with the following AWS CLI Command:
    aws lambda create-event-source-mapping --region us-east-1 --event-source-arn arn:aws:kafka:us-east-1:0123456789:cluster/demo-us-east-1/78a8d1c1-fa31-4f59-9de3-aacdd77b79bb-23 --function-name msk-consumer-demo-ProcessMSKfunction-IrUhEoDY6X9N --batch-size 3 --amazon-managed-kafka-event-source-config '{"ConsumerGroupId":"specificTimeCG"}' --topics demoTopic01

    The event source in the Lambda console or CLI shows the starting position set to TRIM_HORIZON. However, if you specify a custom consumer group ID that already has existing offsets, those offsets take precedent.

  9. With the event source created, navigate to the DynamoDB console. Locate the DynamoDB table to see the records written by the Lambda function.
    DynamoDB table

Converting the record timestamp of the earliest record in DynamoDB, 1660147212, to a human-readable date shows that the first record was created on 2022-08-10T16:00:12.

In this example, the consumer group is created before the Lambda ESM so that you can specify the timestamp to start from.

If you create an ESM and specify a custom consumer group ID that does not exist, it is created. This is a convenient way to create a new consumer group for an ESM with an ID of your choosing.

Deleting an ESM does not delete the consumer group, regardless of whether it is created before, or during, the ESM creation.

Using the AWS Serverless Application Model (AWS SAM)

To create the event source mapping with a custom consumer group using an AWS Serverless Application Model (AWS SAM) template, use the following snippet:

Events:
  MyMskEvent:
    Type: MSK
    Properties:
      Stream: !Sub arn:aws:kafka:${AWS::Region}:012345678901:cluster/ demo-us-east-1/78a8d1c1-fa31-4f59-9de3-aacdd77b79bb-23
      Topics:
        - "demoTopic01"
      ConsumerGroupId: specificTimeCG

Other types of Kafka clusters

This example uses the custom consumer group ID feature when consuming a Kafka topic from an MSK cluster. In addition to MSK clusters, this feature also supports self-managed Kafka clusters. These could be clusters running on EC2 instances or managed Kafka clusters from a partner such as Confluent.

Conclusion

This post shows how to use the new custom consumer group ID feature of the Lambda event source mapping for Amazon MSK and self-managed Kafka. This feature can be used to consume messages with Lambda starting at a specific timestamp or offset within a Kafka topic. It can also be used to consume messages from a consumer group that is replicated from another Kafka cluster using MirrorMaker v2.

For more serverless learning resources, visit Serverless Land.

Amazon Redshift data sharing best practices and considerations

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/amazon-redshift-data-sharing-best-practices-and-considerations/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Amazon Redshift data sharing allows for a secure and easy way to share live data for reading across Amazon Redshift clusters. It allows an Amazon Redshift producer cluster to share objects with one or more Amazon Redshift consumer clusters for read purposes without having to copy the data. With this approach, workloads isolated to different clusters can share and collaborate frequently on data to drive innovation and offer value-added analytic services to your internal and external stakeholders. You can share data at many levels, including databases, schemas, tables, views, columns, and user-defined SQL functions, to provide fine-grained access controls that can be tailored for different users and businesses that all need access to Amazon Redshift data. The feature itself is simple to use and integrate into existing BI tools.

In this post, we discuss Amazon Redshift data sharing, including some best practices and considerations.

How does Amazon Redshift data sharing work ?

  • To achieve best in class performance Amazon Redshift consumer clusters cache and incrementally update block level data (let us refer to this as block metadata) of objects that are queried, from the producer cluster (this works even when cluster is paused).
  • The time taken for caching block metadata depends on the rate of the data change on the producer since the respective object(s) were last queried on the consumer. (As of today the consumer clusters only update their metadata cache for an object only on demand i.e. when queried)
  • If there are frequent DDL operations, the consumer is forced to re-cache the full block metadata for an object during the next access to maintain consistency as to enable live sharing as structure changes on the producer invalidate all the existing metadata cache on the consumers.
  • Once the consumer has the block metadata in sync with the latest state of an object on the producer that is when the query would execute as any other regular query (query referring to local objects).

Now that we have the necessary background on data sharing and how it works, let’s look at a few best practices across streams that can help improve workloads while using data sharing.

Security

In this section, we share some best practices for security when using Amazon Redshift data sharing.

Use INCLUDE NEW cautiously

INCLUDE NEW is a very useful setting while adding a schema to a data share (ALTER DATASHARE). If set to TRUE, this automatically adds all the objects created in the future in the specified schema to the data share automatically. This might not be ideal in cases where you want to have fine-grained control on objects being shared. In these cases, leave the setting at its default of FALSE.

Use views to achieve fine-grained access control

To achieve fine-grained access control for data sharing, you can create late-binding views or materialized views on shared objects on the consumer, and then share the access to these views to users on consumer cluster, instead of giving full access on the original shared objects. This comes with its own set of considerations, which we explain later in this post.

Audit data share usage and changes

Amazon Redshift provides an efficient way to audit all the activity and changes with respect to a data share using system views. We can use the following views to check these details:

Performance

In this section, we discuss best practices related to performance.

Materialized views in data sharing environments

Materialized views (MVs) provide a powerful route to precompute complex aggregations for use cases where high throughput is needed, and you can directly share a materialized view object via data sharing as well.

For materialized views built on tables where there are frequent write operations, it’s ideal to create the materialized view object on the producer itself and share the view. This method gives us the opportunity to centralize the management of the view on the producer cluster itself.

For slowly changing data tables, you can share the table objects directly and build the materialized view on the shared objects directly on the consumer. This method gives us the flexibility of creating a customized view of data on each consumer according to your use case.

This can help optimize the block metadata download and caching times in the data sharing query lifecycle. This also helps in materialized view refreshes because, as of this writing, Redshift doesn’t support incremental refresh for MVs built on shared objects.

Factors to consider when using cross-Region data sharing

Data sharing is supported even if the producer and consumer are in different Regions. There are a few differences we need to consider while implementing a share across Regions:

  • Consumer data reads are charged at $5/TB for cross region data shares, Data sharing within the same Region is free. For more information, refer to Managing cost control for cross-Region data sharing.
  • Performance will also vary when compared to a uni-Regional data share because the block metadata exchange and data transfer process between the cross-Regional shared clusters will take more time due to network throughput.

Metadata access

There are many system views that help with fetching the list of shared objects a user has access to. Some of these include all the objects from the database that you’re currently connected to, including objects from all the other databases that you have access to on the cluster, including external objects. The views are as follows:

We suggest using very restrictive filtering while querying these views because a simple select * will result in an entire catalog read, which isn’t ideal. For example, take the following query:

select * from svv_all_tables;

This query will try to collect metadata for all the shared and local objects, making it very heavy in terms of metadata scans, especially for shared objects.

The following is a better query for achieving a similar result:

SELECT table_name,
column_name,
data_type FROM svv_all_tables WHERE table_name = < tablename > AND schema_name = < schemaname > AND database_name = < databasename > ORDER BY ordinal_position

This is a good practice to follow for all metadata views and tables; doing so allows seamless integration into several tools. You can also use the SVV_DATASHARE* system views to exclusively see shared object-related information.

Producer/consumer dependencies

In this section, we discuss the dependencies between the producer and consumer.

Impact of the consumer on the producer

Queries on the consumer cluster will have no impact in terms of performance or activity on the producer cluster. This is why we can achieve true workload isolation using data sharing.

Encrypted producers and consumers

Data sharing seamlessly integrates even if both the producer and the consumer are encrypted using different AWS Key Management Service (AWS KMS) keys. There are sophisticated, highly secure key exchange protocols to facilitate this so you don’t have to worry about encryption at rest and other compliance dependencies. The only thing to make sure is that both the producer and consumer are in a homogeneous encryption configuration.

Data visibility and consistency

A data sharing query on the consumer can’t impact the transaction semantics on the producer. All the queries involving shared objects on the consumer cluster follow read-committed transaction consistency while checking for visible data for that transaction.

Maintenance

If there is a scheduled manual VACUUM operation in use for maintenance activities on the producer cluster on shared objects, you should use VACUUM recluster whenever possible. This is especially important for large objects because it has optimizations in terms of the number of data blocks the utility interacts with, which results in less block metadata churn compared to a full vacuum. This benefits the data sharing workloads by reducing the block metadata sync times.

Add-ons

In this section, we discuss additional add-on features for data sharing in Amazon Redshift.

Real-time data analytics using Amazon Redshift streaming data

Amazon Redshift recently announced the preview for streaming ingestion using Amazon Kinesis Data Streams. This eliminates the need for staging the data and helps achieve low-latency data access. The data generated via streaming on the Amazon Redshift cluster is exposed using a materialized view. You can share this as any other materialized view via a data share and use it to set up low-latency shared data access across clusters in minutes.

Amazon Redshift concurrency scaling to improve throughput

Amazon Redshift data sharing queries can utilize concurrency scaling to improve the overall throughput of the cluster. You can enable concurrency scaling on the consumer cluster for queues where you expect a heavy workload to improve the overall throughput when the cluster is experiencing heavy load.

For more information about concurrency scaling, refer to Data sharing considerations in Amazon Redshift.

Amazon Redshift Serverless

Amazon Redshift Serverless clusters are ready for data sharing out of the box. A serverless cluster can also act as a producer or a consumer for a provisioned cluster. The following are the supported permutations with Redshift Serverless:

  • Serverless (producer) and provisioned (consumer)
  • Serverless (producer) and serverless (consumer)
  • Serverless (consumer) and provisioned (producer)

Conclusion

Amazon Redshift data sharing gives you the ability to fan out and scale complex workloads without worrying about workload isolation. However, like any system, not having the right optimization techniques in place could pose complex challenges in the long term as the systems grow in scale. Incorporating the best practices listed in this post presents a way to mitigate potential performance bottlenecks proactively in various areas.

Try data sharing today to unlock the full potential of Amazon Redshift, and please don’t hesitate to reach out to us in case of further questions or clarifications.


About the authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sai Teja Boddapati is a Database Engineer based out of Seattle. He works on solving complex database problems to contribute to building the most user friendly data warehouse available. In his spare time, he loves travelling, playing games and watching movies & documentaries.

AWS CyberVadis report now available for due diligence on third-party suppliers

Post Syndicated from Andreas Terwellen original https://aws.amazon.com/blogs/security/aws-cybervadis-report-now-available-for-due-diligence-on-third-party-suppliers/

At Amazon Web Services (AWS), we’re continuously expanding our compliance programs to provide you with more tools and resources to perform effective due diligence on AWS. We’re excited to announce the availability of the AWS CyberVadis report to help you reduce the burden of performing due diligence on your third-party suppliers.

With the increase in adoption of cloud products and services across multiple sectors and industries, AWS is a critical component of customers’ third-party environments. Regulated customers, such as those in the financial services sector, are held to high standards by regulators and auditors when it comes to exercising effective due diligence on third parties.

Many customers use third-party cyber risk management (TPCRM) services such as CyberVadis to better manage risks from their evolving third-party environments and to drive operational efficiencies. To help with such efforts, AWS has completed the CyberVadis assessment of its security posture. CyberVadis security analysts perform the assessment and validate the results annually.

CyberVadis is a comprehensive third-party risk assessment process that combines the speed and scalability of automation with the certainty of analyst validation. The CyberVadis cybersecurity rating methodology assesses the maturity of a company’s information security management system (ISMS) through its policies, implementation measures, and results.

CyberVadis integrates responses from AWS with analytics and risk models to provide an in-depth view of the AWS security posture. The CyberVadis methodology maps to major international compliance standards, including the following:

Customers can download the AWS CyberVadis report at no additional cost. For details on how to access the report, see our AWS CyberVadis report page.

As always, we value your feedback and questions. Reach out to the AWS Compliance team through the Contact Us page. If you have feedback about this post, submit comments in the Comments section below. To learn more about our other compliance and security programs, see AWS Compliance Programs.

Want more AWS Security news? Follow us on Twitter.

Andreas Terwellen

Andreas Terwellen

Andreas is a senior manager in security audit assurance at AWS, based in Frankfurt, Germany. His team is responsible for third-party and customer audits, attestations, certifications, and assessments across Europe. Previously, he was a CISO in a DAX-listed telecommunications company in Germany. He also worked for different consulting companies managing large teams and programs across multiple industries and sectors.

Manuel Mazarredo

Manuel Mazarredo

Manuel is a security audit program manager at AWS based in Amsterdam, the Netherlands. Manuel leads security audits, attestations, and certification programs across Europe, and is responsible for the BeNeLux area. For the past 18 years, he has worked in information systems audits, ethical hacking, project management, quality assurance, and vendor management across a variety of industries.

AWS Trusted Advisor – New Priority Capability

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-trusted-advisor-new-priority-capability/

AWS Trusted Advisor is a service that continuously analyzes your AWS accounts and provides recommendations to help you to follow AWS best practices and AWS Well-Architected guidelines. Trusted Advisor implements a series of checks. These checks identify ways to optimize your AWS infrastructure, improve security and performance, reduce costs, and monitor service quotas.

Today, we are making available to all Enterprise Support customers a new capability for AWS Trusted Advisor: Trusted Advisor Priority. It gives you prioritized and context-driven recommendations manually curated by your AWS account team, based on their knowledge of your environment and the machine-generated checks from AWS Services.

Trusted Advisor implements over 200 checks in five categories: cost optimization, performance, security, fault tolerance, and service limits. Here is a view of the current Trusted Advisor dashboard.

AWS Trusted Advisor Categories

The list of checks available on your account depends on your level of support. When you have AWS Basic Support, available to all customers, or AWS Developer Support, you have access to core security and service limits checks. When you have AWS Business Support or AWS Enterprise Support, you have access to all checks.

The new Priority capability gives you a prioritized view of critical risks. It shows prioritized, contextual recommendations and actionable insights based on your business outcomes and what’s important to you. It also surfaces risks proactively identified by your AWS account team to alert and address critical cloud risks stemming from deviations from AWS best practices. It is designed to help you: IT leaders, technical decisions makers, and members of a Cloud Center of Excellence.

The account team takes advantage of their understanding of your production accounts and business-critical workloads. By working with you, they identify what’s important to you, and the outcomes or goals you wish to achieve. For example, they know about your business viewpoint whether it is exiting a data center by the end of the year, launching a new product, expanding to a new geography, or migrating a workload to the cloud.

Trusted Advisor uses multiple sources to define the priorities. On one side, it uses signals from other AWS services, such as AWS Compute Optimizer, Amazon GuardDuty, or VPC Flow Logs. On the other side, it uses context manually curated by your AWS account team (Account Manager, Technical Account Manager, Solutions Architect, Customer Solutions Manager, and others) and the knowledge they have about your production accounts, business-critical applications and critical workloads. You will be guided to opportunities to take advantage of AWS Support engagements like a Cost Optimization workshop when the account team believes there are opportunities to reduce costs, a deep dive with a service team, or an Infrastructure Event Management for an upcoming workload migration.

You will be alerted to risks in your deployments on AWS, using sources such as the AWS Well-Architected framework. We will highlight and bring to attention any open high risk issues (HRIs) from recently conducted Well-Architected reviews. We also run campaigns to proactively identify, alert, and reduce single points of failures, such as single Availability Zone deployments. This verifies that you don’t have a single point of failures for production applications that are used for mission-critical processes, that drive significant revenue, or have regulated availability requirements. Trusted Advisor helps you to detect, raise awareness, and provide prescriptive guidance.

Here is a diagram to visualize my mental model for Trusted Advisor Priority:

Trusted Advisor Mental Model Diagram

Trusted Advisor Priority works with AWS Organizations: it aggregates all recommendations from member accounts in your management account or designed delegated administrator. You may delegate access to Trusted Advisor Priority to a maximum of five other AWS accounts. Trusted Advisor Priority comes with a new AWS Identity and Access Management (IAM) policy to help you manage access to the capability. Finally, you can also configure to receive daily and weekly email digests of all prioritized notifications to the alternate contacts you set up in the management account or each delegated admin account.

Let’s See Trusted Advisor Priority in Action
I open the AWS Management Console and navigate to Trusted Advisor. I notice a new navigation entry on the left menu. It is the default view for Enterprise Support customers.

The Trusted Advisor Priority main screen summarizes the number of Pending response and In progress recommendations. It shares some time-related statistics on the right side of the screen. I can start to look at the Active prioritized recommendations list on the bottom half of the screen.

Recommendations are divided into two panels: Active and Closed. The Active tab includes recommendations that have been surfaced to you and which you are actively working on. The Closed tab includes recommendations that have been resolved. All account team prioritized recommendations are presented with a series of searchable and sortable columns. I see the recommendation name, status, source, category, and age.

AWS Trusted Advisor Priority

The list gives me details about the category, the age, and the status of the recommendations. The Source column distinguishes between auto-detected and manually identified opportunities. The Category column shows the category from Trusted Advisor (cost optimization, performance, security, fault tolerance, and service limits). The Age column shows me how long it’s been since the recommendation was first shared. This helps with tracking the time to resolution for each of these items.

AWS Trusted Advisor Priority

I can select any recommendation to drill down into the details. In this example, I select the second one: Amazon RDS Public Snapshots. This is a recommendation in the Security category.

AWS Trusted Advisor Priority

Recommendations are actionable, and they give you a real course of action to respond to the issue. In this case, it suggests modifying the snapshot configuration and removing the public flag that makes the database snapshot available to all AWS customers.

Trusted Advisor Priority provides a closed-loop feedback mechanism where I have the ability to accept or reject a recommendation if I don’t think the issue is relevant to my account.

The information is aggregated at an Organizations level. When you are using Organizations to group accounts to reflect your business units, the recommendations are aggregated and present an overall risk posture across your business units.

As an infrastructure manager, I can either Accept the recommendation and take action or Reject it because it is not a risk or it is something I will not fix and want to remove the recommendation from my list.

AWS Trusted Advisor Priority - Accept AWS Trusted Advisor Priority - Reject

Pricing and Availability
AWS Trusted Advisor Priority is available in all commercial AWS Regions where Trusted Advisor is available now, except the two AWS Regions in China. It is available at no additional cost for Enterprise Support customers.

Trusted Advisor Priority will not replace your Technical Account Manager or Solution Architect. They are key in providing tailored guidance and working with you through all phases of managing your cloud applications. Trusted Advisor Priority provides anytime access to tailored, context-aware, risk-mitigating recommendations and insights from your account team and optimizes your engagement with AWS. It will not reduce your access to your account team in any way but rather will make it easier for you to collaborate with them on your most important priorities.

You can start to use Trusted Advisor Priority today.

And now, go build!

— seb

New row and column interactivity options for tables and pivot tables in Amazon QuickSight – Part 1

Post Syndicated from Bhupinder Chadha original https://aws.amazon.com/blogs/big-data/part-1-new-row-and-column-interactivity-options-for-tables-and-pivot-tables-in-amazon-quicksight/

Amazon QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization. You can make your data come to life with rich interactive charts and create beautiful dashboards to share with thousands of users, either directly within a QuickSight application, or embedded in web apps and portals.

In 2021, as part of table and pivot table customization, we launched the ability for authors to resize row height in tables and pivot tables. Now, we are extending this capability to readers, along with other row and column interactions, such as altering column width and header height for both tables and pivot tables using a drag handler, for consistency between row and column interactions and an improved user experience. Apart from the format pane settings, authors can make quick changes to column, row, and header width for both parent and child nodes by simply dragging the cell, column, or row to the desired setting using the drag handler.

The following are the different interactions that both readers and authors can perform, some of which were already available for tables.

Modify row height

You can modify row height using the drag handler for cells, row headers, or column headers.

The following screenshot illustrates how to alter the row height by dragging from any cell in the table or pivot table.

For row headers, you can only resize row height by dragging the last element (child node), which gets aggregated to define the row height for the parent node.

You can resize the height of the column headers at any level such that you can have different heights at each level for better aesthetics.

Modify column width

You can also use the drag handler to modify column width for row headers, column headers, or cells.

You can alter column width for any field assigned to rows, as shown in the following screenshot.

You can alter column width for column dimension members both from the parent or leaf node in the case of hierarchy, or column field or values headers in the absence of a column hierarchy.

You now have complete flexibility to resize column width from cells. Depending on which cell, the corresponding column width is adjusted.

Considerations

A few things to note about this new feature:

  • Drag handlers are available for both web and embedded use cases
  • Row height is set in common for all rows and not specific to a particular row

Summary

With the introduction of a drag handler, authors and readers can now quickly alter column width, row height, and header height for tables and pivot tables. This provides a consistent behavior and improved user interaction for both the personas and visuals. To learn more about table and pivot table formatting options, refer to Formatting tables and pivot tables in Amazon QuickSight.

Try out the new feature and share your feedback and questions in the comments section.


About the author

Bhupinder Chadha is a senior product manager for Amazon QuickSight focused on visualization and front end experiences. He is passionate about BI, data visualization and low-code/no-code experiences. Prior to QuickSight he was the lead product manager for Inforiver, responsible for building a enterprise BI product from ground up. Bhupinder started his career in presales, followed by a small gig in consulting and then PM for xViz, an add on visualization product.

Leading the Way in Tampa

Post Syndicated from Julian Waits original https://blog.rapid7.com/2022/08/17/leading-the-way-in-tampa/

Leading the Way in Tampa

If you’ve been to the Tampa Bay area in recent years, you’ve probably noticed the significant tech industry expansion taking place in the Channel District. It’s an exciting time to be a part of the scene, and Rapid7 is smack in the middle. Being active in the Tampa Bay Chamber of Commerce is incredibly rewarding. I’ve personally been involved with the Chamber for more than a year, and I’ve witnessed first-hand their commitment to making Tampa Bay a vibrant, inclusive place to do business.

My next adventure with the Chamber involves becoming a part of its Leadership Tampa program — one of the oldest leadership programs in the country — which fulfills a mission to highlight the strengths of the Tampa Bay region and impact the community through various organized projects. Our 52-person group will also be collaborating with local not-for-profits to create strength in numbers.

We will continue building upon the diversity and inclusion efforts driven by previous Leadership Tampa members, as well as focus on the challenges of transportation, affordable housing, and inflation, among others. I’m looking forward to the many conversations and activities we will undertake as part of this initiative.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Additional reading:

Security updates for Wednesday

Post Syndicated from original https://lwn.net/Articles/904955/

Security updates have been issued by Debian (epiphany-browser, net-snmp, webkit2gtk, and wpewebkit), Fedora (python-yara and yara), Red Hat (kernel and kpatch-patch), SUSE (ceph, compat-openssl098, java-1_8_0-openjdk, kernel, python-Twisted, rsync, and webkit2gtk3), and Ubuntu (pyjwt and unbound).

Optimizing your AWS Infrastructure for Sustainability, Part IV: Databases

Post Syndicated from Otis Antoniou original https://aws.amazon.com/blogs/architecture/optimizing-your-aws-infrastructure-for-sustainability-part-iv-databases/

In Part I: Compute, Part II: Storage, and Part III: Networking of this series, we introduced strategies to optimize the compute, storage, and networking layers of your AWS architecture for sustainability.

This post, Part IV, focuses on the database layer and proposes recommendations to optimize your databases’ utilization, performance, and queries. These recommendations are based on design principles of AWS Well-Architected Sustainability Pillar.

Optimizing the database layer of your AWS infrastructure

AWS database services

Figure 1. AWS database services

As your application serves more customers, the volume of data stored within your databases will increase. Implementing the recommendations in the following sections will help you use databases resources more efficiently and save costs.

Use managed databases

Usually, customers overestimate the capacity they need to absorb peak traffic, wasting resources and money on unused infrastructure. AWS fully managed database services provide continuous monitoring, which allows you to increase and decrease your database capacity as needed. Additionally, most AWS managed databases use a pay-as-you-go model based on the instance size and storage used.

Managed services shift responsibility to AWS for maintaining high average utilization and sustainability optimization of the deployed hardware. Amazon Relational Database Service (Amazon RDS) reduces your individual contribution compared to maintaining your own databases on Amazon Elastic Compute Cloud (Amazon EC2). In a managed database, AWS continuously monitors your clusters to keep your workloads running with self-healing storage and automated scaling.

AWS offers 15+ purpose-built engines to support diverse data models. For example, if an Internet of Things (IoT) application needs to process large amounts of time series data, Amazon Timestream is designed and optimized for this exact use case.

Rightsize, reduce waste, and choose the right hardware

To see metrics, thresholds, and actions you can take to identify underutilized instances and rightsizing opportunities, Optimizing costs in Amazon RDS provides great guidance. The following table provides additional tools and metrics for you to find unused resources:

Service Metric Source
Amazon RDS DatabaseConnections Amazon CloudWatch
Amazon RDS Idle DB Instances AWS Trusted Advisor
Amazon DynamoDB AccountProvisionedReadCapacityUtilization, AccountProvisionedWriteCapacityUtilization, ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits CloudWatch
Amazon Redshift Underutilized Amazon Redshift Clusters AWS Trusted Advisor
Amazon DocumentDB DatabaseConnections, CPUUtilization, FreeableMemory CloudWatch
Amazon Neptune CPUUtilization, VolumeWriteIOPs, MainRequestQueuePendingRequests CloudWatch
Amazon Keyspaces ProvisionedReadCapacityUnits, ProvisionedWriteCapacityUnits, ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits CloudWatch

These tools will help you identify rightsizing opportunities. However, rightsizing databases can affect your SLAs for query times, so consider this before making changes.

We also suggest:

  • Evaluating if your existing SLAs meet your business needs or if they could be relaxed as an acceptable trade-off to optimize your environment for sustainability.
  • If any of your RDS instances only need to run during business hours, consider shutting them down outside business hours either manually or with Instance Scheduler.
  • Consider using a more power-efficient processor like AWS Graviton-based instances for your databases. Graviton2 delivers 2-3.5 times better CPU performance per watt than any other processor in AWS.

Make sure to choose the right RDS instance type for the type of workload you have. For example, burstable performance instances can deal with spikes that exceed the baseline without the need to overprovision capacity. In terms of storage, Amazon RDS provides three storage types that differ in performance characteristics and price, so you can tailor the storage layer of your database according to your needs.

Use serverless databases

Production databases that experience intermittent, unpredictable, or spiky traffic may be underutilized. To improve efficiency and eliminate excess capacity, scale your infrastructure according to its load.

AWS offers relational and non-relational serverless databases that shut off when not in use, quickly restart, and automatically scale database capacity based on your application’s needs. This reduces your environmental impact because capacity management is automatically optimized. By selecting the best purpose-built database for your workload, you’ll benefit from the scalability and fully-managed experience of serverless database services, as shown in the following table.

 

Serverless Relational Databases Serverless Non-relational Databases
Amazon Aurora Serverless for an on-demand, autoscaling configuration Amazon DynamoDB (in On-Demand mode) for a fully managed, serverless, key-value NoSQL database
Amazon Redshift Serverless runs and scales data warehouse capacity; you don’t need to set up and manage data warehouse infrastructure Amazon Timestream for a time series database service for IoT and operational applications
Amazon Keyspaces for a scalable, highly available, and managed Apache Cassandra–compatible database service
Amazon Quantum Ledger Database for a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log ‎owned by a central trusted authority

Use automated database backups and remove redundant data

Manual Amazon RDS backups, unlike automated backups, take a manual snapshot of your database and do not have a retention period set by default. This means that unless you delete a manual snapshot, it will not be removed automatically. Removing manual snapshots you don’t need will use fewer resources, which will reduce your costs. If you want manual snapshots of RDS, you can set an “expiration” with AWS Backup. To keep long-term snapshots of MariaDB, MySQL, and PostgreSQL data, we recommend exporting snapshot data to Amazon Simple Storage Service (Amazon S3). You can also export specific tables or databases. This way, you can move data to “colder” longer-term archival storage instead of keeping it within your database.

Optimize long running queries

Identify and optimize queries that are resource intensive because they can affect the overall performance of your application. By using the Performance Insights dashboard, specifically the Top Dimensions table, which displays the Top SQL, waits, and hosts, you’ll be able to view and download SQL queries to diagnose and investigate further.

Tuning Amazon RDS for MySQL with Performance Insights and this knowledge center article will help you optimize and tune queries in Amazon RDS for MySQL. The Optimizing and tuning queries in Amazon RDS PostgreSQL based on native and external tools and Improve query performance with parallel queries in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition blog posts outline how to use native and external tools to optimize and tune Amazon RDS PostgreSQL queries, as well as improve query performance using the parallel query feature.

Improve database performance

You can improve your database performance by monitoring, identifying, and remediating anomalous performance issues. Instead of relying on a database administrator (DBA), AWS offers native tools to continuously monitor and analyze database telemetry, as shown in the following table.

Service CloudWatch Metric Source
Amazon DynamoDB CPUUtilization, FreeStorageSpace CloudWatch
Amazon Redshift CPUUtilization, PercentageDiskSpaceUsed CloudWatch
Amazon Aurora CPUUtilization, FreeLocalStorage Amazon RDS
DynamoDB AccountProvisionedReadCapacityUtilization, AccountProvisionedWriteCapacityUtilization CloudWatch
Amazon ElastiCache CPUUtilization CloudWatch

CloudWatch displays instance-level and account-level usage metrics for Amazon RDS. Create CloudWatch alarms to activate and notify you based on metric value thresholds you specify or when anomalous metric behavior is detected. Enable Enhanced Monitoring real-time metrics for the operating system the DB instance runs on.

Amazon RDS Performance Insights collects performance metrics, such as database load, from each RDS DB instance. This data gives you a granular view of the databases’ activity every second. You can enable Performance Insights without causing downtime, reboot, or failover.

Amazon DevOps Guru for RDS uses the data from Performance Insights, Enhanced Monitoring, and CloudWatch to identify operational issues. It uses machine learning to detect and notify of database-related issues, including resource overutilization or misbehavior of certain SQL queries.

Conclusion

In this blog post, we discussed technology choices, design principles, and recommended actions to optimize and increase efficiency of your databases. As your data grows, it is important to scale your database capacity in line with your user load, remove redundant data, optimize database queries, and optimize database performance. Figure 2 shows an overview of the tools you can use to optimize your databases.

Figure 2. Tools you can use on AWS for optimization purposes

Figure 2. Tools you can use on AWS for optimization

Other blog posts in this series

Active Exploitation of Multiple Vulnerabilities in Zimbra Collaboration Suite

Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2022/08/17/active-exploitation-of-multiple-vulnerabilities-in-zimbra-collaboration-suite/

Active Exploitation of Multiple Vulnerabilities in Zimbra Collaboration Suite

Over the past few weeks, five different vulnerabilities affecting Zimbra Collaboration Suite have come to our attention, one of which is unpatched, and four of which are being actively and widely exploited in the wild by well-organized threat actors. We urge organizations who use Zimbra to patch to the latest version on an urgent basis, and to upgrade future versions as quickly as possible once they are released.

Exploited RCE vulnerabilities

The following vulnerabilities can be used for remote code execution and are being exploited in the wild.

CVE-2022-30333

CVE-2022-30333 is a path traversal vulnerability in unRAR, Rarlab’s command line utility for extracting RAR file archives. CVE-2022-30333 allows an attacker to write a file anywhere on the target file system as the user that executes unrar. Zimbra Collaboration Suite uses a vulnerable implementation of unrar (specifically, the amavisd component, which is used to inspect incoming emails for spam and malware). Zimbra addressed this issue in 9.0.0 patch 25 and 8.5.15 patch 32 by replacing unrar with 7z.

Our research team has a full analysis of CVE-2022-30333 in AttackerKB. A Metasploit module is also available. Note that the server does not necessarily need to be internet-facing to be exploited — it simply needs to receive a malicious email.

CVE-2022-27924

CVE-2022-27924 is a blind Memcached injection vulnerability first analyzed publicly in June 2022. Successful exploitation allows an attacker to change arbitrary keys in the Memcached cache to arbitrary values. In the worst-case scenario, an attacker can steal a user’s credentials when a user attempts to authenticate. Combined with CVE-2022-27925, an authenticated remote code execution vulnerability, and CVE-2022-37393, a currently unpatched privilege escalation issue that was publicly disclosed in October 2021, capturing a user’s password can lead to remote code execution as the root user on an organization’s email server, which frequently contains sensitive data.

Our research team has a full analysis of CVE-2022-27924 in AttackerKB. Note that an attacker does need to know a username on the server in order to exploit CVE-2022-27924. According to Sonar, it is also possible to poison the cache for any user by stacking multiple requests.

CVE-2022-27925

CVE-2022-27925 is a directory traversal vulnerability in Zimbra Collaboration Suite versions 8.8.15 and 9.0 that allows an authenticated user with administrator rights to upload arbitrary files to the system. On August 10, 2022, security firm Volexity published findings from multiple customer compromise investigations that indicated CVE-2022-27925 was being exploited in combination with a zero-day authentication bypass, now assigned CVE-2022-37042, that allowed attackers to leverage CVE-2022-27925 without authentication.

CVE-2022-37042

As noted above, CVE-2022-37042 is a critical authentication bypass that arises from an incomplete fix for CVE-2022-27925. Zimbra patched CVE-2022-37042 in 9.0.0P26 and 8.8.15P33.

Unpatched privilege escalation CVE-2022-37393

In October of 2021, researcher Darren Martyn published an exploit for a zero-day root privilege escalation vulnerability in Zimbra Collaboration Suite. When successfully exploited, the vulnerability allows a user with a shell account as the zimbra user to escalate to root privileges. While this issue requires a local account on the Zimbra host, the previously mentioned vulnerabilities in this blog post offer plenty of opportunity to obtain it.

Our research team tested the privilege escalation in combination with CVE-2022-30333 and CVE-2022-27924 at the end of July 2022 and found that at the time, all versions of Zimbra were affected through at least 9.0.0 P25 and 8.8.15 P32. Rapid7 disclosed the vulnerability to Zimbra on July 21, 2022 and later assigned CVE-2022-37393 (still awaiting NVD analysis) to track it. A full analysis of CVE-2022-37393 is available in AttackerKB. A Metasploit module is also available.

Mitigation guidance

We strongly advise that all organizations who use Zimbra in their environments update to the latest available version (at time of writing, the latest versions available are 9.0.0 P26 and 8.8.15 P33) to remediate known remote code execution vectors. We also advise monitoring Zimbra’s release communications for future security updates, and patching on an urgent basis when new versions become available.

The AttackerKB analyses for CVE-2022-30333, CVE-2022-27924, and CVE-2022-37393 all include vulnerability details (including proofs of concept) and sample IOCs. Volexity’s blog also has information on how to look for webshells dropped on Zimbra instances, such as comparing the list of JSP files on a Zimbra instance with those present by default in Zimbra installations. They have published lists of valid JSP files included in Zimbra installations for the latest version of 8.8.15 and of 9.0.0 (at time of writing).

Finally, we recommend blocking internet traffic to Zimbra servers wherever possible and configuring Zimbra to block external Memcached, even on patched versions of Zimbra.

Rapid7 customers

Our engineering team is in the investigation phase of vulnerability check development and will assess the risk and customer needs for each vulnerability separately. We will update this blog with more information as it becomes available.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Additional reading:

Zoom Exploit on MacOS

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/08/zoom-exploit-on-macos.html

This vulnerability was reported to Zoom last December:

The exploit works by targeting the installer for the Zoom application, which needs to run with special user permissions in order to install or remove the main Zoom application from a computer. Though the installer requires a user to enter their password on first adding the application to the system, Wardle found that an auto-update function then continually ran in the background with superuser privileges.

When Zoom issued an update, the updater function would install the new package after checking that it had been cryptographically signed by Zoom. But a bug in how the checking method was implemented meant that giving the updater any file with the same name as Zoom’s signing certificate would be enough to pass the test—so an attacker could substitute any kind of malware program and have it be run by the updater with elevated privilege.

It seems that it’s not entirely fixed:

Following responsible disclosure protocols, Wardle informed Zoom about the vulnerability in December of last year. To his frustration, he says an initial fix from Zoom contained another bug that meant the vulnerability was still exploitable in a slightly more roundabout way, so he disclosed this second bug to Zoom and waited eight months before publishing the research.

EDITED TO ADD: Disclosure works. The vulnerability seems to be patched now.

Target your customers with ML based on their interest in a product or product attribute.

Post Syndicated from Pavlos Ioannou Katidis original https://aws.amazon.com/blogs/messaging-and-targeting/use-machine-learning-to-target-your-customers-based-on-their-interest-in-a-product-or-product-attribute/

Customer segmentation allows marketers to better tailor their efforts to specific subgroups of their audience. Businesses who employ customer segmentation can create and communicate targeted marketing messages that resonate with specific customer groups. Segmentation increases the likelihood that customers will engage with the brand, and reduces the potential for communications fatigue—that is, the disengagement of customers who feel like they’re receiving too many messages that don’t apply to them. For example, if your business wants to launch an email campaign about business suits, the target audience should only include people who wear suits.

This blog presents a solution that uses Amazon Personalize to generate highly personalized Amazon Pinpoint customer segments. Using Amazon Pinpoint, you can send messages to those customer segments via campaigns and journeys.

Personalizing Pinpoint segments

Marketers first need to understand their customers by collecting customer data such as key characteristics, transactional data, and behavioral data. This data helps to form buyer personas, understand how they spend their money, and what type of information they’re interested in receiving.

You can create two types of customer segments in Amazon Pinpoint: imported and dynamic. With both types of segments, you need to perform customer data analysis and identify behavioral patterns. After you identify the segment characteristics, you can build a dynamic segment that includes the appropriate criteria. You can learn more about dynamic and imported segments in the Amazon Pinpoint User Guide.

Businesses selling their products and services online could benefit from segments based on known customer preferences, such as product category, color, or delivery options. Marketers who want to promote a new product or inform customers about a sale on a product category can use these segments to launch Amazon Pinpoint campaigns and journeys, increasing the probability that customers will complete a purchase.

Building targeted segments requires you to obtain historical customer transactional data, and then invest time and resources to analyze it. This is where the use of machine learning can save time and improve the accuracy.

Amazon Personalize is a fully managed machine learning service, which requires no prior ML knowledge to operate. It offers ready to use models for segment creation as well as product recommendations, called recipes. Using Amazon Personalize USER_SEGMENTATION recipes, you can generate segments based on a product ID or a product attribute.

About this solution

The solution is based on the following reference architectures:

Both of these architectures are deployed as nested stacks along the main application to showcase how contextual segmentation can be implemented by integrating Amazon Personalize with Amazon Pinpoint.

High level architecture

Architecture Diagram

Once training data and training configuration are uploaded to the Personalize data bucket (1) an AWS Step Function state machine is executed (2). This state machine implements a training workflow to provision all required resources within Amazon Personalize. It trains a recommendation model (3a) based on the Item-Attribute-Affinity recipe. Once the solution is created, the workflow creates a batch segment job to get user segments (3b). The job configuration focuses on providing segments of users that are interested in action genre movies

{ "itemAttributes": "ITEMS.genres = \"Action\"" }

When the batch segment job finishes, the result is uploaded to Amazon S3 (3c). The training workflow state machine publishes Amazon Personalize state changes on a custom event bus (4). An Amazon Event Bridge rule listens on events describing that a batch segment job has finished (5). Once this event is put on the event bus, a batch segment postprocessing workflow is executed as AWS Step Function state machine (6). This workflow reads and transforms the segment job output from Amazon Personalize (7) into a CSV file that can be imported as static segment into Amazon Pinpoint (8). The CSV file contains only the Amazon Pinpoint endpoint-ids that refer to the corresponding users from the Amazon Personalize recommendation segment, in the following format:

Id
hcbmnng30rbzf7wiqn48zhzzcu4
tujqyruqu2pdfqgmcgkv4ux7qlu
keul5pov8xggc0nh9sxorldmlxc
lcuxhxpqh/ytkitku2zynrqb2ce

The mechanism to resolve an Amazon Pinpoint endpoint id relies on the user id that is set in Amazon Personalize to be also referenced in each endpoint within Amazon Pinpoint using the user ID attribute.

State machine for getting Amazon Pinpoint endpoints

The workflow ensures that the segment file has a unique filename so that the segments within Amazon Pinpoint can be identified independently. Once the segment CSV file is uploaded to S3 (7), the segment import workflow creates a new imported segment within Amazon Pinpoint (8).

Datasets

The solution uses an artificially generated movies’ dataset called Bingewatch for demonstration purposes. The data is pre-processed to make it usable in the context of Amazon Personalize and Amazon Pinpoint. The pre-processed data consists of the following:

  • Interactions’ metadata created out of the Bingewatch ratings.csv
  • Items’ metadata created out of the Bingewatch movies.csv
  • users’ metadata created out of the Bingewatch ratings.csv, enriched with invented data about e-mail address and age
  • Amazon Pinpoint endpoint data

Interactions’ dataset

The interaction dataset describes movie ratings from Bingewatch users. Each row describes a single rating by a user identified by a user id.

The EVENT_VALUE describes the actual rating from 1.0 to 5.0 and the EVENT_TYPE specifies that the rating resulted because a user watched this movie at the given TIMESTAMP, as shown in the following example:

USER_ID,ITEM_ID,EVENT_VALUE,EVENT_TYPE,TIMESTAMP
1,1,4.0,Watch,964982703 
2,3,4.0,Watch,964981247
3,6,4.0,Watch,964982224
...

Items’ dataset

The item dataset describes each available movie using a TITLE, RELEASE_YEAR, CREATION_TIMESTAMP and a pipe concatenated list of GENRES, as shown in the following example:

ITEM_ID,TITLE,RELEASE_YEAR,CREATION_TIMESTAMP,GENRES
1,Toy Story,1995,788918400,Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji,1995,788918400,Adventure|Children|Fantasy
3,Grumpier Old Men,1995,788918400,Comedy|Romance
...

Users’ dataset

The users dataset contains all known users identified by a USER_ID. This dataset contains artificially generated metadata that describe the users’ GENDER and AGE, as shown in the following example:

USER_ID,GENDER,E_MAIL,AGE
1,Female,[email protected],21
2,Female,[email protected],35
3,Male,[email protected],37
4,Female,[email protected],47
5,Agender,[email protected],50
...

Amazon Pinpoint endpoints

To map Amazon Pinpoint endpoints to users in Amazon Personalize, it is important to have a consisted user identifier. The mechanism to resolve an Amazon Pinpoint endpoint id relies that the user id in Amazon Personalize is also referenced in each endpoint within Amazon Pinpoint using the userId attribute, as shown in the following example:

User.UserId,ChannelType,User.UserAttributes.Gender,Address,User.UserAttributes.Age
1,EMAIL,Female,[email protected],21
2,EMAIL,Female,[email protected],35
3,EMAIL,Male,[email protected],37
4,EMAIL,Female,[email protected],47
5,EMAIL,Agender,[email protected],50
...

Solution implementation

Prerequisites

To deploy this solution, you must have the following:

Note: This solution creates an Amazon Pinpoint project with the name personalize. If you want to deploy this solution on an existing Amazon Pinpoint project, you will need to perform changes in the YAML template.

Deploy the solution

Step 1: Deploy the SAM solution

Clone the GitHub repository to your local machine (how to clone a GitHub repository). Navigate to the GitHub repository location in your local machine using SAM CLI and execute the command below:

sam deploy --stack-name contextual-targeting --guided

Fill the fields below as displayed. Change the AWS Region to the AWS Region of your preference, where Amazon Pinpoint and Amazon Personalize are available. The Parameter Email is used from Amazon Simple Notification Service (SNS) to send you an email notification when the Amazon Personalize job is completed.

Configuring SAM deploy
======================
        Looking for config file [samconfig.toml] :  Not found
        Setting default arguments for 'sam deploy'     =========================================
        Stack Name [sam-app]: contextual-targeting
        AWS Region [us-east-1]: eu-west-1
        Parameter Email []: [email protected]
        Parameter PEVersion [v1.2.0]:
        Parameter SegmentImportPrefix [pinpoint/]:
        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [y/N]:
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]:
        #Preserves the state of previously provisioned resources when an operation fails
        Disable rollback [y/N]:
        Save arguments to configuration file [Y/n]:
        SAM configuration file [samconfig.toml]:
        SAM configuration environment [default]:
        Looking for resources needed for deployment:
        Creating the required resources...
        [...]
        Successfully created/updated stack - contextual-targeting in eu-west-1
======================

Step 2: Import the initial segment to Amazon Pinpoint

We will import some initial and artificially generated endpoints into Amazon Pinpoint.

Execute the command below to your AWS CLI in your local machine.

The command below is compatible with Linux:

SEGMENT_IMPORT_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`SegmentImportBucket`].OutputValue' --output text)
aws s3 sync ./data/pinpoint s3://$SEGMENT_IMPORT_BUCKET/pinpoint

For Windows PowerShell use the command below:

$SEGMENT_IMPORT_BUCKET = (aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`SegmentImportBucket`].OutputValue' --output text)
aws s3 sync ./data/pinpoint s3://$SEGMENT_IMPORT_BUCKET/pinpoint

Step 3: Upload training data and configuration for Amazon Personalize

Now we are ready to train our initial recommendation model. This solution provides you with dummy training data as well as a training and inference configuration, which needs to be uploaded into the Amazon Personalize S3 bucket. Training the model can take between 45 and 60 minutes.

Execute the command below to your AWS CLI in your local machine.

The command below is compatible with Linux:

PERSONALIZE_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`PersonalizeBucketName`].OutputValue' --output text)
aws s3 sync ./data/personalize s3://$PERSONALIZE_BUCKET

For Windows PowerShell use the command below:

$PERSONALIZE_BUCKET = (aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`PersonalizeBucketName`].OutputValue' --output text)
aws s3 sync ./data/personalize s3://$PERSONALIZE_BUCKET

Step 4: Review the inferred segments from Amazon Personalize

Once the training workflow is completed, you should receive an email on the email address you provided when deploying the stack. The email should look like the one in the screenshot below:

SNS notification for Amazon Personalize job

Navigate to the Amazon Pinpoint Console > Your Project > Segments and you should see two imported segments. One named endpoints.csv that contains all imported endpoints from Step 2. And then a segment named ITEMSgenresAction_<date>-<time>.csv that contains the ids of endpoints that are interested in action movies inferred by Amazon Personalize

Amazon Pinpoint segments created by the solution

You can engage with Amazon Pinpoint customer segments via Campaigns and Journeys. For more information on how to create and execute Amazon Pinpoint Campaigns and Journeys visit the workshop Building Customer Experiences with Amazon Pinpoint.

Next steps

Contextual targeting is not bound to a single channel, like in this solution email. You can extend the batch-segmentation-postprocessing workflow to fit your engagement and targeting requirements.

For example, you could implement several branches based on the referenced endpoint channel types and create Amazon Pinpoint customer segments that can be engaged via Push Notifications, SMS, Voice Outbound and In-App.

Clean-up

To delete the solution, run the following command in the AWS CLI.

The command below is compatible with Linux:

SEGMENT_IMPORT_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`SegmentImportBucket`].OutputValue' --output text)
PERSONALIZE_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`PersonalizeBucketName`].OutputValue' --output text)
aws s3 rm s3://$SEGMENT_IMPORT_BUCKET/ --recursive
aws s3 rm s3://$PERSONALIZE_BUCKET/ --recursive
sam delete

For Windows PowerShell use the command below:

$SEGMENT_IMPORT_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`SegmentImportBucket`].OutputValue' --output text)
$PERSONALIZE_BUCKET=$(aws cloudformation describe-stacks --stack-name contextual-targeting --query 'Stacks[0].Outputs[?OutputKey==`PersonalizeBucketName`].OutputValue' --output text)
aws s3 rm s3://$SEGMENT_IMPORT_BUCKET/ --recursive
aws s3 rm s3://$PERSONALIZE_BUCKET/ --recursive
sam delete

Amazon Personalize resources like Dataset groups, datasets, etc. are not created via AWS Cloudformation, thus you have to delete them manually. Please follow the instructions in the official AWS documentation on how to clean up the created resources.

About the Authors

Pavlos Ioannou Katidis

Pavlos Ioannou Katidis

Pavlos Ioannou Katidis is an Amazon Pinpoint and Amazon Simple Email Service Specialist Solutions Architect at AWS. He loves to dive deep into his customer’s technical issues and help them design communication solutions. In his spare time, he enjoys playing tennis, watching crime TV series, playing FPS PC games, and coding personal projects.

Christian Bonzelet

Christian Bonzelet

Christian Bonzelet is an AWS Solutions Architect at DFL Digital Sports. He loves those challenges to provide high scalable systems for millions of users. And to collaborate with lots of people to design systems in front of a whiteboard. He uses AWS since 2013 where he built a voting system for a big live TV show in Germany. Since then, he became a big fan on cloud, AWS and domain driven design.

[$] From late-bound arguments to deferred computation, part 1

Post Syndicated from original https://lwn.net/Articles/904777/

Back in November, we looked at a Python proposal
to have function arguments with defaults that get
evaluated when the function is called, rather than when it is defined.
The article suggested that the discussion surrounding the proposal was
likely to continue on for a ways—which it did—but it had died down by the
end of last year. That all changed in mid-June, when the already voluminous
discussion of the feature picked up again; once again, some people thought that
applying the idea only to function arguments was too restrictive. Instead,
a more general mechanism to defer evaluation was touted as something that
could work for late-bound arguments while being useful for other use cases as
well.

Introducing AWS Glue interactive sessions for Jupyter

Post Syndicated from Zach Mitchell original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-interactive-sessions-for-jupyter/

Interactive Sessions for Jupyter is a new notebook interface in the AWS Glue serverless Spark environment. Starting in seconds and automatically stopping compute when idle, interactive sessions provide an on-demand, highly-scalable, serverless Spark backend to Jupyter notebooks and Jupyter-based IDEs such as Jupyter Lab, Microsoft Visual Studio Code, JetBrains PyCharm, and more. Interactive sessions replace AWS Glue development endpoints for interactive job development with AWS Glue and offers the following benefits:

  • No clusters to provision or manage
  • No idle clusters to pay for
  • No up-front configuration required
  • No resource contention for the same development environment
  • Easy installation and usage
  • The exact same serverless Spark runtime and platform as AWS Glue extract, transform, and load (ETL) jobs

Getting started with interactive sessions for Jupyter

Installing interactive sessions is simple and only takes a few terminal commands. After you install it, you can run interactive sessions anytime within seconds of deciding to run. In the following sections, we walk you through installation on macOS and getting started in Jupyter.

To get started with interactive sessions for Jupyter on Windows, follow the instructions in Getting started with AWS Glue interactive sessions.

Prerequisites

These instructions assume you’re running Python 3.6 or later and have the AWS Command Line Interface (AWS CLI) properly running and configured. You use the AWS CLI to make API calls to AWS Glue. For more information on installing the AWS CLI, refer to Installing or updating the latest version of the AWS CLI.

Install AWS Glue interactive sessions on macOS and Linux

To install AWS Glue interactive sessions, complete the following steps:

  1. Open a terminal and run the following to install and upgrade Jupyter, Boto3, and AWS Glue interactive sessions from PyPi. If desired, you can install Jupyter Lab instead of Jupyter.
    pip3 install --user --upgrade jupyter boto3 aws-glue-sessions

  2. Run the following commands to identify the package install location and install the AWS Glue PySpark and AWS Glue Spark Jupyter kernels with Jupyter:
    SITE_PACKAGES=$(pip3 show aws-glue-sessions | grep Location | awk '{print $2}')
    jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_pyspark
    jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_spark

  3. To validate your install, run the following command:
    jupyter kernelspec list

In the output, you should see both the AWS Glue PySpark and the AWS Glue Spark kernels listed alongside the default Python3 kernel. It should look something like the following:

Available kernels:
  Python3		~/.venv/share/jupyter/kernels/python3
  glue_pyspark    /usr/local/share/jupyter/kernels/glue_pyspark
  glue_spark      /usr/local/share/jupyter/kernels/glue_spark

Choose and prepare IAM principals

Interactive sessions use two AWS Identity and Access Management (IAM) principals (user or role) to function. The first is used to call the interactive sessions APIs and is likely the same user or role that you use with the AWS CLI. The second is GlueServiceRole, the role that AWS Glue assumes to run your session. This is the same role as AWS Glue jobs; if you’re developing a job with your notebook, you should use the same role for both interactive sessions and the job you create.

Prepare the client user or role

In the case of local development, the first role is already configured if you can run the AWS CLI. If you can’t run the AWS CLI, follow these steps for setting up. If you often use the AWS CLI or Boto3 to interact with AWS Glue and have full AWS Glue permissions, you can likely skip this step.

  1. To validate this first user or role is set up, open a new terminal window and run the following code:
    aws sts get-caller-identity

    You should see a response like the following. If not, you may not have permissions to call AWS Security Token Service (AWS STS), or you don’t have the AWS CLI set up properly. If you simply get access denied calling AWS STS, you may continue if you know your user or role and its needed permissions.

    {
        "UserId": "ABCDEFGHIJKLMNOPQR",
        "Account": "123456789123",
        "Arn": "arn:aws:iam::123456789123:user/MyIAMUser"
    }
    
    {
        "UserId": "ABCDEFGHIJKLMNOPQR",
        "Account": "123456789123",
        "Arn": "arn:aws:iam::123456789123:role/myIAMRole"
    }

  2. Ensure your IAM user or role can call the AWS Glue interactive sessions APIs by attaching the AWSGlueConsoleFullAccess managed IAM policy to your role.

If your caller identity returned a user, run the following:

aws iam attach-user-policy --role-name <myIAMUser> --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess

If your caller identity returned a role, run the following:

aws iam attach-role-policy --role-name, --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess

Prepare the AWS Glue service role for interactive sessions

You can specify the second principal, GlueServiceRole, either in the notebook itself by using the %iam_role magic or stored alongside the AWS CLI config. If you have a role that you typically use with AWS Glue jobs, this will be that role. If you don’t have a role you use for AWS Glue jobs, refer to Setting up IAM permissions for AWS Glue to set one up.

To set this role as the default role for interactive sessions, edit the AWS CLI credentials file and add glue_role_arn to the profile you intend to use.

  1. With a text editor, open ~/.aws/credentials.
    On Windows, use C:\Users\username\.aws\credentials.
  2. Look for the profile you use for AWS Glue; if you don’t use a profile, you’re looking for [Default].
  3. Add a line in the profile for the role you intend to use like, glue_role_arn=<AWSGlueServiceRole>.
  4. I recommend adding a default Region to your profile if one is not specified already. You can do so by adding the line region=us-east-1, replacing us-east-1 with your desired Region.
    If you don’t add a Region to your profile, you’re required to specify the Region at the top of each notebook with the %region magic.When finished, your config should look something like the following:

    [Defaut]
    aws_access_key_id=ABCDEFGHIJKLMNOPQRST
    aws_secret_access_key=1234567890ABCDEFGHIJKLMNOPQRSTUVWZYX1234
    glue_role_arn=arn:aws:iam::123456789123:role/AWSGlueServiceRoleForSessions
    region=us-west-2

  5. Save the config.

Start Jupyter and an AWS Glue PySpark notebook

To start Jupyter and your notebook, complete the following steps:

  1. Run the following command in your terminal to open the Jupyter notebook in your browser:
    jupyter notebook

    Your browser should open and you’re presented with a page that looks like the following screenshot.

  2. On the New menu, choose Glue PySpark.

A new tab opens with a blank Jupyter notebook using the AWS Glue PySpark kernel.

Configure your notebook with magics

AWS Glue interactive sessions are configured with Jupyter magics. Magics are small commands prefixed with % at the start of Jupyter cells that provide shortcuts to control the environment. In AWS Glue interactive sessions, magics are used for all configuration needs, including:

  • %region – Region
  • %profile – AWS CLI profile
  • %iam_role – IAM role for the AWS Glue service role
  • %worker_type – Worker type
  • %number_of_workers – Number of workers
  • %idle_timeout – How long to allow a session to idle before stopping it
  • %additional_python_modules – Python libraries to install from pip

Magics are placed at the beginning of your first cell, before your code, to configure AWS Glue. To discover all the magics of interactive sessions, run %help in a cell and a full list is printed. With the exception of %%sql, running a cell of only magics doesn’t start a session, but sets the configuration for the session that starts next when you run your first cell of code. For this post, we use three magics to configure AWS Glue with version 2.0 and two G.2X workers. Let’s enter the following magics into our first cell and run it:

%glue_version 2.0
%number_of_workers 2
%worker_type G.2X
%idle_tiemout 60


Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Setting Glue version to: 2.0
Previous number of workers: 5
Setting new number of workers to: 2
Previous worker type: G.1X
Setting new worker type to: G.2X

When you run magics, the output lets us know the values we’re changing along with their previous settings. Explicitly setting all your configuration in magics helps ensure consistent runs of your notebook every time and is recommended for production workloads.

Run your first code cell and author your AWS Glue notebook

Next, we run our first code cell. This is when a session is provisioned for use with this notebook. When interactive sessions are properly configured within an account, the session is completely isolated to this notebook. If you open another notebook in a new tab, it gets its own session on its own isolated compute. Run your code cell as follows:

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Authenticating with profile=default
glue_role_arn defined by user: arn:aws:iam::123456789123:role/AWSGlueServiceRoleForSessions
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.2X
Number of Workers: 2
Session ID: 12345678-12fa-5315-a234-567890abcdef
Applying the following default arguments:
--glue_kernel_version 0.31
--enable-glue-datacatalog true
Waiting for session 12345678-12fa-5315-a234-567890abcdef to get into ready status...
Session 12345678-12fa-5315-a234-567890abcdef has been created

When you ran the first cell containing code, Jupyter invoked interactive sessions, provisioned an AWS Glue cluster, and sent the code to AWS Glue Spark. The notebook was given a session ID, as shown in the preceding code. We can also see the properties used to provision AWS Glue, including the IAM role that AWS Glue used to create the session, the number of workers and their type, and any other options that were passed as part of the creation.

Interactive sessions automatically initialize a Spark session as spark and SparkContext as sc; having Spark ready to go saves a lot of boilerplate code. However, if you want to convert your notebook to a job, spark and sc must be initialized and declared explicitly.

Work in the notebook

Now that we have a session up, let’s do some work. In this exercise, we look at population estimates from the AWS COVID-19 dataset, clean them up, and write the results a table.

This walkthrough uses data from the COVID-19 data lake.

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create an AWS CloudFormation stack using the following template.

If you’re signed in to your AWS account, deploy the CloudFormation stack by clicking the following Launch stack button:

BDB-2063-launch-cloudformation-stack

It fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Get started.

When I’m working on a new data integration process, the first thing I often do is identify and preview the datasets I’m going to work on. If I don’t recall the exact location or table name, I typically open the AWS Glue console and search or browse for the table then return to my notebook to preview it. With interactive sessions, there is a quicker way to browse the Data Catalog. We can use the %%sql magic to show databases and tables without leaving the notebook. For this example, the population table I want in is the COVID-19 dataset but I don’t recall its exact name, so I use the %%sql magic to look it up:

%%sql
show tables in `covid-19`  # Remember, dashes in names must be escaped with backticks.

+--------+--------------------+-----------+
|database|           tableName|isTemporary|
+--------+--------------------+-----------+
|covid-19|alleninstitute_co...|      false|
|covid-19|alleninstitute_me...|      false|
|covid-19|aspirevc_crowd_tr...|      false|
|covid-19|aspirevc_crowd_tr...|      false|
|covid-19|cdc_moderna_vacci...|      false|
|covid-19|cdc_pfizer_vaccin...|      false|
|covid-19|       country_codes|      false|
|covid-19|  county_populations|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_knowledge_g...|      false|
|covid-19|covid_testing_sta...|      false|
|covid-19|covid_testing_us_...|      false|
|covid-19|covid_testing_us_...|      false|
|covid-19|      covidcast_data|      false|
|covid-19|  covidcast_metadata|      false|
|covid-19|enigma_aggregatio...|      false|
+--------+--------------------+-----------+
only showing top 20 rows

Looking through the returned list, we see a table named county_populations. Let’s select from this table, sorting for the largest counties by population:

%%sql
select * from `covid-19`.county_populations sort by `population estimate 2018` desc limit 10

+--------------+-----+---------------+-----------+------------------------+
|            id|  id2|         county|      state|population estimate 2018|
+--------------+-----+---------------+-----------+------------------------+
|            Id|  Id2|         County|      State|    Population Estima...|
|0500000US01085| 1085|        Lowndes|    Alabama|                    9974|
|0500000US06057| 6057|         Nevada| California|                   99696|
|0500000US29189|29189|      St. Louis|   Missouri|                  996945|
|0500000US22021|22021|Caldwell Parish|  Louisiana|                    9960|
|0500000US06019| 6019|         Fresno| California|                  994400|
|0500000US28143|28143|         Tunica|Mississippi|                    9944|
|0500000US05051| 5051|        Garland|   Arkansas|                   99154|
|0500000US29079|29079|         Grundy|   Missouri|                    9914|
|0500000US27063|27063|        Jackson|  Minnesota|                    9911|
+--------------+-----+---------------+-----------+------------------------+

Our query returned data but in an unexpected order. It looks like population estimate 2018 sorted lexicographically if the values were strings. Let’s use an AWS Glue DynamicFrame to get the schema of the table and verify the issue:

# Create a DynamicFrame of county_populations and print it's schema
dyf = glueContext.create_dynamic_frame.from_catalog(
    database="covid-19", table_name="county_populations"
)
dyf.printSchema()

root
|-- id: string
|-- id2: string
|-- county: string
|-- state: string
|-- population estimate 2018: string

The schema shows population estimate 2018 to be a string, which is why our column isn’t sorting properly. We can use the apply_mapping transform in our next cell to correct the column type. In the same transform, we also clean up the column names and other column types: clarifying the distinction between id and id2, removing spaces from population estimate 2018 (conforming to Hive’s standards), and casting id2 as an integer for proper sorting. After validating the schema, we show the data with the new schema:

# Rename id2 to simple_id and convert to Int
# Remove spaces and rename population est. and convert to Long
mapped = dyf.apply_mapping(
    mappings=[
        ("id", "string", "id", "string"),
        ("id2", "string", "simple_id", "int"),
        ("county", "string", "county", "string"),
        ("state", "string", "state", "string"),
        ("population estimate 2018", "string", "population_est_2018", "long"),
    ]
)
mapped.printSchema()
 
root
|-- id: string
|-- simple_id: int
|-- county: string
|-- state: string
|-- population_est_2018: long


mapped_df = mapped.toDF()
mapped_df.show()

+--------------+---------+---------+-------+-------------------+
|            id|simple_id|   county|  state|population_est_2018|
+--------------+---------+---------+-------+-------------------+
|0500000US01001|     1001|  Autauga|Alabama|              55601|
|0500000US01003|     1003|  Baldwin|Alabama|             218022|
|0500000US01005|     1005|  Barbour|Alabama|              24881|
|0500000US01007|     1007|     Bibb|Alabama|              22400|
|0500000US01009|     1009|   Blount|Alabama|              57840|
|0500000US01011|     1011|  Bullock|Alabama|              10138|
|0500000US01013|     1013|   Butler|Alabama|              19680|
|0500000US01015|     1015|  Calhoun|Alabama|             114277|
|0500000US01017|     1017| Chambers|Alabama|              33615|
|0500000US01019|     1019| Cherokee|Alabama|              26032|
|0500000US01021|     1021|  Chilton|Alabama|              44153|
|0500000US01023|     1023|  Choctaw|Alabama|              12841|
|0500000US01025|     1025|   Clarke|Alabama|              23920|
|0500000US01027|     1027|     Clay|Alabama|              13275|
|0500000US01029|     1029| Cleburne|Alabama|              14987|
|0500000US01031|     1031|   Coffee|Alabama|              51909|
|0500000US01033|     1033|  Colbert|Alabama|              54762|
|0500000US01035|     1035|  Conecuh|Alabama|              12277|
|0500000US01037|     1037|    Coosa|Alabama|              10715|
|0500000US01039|     1039|Covington|Alabama|              36986|
+--------------+---------+---------+-------+-------------------+
only showing top 20 rows

With the data sorting correctly, we can write it to Amazon Simple Storage Service (Amazon S3) as a new table in the AWS Glue Data Catalog. We use the mapped DynamicFrame for this write because we didn’t modify any data past that transform:

# Create "demo" Database if none exists
spark.sql("create database if not exists demo")


# Set glueContext sink for writing new table
S3_BUCKET = "<S3_BUCKET>"
s3output = glueContext.getSink(
    path=f"s3://{S3_BUCKET}/interactive-sessions-blog/populations/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    compression="snappy",
    enableUpdateCatalog=True,
    transformation_ctx="s3output",
)
s3output.setCatalogInfo(catalogDatabase="demo", catalogTableName="populations")
s3output.setFormat("glueparquet")
s3output.writeFrame(mapped)


# Write out ‘mapped’ to a table in Glue Catalog
s3output = glueContext.getSink(
    path=f"s3://{S3_BUCKET}/interactive-sessions-blog/populations/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    compression="snappy",
    enableUpdateCatalog=True,
    transformation_ctx="s3output",
)
s3output.setCatalogInfo(catalogDatabase="demo", catalogTableName="populations")
s3output.setFormat("glueparquet")
s3output.writeFrame(mapped)

Finally, we run a query against our new table to show our table created successfully and validate our work:

%%sql
select * from demo.populations

Convert notebooks to AWS Glue jobs with nbconvert

Jupyter notebooks are saved as .ipynb files. AWS Glue doesn’t currently run .ipynb files directly, so they need to be converted to Python scripts before they can be uploaded to Amazon S3 as jobs. Use the jupyter nbconvert command from a terminal to convert the script.

  1. Open a new terminal or PowerShell tab or window.
  2. cd to the working directory where your notebook is.
    This is likely the same directory where you ran jupyter notebook at the beginning of this post.
  3. Run the following bash command to convert the notebook, providing the correct file name for your notebook:
    jupyter nbconvert --to script <Untitled-1>.ipynb

  4. Run cat <Untitled-1>.ipynb to view your new file.
  5. Upload the .py file to Amazon S3 using the following command, replacing the bucket, path, and file name as needed:
    aws s3 cp <Untitled-1>.py s3://<bucket>/<path>/<Untitled-1.py>

  6. Create your AWS Glue job with the following command.

Note that the magics aren’t automatically converted to job parameters when converting notebooks locally. You need to put in your job arguments correctly, or import your notebook to AWS Glue Studio and complete the following steps to keep your magic settings.

aws glue create-job \
    --name is_blog_demo
    --role "<GlueServiceRole>" \
    --command {"Name": "glueetl", "PythonVersion": "3", "ScriptLocation": "s3://<bucket>/<path>/<Untitled-1.py"} \
    --default-arguments {"--enable-glue-datacatalog": "true"} \
    --number-of-workers 2 \
    --worker-type G.2X

Run the job

After you have authored the notebook, converted it to a Python file, uploaded it to Amazon S3, and finally made it into an AWS Glue job, the only thing left to do is run it. Do so with the following terminal command:

aws glue start-job-run --job-name is_blog --region us-east-1

Conclusion

AWS Glue interactive sessions offer a new way to interact with the AWS Glue serverless Spark environment. Set it up in minutes, start sessions in seconds, and only pay for what you use. You can use interactive sessions for AWS Glue job development, ad hoc data integration and exploration, or for large queries and audits. AWS Glue interactive sessions are generally available in all Regions that support AWS Glue.

To learn more and get started using AWS Glue Interactive Sessions visit our developer guide and begin coding in seconds.


About the author

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

How Grillo Built a Low-Cost Earthquake Early Warning System on AWS

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/how-grillo-built-a-low-cost-earthquake-early-warning-system-on-aws/

It is estimated that 50 percent of the injuries caused when a high magnitude earthquake affects an area are because of falls or falling hazards. This means that most of these injuries could have been prevented if the population had a few seconds of warning to take cover. Grillo, a social impact enterprise focused on seismology, created a low-cost solution using AWS that senses earthquakes and alerts the population in real time about the dangers in the area.

Earthquakes can happen at any time, and there are two actions cities can take to mitigate the damages. First is structural refitting, that is, building structures that can resist earthquakes. This solution doesn’t apply to many areas because they require big investments. The second solution is to send an alert to the affected population before the shaking reaches them. Ten to sixty seconds can be enough time for people to take action by getting out of a building, taking cover, or turning off a dangerous machine.

Earthquake Early Warning (EEW) systems provide rapid detection of earthquakes and alert people at risk. However, because of the hardware, infrastructure, and technology involved, traditional EEW systems can cost hundreds of millions of US dollars to deploy—a cost too high for most countries.

Andrés Meira was living in Haiti during the 2010 earthquake that claimed over 100,000 human lives and left many people homeless and injured. It is estimated that the earthquake affected three million people. He later moved to Mexico, where in 2017, he experienced another high-magnitude earthquake. As a result, Andrés founded Grillo to develop an accessible EEW system, and its solution has been operating successfully in Mexico since 2017.

Grillo developed a low-cost EEW system using sensors and cloud computing. This system uses off-the-shelf sensors that are placed in buildings near seismically active zones. Grillo sensors cost approximately $300 USD, compared to the traditional seismometers that cost around $10,000 USD. Because of these inexpensive sensors, Grillo can offer a higher density of sensors, which reduces the time needed to issue an alert and gives people more time for action. This benefits the population because higher density increases the accuracy of the location detection, reduces false positives, and reduces times to alert.

How sensors are placed

How Grillo sensors are placed

Grillo’s sensors transmit data to the cloud as the shaking is happening. The cloud platform Grillo built on AWS uses machine learning models that can determine and alert in almost real time, with an average latency of 2 to 3 seconds if an earthquake is happening, depending on the data sent by the different sensors. When the cloud platform detects earthquake risk, it sends alerts to nearby populations via a native phone application, IoT loudspeakers placed in populated areas, or by SMS.

Grillo data flow

How data flows from the shaking to the end users

OpenEEW
In addition, Grillo founded the OpenEEW initiative to enable EEW systems for millions of people who live in areas with earthquake risks. This features the sensor hardware schematics, firmware, dashboard, and other elements of the system as open source, with a permissive license for anyone to use freely.

In this initiative, they also share on the Registry of Open Data on AWS all the data produced from the sensors deployed in Mexico, Chile, Puerto Rico, and Costa Rica for different organizations to learn from it and also to train machine learning models.

Low cost sensor

Low-cost sensor

Grillo in Haiti
Haiti ranks among the countries with the highest seismic risk in the world. Large magnitude earthquakes hit Haiti in 2020 and 2021. Currently, Grillo is working to establish their low-cost EEW system in southern Haiti, where most of the large seismic events in the past decade have occurred. This area is home to over three million people.

Over the course of 2021, Grillo installed over 100 sensors in Puerto Rico. And during 2022, they have focused on deploying sensors in the nationwide cell tower network of Haiti. Also during this year, they will calibrate the machine learning models with data from the new sensors in order to correctly predict when there is earthquake risk. Finally, they will develop an SMS alert system with Digicel, a local telecommunication company. Grillo plans to complete the deployment of the south Haiti EEW system by the end of 2022.

School in southern Haiti where alarm systems are placed

School in southern Haiti where alarm systems are placed

Learn more
Grillo partnered with the AWS Disaster Response team to achieve their goals. AWS helped Grillo to migrate their initial system to AWS and provided expert technical assistance on how to use Amazon SageMaker and AWS IoT services. AWS also provided credits to run the system and financial help to build the sensors.

Check the AWS Disaster Response page to learn more about the projects they are currently working on. And visit the Grillo home page to learn more about their EEW system.

Marcia

From centralized architecture to decentralized architecture: How data sharing fine-tunes Amazon Redshift workloads

Post Syndicated from Jingbin Ma original https://aws.amazon.com/blogs/big-data/from-centralized-architecture-to-decentralized-architecture-how-data-sharing-fine-tunes-amazon-redshift-workloads/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift has become the most widely used cloud data warehouse.

With the significant growth of data for big data analytics over the years, some customers have asked how they should optimize Amazon Redshift workloads. In this post, we explore how to optimize workloads on Amazon Redshift clusters using Amazon Redshift RA3 nodes, data sharing, and pausing and resuming clusters. For more cost-optimization methods, refer to Getting the most out of your analytics stack with Amazon Redshift.

Key features of Amazon Redshift

First, let’s review some key features:

  • RA3 nodes – Amazon Redshift RA3 nodes are backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage. They bring a few very important features, one of which is data sharing. RA3 nodes also support the ability to pause and resume, which allows you to easily suspend on-demand billing while the cluster is not being used.
  • Data sharing – Amazon Redshift data sharing offers you to extend the ease of use, performance, and cost benefits of Amazon Redshift in a single cluster to multi-cluster deployments while being able to share data. Data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy or move it. You can securely share live data with Amazon Redshift clusters in the same or different AWS accounts, and across regions. You can share data at many levels, including schemas, tables, views, and user-defined functions. You can also share the most up-to-date and consistent information as it’s updated in Amazon Redshift Serverless. It also provides fine-grained access controls that you can tailor for different users and businesses that all need access to the data. However, data sharing in Amazon Redshift has a few limitations.

Solution overview

In this use case, our customer is heavily using Amazon Redshift as their data warehouse for their analytics workloads, and they have been enjoying the possibility and convenience that Amazon Redshift brought to their business. They mainly use Amazon Redshift to store and process user behavioral data for BI purposes. The data has increased by hundreds of gigabytes daily in recent months, and employees from departments continuously run queries against the Amazon Redshift cluster on their BI platform during business hours.

The company runs four major analytics workloads on a single Amazon Redshift cluster, because some data is used by all workloads:

  • Queries from the BI platform – Various queries run mainly during business hours.
  • Hourly ETL – This extract, transform, and load (ETL) job runs in the first few minutes of each hour. It generally takes about 40 minutes.
  • Daily ETL – This job runs twice a day during business hours, because the operation team needs to get daily reports before the end of the day. Each job normally takes between 1.5–3 hours. It’s the second-most resource-heavy workload.
  • Weekly ETL – This job runs in the early morning every Sunday. It’s the most resource-heavy workload. The job normally takes 3–4 hours.

The analytics team has migrated to the RA3 family and increased the number of nodes of the Amazon Redshift cluster to 12 over the years to keep the average runtime of queries from their BI tool within an acceptable time due to the data size, especially when other workloads are running.

However, they have noticed that performance is reduced while running ETL tasks, and the duration of ETL tasks is long. Therefore, the analytics team wants to explore solutions to optimize their Amazon Redshift cluster.

Because CPU utilization spikes appear while the ETL tasks are running, the AWS team’s first thought was to separate workloads and relevant data into multiple Amazon Redshift clusters with different cluster sizes. By reducing the total number of nodes, we hoped to reduce the cost of Amazon Redshift.

After a series of conversations, the AWS team found that one of the reasons that the customer keeps all workloads on the 12-node Amazon Redshift cluster is to manage the performance of queries from their BI platform, especially while running ETL workloads, which have a big impact on the performance of all workloads on the Amazon Redshift cluster. The obstacle is that many tables in the data warehouse are required to be read and written by multiple workloads, and only the producer of a data share can update the shared data.

The challenge of dividing the Amazon Redshift cluster into multiple clusters is data consistency. Some tables need to be read by ETL workloads and written by BI workloads, and some tables are the opposite. Therefore, if we duplicate data into two Amazon Redshift clusters or only create a data share from the BI cluster to the reporting cluster, the customer will have to develop a data synchronization process to keep the data consistent between all Amazon Redshift clusters, and this process could be very complicated and unmaintainable.

After more analysis to gain an in-depth understanding of the customer’s workloads, the AWS team found that we could put tables into four groups, and proposed a multi-cluster, two-way data sharing solution. The purpose of the solution is to divide the workloads into separate Amazon Redshift clusters so that we can use Amazon Redshift to pause and resume clusters for periodic workloads to reduce the Amazon Redshift running costs, because clusters can still access a single copy of data that is required for workloads. The solution should meet the data consistency requirements without building a complicated data synchronization process.

The following diagram illustrates the old architecture (left) compared to the new multi-cluster solution (right).

Improve the old architecture (left) to the new multi-cluster solution (right)

Dividing workloads and data

Due to the characteristics of the four major workloads, we categorized workloads into two categories: long-running workloads and periodic-running workloads.

The long-running workloads are for the BI platform and hourly ETL jobs. Because the hourly ETL workload requires about 40 minutes to run, the gain is small even if we migrate it to an isolated Amazon Redshift cluster and pause and resume it every hour. Therefore, we leave it with the BI platform.

The periodic-running workloads are the daily and weekly ETL jobs. The daily job generally takes about 1 hour and 40 minutes to 3 hours, and the weekly job generally takes 3–4 hours.

Data sharing plan

The next step is identifying all data (tables) access patterns of each category. We identified four types of tables:

  • Type 1 – Tables are only read and written by long-running workloads
  • Type 2 – Tables are read and written by long-running workloads, and are also read by periodic-running workloads
  • Type 3 – Tables are read and written by periodic-running workloads, and are also read by long-running workloads
  • Type 4 – Tables are only read and written by periodic-running workloads

Fortunately, there is no table that is required to be written by all workloads. Therefore, we can separate the Amazon Redshift cluster into two Amazon Redshift clusters: one for the long-running workloads, and the other for periodic-running workloads with 20 RA3 nodes.

We created a two-way data share between the long-running cluster and the periodic-running cluster. For type 2 tables, we created a data share on the long-running cluster as the producer and the periodic-running cluster as the consumer. For type 3 tables, we created a data share on the periodic-running cluster as the producer and the long-running cluster as the consumer.

The following diagram illustrates this data sharing configuration.

The long-running cluster (producer) shares type 2 tables to the periodic-running cluster (consumer). The periodic-running cluster (producer’) shares type 3 tables to the long-running cluster (consumer’)

Build two-way data share across Amazon Redshift clusters

In this section, we walk through the steps to build a two-way data share across Amazon Redshift clusters. First, let’s take a snapshot of the original Amazon Redshift cluster, which became the long-running cluster later.

Take a snapshot of the long-running-cluster from the Amazon Redshift console

Now, let’s create a new Amazon Redshift cluster with 20 RA3 nodes for periodic-running workloads. Then we migrate the type 3 and type 4 tables to the periodic-running cluster. Make sure you choose the ra3 node type. (Amazon Redshift Serverless supports data sharing too, and it becomes generally available in July 2022, so it is also an option now.)

Create the periodic-running-cluster. Make sure you select the ra3 node type.

Create the long-to-periodic data share

The next step is to create the long-to-periodic data share. Complete the following steps:

  1. On the periodic-running cluster, get the namespace by running the following query:
SELECT current_namespace;

Make sure record the namespace.

  1. On the long-running cluster, we run queries similar to the following:
CREATE DATASHARE ltop_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ltop_share ADD SCHEMA public_long;
ALTER DATASHARE ltop_share ADD ALL TABLES IN SCHEMA public_long;
GRANT USAGE ON DATASHARE ltop_share TO NAMESPACE '[periodic-running-cluster-namespace]';
  1. We can validate the long-to-periodic data share using the following command:
SHOW datashares;
  1. After we validate the data share, we get the long-running cluster namespace with the following query:
SELECT current-namespace;

Make sure record the namespace.

  1. On the periodic-running cluster, run the following command to load the data from the long-to-periodic data share with the long-running cluster namespace:
CREATE DATABASE ltop FROM DATASHARE ltop_share OF NAMESPACE '[long-running-cluster-namespace]';
  1. Confirm that we have read access to tables in the long-to-periodic data share.

Create the periodic-to-long data share

The next step is to create the periodic-to-long data share. We use the namespaces of the long-running cluster and the periodic-running cluster that we collected in the previous step.

  1. On the periodic-running cluster, run queries similar to the following to create the periodic-to-long data share:
CREATE DATASHARE ptol_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ptol_share ADD SCHEMA public_periodic;
ALTER DATASHARE ptol_share ADD ALL TABLES IN SCHEMA public_periodic;
GRANT USAGE ON DATASHARE ptol_share TO NAMESPACE '[long-running-cluster-namespace]';
  1. Validate the data share using the following command:
SHOW datashares;
  1. On the long-running cluster, run the following command to load the data from the periodic-to-long data using the periodic-running cluster namespace:
CREATE DATABASE ptol FROM DATASHARE ptol_share OF NAMESPACE '[periodic-running-cluster-namespace]';
  1. Check that we have read access to the tables in the periodic-to-long data share.

At this stage, we have separated workloads into two Amazon Redshift clusters and built a two-way data share across two Amazon Redshift clusters.

The next step is updating the code of different workloads to use the correct endpoints of two Amazon Redshift clusters and perform consolidated tests.

Pause and resume the periodic-running Amazon Redshift cluster

Let’s update the crontab scripts, which run periodic-running workloads. We make two updates.

  1. When the scripts start, call the Amazon Redshift check and resume cluster APIs to resume the periodic-running Amazon Redshift cluster when the cluster is paused:
    aws redshift resume-cluster --cluster-identifier [periodic-running-cluster-id]

  2. After the workloads are finished, call the Amazon Redshift pause cluster API with the cluster ID to pause the cluster:
    aws redshift pause-cluster --cluster-identifier [periodic-running-cluster-id]

Results

After we migrated the workloads to the new architecture, the company’s analytics team ran some tests to verify the results.

According to tests, the performance of all workloads improved:

  • The BI workload is about 100% faster during the ETL workload running periods
  • The hourly ETL workload is about 50% faster
  • The daily workload duration reduced to approximately 40 minutes, from a maximum of 3 hours
  • The weekly workload duration reduced to approximately 1.5 hours, from a maximum of 4 hours

All functionalities work properly, and cost of the new architecture only increased approximately 13%, while over 10% of new data had been added during the testing period.

Learnings and limitations

After we separated the workloads into different Amazon Redshift clusters, we discovered a few things:

  • The performance of the BI workloads was 100% faster because there was no resource competition with daily and weekly ETL workloads anymore
  • The duration of ETL workloads on the periodic-running cluster was reduced significantly because there were more nodes and no resource competition from the BI and hourly ETL workloads
  • Even when over 10% new data was added, the overall cost of the Amazon Redshift clusters only increased by 13%, due to using the cluster pause and resume function of the Amazon Redshift RA3 family

As a result, we saw a 70% price-performance improvement of the Amazon Redshift cluster.

However, there are some limitations of the solution:

  • To use the Amazon Redshift pause and resume function, the code for calling the Amazon Redshift pause and resume APIs must be added to all scheduled scripts that run ETL workloads on the periodic-running cluster
  • Amazon Redshift clusters require several minutes to finish pausing and resuming, although you’re not charged during these processes
  • The size of Amazon Redshift clusters can’t automatically scale in and out depending on workloads

Next steps

After improving performance significantly, we can explore the possibility of reducing the number of nodes of the long-running cluster to reduce Amazon Redshift costs.

Another possible optimization is using Amazon Redshift Spectrum to reduce the cost of Amazon Redshift on cluster storage. With Redshift Spectrum, multiple Amazon Redshift clusters can concurrently query and retrieve the same structured and semistructured dataset in Amazon Simple Storage Service (Amazon S3) without the need to make copies of the data for each cluster or having to load the data into Amazon Redshift tables.

Amazon Redshift Serverless was announced for preview in AWS re:Invent 2021 and became generally available in July 2022. Redshift Serverless automatically provisions and intelligently scales your data warehouse capacity to deliver best-in-class performance for all your analytics. You only pay for the compute used for the duration of the workloads on a per-second basis. You can benefit from this simplicity without making any changes to your existing analytics and BI applications. You can also share data for read purposes across different Amazon Redshift Serverless instances within or across AWS accounts.

Therefore, we can explore the possibility of removing the need to script for pausing and resuming the periodic-running cluster by using Redshift Serverless to make the management easier. We can also explore the possibility of improving the granularity of workloads.

Conclusion

In this post, we discussed how to optimize workloads on Amazon Redshift clusters using RA3 nodes, data sharing, and pausing and resuming clusters. We also explored a use case implementing a multi-cluster two-way data share solution to improve workload performance with a minimum code change. If you have any questions or feedback, please leave them in the comments section.


About the authors

Jingbin Ma

Jingbin Ma is a Sr. Solutions Architect at Amazon Web Services. He helps customers build well-architected applications using AWS services. He has many years of experience working in the internet industry, and his last role was CTO of a New Zealand IT company before joining AWS. He is passionate about serverless and infrastructure as code.

Chao PanChao Pan is a Data Analytics Solutions Architect at Amazon Web Services. He’s responsible for the consultation and design of customers’ big data solution architectures. He has extensive experience in open-source big data. Outside of work, he enjoys hiking.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close