Alex Kim: Why I joined Cloudflare

Post Syndicated from Alex Kim original https://blog.cloudflare.com/alex-kim-why-i-joined-cloudflare/

Alex Kim: Why I joined Cloudflare

Alex Kim: Why I joined Cloudflare

I am excited to announce that as of November 1, I have joined Cloudflare as Country Manager of South Korea to help build a better Internet and to expand Cloudflare’s growing customer, partner, and local teams in Korea. We just opened a new entity (after making Seoul our 23rd data center, more than 10 years ago)  and I am the first official employee of Cloudflare Korea LLC in Seoul, which is truly a great moment and privilege for me.

A little about me

I was born in Korea and was educated in Korea until middle school, then I decided to move to Toronto, Canada to study film making to become a movie director. I finished high school and obtained a university degree in Toronto, during which I had the opportunity to be exposed to various cultures, as well as learn and become well-versed in the English language. I think it was a great time to learn how diverse people in the world are. My dream of becoming a movie director has changed over time for many reasons, but I think it is no coincidence that I have a job where I have to produce results while collaborating and orchestrating with many people, much like a movie director.

In my career of about 18 years, I have had various experiences, including pre-sales, support, consultant, and field sales, starting with Java programmer. The lesson from this variety of experiences is that if you work with a sense of ownership all the time, you can be the best in the field, and you can get the best compliments from your customers.

I’ve worked in a small company where the whole company has been agile, and I’ve worked in large companies like SAP, Dell, Autodesk, and Akamai, working with many teams. New technology and the best technology are important, but I also learned that the most important thing is the environment where people can work together and have fun, because people make the results after all.

Besides work, I love music. I didn’t become a movie director, which was my childhood dream, but I relieve my stress by playing the piano and composing songs. In the past, I made a rock song for one of the companies I worked for, and when an opportunity presented itself,  we had a program where all the employees jumped in and sang my composition together. Unfortunately, I have not had enough time to make a lot of songs now, but if I have a chance, I would love to make a Cloudflare song and hope I can sing it together with my new colleagues.

Why Cloudflare

Korea has one of the highest smartphone and Internet penetration rates in the world. Korea is also one of the countries with the fastest Internet speeds in the world. On the other hand, the pace of cloud transformation, that is making such a big difference to so many companies, is still lagging behind. The reason is that there are many government regulations on public enterprises and finance industries. Fortunately, as the government has recently moved to ease many regulations, the pace of cloud transformation is expected to accelerate in the future.

As cloud transitions accelerate, enterprises need to pay attention to security, and few companies will be able to deploy security as easily and securely in a cloud environment as Cloudflare.

Korea is a country where the economy grows only when it exports a lot. Many startups and chaebol (conglomerate) companies often grow future-oriented industries such as metaverse in Korea first and then expand their business abroad. For customers leading this global industry, Cloudflare will act like a safe highway in an Internet environment. I’ve come to Cloudflare to be part of this meaningful work.

In addition, Cloudflare Korea has just been launched. Even though we’ve had a presence here through our data center for the last 10 years, there are still many companies that we still need to build relationships with. I want to spread the value of Cloudflare to the Korean market quickly and become a Supercloud evangelist. I would also like to help Korean customers — organizations and businesses across multiple industries — achieve great success and ensure they have the right technology and Internet infrastructure. In the next few years, I will work hard to establish Cloudflare as the most trusted cloud security company in Korea, as well as contribute to expanding the business and creating jobs in the country.

The vision for the future…

As the first Country Manager of Cloudflare Korea, I am very excited  to work for a company with unlimited growth potential. As the global economy slows down, customers will gravitate towards products and solutions that are more valuable and price competitive. I’m looking forward to meeting and working with more customers that will benefit from all that Cloudflare has to offer.

One of the biggest reasons I chose Cloudflare is that Cloudflare has big dreams and visions. In particular, I think the emergence of R2 will provide an extremely cost-effective solution to enterprises’ egress cost concerns, especially in economically challenging times.

In addition, Cloudflare is investing heavily to become the number one Zero Trust player. The VPN market is huge, and it has a lot of challenges (including user experience, speed, and security), and Zero Trust is still in its infancy but already showing its true potential. Cloudflare, which understands and invests in these huge markets, knows where to go in the marketplace.

Finally, the Supercloud is also an area that only Cloudflare can realize. Cloud security and Zero Trust are indispensable areas of the future, and I am very happy to join this futuristic company.

How dynamic data masking support in Amazon Redshift helps achieve data privacy and compliance

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/how-dynamic-data-masking-support-in-amazon-redshift-helps-achieve-data-privacy-and-compliance/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift is the most widely used cloud data warehouse.

Dynamic data masking (DDM) support (preview) in Amazon Redshift enables you to simplify the process of protecting sensitive data in your Amazon Redshift data warehouse. You can now use DDM to protect data based on your job role or permission rights and level of data sensitivity through a SQL interface. DDM support (preview) in Amazon Redshift enables you to hide, obfuscate, or pseudonymize column values within the tables in your data warehouse without incurring additional storage costs. It is configurable to allow you to define consistent, format-preserving, and irreversible masked data values.

DDM support (preview) in Amazon Redshift provides a native feature to support your need to mask data for regulatory or compliance requirements, or to increase internal privacy standards. Compared to static data masking where underlying data at rest gets permanently replaced or redacted, DDM support (preview) in Amazon Redshift enables you to temporarily manipulate the display of sensitive data in transit at query time based on user privilege, leaving the original data at rest intact. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

With DDM support (preview) in Amazon Redshift, you can do the following:

  • Define masking policies that apply custom obfuscation policies (for example, masking policies to handle credit card, PII entries, HIPAA or GDPR needs, and more)
  • Transform the data at query time to apply masking policies
  • Attach masking policies to roles or users
  • Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
  • Implement cell-level masking by using conditional columns when creating your masking policy
  • Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

Here’s what our customers have to say on DDM support(private beta) in Amazon Redshift:

“Baffle delivers data-centric protection for enterprises via a data security platform that is transparent to applications and unique to data security. Our mission is to seamlessly weave data security into every data pipeline. Previously, to apply data masking to an Amazon Redshift data source, we had to stage the data in an Amazon S3 bucket. Now, by utilizing the Amazon Redshift Dynamic Data Masking capability, our customers can protect sensitive data throughout the analytics pipeline, from secure ingestion to responsible consumption reducing the risk of breaches.”

-Ameesh Divatia, CEO & co-founder of Baffle

“EnergyAustralia is a leading Australian energy retailer and generator, with a mission to lead the clean energy transition for customers in a way that is reliable, affordable and sustainable for all. We enable all corners of our business with Data & Analytics capabilities that are used to optimize business processes and enhance our customers’ experience. Keeping our customers’ data safe is a top priority across our teams. In the past, this involved multiple layers of custom built security policies that could make it cumbersome for analysts to find the data they require. The new AWS dynamic data masking feature will significantly simplify our security processes so we continue to keep customer data safe, while also reducing the administrative overhead.”

-William Robson, Data Solutions Design Lead, EnergyAustralia

Use case

For our use case, a retail company wants to control how they show credit card numbers to users based on their privilege. They also don’t want to duplicate the data for this purpose. They have the following requirements:

  • Users from Customer Service should be able to view the first six digits and the last four digits of the credit card for customer verification
  • Users from Fraud Prevention should be able to view the raw credit card number only if it’s flagged as fraud
  • Users from Auditing should be able to view the raw credit card number
  • All other users should not be able to view the credit card number

Solution overview

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

  • A data masking policy that defines the data obfuscation rules
  • Roles with different access levels depending on the business case
  • The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

The following diagram illustrates how DDM support (preview) in Amazon Redshift policies works with roles and users for our retail use case.

For a user with multiple roles, the masking policy with the highest attachment priority is used. For example, in the following example, Ken is part of the Public and FrdPrvnt role. Because the FrdPrvnt role has a higher attachment priority, card_number_conditional_mask will be applied.

Prerequisites

To implement this solution, you need to complete the following prerequisites:

  1. Have an AWS account.
  2. Have an Amazon Redshift cluster provisioned with DDM support (preview) or a serverless workgroup with DDM support (preview).
    1. Navigate to the provisioned or serverless Amazon Redshift console and choose Create preview cluster.
    2. In the create cluster wizard, choose the preview track.
  3. Have Superuser privilege, or the sys:secadmin role on the Amazon Redshift data warehouse created in step 2.

Preparing the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer.
    If you’re familiar with SQL Notebooks, you can download the Jupyter notebook for the demonstration, and import it to quickly get started.
  2. Create the table and populate contents.
  3. Create users.
    -- 1- Create the credit cards table
    CREATE TABLE credit_cards (
    customer_id INT,
    is_fraud BOOLEAN,
    credit_card TEXT
    );
    -- 2- Populate the table with sample values
    INSERT INTO credit_cards
    VALUES
    (100,'n', '453299ABCDEF4842'),
    (100,'y', '471600ABCDEF5888'),
    (102,'n', '524311ABCDEF2649'),
    (102,'y', '601172ABCDEF4675'),
    (102,'n', '601137ABCDEF9710'),
    (103,'n', '373611ABCDEF6352')
    ;
    --run GRANT to grant SELECT permission on the table
    GRANT SELECT ON credit_cards TO PUBLIC;
    --create four users
    CREATE USER Kate WITH PASSWORD '1234Test!';
    CREATE USER Ken  WITH PASSWORD '1234Test!';
    CREATE USER Bob  WITH PASSWORD '1234Test!';
    CREATE USER Jane WITH PASSWORD '1234Test!';

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

  1. Create user roles and grant different users to different roles:
    -- 1. Create User Roles
    CREATE ROLE cust_srvc_role;
    CREATE ROLE frdprvnt_role;
    CREATE ROLE auditor_role;
    -- note that public role exist by default.
    
    -- Grant Roles to Users
    GRANT ROLE cust_srvc_role to Kate;
    GRANT ROLE frdprvnt_role  to Ken;
    GRANT ROLE auditor_role   to Bob;
    -- note that regualr_user is attached to public role by default.

  2. Create masking policies:
    -- 2. Create Masking policies
    
    -- 2.1 create a masking policy that fully masks the credit card number
    CREATE MASKING POLICY Mask_CC_Full
    WITH (credit_card VARCHAR(256))
    USING ('XXXXXXXXXXXXXXXX');
    
    --2.2- Create a scalar SQL user-defined function(UDF) that partially obfuscates credit card number, only showing the first 6 digits and the last 4 digits
    CREATE FUNCTION REDACT_CREDIT_CARD (text)
      returns text
    immutable
    as $$
      select left($1,6)||'XXXXXX'||right($1,4)
    $$ language sql;
    
    
    --2.3- create a masking policy that applies the REDACT_CREDIT_CARD function
    CREATE MASKING POLICY Mask_CC_Partial
    WITH (credit_card VARCHAR(256))
    USING (REDACT_CREDIT_CARD(credit_card));
    
    -- 2.4- create a masking policy that will display raw credit card number only if it is flagged for fraud 
    CREATE MASKING POLICY Mask_CC_Conditional
    WITH (is_fraud BOOLEAN, credit_card VARCHAR(256))
    USING (CASE WHEN is_fraud 
                     THEN credit_card 
                     ELSE Null 
           END);
    
    -- 2.5- Create masking policy that will show raw credit card number.
    CREATE MASKING POLICY Mask_CC_Raw
    WITH (credit_card varchar(256))
    USING (credit_card);

  3. Attach the masking policies on the table or column to the user or role:
    -- 3. ATTACHING MASKING POLICY
    -- 3.1- make the Mask_CC_Full the default policy for all users
    --    all users will see this masking policy unless a higher priority masking policy is attached to them or their role
    
    ATTACH MASKING POLICY Mask_CC_Full
    ON credit_cards(credit_card)
    TO PUBLIC;
    
    -- 3.2- attach Mask_CC_Partial to the cust_srvc_role role
    --users with the cust_srvc_role role can see partial credit card information
    ATTACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    TO ROLE cust_srvc_role
    PRIORITY 10;
    
    -- 3.3- Attach Mask_CC_Conditional masking policy to frdprvnt_role role
    --    users with frdprvnt_role role can only see raw credit card if it is fraud
    ATTACH MASKING POLICY Mask_CC_Conditional
    ON credit_cards(credit_card)
    USING (is_fraud, credit_card)
    TO ROLE frdprvnt_role
    PRIORITY 20;
    
    -- 3.4- Attach Mask_CC_Raw masking policy to auditor_role role
    --    users with auditor_role role can see raw credit card numbers
    ATTACH MASKING POLICY Mask_CC_Raw
    ON credit_cards(credit_card)
    TO ROLE auditor_role
    PRIORITY 30;

Test the solution

Let’s confirm that the masking policies are created and attached.

  1. Check that the masking policies are created with the following code:
    -- 1.1- Confirm the masking policies are created
    SELECT * FROM svv_masking_policy;

  2. Check that the masking policies are attached:
    -- 1.2- Verify attached masking policy on table/column to user/role.
    SELECT * FROM svv_attached_masking_policy;

    Now we can test that different users can see the same data masked differently based on their roles.

  3. Test that the Customer Service agents can only view the first six digits and the last four digits of the credit card number:
    -- 1- Confirm that customer service agent can only view the first 6 digits and the last 4 digits of the credit card number
    SET SESSION AUTHORIZATION Kate;
    SELECT * FROM credit_cards;

  4. Test that the Fraud Prevention users can only view the raw credit card number when it’s flagged as fraud:
    -- 2- Confirm that Fraud Prevention users can only view fraudulent credit card number
    SET SESSION AUTHORIZATION Ken;
    SELECT * FROM credit_cards;

  5. Test that Auditor users can view the raw credit card number:
    -- 3- Confirm the auditor can view RAW credit card number
    SET SESSION AUTHORIZATION Bob;
    SELECT * FROM credit_cards;

  6. Test that general users can’t view any digits of the credit card number:
    -- 4- Confirm that regular users can not view any digit of the credit card number
    SET SESSION AUTHORIZATION Jane;
    SELECT * FROM credit_cards;

Modify the masking policy

To modify an existing masking policy, you must detach it from the role first and then drop and recreate it.

In our use case, the business changed direction and decided that Customer Service agents should only be allowed to view the last four digits of the credit card number.

  1. Detach and drop the policy:
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;
    --detach masking policy from the credit_cards table
    DETACH MASKING POLICY Mask_CC_Partial
    ON                    credit_cards(credit_card)
    FROM ROLE             cust_srvc_role;
    -- Drop the masking policy
    DROP MASKING POLICY Mask_CC_Partial;
    -- Drop the function used in masking
    DROP FUNCTION REDACT_CREDIT_CARD (TEXT);

  2. Recreate the policy and reattach the policy on the table or column to the intended user or role.Note that this time we created a scalar Python UDF. It’s possible to create a SQL, Python, and Lambda UDF based on your use case.
    -- Re-create the policy and re-attach it to role
    
    -- Create a user-defined function that partially obfuscates credit card number, only showing the last 4 digits
    CREATE FUNCTION REDACT_CREDIT_CARD (credit_card TEXT) RETURNS TEXT IMMUTABLE AS $$
        import re
        regexp = re.compile("^([0-9A-F]{6})[0-9A-F]{5,6}([0-9A-F]{4})")
        match = regexp.search(credit_card)
        if match != None:
            last = match.group(2)
        else:
            last = "0000"
        return "XXXXXXXXXXXX{}".format(last)
    $$ LANGUAGE plpythonu;
    
    --Create a masking policy that applies the REDACT_CREDIT_CARD function
    CREATE MASKING POLICY Mask_CC_Partial
    WITH (credit_card VARCHAR(256))
    USING (REDACT_CREDIT_CARD(credit_card));
    
    -- attach Mask_CC_Partial to the cust_srvc_role role
    -- users with the cust_srvc_role role can see partial credit card information
    ATTACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    TO ROLE cust_srvc_role
    PRIORITY 10;

  3. Test that Customer Service agents can only view the last four digits of the credit card number:
    -- Confirm that customer service agent can only view the last 4 digits of the credit card number
    SET SESSION AUTHORIZATION Kate;
    SELECT * FROM credit_cards;

Clean up

When you’re done with the solution, clean up your resources:

  1. Detach the masking policies from the table:
    -- Cleanup
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;
    
    --1.	Detach the masking policies from table
    DETACH MASKING POLICY Mask_CC_Full
    ON credit_cards(credit_card)
    FROM PUBLIC;
    DETACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    FROM ROLE cust_srvc_role;
    DETACH MASKING POLICY Mask_CC_Conditional
    ON credit_cards(credit_card)
    FROM ROLE frdprvnt_role;
    DETACH MASKING POLICY Mask_CC_Raw
    ON credit_cards(credit_card)
    FROM ROLE auditor_role;

  2. Drop the masking policies:
    -- 2.	Drop the masking policies 
    DROP MASKING POLICY Mask_CC_Full;
    DROP MASKING POLICY Mask_CC_Partial;
    DROP MASKING POLICY Mask_CC_Conditional;
    DROP MASKING POLICY Mask_CC_Raw;

  3. Revoke and drop each user and role:
    -- 3.	Revoke/Drop - role/user 
    REVOKE ROLE cust_srvc_role from Kate;
    REVOKE ROLE frdprvnt_role  from Ken;
    REVOKE ROLE auditor_role   from Bob;
    
    DROP ROLE cust_srvc_role;
    DROP ROLE frdprvnt_role;
    DROP ROLE auditor_role;
    
    DROP USER Kate;
    DROP USER Ken;
    DROP USER Bob;
    DROP USER Jane;

  4. Drop the function and table:
    -- 4.	Drop function and table 
    DROP FUNCTION REDACT_CREDIT_CARD (credit_card TEXT);
    DROP TABLE credit_cards;

Considerations and best practices

Consider the following:

  • Always create a default policy attached to the public user. If you create a new user, they will always have a minimum policy attached. It will enforce the intended security posture.
  • Remember that DDM policies in Amazon Redshift always follow invoker permissions convention, not definer (for more information, refer to Security and privileges for stored procedures ). That being said, the masking policies are applicable based on the user or role running it.
  • For best performance, create the masking functions using a scalar SQL UDF, if possible. The performance of scalar UDFs typically goes by the order of SQL to Python to Lambda, in that order. Generally, SQL UDF outperforms Python UDFs and the latter outperforms scalar Lambda UDFs.
  • DDM policies in Amazon Redshift are applied ahead of any predicate or join operations. For example, if you’re running a join on a masked column (per your access policy) to an unmasked column, the join will lead to a mismatch. That’s an expected behavior.
  • Always detach a masking policy from all users or roles before dropping it.
  • As of this writing, the solution has the following limitations:
    • You can apply a mask policy on tables and columns and attach it to a user or role, but groups are not supported.
    • You can’t create a mask policy on views, materialized views, and external tables.
    • The DDM support (preview) in Amazon Redshift is available in following regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

Performance benchmarks

Based on various tests performed on TPC-H datasets, we’ve found built-in functions to be more performant as compared to functions created externally using scalar Python or Lambda UDFs.

Expand the solution

You can take this solution further and set up a masking policy that restricts SSN and email address access as follows:

  • Customer Service agents accessing pre-built dashboards may only view the last four digits of SSNs and complete email addresses for correspondence
  • Analysts cannot view SSNs or email addresses
  • Auditing services may access raw values for SSNs as well as email addresses

For more information, refer to Use DDM support (preview) in Amazon Redshift for E-mail & SSN Masking.

Conclusion

In this post, we discussed how to use DDM support (preview) in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support (preview) in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply a conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user defined functions for various use-cases and accomplish desired security posture using dynamic data masking support in Amazon Redshift.


About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, TX. He has more than 16 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with the utmost security and data governance.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Variyam Ramesh is a Senior Analytics Specialist Solutions Architect at AWS based in Charlotte, NC. He is an accomplished technology leader helping customers conceptualize, develop, and deliver innovative analytic solutions.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

James Moore is a Technical Lead at Amazon Redshift focused on SQL features and security. His work over the last 10 years has spanned distributed systems, machine learning, and databases. He is passionate about building scalable software that enables customers to solve real-world problems.

Simplify data ingestion from Amazon S3 to Amazon Redshift using auto-copy

Post Syndicated from Jason Pedreza original https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze your data using standard SQL and your existing business intelligence (BI) tools. Tens of thousands of customers today rely on Amazon Redshift to analyze exabytes of data and run complex analytical queries, making it the most widely used cloud data warehouse.

Data ingestion is the process of getting data to Amazon Redshift. You can leverage one of the many zero-ETL integration methods to make data available in Amazon Redshift directly. However, if your data is in your Amazon S3 bucket, then you can simply load data from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift using the COPY command. A COPY command is the most efficient way to load a table from S3 because it uses the Amazon Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel.

Amazon Redshift launched auto-copy support to simplify data loading from Amazon S3 into Amazon Redshift. You can now setup continuous file ingestion rules to track your Amazon S3 paths and automatically load new files without the need for additional tools or custom solutions. This also enables end users to have the latest data available in Amazon Redshift shortly after the source data is available.

This post shows you how to build automatic file ingestion pipelines in Amazon Redshift when source files are located on Amazon S3 by using a simple SQL command. In addition, we show you how to enable auto-copy using auto-copy jobs, how to monitor jobs, considerations, and best practices.

Overview of the auto-copy feature in Amazon Redshift

The auto-copy feature in Amazon Redshift leverages the S3 event integration to automatically load data into Amazon Redshift and simplifies automatic data loading from Amazon S3 with a simple SQL command. You can enable Amazon Redshift auto-copy by creating auto-copy jobs. A auto-copy job is a database object that stores, automates, and reuses the COPY statement for newly created files that land in the S3 folder.

The following diagram illustrates this process.

S3 event integration and auto-copy jobs have the following benefits:

  • Users can now load data from Amazon S3 automatically without having to build a pipeline or using an external framework
  • auto-copy jobs offer automatic and incremental data ingestion from an Amazon S3 location without the need to implement a custom solution
  • This functionality comes at no additional cost
  • Existing COPY statements can be converted into auto-copy jobs by appending the JOB CREATE <job_name> parameter
  • It keeps track of loaded files and minimizes data duplication.
  • It can be quickly set up using a simple SQL statement using your choice of JDBC/ODBC clients.
  • It has automatic error handling of bad quality data files.
  • It has a mechanism to load-once for each file. This means that there is no need to generate explicit manifest files.

Prerequisites

To get started with auto-copy, you need the following prerequisites:

  • An AWS account
  • An encrypted Amazon Redshift provisioned cluster or Amazon redshift serverless workgroup
  • An Amazon S3 bucket
  • Add following to the Amazon S3 bucket policy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Auto-Copy-Policy-01",
                "Effect": "Allow",
                "Principal": {
                    "Service":"redshift.amazonaws.com"
                        
                    
                },
                "Action": [
                    "s3:GetBucketNotification",
                    "s3:PutBucketNotification",
                    "s3:GetBucketLocation"
                ],
                "Resource": "arn:aws:s3:::<<your-s3-bucket-name>>",
                "Condition": {
                    "StringEquals": {
                        "aws:SourceArn": "arn:aws:redshift:<region-name>:<aws-account-id>:integration:*",
                        "aws:SourceAccount": "<aws-account-id>"
                    }
                }
            }
        ]
    }

Set up Amazon S3 event Integration

An Amazon S3 event integration facilitates seamless and automated data ingestion from S3 buckets into an Amazon Redshift data warehouse, streamlining the process of transferring and storing data for analytical purposes

  1. Sign in to the AWS Management Console and Navigate to Amazon Redshift home page. Under Integrations section choose S3 event integrations

  2. Choose Create S3 event integration

  3. Enter Integration name and Description, choose Next

  4. Choose Browse S3 buckets, a dialog box pops up, select the Amazon S3 bucket and choose Continue

  5. Amazon S3 bucket is selected. Choose Next
  6. Choose Browse Redshift data warehouse

  7. Choose the Amazon Redshift data warehouse and choose Continue

  8. Then Amazon Redshift resource policy needs access to S3 event integration. In case of Resource policy error, check Fix it for me and choose Next

  9. Add Tags as required and choose Next

  10. Review changes and choose Create S3 event integration
  11. An S3 event integration is created. Wait until the status of S3 event integration is Active

Set up auto-copy jobs

In this section, we demonstrate how to automate data loading of files from Amazon S3 into Amazon Redshift. With the existing COPY syntax, we add the JOB CREATE parameter to perform a one-time setup for automatic file ingestion. See the following code:

COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name> [AUTO ON | OFF];

Auto ingestion is enabled by default on auto-copy jobs. Files already present at the S3 location will not be visible to the auto-copy job. Only files added after JOB creation are tracked by Amazon Redshift.

Automate ingestion from a single data source

With a auto-copy job, you can automate ingestion from a single data source by creating one job and specifying the path to the S3 objects that contain the data. The S3 object path can reference a set of folders that have the same key prefix.

In this example, we have multiple files that are being loaded on a daily basis containing the sales transactions across all the stores in the US. For this we can create a store_sales folder in the bucket.

The following code creates the store_sales table:

DROP TABLE IF EXISTS public.store_sales;
CREATE TABLE IF NOT EXISTS public.store_sales
(
  ss_sold_date_sk int4,            
  ss_sold_time_sk int4,     
  ss_item_sk int4 not null,      
  ss_customer_sk int4,           
  ss_cdemo_sk int4,              
  ss_hdemo_sk int4,         
  ss_addr_sk int4,               
  ss_store_sk int4,           
  ss_promo_sk int4,           
  ss_ticket_number int8 not null,        
  ss_quantity int4,           
  ss_wholesale_cost numeric(7,2),          
  ss_list_price numeric(7,2),              
  ss_sales_price numeric(7,2),
  ss_ext_discount_amt numeric(7,2),             
  ss_ext_sales_price numeric(7,2),              
  ss_ext_wholesale_cost numeric(7,2),           
  ss_ext_list_price numeric(7,2),               
  ss_ext_tax numeric(7,2),                 
  ss_coupon_amt numeric(7,2), 
  ss_net_paid numeric(7,2),   
  ss_net_paid_inc_tax numeric(7,2),             
  ss_net_profit numeric(7,2),
  primary key (ss_item_sk, ss_ticket_number)
 ) 
DISTKEY (ss_item_sk) 
SORTKEY (ss_sold_date_sk);

Next, we create the auto-copy job to automatically load the gzip-compressed files into the store_sales table:

COPY store_sales
FROM 's3://aws-redshift-s3-auto-copy-demo/store_sales'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
gzip delimiter '|' EMPTYASNULL
region 'us-east-1'
JOB CREATE job_store_sales AUTO ON;

Each day’s sales transactions are loaded to their own folder in Amazon S3.

Now upload the files for transaction sold on 2002-12-31. Each folder contains multiple gzip-compressed files.

Since the auto-copy job is already created, it automatically loads the gzip-compressed files located in the S3 object path specified in the COPY command to the store_sales table.

Let’s run a query to get the daily total of sales transactions across all the stores in the US:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

The output shown comes from the transactions sold on 2002-12-31.

The following day, incremental sales transactions data are loaded to a new folder in the same S3 object path.

As new files arrive to the same S3 object path, the auto-copy job automatically loads the unprocessed files to the store_sales table in an incremental fashion.

All new sales transactions for 2003-01-01 are automatically ingested, which can be verified by running the following query:

SELECT ss_sold_date_sk, count(1)
  FROM store_sales
GROUP BY ss_sold_date_sk;

Automate ingestion from multiple data sources

We can also load an Amazon Redshift table from multiple data sources. When using a pub/sub pattern where multiple S3 buckets populate data to an Amazon Redshift table, you have to maintain multiple data pipelines for each source/target combination. With new parameters in the COPY command, this can be automated to handle data loads efficiently.

In the following example, the Customer_1 folder has Green Cab Company sales data, and the Customer_2 folder has Red Cab Company sales data. We can use the COPY command with the JOB parameter to automate this ingestion process.

The following screenshot shows sample data stored in files. Each folder has similar data but for different customers.

The target for these files in this example is the Amazon Redshift table cab_sales_data.

Define the target table cab_sales_data:

DROP TABLE IF EXISTS cab_sales_data;
CREATE TABLE IF NOT EXISTS cab_sales_data
(
  vendorid                VARCHAR(4),
  pickup_datetime         TIMESTAMP,
  dropoff_datetime        TIMESTAMP,
  store_and_fwd_flag      VARCHAR(1),
  ratecode                INT,
  pickup_longitude        FLOAT4,
  pickup_latitude         FLOAT4,
  dropoff_longitude       FLOAT4,
  dropoff_latitude        FLOAT4,
  passenger_count         INT,
  trip_distance           FLOAT4,
  fare_amount             FLOAT4,
  extra                   FLOAT4,
  mta_tax                 FLOAT4,
  tip_amount              FLOAT4,
  tolls_amount            FLOAT4,
  ehail_fee               FLOAT4,
  improvement_surcharge   FLOAT4,
  total_amount            FLOAT4,
  payment_type            VARCHAR(4),
  trip_type               VARCHAR(4)
)
DISTSTYLE EVEN
SORTKEY (passenger_count,pickup_datetime);

You can define two auto-copy jobs as shown in the following code to handle and monitor the ingestion of sales data belonging to different customers, in our case Customer_1 and Customer_2. These jobs monitor the Customer_1 and Customer_2 folders and load new files that are added here.

COPY public.cab_sales_data
FROM 's3://aws-redshift-s3-auto-copy-demo/Customer_1'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_green_cab AUTO ON;

COPY public.cab_sales_data
FROM 's3:// aws-redshift-s3-auto-copy-demo/Customer_2'
IAM_ROLE 'arn:aws:iam::**********:role/Redshift-S3'
DATEFORMAT 'auto'
IGNOREHEADER 1
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
JOB CREATE job_red_cab AUTO ON;

After setting up the two jobs, we can upload the relevant files into their respective folders. This will make sure that the data is loaded efficiently as soon as the files arrive. Each customer is assigned its own vendorid, as shown in the following output:

SELECT vendorid,
       sum(passenger_count) as total_passengers 
  FROM cab_sales_data
GROUP BY vendorid;

result 1

Manually run a auto-copy job

There might be scenarios wherein the auto-copy job needs to be paused, meaning it needs to stop looking for new files, for example, to fix a corrupted data pipeline at the data source.

In that case, either use the auto-copy job ALTER command to set AUTO to OFF or create a new auto-copy job with AUTO OFF. Once this is set, auto copy will no longer look for new files.

If necessary, users can manually invoke auto-copy job which will do the work and ingest if new files are found.

auto-copy job RUN <auto-copy job Name>

You can disable “AUTO ON” in the existing auto-copy job using the following command:

auto-copy job ALTER <auto-copy job Name> AUTO OFF

The following table compares the syntax and data duplication between a regular copy statement and the new auto-copy job

. Copy Auto-copy job
Syntax COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
COPY <table-name>
FROM 's3://<s3-object-path>'
[COPY PARAMETERS...]
JOB CREATE <job-name>;
Data Duplication If it is run multiple times against the same S3 folder, it will load the data again, resulting in data duplication. It will not load the same file twice, preventing data duplication.

Error handling and monitoring for auto-copy jobs

auto-copy jobs continuously monitor the S3 folder specified during job creation and perform ingestion whenever new files are created. New files created under the S3 folder are loaded exactly once to avoid data duplication.

By default, if there are data or format issues with the specific files, the auto-copy job will fail to ingest the files with a load error and log details to the system tables. The auto-copy job will remain AUTO ON with new data files and will continue to ignore previously failed files.

Amazon Redshift provides the following system tables for users to monitor or troubleshoot auto-copy jobs as needed:

  • List auto-copy jobs – Use SYS_COPY_JOB to list the auto-copy jobs stored in the database:
SELECT * 
  FROM sys_copy_job;

  • Get a summary of a auto-copy job – Use the SYS_LOAD_HISTORY view to get the aggregate metrics of a auto-copy job operation by specifying the copy_job_id. It shows the aggregate metrics of the files that have been processed by a auto-copy job.
SELECT *
  FROM sys_load_history
 WHERE copy_job_id = 274978;

  • Get details of a auto-copy job – Use STL_LOAD_COMMITS to get the status and details of each file that was processed by a auto-copy job:
SELECT *
  FROM stl_load_commits
 WHERE copy_job_id = 274978
ORDER BY curtime ASC;

  • Get exception details of a auto-copy job – Use STL_LOAD_ERRORS to get the details of files that failed to ingest from a auto-copy job:
SELECT   query,
    slice,
    starttime , 
    filename,
    line_number,
    colname,
    type,
    err_code,
    err_reason,
    copy_job_id,
    raw_line,
    raw_field_value
  FROM stl_load_errors
 WHERE copy_job_id = 274978;

Auto-copy job best practices

In an auto-copy job, when a new file is detected and ingested (automatically or manually), Amazon Redshift stores the file name and doesn’t run this specific job when a new file is created with the same file name.

The following are the recommended best practices when working with files using the auto-copy job:

  • Use unique file names for each file in a auto-copy job (for example, 2022-10-15-batch-1.csv). However, you can use the same file name as long as it’s from different auto-copy jobs:
    • job_customerA_saless3://redshift-blogs/sales/customerA/2022-10-15-sales.csv
    • job_customerB_saless3://redshift-blogs/sales/customerB/2022-10-15-sales.csv
  • Do not update file contents. Do not overwrite existing files. Changes in existing files will not be reflected to the target table. The auto-copy job doesn’t pick up updated or overwritten files, so make sure they’re renamed as new file names for the auto-copy job to pick up.
  • Run regular COPY statements (not a job) if you need to ingest a file that was already processed by your auto-copy job. (COPY statement without a JOB CREATE syntax doesn’t track loaded files.) For example, this is helpful in scenarios where you don’t have control of the file name and the initial file received failed. The following figure shows a typical workflow in this case.

  • Delete and recreate your auto-copy job if you want to reset file tracking history and start over. You can drop auto-copy job using following command.
    auto-copy job DROP <auto-copy job Name>

auto-copy job considerations

Here are the main things to consider when using auto-copy:

For additional details on other considerations for auto-copy, refer to the AWS documentation.

Customer feedback

GE Aerospace is a global provider of jet engines, components, and systems for commercial and military aircraft. The company has been designing, developing, and manufacturing jet engines since World War I.

“GE Aerospace uses AWS analytics and Amazon Redshift to enable critical business insights that drive important business decisions. With the support for auto-copy from Amazon S3, we can build simpler data pipelines to move data from Amazon S3 to Amazon Redshift. This accelerates our data product teams’ ability to access data and deliver insights to end users. We spend more time adding value through data and less time on integrations.”

Alcuin Weidus Sr Principal Data Architect at GE Aerospace

Conclusion

This post demonstrated how to automate data ingestion from Amazon S3 to Amazon Redshift using the auto-copy feature. This new functionality helps make Amazon Redshift data ingestion easier than ever, and will allow SQL users to get access to the most recent data using a simple SQL command.

Users can begin ingesting data to Redshift from Amazon S3 with simple SQL commands and gain access to the most up-to-date data without the need for third-party tools or custom implementation.


About the authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Omama Khurshid is an Acceleration Lab Solutions Architect at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, watching movies, listening to music, and learning new technologies.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer-facing products and is currently focusing on data lake and file ingestion topics in the Amazon Redshift team. He was the CEO and co-founder of DataRow, which was acquired by Amazon in 2020.

Eesha Kumar is an Analytics Solutions Architect with AWS. He works with customers to realize the business value of data by helping them build solutions using the AWS platform and tools.

Satish Sathiya is a Senior Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

 Hangjian Yuan is a Software Development Engineer at Amazon Redshift. He’s passionate about analytical databases and focuses on delivering cutting-edge streaming experiences for customers.

About Anomalous Data Transfer detection in InsightIDR

Post Syndicated from Dina Durutlic original https://blog.rapid7.com/2022/12/07/about-anomalous-data-transfer-detection-in-insightidr/

About Anomalous Data Transfer detection in InsightIDR

By Shivangi Pandey

Shivangi is a Senior Product Manager for D&R at Rapid7.

Data exfiltration is an unauthorized movement or transfer of data occurring on an organization’s network. This can occur when a malicious actor gains access to a corporation’s network with the intention of stealing or leaking data.

Data exfiltration can also be carried out by inside actors moving data outside of the network accidentally, by uploading corporate files to their personal cloud – or deliberately to leak information that harms the organization.

Identifying this cyber risk is integral to securing your organization’s network.

Of course, attackers use multiple methods

Some use phishing scams to trick users into inputting personal login information into spoofed domains so that they can use the appropriate credentials to infiltrate the network. Once on to the network, the malicious actor can send the files they were searching for outside of their network using remote desktop, SSH, etc.

Another method? Ignoring security controls of a network. For example, employees may download unauthorized software for ease of use, but unintentionally allow a third party to gain access to sensitive information that was not meant to leave the network. People may use personal accounts and devices for work related tasks just because it’s easy. A malicious inside actor can also circumvent security controls to leak information outside of the network.

With many organizations moving to a hybrid model of work, it’s more important than ever to prevent data exfiltration, intended or unintended. This can be done by educating your employees of appropriate conduct when it comes to data usage and data sharing within and outside of your network. Education about common attack vectors attackers may use to steal their credentials will also help your employees keep your network secure. Additionally, education around what devices can access your network will make it easier to monitor whether a data breach is about to occur. Finally, assigning certain privileges based on employee functions will help.

Being able to detect data exfiltration is incredibly important for an organization’s environment and essential to your organization’s security posture. One of our new detections, Anomalous Data Transfer, provides you with the visibility into possible occurrences of data exfiltration within your network.

Rapid7s approach for detecting Anomalous Data Transfers

Anomalous Data Transfer is an InsightIDR detection which utilizes network flow data, produced by the Insight Network Sensor, to identify and mark unusual transfers of data and behavior. The detection identifies anomalously large transfers of data sent by assets out of a network, and outputs data exfiltration alerts

The model dynamically derives a baseline for each asset based on its active periods over 30 days, and each hour, will output network activity that is anomalously high as compared to that baseline as a candidate for further investigation. This process effectively acts as a filter, reducing millions of network connections into a few candidate alerts to bring to the attention of a security analyst.

Further contextual information is included in each candidate alert to help a security team make informed decisions about how to investigate the possible occurrence of data exfiltration.

The user has the ability to tune exceptions for which anomalous data transfer alerts are shown by going into Managed Detections. The user can tune exception rules for Anomalous Data Transfer with the following attributes: Organization, Certificate, and Source IP/Subnet. This allows for the analysts to focus on alerts that are well tailored to their organization’s environment.

CVE-2022-4261: Rapid7 Nexpose Update Validation Issue (FIXED)

Post Syndicated from Tod Beardsley original https://blog.rapid7.com/2022/12/07/cve-2022-4261-rapid7-nexpose-update-validation-issue-fixed/

CVE-2022-4261: Rapid7 Nexpose Update Validation Issue (FIXED)

On November 14, 2022, Rapid7’s product engineering team discovered that the mechanism used to validate the source of an update file was unreliable. This failure involving the internal cryptographic validation of received updates was designated as CVE-2022-4261, and is an instance of CWE-494. Rapid7’s estimate of the CVSSv3.1 base rating for this vulnerability for most environments is 4.4 (Medium). This issue is resolved in the regular December 7, 2022 release.

Product Description

Rapid7 Nexpose is an on-premise vulnerability scanner, used by many enterprises around the world to assess and manage the vulnerability exposures present in their networks. You can read more about Nexpose at our website.

Note that CVE-2022-4261 only affects the on-premise Nexpose product, and does not affect InsightVM.

Credit

This issue was discovered by Rapid7 Principal Software Engineer Emmett Kelly and validated by the Rapid7 Nexpose product team. It is being disclosed in accordance with Rapid7’s vulnerability disclosure policy.

Exploitation

Exploitation of this issue is complex, and requires an attacker already in a privileged position in the network. By understanding these complications, we believe our customers will be better able to make appropriate judgements on the risk of delaying this update, perhaps due to established change control procedures.

In order to exploit CVE-2022-4261, an attacker would first need to be in a position to provide a malicious update to Nexpose, either through a privileged position on the network, on the local computer that runs Nexpose (with sufficient privileges to initiate an update), or by convincing a Nexpose administrator to apply a maliciously-crafted update through social engineering. Once applied, the update could introduce new functionality to Nexpose that would benefit the attacker.

Impact

Given the requirement of a privileged position on the network or local machine, exploiting CVE-2022-4261, in most circumstances, is academic. Such an adversary is likely to already have many other (and often easier) choices when it comes to leveraging this position to cause trouble on the target network. In the case of a local machine compromise (which is the most likely attack scenario), the attacker could use this position to instead create a fairly permanent ingress avenue to the internal network and exercise the usual lateral movement options documented as ATT&CK technique T1557.

Remediation

Disabling automatic updates completely removes the risk of exploitation of CVE-2022-4261. That said, most Nexpose administrators already employ Nexpose’s automated updates, and should apply updates either on their already established automated schedules or as soon as it’s convenient to do so.

Nexpose administrators that are especially concerned that they will be targeted during their next update, or who believe they have already been compromised by persistent attackers, should disable automatic updates and use the documented Managing Updates without an Internet Connection procedure to fix this issue, after manually validating the authenticity of the update package.

Fixing an update system with an update is always fairly complex, given the chicken-and-egg nature of the problem being addressed, as well as the risks involved in using an update system to fix an update system. So, it is out of an abundance of caution that we are publishing this advisory today to ensure that customers who rely on automatic updates are made plainly aware of this issue and can plan accordingly.

Disclosure Timeline

  • Mon, Nov 14, 2022: Issue discovered by Emett Kelly, and validated by the Nexpose product team.
  • Thu, Dec 1, 2022: CVE-2022-4261 reserved by Rapid7.
  • Wed, Dec 7, 2022 : This disclosure and update 6.6.172 released.

How to secure your SaaS tenant data in DynamoDB with ABAC and client-side encryption

Post Syndicated from Jani Muuriaisniemi original https://aws.amazon.com/blogs/security/how-to-secure-your-saas-tenant-data-in-dynamodb-with-abac-and-client-side-encryption/

If you’re a SaaS vendor, you may need to store and process personal and sensitive data for large numbers of customers across different geographies. When processing sensitive data at scale, you have an increased responsibility to secure this data end-to-end. Client-side encryption of data, such as your customers’ contact information, provides an additional mechanism that can help you protect your customers and earn their trust.

In this blog post, we show how to implement client-side encryption of your SaaS application’s tenant data in Amazon DynamoDB with the Amazon DynamoDB Encryption Client. This is accomplished by leveraging AWS Identity and Access Management (IAM) together with AWS Key Management Service (AWS KMS) for a more secure and cost-effective isolation of the client-side encrypted data in DynamoDB, both at run-time and at rest.

Encrypting data in Amazon DynamoDB

Amazon DynamoDB supports data encryption at rest using encryption keys stored in AWS KMS. This functionality helps reduce operational burden and complexity involved in protecting sensitive data. In this post, you’ll learn about the benefits of adding client-side encryption to achieve end-to-end encryption in transit and at rest for your data, from its source to storage in DynamoDB. Client-side encryption helps ensure that your plaintext data isn’t available to any third party, including AWS.

You can use the Amazon DynamoDB Encryption Client to implement client-side encryption with DynamoDB. In the solution in this post, client-side encryption refers to the cryptographic operations that are performed on the application-side in the application’s Lambda function, before the data is sent to or retrieved from DynamoDB. The solution in this post uses the DynamoDB Encryption Client with the Direct KMS Materials Provider so that your data is encrypted by using AWS KMS. However, the underlying concept of the solution is not limited to the use of the DynamoDB Encryption Client, you can apply it to any client-side use of AWS KMS, for example using the AWS Encryption SDK.

For detailed information about using the DynamoDB Encryption Client, see the blog post How to encrypt and sign DynamoDB data in your application. This is a great place to start if you are not yet familiar with DynamoDB Encryption Client. If you are unsure about whether you should use client-side encryption, see Client-side and server-side encryption in the Amazon DynamoDB Encryption Client Developer Guide to help you with the decision.

AWS KMS encryption context

AWS KMS gives you the ability to add an additional layer of authentication for your AWS KMS API decrypt operations by using encryption context. The encryption context is one or more key-value pairs of additional data that you want associated with AWS KMS protected information.

Encryption context helps you defend against the risks of ciphertexts being tampered with, modified, or replaced — whether intentionally or unintentionally. Encryption context helps defend against both an unauthorized user replacing one ciphertext with another, as well as problems like operational events. To use encryption context, you specify associated key-value pairs on encrypt. You must provide the exact same key-value pairs in the encryption context on decrypt, or the operation will fail. Encryption context is not secret, and is not an access-control mechanism. The encryption context is a means of authenticating the data, not the caller.

The Direct KMS Materials Provider used in this blog post transparently generates a unique data key by using AWS KMS for each item stored in the DynamoDB table. It automatically sets the item’s partition key and sort key (if any) as AWS KMS encryption context key-value pairs.

The solution in this blog post relies on the partition key of each table item being defined in the encryption context. If you encrypt data with your own implementation, make sure to add your tenant ID to the encryption context in all your AWS KMS API calls.

For more information about the concept of AWS KMS encryption context, see the blog post How to Protect the Integrity of Your Encrypted Data by Using AWS Key Management Service and EncryptionContext. You can also see another example in Exercise 3 of the Busy Engineer’s Document Bucket Workshop.

Attribute-based access control for AWS

Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes. In AWS, these attributes are called tags. In the solution in this post, ABAC helps you create tenant-isolated access policies for your application, without the need to provision tenant specific AWS IAM roles.

If you are new to ABAC, or need a refresher on the concepts and the different isolation methods, see the blog post How to implement SaaS tenant isolation with ABAC and AWS IAM.

Solution overview

If you are a SaaS vendor expecting large numbers of tenants, it is important that your underlying architecture can cost effectively scale with minimal complexity to support the required number of tenants, without compromising on security. One way to meet these criteria is to store your tenant data in a single pooled DynamoDB table, and to encrypt the data using a single AWS KMS key.

Using a single shared KMS key to read and write encrypted data in DynamoDB for multiple tenants reduces your per-tenant costs. This may be especially relevant to manage your costs if you have users on your organization’s free tier, with no direct revenue to offset your costs.

When you use shared resources such as a single pooled DynamoDB table encrypted by using a single KMS key, you need a mechanism to help prevent cross-tenant access to the sensitive data. This is where you can use ABAC for AWS. By using ABAC, you can build an application with strong tenant isolation capabilities, while still using shared and pooled underlying resources for storing your sensitive tenant data.

You can find the solution described in this blog post in the aws-dynamodb-encrypt-with-abac GitHub repository. This solution uses ABAC combined with KMS encryption context to provide isolation of tenant data, both at rest and at run time. By using a single KMS key, the application encrypts tenant data on the client-side, and stores it in a pooled DynamoDB table, which is partitioned by a tenant ID.

Solution Architecture

Figure 1: Components of solution architecture

Figure 1: Components of solution architecture

The presented solution implements an API with a single AWS Lambda function behind an Amazon API Gateway, and implements processing for two types of requests:

  1. GET request: fetch any key-value pairs stored in the tenant data store for the given tenant ID.
  2. POST request: store the provided key-value pairs in the tenant data store for the given tenant ID, overwriting any existing data for the same tenant ID.

The application is written in Python, it uses AWS Lambda Powertools for Python, and you deploy it by using the AWS CDK.

It also uses the DynamoDB Encryption Client for Python, which includes several helper classes that mirror the AWS SDK for Python (Boto3) classes for DynamoDB. This solution uses the EncryptedResource helper class which provides Boto3 compatible get_item and put_item methods. The helper class is used together with the KMS Materials Provider to handle encryption and decryption with AWS KMS transparently for the application.

Note: This example solution provides no authentication of the caller identity. See chapter “Considerations for authentication and authorization” for further guidance.

How it works

Figure 2: Detailed architecture for storing new or updated tenant data

Figure 2: Detailed architecture for storing new or updated tenant data

As requests are made into the application’s API, they are routed by API Gateway to the application’s Lambda function (1). The Lambda function begins to run with the IAM permissions that its IAM execution role (DefaultExecutionRole) has been granted. These permissions do not grant any access to the DynamoDB table or the KMS key. In order to access these resources, the Lambda function first needs to assume the ResourceAccessRole, which does have the necessary permissions. To implement ABAC more securely in this use case, it is important that the application maintains clear separation of IAM permissions between the assumed ResourceAccessRole and the DefaultExecutionRole.

As the application assumes the ResourceAccessRole using the AssumeRole API call (2), it also sets a TenantID session tag. Session tags are key-value pairs that can be passed when you assume an IAM role in AWS Simple Token Service (AWS STS), and are a fundamental core building block of ABAC on AWS. When the session credentials (3) are used to make a subsequent request, the request context includes the aws:PrincipalTag context key, which can be used to access the session’s tags. The chapter “The ResourceAccessRole policy” describes how the aws:PrincipalTag context key is used in IAM policy condition statements to implement ABAC for this solution. Note that for demonstration purposes, this solution receives the value for the TenantID tag directly from the request URL, and it is not authenticated.

The trust policy of the ResourceAccessRole defines the principals that are allowed to assume the role, and to tag the assumed role session. Make sure to limit the principals to the least needed for your application to function. In this solution, the application Lambda function is the only trusted principal defined in the trust policy.

Next, the Lambda function prepares to encrypt or decrypt the data (4). To do so, it uses the DynamoDB Encryption Client. The KMS Materials Provider and the EncryptedResource helper class are both initialized with sessions by using the temporary credentials from the AssumeRole API call. This allows the Lambda function to access the KMS key and DynamoDB table resources, with access restricted to operations on data belonging only to the specific tenant ID.

Finally, using the EncryptedResource helper class provided by the DynamoDB Encryption Library, the data is written to and read from the DynamoDB table (5).

Considerations for authentication and authorization

The solution in this blog post intentionally does not implement authentication or authorization of the client requests. Instead, the requested tenant ID from the request URL is passed as the tenant identity. Your own applications should always authenticate and authorize tenant requests. There are multiple ways you can achieve this.

Modern web applications commonly use OpenID Connect (OIDC) for authentication, and OAuth for authorization. JSON Web Tokens (JWTs) can be used to pass the resulting authorization data from client to the application. You can validate a JWT when using AWS API Gateway with one of the following methods:

  1. When using a REST or a HTTP API, you can use a Lambda authorizer
  2. When using a HTTP API, you can use a JWT authorizer
  3. You can validate the token directly in your application code

If you write your own authorizer code, you can pick a popular open source library or you can choose the AWS provided open source library. To learn more about using a JWT authorizer, see the blog post How to secure API Gateway HTTP endpoints with JWT authorizer.

Regardless of the chosen method, you must be able to map a suitable claim from the user’s JWT, such as the subject, to the tenant ID, so that it can be used as the session tag in this solution.

The ResourceAccessRole policy

A critical part of the correct operation of ABAC in this solution is with the definition of the IAM access policy for the ResourceAccessRole. In the following policy, be sure to replace <region>, <account-id>, <table-name>, and <key-id> with your own values.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:PutItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:<region>:<account-id>:table/<table-name>",
           ],
            "Condition": {
                "ForAllValues:StringEquals": {
                    "dynamodb:LeadingKeys": [
                        "${aws:PrincipalTag/TenantID}"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey",
            ],
            "Resource": "arn:aws:kms:<region>:<account-id>:key/<key-id>",
            "Condition": {
                "StringEquals": {
                    "kms:EncryptionContext:tenant_id": "${aws:PrincipalTag/TenantID}"
                }
            }
        }
    ]
}

The policy defines two access statements, both of which apply separate ABAC conditions:

  1. The first statement grants access to the DynamoDB table with the condition that the partition key of the item matches the TenantID session tag in the caller’s session.
  2. The second statement grants access to the KMS key with the condition that one of the key-value pairs in the encryption context of the API call has a key called tenant_id with a value that matches the TenantID session tag in the caller’s session.

Warning: Do not use a ForAnyValue or ForAllValues set operator with the kms:EncryptionContext single-valued condition key. These set operators can create a policy condition that does not require values you intend to require, and allows values you intend to forbid.

Deploying and testing the solution

Prerequisites

To deploy and test the solution, you need the following:

Deploying the solution

After you have the prerequisites installed, run the following steps in a command line environment to deploy the solution. Make sure that your AWS CLI is configured with your AWS account credentials. Note that standard AWS service charges apply to this solution. For more information about pricing, see the AWS Pricing page.

To deploy the solution into your AWS account

  1. Use the following command to download the source code:
    git clone https://github.com/aws-samples/aws-dynamodb-encrypt-with-abac
    cd aws-dynamodb-encrypt-with-abac

  2. (Optional) You will need an AWS CDK version compatible with the application (2.37.0) to deploy. The simplest way is to install a local copy with npm, but you can also use a globally installed version if you already have one. To install locally, use the following command to use npm to install the AWS CDK:
    npm install [email protected]

  3. Use the following commands to initialize a Python virtual environment:
    python3 -m venv demoenv
    source demoenv/bin/activate
    python3 -m pip install -r requirements.txt

  4. (Optional) If you have not used AWS CDK with this account and Region before, you first need to bootstrap the environment:
    npx cdk bootstrap

  5. Use the following command to deploy the application with the AWS CDK:
    npx cdk deploy

  6. Make note of the API endpoint URL https://<api url>/prod/ in the Outputs section of the CDK command. You will need this URL for the next steps.
    Outputs:
    DemoappStack.ApiEndpoint4F160690 = https://<api url>/prod/

Testing the solution with example API calls

With the application deployed, you can test the solution by making API calls against the API URL that you captured from the deployment output. You can start with a simple HTTP POST request to insert data for a tenant. The API expects a JSON string as the data to store, so make sure to post properly formatted JSON in the body of the request.

An example request using curl -command looks like:

curl https://<api url>/prod/tenant/<tenant-name> -X POST --data '{"email":"<[email protected]>"}'

You can then read the same data back with an HTTP GET request:

curl https://<api url>/prod/tenant/<tenant-name>

You can store and retrieve data for any number of tenants, and can store as many attributes as you like. Each time you store data for a tenant, any previously stored data is overwritten.

Additional considerations

A tenant ID is used as the DynamoDB table’s partition key in the example application in this solution. You can replace the tenant ID with another unique partition key, such as a product ID, as long as the ID is consistently used in the IAM access policy, the IAM session tag, and the KMS encryption context. In addition, while this solution does not use a sort key in the table, you can modify the application to support a sort key with only a few changes. For more information, see Working with tables and data in DynamoDB.

Clean up

To clean up the application resources that you deployed while testing the solution, in the solution’s home directory, run the command cdk destroy.

Then, if you no longer plan to deploy to this account and Region using AWS CDK, you can also use the AWS CloudFormation console to delete the bootstrap stack (CDKToolKit).

Conclusion

In this post, you learned a method for simple and cost-efficient client-side encryption for your tenant data. By using the DynamoDB Encryption Client, you were able to implement the encryption with less effort, all while using a standard Boto3 DynamoDB Table resource compatible interface.

Adding to the client-side encryption, you also learned how to apply attribute-based access control (ABAC) to your IAM access policies. You used ABAC for tenant isolation by applying conditions for both the DynamoDB table access, as well as access to the KMS key that is used for encryption of the tenant data in the DynamoDB table. By combining client-side encryption with ABAC, you have increased your data protection with multiple layers of security.

You can start experimenting today on your own by using the provided solution. If you have feedback about this post, submit comments in the Comments section below. If you have questions on the content, consider submitting them to AWS re:Post

Want more AWS Security news? Follow us on Twitter.

Jani Muuriaisniemi

Jani is a Principal Solutions Architect at Amazon Web Services based out of Helsinki, Finland. With more than 20 years of industry experience, he works as a trusted advisor with a broad range of customers across different industries and segments, helping the customers on their cloud journey.

Gain visibility into your Amazon MSK cluster by deploying the Conduktor Platform

Post Syndicated from Stéphane Maarek original https://aws.amazon.com/blogs/big-data/gain-visibility-into-your-amazon-msk-cluster-by-deploying-the-conduktor-platform/

This is a guest post by AWS Data Hero and co-founder of Conduktor, Stephane Maarek.

Deploying Apache Kafka on AWS is now easier, thanks to Amazon Managed Streaming for Apache Kafka (Amazon MSK). In a few clicks, it provides you with a production-ready Kafka cluster on which you can run your applications and create data streams.

Apache Kafka is an open-source project, and no official user interfaces are available. The lack of visibility into Apache Kafka is a factor in the slow development of applications.

The recent announcement of the Conduktor Platform makes Amazon MSK operations simple, and you can solve Kafka issues end to end with solutions for testing, monitoring, data quality, governance, and security.

You can use the Conduktor Platform to monitor both types of MSK clusters, provisioned and serverless. In this post, we demonstrate how to use AWS Identity and Access Management (IAM) based security to administer our MSK cluster.

Solution overview

We look at how we can deploy the Conduktor Platform on Amazon MSK in a production-ready deployment so you can try it out today.

The solution is fully serverless and customizable. Everything is deployed using AWS CloudFormation templates.

The source code and CloudFormation templates used in this post are available in the GitHub repo.

To implement this solution, we complete the following high-level steps:

  1. Deploy a CloudFormation template to create our customized Docker image for the Conduktor Platform using AWS CodeBuild.
  2. Optionally, deploy an MSK cluster in provisioned or serverless mode using a CloudFormation template.
  3. Deploy the Conduktor Platform as an AWS Fargate container against our MSK cluster using a CloudFormation template.

Create a customized configuration for the Conduktor Platform

The Conduktor Platform uses a YAML configuration file to define the cluster connection endpoints. Therefore, we must create a customized Docker image of the Conduktor Platform that is able to connect to a cluster on Amazon MSK with a customized YAML file. For this, we use CodeBuild, and we store our configuration files in Amazon Simple Storage Service (Amazon S3). The final image is stored in Amazon Elastic Container Registry (Amazon ECR). The following diagram illustrates this workflow.

  1. Deploy the first CloudFormation template to create the following resources:
    • An S3 bucket to store our configuration files.
    • An ECR repository to store our final Docker image.
    • A CodeBuild project to build that Docker image.
    • An IAM role and policy to allow CodeBuild to perform the build.

Now we need to upload our files into Amazon S3.

  1. Upload the following files:
    • The file buildspec.yml, which is used by CodeBuild to build our primary Docker image.
    • The Dockerfile, which contains instructions on how to build our final Docker image.
    • The folder conduktor-platform-config (as is), which contains the configuration files to connect to Amazon MSK.

  1. At this stage, you can customize the conduktor-platform.yaml file, allowing you to connect to one MSK cluster:

Alternatively, you can connect to multiple MSK clusters or external ones by specifying multiple Kafka bootstrap servers, as shown in the following code. You can also use the same configuration file to specify the schema registry URL, Kafka Connect connection details, and SSO.

A single-Region Conduktor Platform deployment can work for multi-Region MSK clusters, although natural latency is expected. For latency-sensitive usage, you can deploy this solution in every Region in which you’re using Amazon MSK.

After uploading the files and configurations in your S3 bucket, let’s run CodeBuild to generate a new image.

  1. On the CodeBuild console, navigate to the project and choose Start build.

The build should complete in about 3 minutes.

The final image is pushed to Amazon ECR thanks to the script hosted in our build-spec.yml script run by CodeBuild. We’re now done with our first step. Your Conduktor Platform setup can now fully connect to your MSK cluster.

Start the MSK cluster

If you already have an MSK cluster set up with IAM access control, you can skip this step. If not, you can create one using the provided CloudFormation template.

From the MSK cluster (the new one or existing one), retrieve two essential pieces of information:

We use IAM access control so that we only need to use IAM policies to connect to our cluster.

If you’re using another security mechanism (such as SASL/SCRAM), you need to modify the Conduktor configuration files with the right properties, upload them back into Amazon S3, and rebuild the Conduktor image using CodeBuild.

Conduktor supports every single Kafka authentication method, including the ones supported by Amazon MSK: IAM access control, mutual TLS authentication, and user name/password using SASL/SCRAM.

Deploy the Conduktor Platform on Amazon ECS with Fargate

The last step is to deploy the Conduktor Platform. For this, we prefer running serverless solutions using Amazon Elastic Container Service (Amazon ECS) with Fargate. This allows you to right-size your containers in the future in case your usage of Conduktor grows over time.

Conduktor stores persistent data in the /var/conduktor file system folder, to store configuration, cache computation results, store logs, and run an internal database (for example, if you start creating data masking rules). For the persistence layer, we use Amazon Elastic File System (Amazon EFS), an elastic network file system that can be mounted on Fargate to provide a persistence layer.

Finally, we expose our Fargate container through an Application Load Balancer, giving us a public static DNS endpoint to expose the Conduktor Platform and giving us complete control over the network security to access the Conduktor Platform. The following diagram illustrates our architecture.

Deploy the Conduktor Platform on Amazon ECS with Fargate

We deploy our last CloudFormation file and specify some important parameters:

  • MSKBookstrapServersURL – This parameter is necessary to tell Conduktor which MSK cluster to connect to
  • MSKSecurityGroupID – The MSK security group is necessary to allow the template to add a security group ingress rule to it, thereby allowing our ECS task
  • PublicSubnetIDs – The public subnet IDs are for your Application Load Balancer
  • SubnetIDs – The subnet IDs are for your ECS task and can be the same subnets or private subnets (as long as they have access to the MSK cluster and the other public subnets)
  • VpcID – This is the VPC you’re deploying to

After deploying the template, on the Output tab of the stack, you can find the Application Load Balancer URL.

We use this URL and log in to the Conduktor Platform with the user name [email protected] and password password. These login credentials can be changed using the YAML configuration file, and you can even enable SSO and LDAP.

On the Conduktor console, you can start creating topics, producing data, consuming data, and much more! AWS Glue Schema Registry support is coming soon, and Confluent Schema Registry compatibility is already available.

Clean up

To clean up your AWS account, perform the following steps in order:

  1. Delete the third CloudFormation template (3 – create ECS Service.yaml).
  2. Delete the second CloudFormation template (2 – create MSK cluster.yaml).
  3. Empty the contents of your S3 bucket.
  4. Delete all your images in your ECR repository.
  5. Delete the first CloudFormation template (1 – base conduktor.yaml).

Conclusion

You can use the Conduktor Platform against as many MSK clusters as desired by editing the file conduktor-platform.yaml. You can even connect to your clusters running elsewhere, for example on Amazon Elastic Compute Cloud (Amazon EC2).

On our roadmap, we’re working on a complete integration with Amazon MSK, including AWS Glue Schema Registry support, Amazon MSK Connect support, and complete monitoring capabilities.

The Conduktor Platform offers a limited free tier with no time limit. Head to Conduktor’s Get Started page and create an account to start using the Platform alongside MSK clusters today.


About the Author

Stéphane Maarek is the co-founder of Conduktor. He is also the lead instructor on Udemy for learning Apache Kafka and AWS Certifications, having taught these technologies to over 1.5 million learners. Through Conduktor, he wants to democratize access to Apache Kafka and make its usage seamless and enterprise-ready.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size 16 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5
Total query runtime on I3 (seconds) 4752.15457 4506.43694 7110.03042 11853.40336 21333.05743
Total query runtime on I4I (seconds) 2642.77407 2812.05517 4415.0023 7537.52779 12981.20251
Total query runtime improvement with I4I 44.39% 37.60% 37.90% 36.41% 39.15%
Geometric mean query runtime on I3 (sec) 34.99551 29.14821 41.53093 60.8069 95.46128
Geometric mean query runtime on I4I (sec) 19.17906 18.65311 25.66263 38.13503 56.95073
Geometric mean query runtime improvement with I4I 45.20% 36.01% 38.21% 37.29% 40.34%
EC2 I3 instance price ($ per hour) $4.990 $2.496 $1.248 $0.624 $0.312
EMR I3 instance price ($ per hour) $0.270 $0.270 $0.270 $0.156 $0.078
(EC2 + EMR) I3 instance price ($ per hour) $5.260 $2.766 $1.518 $0.780 $0.390
Cost of running on I3 ($ per instance) $6.943 $3.462 $2.998 $2.568 $2.311
EC2 I4I instance price ($ per hour) $5.491 $2.746 $1.373 $0.686 $0.343
EMR I4I price ($ per hour per instance) $1.373 $0.687 $0.343 $0.172 $0.086
(EC2 + EMR) I4I instance price ($ per hour) $6.864 $3.433 $1.716 $0.858 $0.429
Cost of running on I4I ($ per instance) $5.039 $2.681 $2.105 $1.795 $1.546
Total cost reduction with I4I including performance improvement -27.43% -22.56% -29.79% -30.09% -33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on M5 (seconds) 4027.58043 3782.10766 3348.05362 3516.4308 5621.22532 10075.45109 17278.15146
Total query runtime on M6I (seconds) 3106.43834 2665.70607 2714.69862 3043.5975 4195.02715 8226.88301 14515.50394
Total query runtime improvement with M6I 22.87% 29.52% 18.92% 13.45% 25.37% 18.35% 15.99%
Geometric mean query runtime M5 (sec) 30.45437 28.5207 23.95314 23.55958 32.95975 49.43178 75.95984
Geometric mean query runtime M6I (sec) 23.76853 19.21783 19.16869 19.9574 24.23012 39.09965 60.79494
Geometric mean query runtime improvement with M6I 21.95% 32.62% 19.97% 15.29% 26.49% 20.90% 19.96%
EC2 M5 instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M5 instance price ($ per hour) $0.27 $0.27 $0.27 $0.27 $0.19 $0.10 $0.05
(EC2 + EMR) M5 instance price ($ per hour) $4.88 $3.34 $2.57 $1.81 $0.96 $0.48 $0.24
Cost of running on M5 ($ per instance) $5.46 $3.51 $2.39 $1.76 $1.50 $1.34 $1.15
EC2 M6I instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M6I price ($ per hour per instance) $1.15 $0.77 $0.58 $0.38 $0.19 $0.10 $0.05
(EC2 + EMR) M6I instance price ($ per hour) $5.76 $3.84 $2.88 $1.92 $0.96 $0.48 $0.24
Cost of running on M6I ($ per instance) $4.97 $2.84 $2.17 $1.62 $1.12 $1.10 $0.97
Total cost reduction with M6I including performance improvement -8.92% -19.02% -9.28% -7.98% -25.37% -18.35% -15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2XL
Number of core instances in EMR cluster 5 5 5 5 5 5
Total query runtime on R5 (seconds) 4024.4737 3715.74432 3552.97298 3535.69879 5379.73168 9121.41532
Total query runtime on R6I (seconds) 2865.83169 2518.24192 2513.4849 3031.71973 4544.44854 6977.9508
Total query runtime improvement with R6I 28.79% 32.23% 29.26% 14.25% 15.53% 23.50%
Geometric mean query runtime R5 (sec) 30.59066 28.30849 25.30903 23.85511 32.33391 47.28424
Geometric mean query runtime R6I (sec) 21.87897 17.97587 17.54117 20.00918 26.6277 34.52817
Geometric mean query runtime improvement with R6I 28.48% 36.50% 30.69% 16.12% 17.65% 26.98%
EC2 R5 instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R5 instance price ($ per hour) $0.2700 $0.2700 $0.2700 $0.2700 $0.2520 $0.1260
(EC2 + EMR) R5 instance price ($ per hour) $6.3180 $4.3020 $3.2940 $2.2860 $1.2600 $0.6300
Cost of running on R5 ($ per instance) $7.0630 $4.4403 $3.2510 $2.2452 $1.8829 $1.5962
EC2 R6I instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R6I price ($ per hour per instance) $1.5120 $1.0080 $0.7560 $0.5040 $0.2520 $0.1260
(EC2 + EMR) R6I instance price ($ per hour) $7.5600 $5.0400 $3.7800 $2.5200 $1.2600 $0.6300
Cost of running on R6I ($ per instance) $6.0182 $3.5255 $2.6392 $2.1222 $1.5906 $1.2211
Total cost reduction with R6I including performance improvement -14.79% -20.60% -18.82% -5.48% -15.53% -23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size * 24 XL 12 XL 4 XL 2 XL
Number of core instances in EMR cluster 5 5 5 5
Total query runtime on C5 (seconds) 3435.59808 2900.84981 5945.12879 10173.00757
Total query runtime on C6I (seconds) 2711.16147 2471.86778 5195.30093 8787.43422
Total query runtime improvement with C6I 21.09% 14.79% 12.61% 13.62%
Geometric mean query runtime C5 (sec) 25.67058 20.06539 31.76582 46.78632
Geometric mean query runtime C6I (sec) 20.4458 17.14133 26.92196 39.32622
Geometric mean query runtime improvement with C6I 20.35% 14.57% 15.25% 15.95%
EC2 C5 instance price ($ per hour) $4.080 $2.040 $0.680 $0.340
EMR C5 instance price ($ per hour) $0.270 $0.270 $0.170 $0.085
(EC2 + EMR) C5 instance price ($ per hour) $4.35000 $2.31000 $0.85000 $0.42500
Cost of running on C5 ($ per instance) $4.15135 $1.86138 $1.40371 $1.20098
EC2 C6I instance price ($ per hour) $4.0800 $2.0400 $0.6800 $0.3400
EMR C6I price ($ per hour per instance) $1.02000 $0.51000 $0.17000 $0.08500
(EC2 + EMR) C6I instance price ($ per hour) $5.10000 $2.55000 $0.85000 $0.42500
Cost of running on C6I ($ per instance) $3.84081 $1.75091 $1.22667 $1.03741
Total cost reduction with C6I including performance improvement -7.48% -5.93% -12.61% -13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on R5D (seconds) 4054.4492975042 3691.7569385583 3598.6869168064 3532.7398928104 5397.5330161574 9281.2627059927 16862.8766838096
Total query runtime on R6ID (seconds) 2992.1198446983 2633.7131630720 2632.3186613402 2729.8860537867 4583.1040980373 7921.9960917943 14867.5391541445
Total query runtime improvement with R6ID 26.20% 28.66% 26.85% 22.73% 15.09% 14.65% 11.83%
Geometric mean query runtime R5D (sec) 31.0238156851 28.1432927726 25.7532157307 24.0596427675 32.5800246829 48.2306670294 76.6771994376
Geometric mean query runtime R6ID (sec) 22.8681174894 19.1282742957 18.6161830746 18.0498249257 25.9500918360 39.6580341258 65.0947323858
Geometric mean query runtime improvement with R6ID 26.29% 32.03% 27.71% 24.98% 20.35% 17.77% 15.11%
EC2 R5D instance price ($ per hour) $6.912000 $4.608000 $3.456000 $2.304000 $1.152000 $0.576000 $0.288000
EMR R5D instance price ($ per hour) $0.270000 $0.270000 $0.270000 $0.270000 $0.270000 $0.144000 $0.072000
(EC2 + EMR) R5D instance price ($ per hour) $7.182000 $4.878000 $3.726000 $2.574000 $1.422000 $0.720000 $0.360000
Cost of running on R5D ($ per instance) $8.088626 $5.002331 $3.724641 $2.525909 $2.132026 $1.856253 $1.686288
EC2 R6ID instance price ($ per hour) $7.257600 $4.838400 $3.628800 $2.419200 $1.209600 $0.604800 $0.302400
EMR R6ID price ($ per hour per instance) $1.814400 $1.209600 $0.907200 $0.604800 $0.302400 $0.151200 $0.075600
(EC2 + EMR) R6ID instance price ($ per hour) $9.072000 $6.048000 $4.536000 $3.024000 $1.512000 $0.756000 $0.378000
Cost of running on R6ID ($ per instance) $7.540142 $4.424638 $3.316722 $2.293104 $1.924904 $1.663619 $1.561092
Total cost reduction with R6ID including performance improvement -6.78% -11.55% -10.95% -9.22% -9.71% -10.38% -7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

[$] Composefs for integrity protection and data sharing

Post Syndicated from original https://lwn.net/Articles/917097/

A read-only filesystem that will transparently share file data between disparate
directory trees, while also providing integrity verification for the data
and the
directory metadata, was recently posted as an
RFC

to the linux-kernel mailing list. Composefs was developed
by Alexander Larsson (who posted it) and Giuseppe Scrivano for use by podman containers and OSTree (or “libostree” as it
is now known) root directories, but there are likely others who want the
abilities
it provides. So far, there has been little response, either with feedback or
complaints, but it is a small patch set (around 2K lines of code) and
generally self-contained since it is a filesystem, so it would not be a
surprise to see it appear in some upcoming kernel.

GitHub Availability Report: November 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-12-07-github-availability-report-november-2022/

In November, we experienced two incidents that resulted in degraded performance across GitHub services. This report also sheds light into an incident that impacted GitHub Codespaces in October.

November 25 16:34 UTC (lasting 1 hour and 56 minutes)

Our alerting systems detected an incident that impacted customers using GitHub Packages and Pages. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the January Availability Report, which we will publish the first Wednesday of January.

October 26 00:47 UTC (lasting 3 hours and 47 minutes)

On October 22, 2022 at 00:47 UTC, our alerting systems detected a decrease in success rates for creates and resumes of Codespaces in the East US region. We initially statused yellow, as the incident affected only the East US region. As the incident persisted for several hours, we provided guidance to customers in the affected region to manually change their location to a nearby healthy region at 01:55 UTC, and statused red at 2:34 UTC due to the prolonged outage.

During this incident, customers were unable to create or resume Codespaces in the East US region. Customers could manually select an alternate region in which to create Codespaces, but could not do so for resumes.

Codespaces uses a third-party database to store data for the service and the provider was experiencing an outage, which impacted Codespaces performance. We were unable to immediately communicate with our East US database because our service does not currently have any replication of its regional data. Our services in the East US region returned to healthy status as soon as Codespaces engineers were able to engage with the third party to help mitigate the outage.

We identified several ways to improve our database resilience to regional outages while working with the third party during this incident and in follow up internal discussions. We are implementing regional replication and failover so that we can mitigate this type of incident more quickly in the future.

November 3 16:10 UTC (lasting 1 hour and 2 minutes)

On November 3, 2022 at 16:10 UTC, our alerting systems detected an increase in the time it took GitHub Actions workflow runs to start. We initially statused GitHub Actions to red, and after assessing impact we statused to yellow at 16:11 UTC.

During this incident, customers experienced high latency in receiving webhook deliveries, starting GitHub Actions workflow runs, and receiving status updates for in-progress runs. They also experienced an increase in error responses from repositories, pull requests, Codespaces, and the GitHub API. At its peak, a majority of repositories attempting to run a GitHub Actions workflow experienced delays longer than five minutes.

GitHub Actions listens to webhooks to trigger workflow runs, and while investigating we found that the run start delays were caused by a backup in the webhooks queue. At 16:29 UTC, we scaled out and accelerated processing of the webhooks queue as a mitigation. By 17:12 UTC, the webhooks queue was fully drained and we statused back to green.

We found that the webhook delays were caused by an inefficient database query for checking repository security advisory access, which was triggered by a high volume of poorly optimized API calls. This caused a backup in background jobs running across GitHub, which is why multiple services were impacted in addition to webhooks and GitHub Actions.

Following our investigation, we fixed the inefficient query for the repository security advisory access. We also reviewed the rate limits for this particular endpoint (as well as limits in this area) to ensure they were in line with our performance expectations. Finally, we increased the default throttling of the webhooks queue to avoid potential backups in the future. As a longer-term improvement to our resiliency, we are investigating options to reduce the potential for other background jobs to impact GitHub Actions workflows. We’ll continue to run game days and conduct enhanced training for first responders to better assess impact for GitHub Actions and determine the appropriate level of statusing moving forward.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

Let’s Architect! Optimizing the cost of your architecture

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-optimizing-the-cost-of-your-architecture/

Written in collaboration with Ben Moses, AWS Senior Solutions Architect, and Michael Holtby, AWS Senior Manager Solutions Architecture


Designing an architecture is not a simple task. There are many dimensions and characteristics of a solution to consider, such as the availability, performance, or resilience.

In this Let’s Architect!, we explore cost optimization and ideas on how to rethink your AWS workloads, providing suggestions that span from compute to data transfer.

Migrating AWS Lambda functions to Arm-based AWS Graviton2 processors

AWS Graviton processors are custom silicon from Amazon’s Annapurna Labs. Based on the Arm processor architecture, they are optimized for performance and cost, which allows customers to get up to 34% better price performance.

This AWS Compute Blog post discusses some of the differences between the x86 and Arm architectures, as well as methods for developing Lambda functions on Graviton2, including performance benchmarking.

Many serverless workloads can benefit from Graviton2, especially when they are not using a library that requires an x86 architecture to run.

Take me to this Compute post!

Choosing Graviton2 for AWS Lambda function in the AWS console

Choosing Graviton2 for AWS Lambda function in the AWS console

Key considerations in moving to Graviton2 for Amazon RDS and Amazon Aurora databases

Amazon Relational Database Service (Amazon RDS) and Amazon Aurora support a multitude of instance types to scale database workloads based on needs. Both services now support Arm-based AWS Graviton2 instances, which provide up to 52% price/performance improvement for Amazon RDS open-source databases, depending on database engine, version, and workload. They also provide up to 35% price/performance improvement for Amazon Aurora, depending on database size.

This AWS Database Blog post showcases strategies for updating RDS DB instances to make use of Graviton2 with minimal changes.

Take me to this Database post!

Choose your instance class that leverages Graviton2, such as db.r6g.large (the “g” stands for Graviton2)

Choose your instance class that leverages Graviton2, such as db.r6g.large (the “g” stands for Graviton2)

Overview of Data Transfer Costs for Common Architectures

Data transfer charges are often overlooked while architecting an AWS solution. Considering data transfer charges while making architectural decisions can save costs. This AWS Architecture Blog post describes the different flows of traffic within a typical cloud architecture, showing where costs do and do not apply. For areas where cost applies, it shows best-practice strategies to minimize these expenses while retaining a healthy security posture.

Take me to this Architecture post!

Accessing AWS services in different Regions

Accessing AWS services in different Regions

Improve cost visibility and re-architect for cost optimization

This Architecture Blog post is a collection of best practices for cost management in AWS, including the relevant tools; plus, it is part of a series on cost optimization using an e-commerce example.

AWS Cost Explorer is used to first identify opportunities for optimizations, including data transfer, storage in Amazon Simple Storage Service and Amazon Elastic Block Store, idle resources, and the use of Graviton2 (Amazon’s Arm-based custom silicon). The post discusses establishing a FinOps culture and making use of Service Control Policies (SCPs) to control ongoing costs and guide deployment decisions, such as instance-type selection.

Take me to this Architecture post!

Applying SCPs on different environments for cost control

Applying SCPs on different environments for cost control

See you next time!

Thanks for joining us to discuss optimizing costs while architecting! This is the last Let’s Architect! post of 2022. We will see you again in 2023, when we explore even more architecture topics together.

Wishing you a happy holiday season and joyous new year!

Can’t get enough of Let’s Architect!?

Visit the Let’s Architect! page of the AWS Architecture Blog for access to the whole series.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Securing Lambda Function URLs using Amazon Cognito, Amazon CloudFront and AWS WAF

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/securing-lambda-function-urls-using-amazon-cognito-amazon-cloudfront-and-aws-waf/

This post is written by Madhu Singh (Solutions Architect), and Krupanidhi Jay (Solutions Architect).

Lambda function URLs is a dedicated HTTPs endpoint for a AWS Lambda function. You can configure a function URL to have two methods of authentication: IAM and NONE. IAM authentication means that you are restricting access to the function URL (and in-turn access to invoke the Lambda function) to certain AWS principals (such as roles or users). Authentication type of NONE means that the Lambda function URL has no authentication and is open for anyone to invoke the function.

This blog shows how to use Lambda function URLs with an authentication type of NONE and use custom authorization logic as part of the function code, and to only allow requests that present valid Amazon Cognito credentials when invoking the function. You also learn ways to protect Lambda function URL against common security threats like DDoS using AWS WAF and Amazon CloudFront.

Lambda function URLs provides a simpler way to invoke your function using HTTP calls. However, it is not a replacement for Amazon API Gateway, which provides advanced features like request validation and rate throttling.

Solution overview

There are four core components in the example.

1. A Lambda function with function URLs enabled

At the core of the example is a Lambda function with the function URLs feature enabled with the authentication type of NONE. This function responds with a success message if a valid authorization code is passed during invocation. If not, it responds with a failure message.

2. Amazon Cognito User Pool

Amazon Cognito user pools enable user authentication on websites and mobile apps. You can also enable publicly accessible Login and Sign-Up pages in your applications using Amazon Cognito user pools’ feature called the hosted UI.

In this example, you use a user pool and the associated Hosted UI to enable user login and sign-up on the website used as entry point. This Lambda function validates the authorization code against this Amazon Cognito user pool.

3. CloudFront distribution using AWS WAF

CloudFront is a content delivery network (CDN) service that helps deliver content to end users with low latency, while also improving the security posture for your applications.

AWS WAF is a web application firewall that helps protect your web applications or APIs against common web exploits and bots and AWS Shield is a managed distributed denial of service (DDoS) protection service that safeguards applications running on AWS. AWS WAF inspects the incoming request according to the configured Web Access Control List (web ACL) rules.

Adding CloudFront in front of your Lambda function URL helps to cache content closer to the viewer, and activating AWS WAF and AWS Shield helps in increasing security posture against multiple types of attacks, including network and application layer DDoS attacks.

4. Public website that invokes the Lambda function

The example also creates a public website built on React JS and hosted in AWS Amplify as the entry point for the demo. This website works both in authenticated mode and in guest mode. For authentication, the website uses Amazon Cognito user pools hosted UI.

Solution architecture

This shows the architecture of the example and the information flow for user requests.

In the request flow:

  1. The entry point is the website hosted in AWS Amplify. In the home page, when you choose “sign in”, you are redirected to the Amazon Cognito hosted UI for the user pool.
  2. Upon successful login, Amazon Cognito returns the authorization code, which is stored as a cookie with the name “code”. The user is redirected back to the website, which has an “execute Lambda” button.
  3. When the user choose “execute Lambda”, the value from the “code” cookie is passed in the request body to the CloudFront distribution endpoint.
  4. The AWS WAF web ACL rules are configured to determine whether the request is originating from the US or Canada IP addresses and to determine if the request should be allowed to invoke Lambda function URL origin.
  5. Allowed requests are forwarded to the CloudFront distribution endpoint.
  6. CloudFront is configured to allow CORS headers and has the origin set to the Lambda function URL. The request that CloudFront receives is passed to the function URL.
  7. This invokes the Lambda function associated with the function URL, which validates the token.
  8. The function code does the following in order:
    1. Exchange the authorization code in the request body (passed as the event object to Lambda function) to access_token using Amazon Cognito’s token endpoint (check the documentation for more details).
      1. Amazon Cognito user pool’s attributes like user pool URL, Client ID and Secret are retrieved from AWS Systems Manager Parameter Store (SSM Parameters).
      2. These values are stored in SSM Parameter Store at the time these resources are deployed via AWS CDK (see “how to deploy” section)
    2. The access token is then verified to determine its authenticity.
    3. If valid, the Lambda function returns a message stating user is authenticated as <username> and execution was successful.
    4. If either the authorization code was not present, for example, the user was in “guest mode” on the website, or the code is invalid or expired, the Lambda function returns a message stating that the user is not authorized to execute the function.
  9. The webpage displays the Lambda function return message as an alert.

Getting started

Pre-requisites:

Before deploying the solution, please follow the README from the GitHub repository and take the necessary steps to fulfill the pre-requisites.

Deploy the sample solution

1. From the code directory, download the dependencies:

$ npm install

2. Start the deployment of the AWS resources required for the solution:

$ cdk deploy

Note:

  • optionally pass in the –profile argument if needed
  • The deployment can take up to 15 minutes

3. Once the deployment completes, the output looks similar to this:

Open the amplifyAppUrl from the output in your browser. This is the URL for the demo website. If you don’t see the “Welcome to Compute Blog” page, the Amplify app is still building, and the website is not available yet. Retry in a few minutes. This website works either in an authenticated or unauthenticated state.

Test the authenticated flow

  1. To test the authenticated flow, choose “Sign In”.

2. In the sign-in page, choose on sign-up (for the first time) and create a user name and password.

3. To use an existing an user name and password, enter those credentials and choose login.

4. Upon successful sign-in or sign up, you are redirected back to the webpage with “Execute Lambda” button.

5. Choose this button. In a few seconds, an alert pop-up shows the logged in user and that the Lambda execution is successful.

Testing the unauthenticated flow

1. To test the unauthenticated flow, from the Home page, choose “Continue”.

2. Choose “Execute Lambda” and in a few seconds, you see a message that you are not authorized to execute the Lambda function.

Testing the geo-block feature of AWS WAF

1. Access the website from a Region other than US or Canada. If you are physically in the US or Canada, you may use a VPN service to connect to a Region other than US or Canada.

2. Choose the “Execute Lambda” button. In the Network trace of browser, you can see the call to invoke Lambda function was blocked with Forbidden response.

3. To try either the authenticated or unauthenticated flow again, choose “Return to Home Page” to go back to the home page with “Sign In” and “Continue” buttons.

Cleaning up

To delete the resources provisioned, run the cdk destroy command from the AWS CDK CLI.

Conclusion

In this blog, you create a Lambda function with function URLs enabled with NONE as the authentication type. You then implemented a custom authentication mechanism as part of your Lambda function code. You also increased the security of your Lambda function URL by setting it as Origin for the CloudFront distribution and using AWS WAF Geo and IP limiting rules for protection against common web threats, like DDoS.

For more serverless learning resources, visit Serverless Land.

The collective thoughts of the interwebz