Run a data processing job on Amazon EMR Serverless with AWS Step Functions

Post Syndicated from Siva Ramani original https://aws.amazon.com/blogs/big-data/run-a-data-processing-job-on-amazon-emr-serverless-with-aws-step-functions/

There are several infrastructure as code (IaC) frameworks available today, to help you define your infrastructure, such as the AWS Cloud Development Kit (AWS CDK) or Terraform by HashiCorp. Terraform, an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency, is an IaC tool similar to AWS CloudFormation that allows you to create, update, and version your AWS infrastructure. Terraform provides friendly syntax (similar to AWS CloudFormation) along with other features like planning (visibility to see the changes before they actually happen), graphing, and the ability to create templates to break infrastructure configurations into smaller chunks, which allows better maintenance and reusability. We use the capabilities and features of Terraform to build an API-based ingestion process into AWS. Let’s get started!

In this post, we showcase how to build and orchestrate a Scala Spark application using Amazon EMR Serverless, AWS Step Functions, and Terraform. In this end-to-end solution, we run a Spark job on EMR Serverless that processes sample clickstream data in an Amazon Simple Storage Service (Amazon S3) bucket and stores the aggregation results in Amazon S3.

With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications. You will continue to get the benefits of Amazon EMR, such as open source compatibility, concurrency, and optimized runtime performance for popular data frameworks. EMR Serverless is suitable for customers who want ease in operating applications using open-source frameworks. It offers quick job startup, automatic capacity management, and straightforward cost controls.

Solution overview

We provide the Terraform infrastructure definition and the source code for an AWS Lambda function using sample customer user clicks for online website inputs, which are ingested into an Amazon Kinesis Data Firehose delivery stream. The solution uses Kinesis Data Firehose to convert the incoming data into a Parquet file (an open-source file format for Hadoop) before pushing it to Amazon S3 using the AWS Glue Data Catalog. The generated output S3 Parquet file logs are then processed by an EMR Serverless process, which outputs a report detailing aggregate clickstream statistics in an S3 bucket. The EMR Serverless operation is triggered using Step Functions. The sample architecture and code are spun up as shown in the following diagram.

emr serverless application

The provided samples have the source code for building the infrastructure using Terraform for running the Amazon EMR application. Setup scripts are provided to create the sample ingestion using Lambda for the incoming application logs. For a similar ingestion pattern sample, refer to Provision AWS infrastructure using Terraform (By HashiCorp): an example of web application logging customer data.

The following are the high-level steps and AWS services used in this solution:

  • The provided application code is packaged and built using Apache Maven.
  • Terraform commands are used to deploy the infrastructure in AWS.
  • The EMR Serverless application provides the option to submit a Spark job.
  • The solution uses two Lambda functions:
    • Ingestion – This function processes the incoming request and pushes the data into the Kinesis Data Firehose delivery stream.
    • EMR Start Job – This function starts the EMR Serverless application. The EMR job process converts the ingested user click logs into output in another S3 bucket.
  • Step Functions triggers the EMR Start Job Lambda function, which submits the application to EMR Serverless for processing of the ingested log files.
  • The solution uses four S3 buckets:
    • Kinesis Data Firehose delivery bucket – Stores the ingested application logs in Parquet file format.
    • Loggregator source bucket – Stores the Scala code and JAR for running the EMR job.
    • Loggregator output bucket – Stores the EMR processed output.
    • EMR Serverless logs bucket – Stores the EMR process application logs.
  • Sample invoke commands (run as part of the initial setup process) insert the data using the ingestion Lambda function. The Kinesis Data Firehose delivery stream converts the incoming stream into a Parquet file and stores it in an S3 bucket.

For this solution, we made the following design decisions:

  • We use Step Functions and Lambda in this use case to trigger the EMR Serverless application. In a real-world use case, the data processing application could be long running and may exceed Lambda’s timeout limits. In this case, you can use tools like Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Amazon MWAA is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale.
  • The Lambda code and EMR Serverless log aggregation code are developed using Java and Scala, respectively. You can use any supported languages in these use cases.
  • The AWS Command Line Interface (AWS CLI) V2 is required for querying EMR Serverless applications from the command line. You can also view these from the AWS Management Console. We provide a sample AWS CLI command to test the solution later in this post.

Prerequisites

To use this solution, you must complete the following prerequisites:

  • Install the AWS CLI. For this post, we used version 2.7.18. This is required in order to query the aws emr-serverless AWS CLI commands from your local machine. Optionally, all the AWS services used in this post can be viewed and operated via the console.
  • Make sure to have Java installed, and JDK/JRE 8 is set in the environment path of your machine. For instructions, see the Java Development Kit.
  • Install Apache Maven. The Java Lambda functions are built using mvn packages and are deployed using Terraform into AWS.
  • Install the Scala Build Tool. For this post, we used version 1.4.7. Make sure to download and install based on your operating system needs.
  • Set up Terraform. For steps, see Terraform downloads. We use version 1.2.5 for this post.
  • Have an AWS account.

Configure the solution

To spin up the infrastructure and the application, complete the following steps:

  1. Clone the following GitHub repository.
    The provided exec.sh shell script builds the Java application JAR (for the Lambda ingestion function) and the Scala application JAR (for the EMR processing) and deploys the AWS infrastructure that is needed for this use case.
  2. Run the following commands:
    $ chmod +x exec.sh
    $ ./exec.sh

    To run the commands individually, set the application deployment Region and account number, as shown in the following example:

    $ APP_DIR=$PWD
    $ APP_PREFIX=clicklogger
    $ STAGE_NAME=dev
    $ REGION=us-east-1
    $ ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')

    The following is the Maven build Lambda application JAR and Scala application package:

    $ cd $APP_DIR/source/clicklogger
    $ mvn clean package
    $ sbt reload
    $ sbt compile
    $ sbt package

  3. Deploy the AWS infrastructure using Terraform:
    $ terraform init
    $ terraform plan
    $ terraform apply --auto-approve

Test the solution

After you build and deploy the application, you can insert sample data for Amazon EMR processing. We use the following code as an example. The exec.sh script has multiple sample insertions for Lambda. The ingested logs are used by the EMR Serverless application job.

The sample AWS CLI invoke command inserts sample data for the application logs:

aws lambda invoke --function-name clicklogger-dev-ingestion-lambda —cli-binary-format raw-in-base64-out —payload '{"requestid":"OAP-guid-001","contextid":"OAP-ctxt-001","callerid":"OrderingApplication","component":"login","action":"load","type":"webpage"}' out

To validate the deployments, complete the following steps:

  1. On the Amazon S3 console, navigate to the bucket created as part of the infrastructure setup.
  2. Choose the bucket to view the files.
    You should see that data from the ingested stream was converted into a Parquet file.
  3. Choose the file to view the data.
    The following screenshot shows an example of our bucket contents.
    Now you can run Step Functions to validate the EMR Serverless application.
  4. On the Step Functions console, open clicklogger-dev-state-machine.
    The state machine shows the steps to run that trigger the Lambda function and EMR Serverless application, as shown in the following diagram.
  5. Run the state machine.
  6. After the state machine runs successfully, navigate to the clicklogger-dev-output-bucket on the Amazon S3 console to see the output files.
  7. Use the AWS CLI to check the deployed EMR Serverless application:
    $ aws emr-serverless list-applications \
          | jq -r '.applications[] | select(.name=="clicklogger-dev-loggregrator-emr-<Your-Account-Number>").id'

  8. On the Amazon EMR console, choose Serverless in the navigation pane.
  9. Select clicklogger-dev-studio and choose Manage applications.
  10. The Application created by the stack will be as shown below clicklogger-dev-loggregator-emr-<Your-Account-Number>
    Now you can review the EMR Serverless application output.
  11. On the Amazon S3 console, open the output bucket (us-east-1-clicklogger-dev-loggregator-output-).
    The EMR Serverless application writes the output based on the date partition, such as 2022/07/28/response.md.The following code shows an example of the file output:

    |*createdTime*|*callerid*|*component*|*count*
    |------------|-----------|-----------|-------
    *07-28-2022*|OrderingApplication|checkout|2
    *07-28-2022*|OrderingApplication|login|2
    *07-28-2022*|OrderingApplication|products|2

Clean up

The provided ./cleanup.sh script has the required steps to delete all the files from the S3 buckets that were created as part of this post. The terraform destroy command cleans up the AWS infrastructure that you created earlier. See the following code:

$ chmod +x cleanup.sh
$ ./cleanup.sh

To do the steps manually, you can also delete the resources via the AWS CLI:

# CLI Commands to delete the Amazon S3  

aws s3 rb s3://clicklogger-dev-emr-serverless-logs-bucket-<your-account-number> --force
aws s3 rb s3://clicklogger-dev-firehose-delivery-bucket-<your-account-number> --force
aws s3 rb s3://clicklogger-dev-loggregator-output-bucket-<your-account-number> --force
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force

# Destroy the AWS Infrastructure 
terraform destroy --auto-approve

Conclusion

In this post, we built, deployed, and ran a data processing Spark job in EMR Serverless that interacts with various AWS services. We walked through deploying a Lambda function packaged with Java using Maven, and a Scala application code for the EMR Serverless application triggered with Step Functions with infrastructure as code. You can use any combination of applicable programming languages to build your Lambda functions and EMR job application. EMR Serverless can be triggered manually, automated, or orchestrated using AWS services like Step Functions and Amazon MWAA.

We encourage you to test this example and see for yourself how this overall application design works within AWS. Then, it’s just the matter of replacing your individual code base, packaging it, and letting EMR Serverless handle the process efficiently.

If you implement this example and run into any issues, or have any questions or feedback about this post, please leave a comment!

References


About the Authors

Sivasubramanian Ramani (Siva Ramani) is a Sr Cloud Application Architect at Amazon Web Services. His expertise is in application optimization & modernization, serverless solutions and using Microsoft application workloads with AWS.

Naveen Balaraman is a Sr Cloud Application Architect at Amazon Web Services. He is passionate about Containers, serverless Applications, Architecting Microservices and helping customers leverage the power of AWS cloud.

Upgrade Amazon EMR Hive Metastore from 5.X to 6.X

Post Syndicated from Jianwei Li original https://aws.amazon.com/blogs/big-data/upgrade-amazon-emr-hive-metastore-from-5-x-to-6-x/

If you are currently running Amazon EMR 5.X clusters, consider moving to Amazon EMR 6.X as  it includes new features that helps you improve performance and optimize on cost. For instance, Apache Hive is two times faster with LLAP on Amazon EMR 6.X, and Spark 3 reduces costs by 40%. Additionally, Amazon EMR 6.x releases include Trino, a fast distributed SQL engine and Iceberg, high-performance open data format for petabyte scale tables.

To upgrade Amazon EMR clusters from 5.X to 6.X release, a Hive Metastore upgrade is the first step before applications such as Hive and Spark can be migrated. This post provides guidance on how to upgrade Amazon EMR Hive Metastore from 5.X to 6.X as well as migration of Hive Metastore to the AWS Glue Data Catalog. As Hive 3 Metastore is compatible with Hive 2 applications, you can continue to use Amazon EMR 5.X with the upgraded Hive Metastore.

Solution overview

In the following section, we provide steps to upgrade the Hive Metastore schema using MySQL as the backend.. For any other backends (such as MariaDB, Oracle, or SQL Server), update the commands accordingly.

There are two options to upgrade the Amazon EMR Hive Metastore:

  • Upgrade the Hive Metastore schema from 2.X to 3.X by using the Hive Schema Tool
  • Migrate the Hive Metastore to the AWS Glue Data Catalog

We walk through the steps for both options.

Pre-upgrade prerequisites

Before upgrading the Hive Metastore, you must complete the following prerequisites steps:

  1. Verify the Hive Metastore database is running and accessible.
    You should be able to run Hive DDL and DML queries successfully. Any errors or issues must be fixed before proceeding with upgrade process. Use the following sample queries to test the database:

    create table testtable1 (id int, name string);)
    insert into testtable1 values (1001, "user1");
    select * from testtable1;

  2. To get the Metastore schema version in the current EMR 5.X cluster, run the following command in the primary node:
    sudo hive —service schematool -dbType mysql -info

    The following code shows our sample output:

    $ sudo hive --service schematool -dbType mysql -info
    Metastore connection URL: jdbc:mysql://xxxxxxx.us-east-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true
    Metastore Connection Driver : org.mariadb.jdbc.Driver
    Metastore connection User: admin
    Hive distribution version: 2.3.0
    Metastore schema version:  2.3.0

  3. Stop the Metastore service and restrict access to the Metastore MySQL database.
    It’s very important that no one else accesses or modifies the contents of the Metastore database while you’re performing the schema upgrade.To stop the Metastore, use the following commands:

    $ sudo stop hive-server2
    $ sudo stop hive-hcatalog-server

    For Amazon EMR release 5.30 and 6.0 onwards (Amazon Linux 2 is the operating system for the Amazon EMR 5.30+ and 6.x release series), use the following commands:

    $ sudo systemctl stop hive-server2.service
    $ sudo systemctl stop hive-hcatalog-server.service

    You can also note the total number of databases and tables present in the Hive Metastore before the upgrade, and verify the number of databases and tables after the upgrade.

  4. To get the total number of tables and databases before the upgrade, run the following commands after connecting to the external Metastore database (assuming the Hive Metadata DB name is hive):
    $mysql -u <username> -h <mysqlhost> --password;
    $use hive;
    $select count(*) from DBS;
    $select count(*) from TBLS;

  5. Take a backup or snapshot of the Hive database.
    This allows you to revert any changes made during the upgrade process if something goes wrong. If you’re using Amazon Relational Database Service (Amazon RDS), refer to Backing up and restoring an Amazon RDS instance for instructions.
  6. Take note of the Hive table storage location if data is stored in HDFS.

If all the table data is on Amazon Simple Storage Service (Amazon S3), then no action is needed. If HDFS is used as the storage layer for Hive databases and tables, then take a note of them. You will need to copy the files on HDFS to a similar path on the new cluster, and then verify or update the location attribute for databases and tables on the new cluster accordingly.

Upgrade the Amazon EMR Hive Metastore schema with the Hive Schema Tool

In this approach, you use the persistent Hive Metastore on a remote database (Amazon RDS for MySQL or Amazon Aurora MySQL-Compatible Edition). The following diagram shows the upgrade procedure.

EMR Hive Metastore Upgrade

To upgrade the Amazon EMR Hive Metastore from 5.X (Hive version 2.X) to 6.X (Hive version 3.X), we can use the Hive Schema Tool. The Hive Schema Tool is an offline tool for Metastore schema manipulation. You can use it to initialize, upgrade, and validate the Metastore schema. Run the following command to show the available options for the Hive Schema Tool:

sudo hive --service schematool -help

Be sure to complete the prerequisites mentioned earlier, including taking a backup or snapshot, before proceeding with the next steps.

  1. Note down the details of the existing Hive external Metastore to be upgraded.
    This includes the RDS for MySQL endpoint host name, database name (for this post, hive), user name, and password. You can do this through one of the following options:

    • Get the Hive Metastore DB information from the Hive configuration file – Log in to the EMR 5.X primary node, open the file /etc/hive/conf/hive-site.xml, and note the four properties:
      <property>
      	  <name>javax.jdo.option.ConnectionURL</name>
      	  <value>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</value>
      	  <description>username to use against metastore database</description>
      	</property>
      	<property>
      	  <name>javax.jdo.option.ConnectionDriverName</name>
      	  <value>org.mariadb.jdbc.Driver</value>
      	  <description>username to use against metastore database</description>
      	</property>
      	<property>
      	  <name>javax.jdo.option.ConnectionUserName</name>
      	  <value>{username}</value>
      	  <description>username to use against metastore database</description>
      	</property>
      	<property>
      	  <name>javax.jdo.option.ConnectionPassword</name>
      	  <value>{password}</value>
      	  <description>password to use against metastore database</description>
      	</property>

    • Get the Hive Metastore DB information from the Amazon EMR console – Navigate to the EMR 5.X cluster, choose the Configurations tab, and note down the Metastore DB information.
      EMR Cosole for Configuration
  1. Create a new EMR 6.X cluster.
    To use the Hive Schema Tool, we need to create an EMR 6.X cluster. You can create a new EMR 6.X cluster via the Hive console or the AWS Command Line Interface (AWS CLI), without specifying external hive Metastore details. This lets the EMR 6.X cluster launch successfully using the default Hive Metastore. For more information about EMR cluster management, refer to Plan and configure clusters.
  2. After your new EMR 6.X cluster is launched successfully and is in the waiting state, SSH to the EMR 6.X primary node and take a backup of /etc/hive/conf/hive-site.xml:
    sudo cp /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak

  3. Stop Hive services:
    sudo systemctl stop hive-hcatalog-server.service
    sudo systemctl stop hive-server2.service

    Now you update the Hive configuration and point it to the old hive Metastore database.

  4. Modify /etc/hive/conf/hive-site.xml and update the properties with the values you collected earlier:
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</value>
      <description>username to use against metastore database</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>org.mariadb.jdbc.Driver</value>
      <description>username to use against metastore database</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>{username}</value>
      <description>username to use against metastore database</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>{password}</value>
      <description>password to use against metastore database</description>
    </property>

  5. On the same or new SSH session, run the Hive Schema Tool to check that the Metastore is pointing to the old Metastore database:
    sudo hive --service schemaTool -dbType mysql -info

    The output should look as follows (old-hostname, old-dbname, and old-username are the values you changed):

    Metastore connection URL:     jdbc:mysql://{old-hostname}:3306/{old-dbname}?createDatabaseIfNotExist=true
    Metastore Connection Driver :     org.mariadb.jdbc.Driver
    Metastore connection User:     {old-username}
    Hive distribution version:     3.1.0
    Metastore schema version:      2.3.0
    schemaTool completed

    You can upgrade the Hive Metastore by passing the -upgradeSchema option to the Hive Schema Tool. The tool figures out the SQL scripts required to initialize or upgrade the schema and then runs those scripts against the backend database.

  6. Run the upgradeSchema command with -dryRun, which only lists the SQL scripts needed during the actual run:
    sudo hive --service schematool -dbType mysql -upgradeSchema -dryRun

    The output should look like the following code. It shows the Metastore upgrade path from the old version to the new version. You can find the upgrade order on the GitHub repo. In case of failure during the upgrade process, these scripts can be run manually in the same order.

    Metastore connection URL:     jdbc:mysql://{old-hostname}:3306/{old-dbname}?createDatabaseIfNotExist=true
    Metastore Connection Driver :     org.mariadb.jdbc.Driver
    Metastore connection User:     {old-username}
    Hive distribution version:     3.1.0
    Metastore schema version:      2.3.0
    schemaTool completed

  7. To upgrade the Hive Metastore schema, run the Hive Schema Tool with -upgradeSchema:
    sudo hive --service schematool -dbType mysql -upgradeSchema

    The output should look like the following code:

    Starting upgrade metastore schema from version 2.3.0 to 3.1.0
    Upgrade script upgrade-2.3.0-to-3.0.0.mysql.sql
    ...
    Completed upgrade-2.3.0-to-3.0.0.mysql.sql
    Upgrade script upgrade-3.0.0-to-3.1.0.mysql.sql
    ...
    Completed upgrade-3.0.0-to-3.1.0.mysql.sql
    schemaTool completed

    In case of any issues or failures, you can run the preceding command with verbose. This prints all the queries getting run in order and their output.

    sudo hive --service schemaTool -verbose -dbType mysql -upgradeSchema

    If you encounter any failures during this process and you want to upgrade your Hive Metastore by running the SQL yourself, refer to Upgrading Hive Metastore.

    If HDFS was used as storage for the Hive warehouse or any Hive DB location, you need to update the NameNode alias or URI with the new cluster’s HDFS alias.

  8. Use the following commands to update the HDFS NameNode alias (replace <new-loc> <old-loc> with the HDFS root location of the new and old clusters, respectively):
    hive —service metatool -updateLocation <new-loc> <old-loc>

    You can run the following command on any EMR cluster node to get the HDFS NameNode alias:

    hdfs getconf -confKey dfs.namenode.rpc-address

    At first you can run with the dryRun option, which displays all the changes but aren’t persisted. For example:

    [hadoop@ip-172-31-87-188 ~]$ hive --service metatool -updateLocation hdfs://ip-172-31-50-80.ec2.internal:8020 hdfs://ip-172-31-87-188.ec2.internal:8020 -dryRun
    Initializing HiveMetaTool..
    Looking for LOCATION_URI field in DBS table to update..
    Dry Run of updateLocation on table DBS..
    old location: hdfs://ip-172-31-87-188.ec2.internal:8020/user/hive/warehouse/testdb.db new location: hdfs://ip-172-31-50-80.ec2.internal:8020/user/hive/warehouse/testdb.db
    old location: hdfs://ip-172-31-87-188.ec2.internal:8020/user/hive/warehouse/testdb_2.db new location: hdfs://ip-172-31-50-80.ec2.internal:8020/user/hive/warehouse/testdb_2.db
    old location: hdfs://ip-172-31-87-188.ec2.internal:8020/user/hive/warehouse new location: hdfs://ip-172-31-50-80.ec2.internal:8020/user/hive/warehouse
    Found 3 records in DBS table to update
    Looking for LOCATION field in SDS table to update..
    Dry Run of updateLocation on table SDS..
    old location: hdfs://ip-172-31-87-188.ec2.internal:8020/user/hive/warehouse/testtable1 new location: hdfs://ip-172-31-50-80.ec2.internal:8020/user/hive/warehouse/testtable1
    Found 1 records in SDS table to update

    However, if the new location needs to be changed to a different HDFS or S3 path, then use the following approach.

    First connect to the remote Hive Metastore database and run the following query to pull all the tables for a specific database and list the locations. Replace HiveMetastore_DB with the database name used for the Hive Metastore in the external database (for this post, hive) and the Hive database name (default):

    mysql> SELECT d.NAME as DB_NAME, t.TBL_NAME, t.TBL_TYPE, s.LOCATION FROM <HiveMetastore_DB>.TBLS t 
    JOIN <HiveMetastore_DB>.DBS d ON t.DB_ID = d.DB_ID JOIN <HiveMetastore_DB>.SDS s 
    ON t.SD_ID = s.SD_ID AND d.NAME='default';

    Identify the table for which location needs to be updated. Then run the Alter table command to update the table locations. You can prepare a script or chain of Alter table commands to update the locations for multiple tables.

    ALTER TABLE <table_name> SET LOCATION "<new_location>";

  9. Start and check the status of Hive Metastore and HiveServer2:
    sudo systemctl start hive-hcatalog-server.service
    sudo systemctl start hive-server2.service
    sudo systemctl status hive-hcatalog-server.service
    sudo systemctl status hive-server2.service

Post-upgrade validation

Perform the following post-upgrade steps:

  1. Confirm the Hive Metastore schema is upgraded to the new version:
    sudo hive --service schemaTool -dbType mysql -validate

    The output should look like the following code:

    Starting metastore validation
    
    Validating schema version
    Succeeded in schema version validation.
    [SUCCESS]
    Validating sequence number for SEQUENCE_TABLE
    Succeeded in sequence number validation for SEQUENCE_TABLE.
    [SUCCESS]
    Validating metastore schema tables
    Succeeded in schema table validation.
    [SUCCESS]
    Validating DFS locations
    Succeeded in DFS location validation.
    [SUCCESS]
    Validating columns for incorrect NULL values.
    Succeeded in column validation for incorrect NULL values.
    [SUCCESS]
    Done with metastore validation: [SUCCESS]
    schemaTool completed

  2. Run the following Hive Schema Tool command to query the Hive schema version and verify that it’s upgraded:
    $ sudo hive --service schemaTool -dbType mysql -info
    Metastore connection URL:        jdbc:mysql://<host>:3306/<hivemetastore-dbname>?createDatabaseIfNotExist=true
    Metastore Connection Driver :    org.mariadb.jdbc.Driver
    Metastore connection User:       <username>
    Hive distribution version:       3.1.0
    Metastore schema version:        3.1.0

  3. Run some DML queries against old tables and ensure they are running successfully.
  4. Verify the table and database counts using the same commands mentioned in the prerequisites section, and compare the counts.

The Hive Metastore schema migration process is complete, and you can start working on your new EMR cluster. If for some reason you want to relaunch the EMR cluster, then you just need to provide the Hive Metastore remote database that we upgraded in the previous steps using the options on the Amazon EMR Configurations tab.

Migrate the Amazon EMR Hive Metastore to the AWS Glue Data Catalog

The AWS Glue Data Catalog is flexible and reliable, and can reduce your operation cost. Moreover, the Data Catalog supports different versions of EMR clusters. Therefore, when you migrate your Amazon EMR 5.X Hive Metastore to the Data Catalog, you can use the same Data Catalog with any new EMR 5.8+ cluster, including Amazon EMR 6.x. There are some factors you should consider when using this approach; refer to Considerations when using AWS Glue Data Catalog for more information. The following diagram shows the upgrade procedure.
EMR Hive Metastore Migrate to Glue Data Catalog
To migrate your Hive Metastore to the Data Catalog, you can use the Hive Metastore migration script from GitHub. The following are the major steps for a direct migration.

Make sure all the table data is stored in Amazon S3 and not HDFS. Otherwise, tables migrated to the Data Catalog will have the table location pointing to HDFS, and you can’t query the table. You can check your table data location by connecting to the MySQL database and running the following SQL:

select SD_ID, LOCATION from SDS where LOCATION like 'hdfs%';

Make sure to complete the prerequisite steps mentioned earlier before proceeding with the migration. Ensure the EMR 5.X cluster is in a waiting state and all the components’ status are in service.

  1. Note down the details of the existing EMR 5.X cluster Hive Metastore database to be upgraded.
    As mentioned before, this includes the endpoint host name, database name, user name, and password. You can do this through one of the following options:

    • Get the Hive Metastore DB information from the Hive configuration file – Log in to the Amazon EMR 5.X primary node, open the file /etc/hive/conf/hive-site.xml, and note the four properties:
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://{hostname}:3306/{dbname}?createDatabaseIfNotExist=true</value>
      <description>username to use against metastore database</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>{username}</value>
      <description>username to use against metastore database</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>{password}</value>
      <description>password to use against metastore database</description>
    </property>

    • Get the Hive Metastore DB information from the Amazon EMR console – Navigate to the Amazon EMR 5.X cluster, choose the Configurations tab, and note down the Metastore DB information.

    EMR Cosole for Configuration

  2. On the AWS Glue console, create a connection to the Hive Metastore as a JDBC data source.
    Use the connection JDBC URL, user name, and password you gathered in the previous step. Specify the VPC, subnet, and security group associated with your Hive Metastore. You can find these on the Amazon EMR console if the Hive Metastore is on the EMR primary node, or on the Amazon RDS console if the Metastore is an RDS instance.
  3. Download two extract, transform, and load (ETL) job scripts from GitHub and upload them to an S3 bucket:
    wget https://raw.githubusercontent.com/aws-samples/aws-glue-samples/master/utilities/Hive_metastore_migration/src/hive_metastore_migration.py
    wget https://raw.githubusercontent.com/aws-samples/aws-glue-samples/master/utilities/Hive_metastore_migration/src/import_into_datacatalog.py

    If you configured AWS Glue to access Amazon S3 from a VPC endpoint, you must upload the script to a bucket in the same AWS Region where your job runs.

    Now you must create a job on the AWS Glue console to extract metadata from your Hive Metastore to migrate it to the Data Catalog.

  4. On the AWS Glue console, choose Jobs in the navigation pane.
  5. Choose Create job.
  6. Select Spark script editor.
  7. For Options¸ select Upload and edit an existing script.
  8. Choose Choose file and upload the import_into_datacatalog.py script you downloaded earlier.
  9. Choose Create.
    Glue Job script editor
  10. On the Job details tab, enter a job name (for example, Import-Hive-Metastore-To-Glue).
  11. For IAM Role, choose a role.
  12. For Type, choose Spark.
    Glue ETL Job details
  13. For Glue version¸ choose Glue 3.0.
  14. For Language, choose Python 3.
  15. For Worker type, choose G1.X.
  16. For Requested number of workers, enter 2.
    Glue ETL Job details
  17. In the Advanced properties section, for Script filename, enter import_into_datacatalog.py.
  18. For Script path, enter the S3 path you used earlier (just the parent folder).
    Glue ETL Job details
  19. Under Connections, choose the connection you created earlier.
    Glue ETL Job details
  20. For Python library path, enter the S3 path you used earlier for the file hive_metastore_migration.py.
  21. Under Job parameters, enter the following key-pair values:
    • --mode: from-jdbc
    • --connection-name: EMR-Hive-Metastore
    • --region: us-west-2

    Glue ETL Job details

  22. Choose Save to save the job.
  23. Run the job on demand on the AWS Glue console.

If the job runs successfully, Run status should show as Succeeded. When the job is finished, the metadata from the Hive Metastore is visible on the AWS Glue console. Check the databases and tables listed to verify that they were migrated correctly.

Known issues

In some cases where the Hive Metastore schema version is on a very old release or if some required metadata tables are missing, the upgrade process may fail. In this case, you can use the following steps to identify and fix the issue. Run the schemaTool upgradeSchema command with verbose as follows:

sudo hive --service schemaTool -verbose -dbType mysql -upgradeSchema

This prints all the queries being run in order and their output:

jdbc:mysql://metastore.xxxx.us-west-1> CREATE INDEX PCS_STATS_IDX ON PAR T_COL_STATS (DB_NAME,TABLE_NAME,COLUMN_NAME,PARTITION_NAME) USING BTREE
Error: (conn=6831922) Duplicate key name 'PCS_STATS_IDX' (state=42000,code=1061)
Closing: 0: jdbc:mysql://metastore.xxxx.us-west-1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true
org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode 2

Note down the query and the error message, then take the required steps to address the issue. For example, depending on the error message, you may have to create the missing table or alter an existing table. Then you can either rerun the schemaTool upgradeSchema command, or you can manually run the remaining queries required for upgrade. You can get the complete script that schemaTool runs from the following path on the primary node /usr/lib/hive/scripts/metastore/upgrade/mysql/ or from GitHub.

Clean up

Running additional EMR clusters to perform the upgrade activity in your AWS account may incur additional charges. When you complete the Hive Metastore upgrade successfully, we recommend deleting the additional EMR clusters to save cost.

Conclusion

To upgrade Amazon EMR from 5.X to 6.X and take advantage of some features from Hive 3.X or Spark SQL 3.X, you have to upgrade the Hive Metastore first. If you’re using the AWS Glue Data Catalog as your Hive Metastore, you don’t need to do anything because the Data Catalog supports both Amazon EMR versions. If you’re using a MySQL database as the external Hive Metastore, you can upgrade by following the steps outlined in this post, or you can migrate your Hive Metastore to the Data Catalog.

There are some functional differences between the different versions of Hive, Spark, and Flink. If you have some applications running on Amazon EMR 5.X, make sure test your applications in Amazon EMR 6.X and validate the function compatibility. We will cover application upgrades for Amazon EMR components in a future post.


About the authors

Jianwei Li is Senior Analytics Specialist TAM. He provides consultant service for AWS enterprise support customers to design and build modern data platform. He has more than 10 years experience in big data and analytics domain. In his spare time, he like running and hiking.

Narayanan Venkateswaran is an Engineer in the AWS EMR group. He works on developing Hive in EMR. He has over 17 years of work experience in the industry across several companies including Sun Microsystems, Microsoft, Amazon and Oracle. Narayanan also holds a PhD in databases with focus on horizontal scalability in relational stores.

Partha Sarathi is an Analytics Specialist TAM – at AWS based in Sydney, Australia. He brings 15+ years of technology expertise and helps Enterprise customers optimize Analytics workloads. He has extensively worked on both on-premise and cloud Bigdata workloads along with various ETL platform in his previous roles. He also actively works on conducting proactive operational reviews around the Analytics services like Amazon EMR, Redshift, and OpenSearch.

Krish is an Enterprise Support Manager responsible for leading a team of specialists in EMEA focused on BigData & Analytics, Databases, Networking and Security. He is also an expert in helping enterprise customers modernize their data platforms and inspire them to implement operational best practices. In his spare time, he enjoys spending time with his family, travelling, and video games.

[$] BPF as a safer kernel programming environment

Post Syndicated from original https://lwn.net/Articles/909095/

For better or worse, C is the lingua franca in the world of kernel
engineering. The core logic of the Linux kernel is written entirely in
C (with a bit of assembly), as are its drivers and modules. While C is
rightfully celebrated for
its powerful yet simple semantics, it is an older language that lacks
many of the features present in modern languages such as
Rust. The
BPF subsystem, on the other hand,
provides a programming environment that allows engineers to write
programs that can run safely in kernel space. At the 2022 Linux Plumbers
Conference
in Dublin, Ireland, Alexei Starovoitov presented an overview
of how BPF has evolved over the years to provide a new model for kernel
programming.

What’s Up, Home? – How Zabbix Can Help You with Rising Electricity Bills

Post Syndicated from Janne Pikkarainen original https://blog.zabbix.com/whats-up-home-how-zabbix-can-help-you-with-rising-electricity-bills/23582/

Can you monitor your upcoming electricity bills with Zabbix? Of course, you can! By day, I am a monitoring technical lead in a global cyber security company. By night, I monitor my home with Zabbix & Grafana Labs and make some weird experiments with them. Welcome to my weekly blog about the project.

With the current world events, energy prices are soaring. But how much do I need to really pay next month for my electricity? Zabbix to the rescue!

(Yes, in Finland I can check that from my electricity company’s page, but where’s the fun in that?)

Fixed vs spot price

There are two kinds of electricity contracts you can subscribe to in Finland. With a fixed price, you can be sure your bill does not fluctuate that much from month to month, as you pay the same price per kilowatt for every hour of the day. In this kind of deal, the electricity company adds some extra to each kilowatt, so you will automatically pay some extra compared to the electricity market price, but at least you don’t get so severely surprised by market price peaks.

Then there’s the spot price, where you pay only the electricity market price. This can and will vary a lot depending on the hour of the day, but at least in theory, this is the cheapest option in the long run. But, if the market price goes WAY up, like it tends to do in the winter, and has now been peaking due to world events, this can add to your bill.

Nordpool, please respond

There’s Nord Pool (“Nord Pool runs the leading power market in Europe, offering day-ahead and intraday markets to our customers”), and there’s a Python library for accessing Nord Pool electricity prices. With it, I could get hour-by-hour prices, but for this experiment, let’s stick with the average kWh price. The example script on the GitHub page shows all kinds of data, and for fun let’s use Zabbix item preprocessing to parse the average price from its output.

I now have the below script on running as a cron task every night, so my results will be updated once per 24 hours.

So, Zabbix then reads the file contents, like in so many of my previous blog posts.

Next, let’s add some preprocessing. The regular expression part gets the Average value from the script output, and the custom multiplier changes the value from “Euros per Megawatt” to “Euros per Kilowatt”, for it to be a familiar value for me from the electricity bills.

And… it’s working! As I know our average consumption, let’s add a new Grafana dashboard.

Four seasons

During summer, we don’t actually use very much electricity compared to our harsh winter; for example, keeping our garage “warm” (about +10C) during winter contributes to our electricity bill quite a lot.
Here’s a dashboard showing some guesstimations of how expensive the different seasons will be for us. Or, hopefully cheap, if the long overdue new Olkiluoto 3 nuclear plant finally could operate at its full capacity here in Finland.

The guesstimate above is missing some taxes and electricity transfer prices, so the reality will be a bit more expensive than this. Maybe I should also add some triggers to Zabbix to make me alert about any really crazy price changes.

Anyway, now I can start gathering nuts for the cold winter as it seems that it will be an expensive one.

I have been working at Forcepoint since 2014 and I’m happy that my laptop does not consume too much electricity. — Janne Pikkarainen

This post was originally published on the author’s LinkedIn account.

The post What’s Up, Home? – How Zabbix Can Help You with Rising Electricity Bills appeared first on Zabbix Blog.

Security updates for Friday

Post Syndicated from original https://lwn.net/Articles/909208/

Security updates have been issued by Debian (bind9, expat, firefox-esr, mediawiki, and unzip), Fedora (qemu and thunderbird), Oracle (webkit2gtk3), SUSE (ardana-ansible, ardana-cobbler, ardana-tempest, grafana, openstack-heat-templates, openstack-horizon-plugin-gbp-ui, openstack-neutron-gbp, openstack-nova, python-Django1, rabbitmq-server, rubygem-puma, ardana-ansible, ardana-cobbler, grafana, openstack-heat-templates, openstack-murano, python-Django, rabbitmq-server, rubygem-puma, dpdk, freetype2, rubygem-rack, and virtualbox), and Ubuntu (etcd, libjpeg-turbo, linux-gcp, linux-gke, linux-raspi, linux-oem-5.17, linux-raspi-5.4, python-oauthlib, and python3.5).

GA Week 2022: what you may have missed

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/ga-week-2022-recap/

GA Week 2022: what you may have missed

Back in 2019, we worked on a chart for Cloudflare’s IPO S-1 document that showed major releases since Cloudflare was launched in 2010. Here’s that chart:

GA Week 2022: what you may have missed

Of course, that chart doesn’t show everything we’ve shipped, but the curve demonstrates a truth about a growing company: we keep shipping more and more products and services. Some of those things start with a beta, sometimes open and sometimes private. But all of them become generally available after the beta period.

Back in, say, 2014, we only had a few major releases per year. But as the years have progressed and the company has grown we have constant updates, releases and changes. This year a confluence of products becoming generally available in September meant it made sense to wrap them all up into GA Week.

GA Week has now finished, and the team is working to put the finishing touches on Birthday Week (coming this Sunday!), but here’s a recap of everything that we launched this week.

What launched Summary Available for?
Monday (September 19)
Cloudforce One Our threat operations and research team, Cloudforce One, is now open for business and has begun conducting threat briefings. Enterprise
Improved Access Control: Domain Scoped Roles are now generally available It is possible to scope your users’ access to specific domains with Domain Scoped Roles. This will allow all users access to roles, and the ability to access within zones. Currently available to all Free plans, and coming to Enterprise shortly.
Account WAF now available to Enterprise customers Users can manage and configure the WAF for all of their zones from a single pane of glass. This includes custom rulesets and managed rulesets (Core/OWASP and Managed). Enterprise
Introducing Cloudflare Adaptive DDoS Protection – our new traffic profiling system for mitigating DDoS attacks Cloudflare’s new Adaptive DDoS Protection system learns your unique traffic patterns and constantly adapts to protect you against sophisticated DDoS attacks. Built into our Advanced DDoS product
Introducing Advanced DDoS Alerts Cloudflare’s Advanced DDoS Alerts provide tailored and actionable notifications in real-time. Built into our Advanced DDoS product
Tuesday (September 20)
Detect security issues in your SaaS apps with Cloudflare CASB By leveraging API-driven integrations, receive comprehensive visibility and control over SaaS apps to prevent data leaks, detect Shadow IT, block insider threats, and avoid compliance violations. Enterprise Zero Trust
Cloudflare Data Loss Prevention now Generally Available Data Loss Prevention is now available for Cloudflare customers, giving customers more options to protect their sensitive data. Enterprise Zero Trust
Cloudflare One Partner Program acceleration The Cloudflare One Partner Program gains traction with existing and prospective partners. Enterprise Zero Trust
Isolate browser-borne threats on any network with WAN-as-a-Service Defend any network from browser-borne threats with Cloudflare Browser Isolation by connecting legacy firewalls over IPsec / GRE Zero Trust
Cloudflare Area 1 – how the best Email Security keeps getting better Cloudflare started using Area 1 in 2020 and later acquired the company in 2022. We were most impressed how phishing, responsible for 90+% of cyberattacks, basically became a non-issue overnight when we deployed Area 1. But our vision is much bigger than preventing phishing attacks. Enterprise Zero Trust
Wednesday (September 21)
R2 is now Generally Available R2 gives developers object storage minus the egress fees. With the GA of R2, developers will be free to focus on innovation instead of worrying about the costs of storing their data. All plans
Stream Live is now Generally Available Stream live video to viewers at a global scale. All plans
The easiest way to build a modern SaaS application With Workers for Platforms, your customers can build custom logic to meet their needs right into your application. Enterprise
Going originless with Cloudflare Workers – Building a Todo app – Part 1: The API Today we go through Part 1 in a series on building completely serverless applications on Cloudflare’s Developer Platform. Free for all Workers users
Store and Retrieve your logs on R2 Log Storage on R2: a cost-effective solution to store event logs for any of our products! Enterprise (as part of Logpush)
SVG support in Cloudflare Images Cloudflare Images now supports storing and delivering SVG files. Part of Cloudflare Images
Thursday (September 22)
Regional Services Expansion Cloudflare is launching the Data Localization Suite for Japan, India and Australia. Enterprise
API Endpoint Management and Metrics are now GA API Shield customers can save, update, and monitor the performance of API endpoints. Enterprise
Cloudflare Zaraz supports Managed Components and DLP to make third-party tools private Third party tools are the only thing you can’t control on your website, unless you use Managed Components with Cloudflare Zaraz. Available on all plans
Logpush: now lower cost and with more visibility Logpush jobs can now be filtered to contain only logs of interest. Also, you can receive alerts when jobs are failing, as well as get statistics on the health of your jobs. Enterprise

Of course, you won’t have to wait a year for more products to become GA. We’ll be shipping betas and making products generally available throughout the year. And we’ll continue iterating on our products so that all of them become leaders.

As we said at the start of GA Week:

“But it’s not just about making products work and be available, it’s about making the best-of-breed. We ship early and iterate rapidly. We’ve done this over the years for WAF, DDoS mitigation, bot management, API protection, CDN and our developer platform. Today, analyst firms such as Gartner, Forrester and IDC recognize us as leaders in all those areas.”

Now, onwards to Birthday Week!

Leaking Screen Information on Zoom Calls through Reflections in Eyeglasses

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/09/leaking-screen-information-on-zoom-calls-through-reflections-in-eyeglasses.html

Okay, it’s an obscure threat. But people are researching it:

Our models and experimental results in a controlled lab setting show it is possible to reconstruct and recognize with over 75 percent accuracy on-screen texts that have heights as small as 10 mm with a 720p webcam.” That corresponds to 28 pt, a font size commonly used for headings and small headlines.

[…]

Being able to read reflected headline-size text isn’t quite the privacy and security problem of being able to read smaller 9 to 12 pt fonts. But this technique is expected to provide access to smaller font sizes as high-resolution webcams become more common.

“We found future 4k cameras will be able to peek at most header texts on almost all websites and some text documents,” said Long.

[…]

A variety of factors can affect the legibility of text reflected in a video conference participant’s glasses. These include reflectance based on the meeting participant’s skin color, environmental light intensity, screen brightness, the contrast of the text with the webpage or application background, and the characteristics of eyeglass lenses. Consequently, not every glasses-wearing person will necessarily provide adversaries with reflected screen sharing.

With regard to potential mitigations, the boffins say that Zoom already provides a video filter in its Background and Effects settings menu that consists of reflection-blocking opaque cartoon glasses. Skype and Google Meet lack that defense.

Research paper.

Integrating Amazon MemoryDB for Redis with Java-based AWS Lambda

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/integrating-amazon-memorydb-for-redis-with-java-based-aws-lambda/

This post is written by Mansi Y Doshi, Consultant and Aditya Goteti, Sr. Lead Consultant.

Enterprises are modernizing and migrating their applications to the AWS Cloud to improve scalability, reduce cost, innovate, and reduce time to market new features. Legacy applications are often built with RDBMS as the only backend solution.

Modernizing legacy Java applications with microservices requires breaking down a single monolithic application into multiple independent services. Each microservice does a specific job and requires its own database to persist data, but one database does not fit all use cases. Modern applications require purpose-built databases catering to their specific needs and data models.

This post discusses some of the common use cases for one such data store, Amazon MemoryDB for Redis, which is built to provide durability and faster reads and writes.

Use cases

Modern tech stacks often begin with a backend that interacts with a durable database like MongoDB, Amazon Aurora, or Amazon DynamoDB for their data persistence needs.

But, as traffic volume increases, it often makes sense to introduce a caching layer like ElastiCache. This is populated with data by service logic each time a database read happens, such that the subsequent reads of the same data become faster. While ElastiCache is effective, you must manage and pay for two separate data sources for the same data. You must also write custom logic to handle the cache reads/writes besides the existing read/write logic used for durable databases.

While traditional databases like MySQL, Postgres and DynamoDB provide data durability at the cost of speed, transient data stores like ElastiCache trade durability for faster reads/writes (usually within microseconds). ElastiCache provides writes and strongly consistent reads on the primary node of each shard and eventually consistent reads from read replicas. There is a possibility that the latest data written to the primary node is lost during a failover, which makes ElastiCache fast but not durable.

MemoryDB addresses both these issues. It provides strong consistency on the primary node and eventual consistency reads on replica nodes. The consistency model of MemoryDB is like ElastiCache for Redis. However, in MemoryDB, data is not lost across failovers, allowing clients to read their writes from primaries regardless of node failures. Only data that is successfully persisted in the Multi-AZ transaction log is visible. Replica nodes are still eventually consistent. Because of its distributed transaction model, MemoryDB can provide both durability and microsecond response time.

MemoryDB is most ideal for services that are read-heavy and sensitive to latency, like configuration, search, authentication and leaderboard services. These must operate at microsecond read latency and still be able to persist the data for high availability and durability. Services like leaderboards, having millions of records, often break down the data into smaller chunks/batches and process them in parallel. This needs a data store that can perform calculations on the fly and also store results temporarily. Redis can process millions of operations per second and store temporary calculations for fast retrieval and also run other operations (like aggregations). Since Redis is single-threaded, from the command’s execution point of view, it also helps to avoid dirty writes and reads.

Another use case is a configuration service, where users store, change, and retrieve their configuration data. In large distributed systems, there are often hundreds of independent services interacting with each other using well-defined REST APIs. These services depend on the configuration data to perform specific actions. The configuration service must serve the required information at a low latency to avoid being a bottleneck for the other dependent services.

MemoryDB can read at microsecond latencies durably. It also persists data across multiple Availability Zones. It uses multi- Availability Zone transaction logs to enable fast failover, database recovery, and node restarts. You can use it as a primary database without the need to maintain another cache to lower the data access latency. This also reduces the need to maintain additional caching service, which further reduces cost.

These use cases are a good fit for using MemoryDB. Next, you see how to access, store, and retrieve data in MemoryDB from your Java-based AWS Lambda function.

Overview

This blog shows how to build an Amazon MemoryDB cluster and integrate it with AWS Lambda. Amazon API Gateway and Lambda can be paired together to create a client-facing application, which can be easier to maintain, highly scalable, and secure. Both are fully managed services with no need to provision or manage servers. They can be cost effective when compared to running the application on servers for workloads with long idle periods. Using Lambda authorizers you can also write custom code to control access to your API.

Walkthrough

The following steps show how to provision an Amazon MemoryDB cluster along with Amazon VPC, subnets, security groups and integrate it with a Lambda function using Redis/Jedis Java client. Here, the Lambda function is configured to connect to the same VPC where MemoryDB is provisioned. The steps include provisioning through an AWS SAM template.

Prerequisites

  1. Create an AWS account if you do not already have one and login.
  2. Configure your account and set up permissions to access MemoryDB.
  3. Java 8 or above
  4. Install Maven
  5. Java Client for Redis
  6. Install AWS SAM if you do not already have one

Creating the MemoryDB cluster

Refer to the serverless pattern for a quick setup and customize as required. The AWS SAM template creates VPC, subnets, security groups, the MemoryDB cluster, API Gateway, and Lambda.

To access the MemoryDB cluster from the Lambda function, the security group of the Lambda function is added to the security group of the cluster. The MemoryDB cluster is always launched in a VPC. If the subnet is not specified, the cluster is launched into your default Amazon VPC.

You can also use your existing VPC and subnets and customize the template accordingly. If you are creating a new VPC, you can change the CIDR block and other configuration values as needed. Make sure the DNS hostname and DNS Support of the VPC is enabled. Use the optional parameters section to customize your templates. Parameters enable you to input custom values to your template each time you create or update a stack.

Recommendations

As your workload requirements change, you might want to increase the performance of your cluster or reduce costs by scaling in/out the cluster. To improve the read/write performance, you can scale your cluster horizontally by increasing the number of read replicas or shards for read and write throughout, respectively.

To reduce cost in case the instances are over-provisioned, you can perform vertical scale-in by reducing the size of your cluster, or scale-out by increasing the size to overcome CPU bottlenecks/ memory pressure. Both vertical scaling and horizontal scaling are applied with no downtime and cluster restarts are not required. You can customize the following parameters in the memoryDBCluster as required.

NodeType: db.t4g.small
NumReplicasPerShard: 2
NumShards: 2

In MemoryDB, all the writes are carried on a primary node in a shard and all the reads are performed on the standby nodes. Identifying the right number of read replicas, type of nodes and shards in a cluster is crucial to get the optimal performance and to avoid any additional cost because of over-provisioning the resources. It’s recommended to always start with a minimal number of required resources and scale out as needed.

Replicas improve read scalability, and it is recommended to have at least two read replicas per shard but depending upon the size of the payload and for read heavy workloads, it might be more than two. Adding more read replicas than required does not give any performance improvement, and it attracts additional cost. The following benchmarking is performed using the tool Redis benchmark. The benchmarking is done only on GET requests to simulate a read heavy workload.

The metrics on both the clusters are almost the same with 10 million requests with 1kb of data payload per request. Increasing the size of the payload to 5kb and number of GET requests to 20 million, the cluster with two primary and two replicas could not process, whereas the second cluster processed successfully. To achieve the right sizing, load testing is recommended on the staging/pre-production environment with a similar load as production.

Creating a Lambda function and allow access to the MemoryDB cluster

In the lambda-redis/HelloWorldFunction/pom.xml file, add the following dependency. This adds the Java Jedis client to connect the MemoryDB cluster:

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>4.2.0</version>
</dependency>

The simplest way to connect the Lambda function to the MemoryDB cluster is by configuring it within the same VPC where the MemoryDB cluster was launched.

To create a Lambda function, add the following code in the template.yaml file in the Resources section:

HelloWorldFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: HelloWorldFunction
      Handler: helloworld.App::handleRequest
      Runtime: java8
      MemorySize: 512
      Timeout: 900 #seconds
      Events:
        HelloWorld:
          Type: Api
          Properties:
            Path: /hello
            Method: get
      VpcConfig:
        SecurityGroupIds:
          - !GetAtt lambdaSG.GroupId
        SubnetIds:
          - !GetAtt privateSubnetA.SubnetId
          - !GetAtt privateSubnetB.SubnetId
      Environment:
        Variables:
          ClusterAddress: !GetAtt memoryDBCluster.ClusterEndpoint.Address

Java code to access MemoryDB

  1. In your Java class, connect to Redis using Jedis client:
    HostAndPort hostAndPort = new HostAndPort(System.getenv("ClusterAddress"), 6379);
    JedisCluster jedisCluster = new JedisCluster(Collections.singleton(hostAndPort), 5000, 5000, 2, null, null, new GenericObjectPoolConfig (), true);
  2. You can now perform set and get operations on Redis as follows
    jedisCluster.set(“test”, “value”)
    jedisCluster.get(“test”)

JedisCluster maintains its own pool of connections and takes care of connection teardown. But you can also customize the configuration for closing idle connections using the GenericObjectPoolConfig object.

Clean Up

To delete the entire stack, run the command “sam delete”.

Conclusion

In this post, you learn how to provision a MemoryDB cluster and access it using Lambda. MemoryDB is suitable for applications requiring microsecond reads and single-digit millisecond writes along with durable storage. Accessing MemoryDB through Lambda using API Gateway reduces the further need for provisioning and maintaining servers.

For more serverless learning resources, visit Serverless Land.

How to enable Private Access Tokens in iOS 16 and stop seeing CAPTCHAs

Post Syndicated from João Tomé original https://blog.cloudflare.com/how-to-enable-private-access-tokens-in-ios-16-and-stop-seeing-captchas/

How to enable Private Access Tokens in iOS 16 and stop seeing CAPTCHAs

How to enable Private Access Tokens in iOS 16 and stop seeing CAPTCHAs

You go to a website or service, but before access is granted, there’s a visual challenge that forces you to select bikes, buses or traffic lights in a set of images. That can be an exasperating experience. Now, if you have iOS 16 on your iPhone, those days could be over and are just a one-time toggle enabled away.

CAPTCHA = “Completely Automated Public Turing test to tell Computers and Humans Apart”

In 2021, we took direct steps to end the madness that wastes humanity about 500 years per day called CAPTCHAs, that have been making sure you’re human and not a bot. In August 2022, we announced Private Access Tokens. With that, we’re able to eliminate CAPTCHAs on iPhones, iPads and Macs (and more to come) with open privacy-preserving standards.

On September 12, iOS 16 became generally available (iPad 16 and macOS 13 should arrive in October) and on the settings of your device there’s a toggle that can enable the Private Access Token (PAT) technology that will eliminate the need for those CAPTCHAs, and automatically validate that you are a real human visiting a site. If you already have iOS 16, here’s what you should do to confirm that the toggle is “on” (usually it is):

Settings > Apple ID > Password & Security > Automatic Verification (should be enabled)

How to enable Private Access Tokens in iOS 16 and stop seeing CAPTCHAs

What will you get? A completely invisible, private way to validate yourself, and for a website, a way to automatically verify that real users are visiting the site without the horrible CAPTCHA user experience.

Visitors using operating systems that support these tokens, including the upcoming versions of iPad and macOS, can now prove they’re human without completing a CAPTCHA or giving up personal data.

Let’s recap from our August 2022 announcement blog post what this means for different users:

If you’re an Internet user:

  • We’re helping make your mobile web experience more pleasant and more private.
  • You won’t see a CAPTCHA on a supported iOS or Mac device (other devices coming soon!) accessing the Cloudflare network.

If you’re a web or application developer:

  • You’ll know your users are humans coming from an authentic device and signed application, verified by the device vendor directly.
  • And you’ll validate users without maintaining a cumbersome SDK.

If you’re a Cloudflare customer:

  • You don’t have to do anything! Cloudflare will automatically ask for and use Private Access Tokens when using Managed Challenge.
  • Your visitors won’t see a CAPTCHA.

It’s all about simplicity, without compromising on privacy. The work done over a year was a collaboration between Cloudflare and Apple, Google, and other industry leaders to extend the Privacy Pass protocol with support for a new cryptographic token.

These tokens simplify application security for developers and security teams, and obsolete legacy, third-party SDK-based approaches for determining if a human is using a device. They work for browsers, APIs called by browsers, and APIs called within apps. After Apple announced in August that PATs would be incorporated into iOS 16, iPad 16, and macOS 13, the process of ending CAPTCHAs got a big boost. And we expect additional vendors to announce support in the near future.

Cloudflare has already incorporated PATs into our Managed Challenge platform, so any customer using this feature will automatically take advantage of this new technology to improve the browsing experience for supported devices.

In our August in-depth blog post about PATs, you can learn more about how CAPTCHAs don’t work in mobile environments and PATs remove the need for them, and how when sites can’t challenge a visitor with a CAPTCHA, they collect private data.

Improved privacy

In that blog post, we also explain how Private Access Tokens vastly improve privacy by validating without fingerprinting. So, by partnering with third parties like device manufacturers, who already have the data that would help us validate a device, we are able to abstract portions of the validation process, and confirm data without actually collecting, touching, or storing that data ourselves. Rather than interrogating a device directly, we ask the device vendor to do it for us.

Most customers won’t have to do anything to utilize Private Access Tokens. Why? To take advantage of PATs, all you have to do is choose Managed Challenge rather than Legacy CAPTCHA as a response option in a Firewall rule. More than 65% of Cloudflare customers are already doing this.

Now, if you have iOS 16 on your iPhone, it’s your turn.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close