Comic for 2024.09.07 – Proposal

Code review is in high demand, and short supply, for most open-source projects.
Reviewer time is precious, so any tool that can lighten the load is worth exploring.
That is why Jesse Brandeburg and Kamel Ayari decided to test whether
tools like ChatGPT could review patches to provide quick feedback to
contributors about common problems. In a
talk at the Netdev 0x18 conference this July, Brandeburg provided an overview of an
experiment using machine learning to review emails containing patches
sent to the netdev
mailing list. Large-language models (LLMs) will not be replacing human reviewers anytime
soon, but they may be a useful addition to help humans focus on deeper
reviews instead of simple rule violations.

Metasploit Weekly Wrap-Up 09/06/2024

2024-09-06 Jack Heysel

Post Syndicated from Jack Heysel original https://blog.rapid7.com/2024/09/06/metasploit-weekly-wrap-up-42/

Honey, I shrunk the PHP payloads

Metasploit Weekly Wrap-Up 09/06/2024

This release contains more PHP payload improvements from Julien Voisen. Last week we landed a PR from Julien that added a datastore option to the php/base64 encoder that when enabled, will use zlib to compress the payload which significantly reduced the size, bringing a payload of 4040 bytes down to a mere 1617 bytes. This week’s release includes a php/minify encoder which removes all unnecessary characters from the payload including comments, empty lines, leading spaces, trailing spaces, spaces after keywords and spaces before block openings. Using the php/minify encoder can take a payload of size 4052 bytes down to 2839 bytes. We’d like to thank Julien for their continued commitment to improving PHP payloads!

New module content (1)

PHP Minify Encoder

Author: Julien Voisin
Type: Encoder
Pull request: #19435 contributed by jvoisin
Path: php/minify

Description: This encoder minifies PHP payloads by removing spaces after keywords and before block openings. It removes comments, empty lines, new lines and leading and trailing spaces.

Enhancements and features (2)

#19368 from h00die-gr3y – This adjusts the exploit/multi/http/geoserver_unauth_rce_cve_2024_36401 to dynamically pull and test the feature_type list to establish an RCE. This will make the module more robust towards installations with different feature_type configurations.
#19401 from jvoisin – Add a mixin to get SPIP version and make use of it.

Bugs fixed (2)

#19381 from Takahiro-Yoko – This fixes the gitlab_login scanner so that it uses the proper datastore options Username and Password which are the standard for login scanners. Before this fix the scanner was using HttpUsername and HttpPassword and ignoring the datastore options Username and Password.
#19438 from cgranleese-r7 – Fixes a nil error if login is successful with ldap_login module.

Documentation

You can find the latest Metasploit documentation on our docsite at docs.metasploit.com.

Get it

As always, you can update to the latest Metasploit Framework with msfupdate
and you can get more details on the changes since the last blog post from
GitHub:

If you are a git user, you can clone the Metasploit Framework repo (master branch) for the latest.
To install fresh without using git, you can use the open-source-only Nightly Installers or the
commercial edition Metasploit Pro

A Quick Introduction to the NVIDIA GH200 aka Grace Hopper

2024-09-06 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/a-quick-introduction-to-the-nvidia-gh200-aka-grace-hopper-arm/

The NVIDIA GH200 or “Grace Hopper” is far from a single product. We have a quick guide so when someone says “GH200” you know what to look for

The post A Quick Introduction to the NVIDIA GH200 aka Grace Hopper appeared first on ServeTheHome.

NGINX has moved to Github

2024-09-06 jzb

Post Syndicated from jzb original https://lwn.net/Articles/989229/

The NGINX team has announced
that official NGINX open-source development has moved away from
Mercurial to GitHub, and
the project will now be taking contributions
in the form of pull requests:

Additionally, starting today, we will begin accepting bugs reports,
feature requests and enhancements directly through GitHub, under the
“Issues” tab. Moreover, we’ve moved our community forums to the GitHub
“Discussions” area, where you will now be able
to engage in conversation, ask, and answer questions.

[…] We understand that changes like these may require adjustment,
so to give you more time, we will continue accepting patches and
provide community support via mailing lists until December 31st, 2024.

Missy Franklin, Angela Ruggiero & Ashton Eaton | Olympic Panel | Talks at Google

2024-09-06 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=3sSpFnzMzeU

YubiKey Side-Channel Attack

2024-09-06 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/09/yubikey-side-channel-attack.html

There is a side-channel attack against YubiKey access tokens that allows someone to clone a device. It’s a complicated attack, requiring the victim’s username and password, and physical access to their YubiKey—as well as some technical expertise and equipment.

Still, nice piece of security analysis.

Man pages maintenance suspended

2024-09-06 corbet

Post Syndicated from corbet original https://lwn.net/Articles/989215/

Alejandro Colomar, who has been maintaining the Linux man pages for the
last four years, has announced
that he will have to stop that work.

I’ve been doing it in my free time, and no company has sponsored
that work at all. At the moment, I cannot sustain this work
economically any more, and will temporarily and indefinitely stop
working on this project. If any company has interests in the
future of the project, I’d welcome an offer to sponsor my work
here; if so, please let me know.

The realtime preemption end game — for real this time

2024-09-06 corbet

Post Syndicated from corbet original https://lwn.net/Articles/989212/

Work on realtime preemption for the Linux kernel got its start almost exactly 20 years ago
(though it had its roots in earlier work, of course). It is fair to say
that finishing that job has taken a bit longer than anybody involved would
have expected. Now, though, Sebastian Andrzej Siewior has posted a brief
patch series making it possible to enable realtime preemption in the
mainline kernel on three architectures.

With the printk bits merged, PREEMPT_RT could be enabled on X86,
ARM64 and Risc-V. These three architectures merged required changes
over the years leaving me in a position where I have no essential
changes in the queue that would affect them.

Congratulations are due to the many developers who have worked on this
project for the last two decades.

Security updates for Friday

2024-09-06 corbet

Post Syndicated from corbet original https://lwn.net/Articles/989196/

Security updates have been issued by AlmaLinux (bubblewrap, flatpak), Debian (libxml2), Fedora (lua-mpack, mingw-python3, python-django, python-django4.2, python3.11, python3.13, and python3.9), Oracle (bubblewrap, flatpak), Red Hat (fence-agents, python-urllib3, resource-agents, and wget), Slackware (expat and mozilla), SUSE (buildah, chromium, firefox, gradle, java-1_8_0-ibm, kubernetes1.26, postgresql16, python-Django, python312-pip, and systemd), and Ubuntu (python-aiohttp).

This is the Most Unique Retro Game Convention

2024-09-06 LGR

Post Syndicated from LGR original https://www.youtube.com/watch?v=aUliKw3ZpoA

Explosion at Sea: 3 USS Princetons

2024-09-06 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=CHIocuMNeuU

Start With a Lie

2024-09-06 The Atlantic

Post Syndicated from The Atlantic original https://www.youtube.com/watch?v=MwscvKiQheg

Comic for 2024.09.06 – Invisible Paint

2024-09-06 Explosm.net

Post Syndicated from Explosm.net original https://explosm.net/comics/invisible-paint

New Cyanide and Happiness Comic

Where to Build a Naval Supply Depot

2024-09-06 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=vLQHy4ByoUc

Water Filtration

2024-09-06 xkcd.com

Post Syndicated from xkcd.com original https://xkcd.com/2982/

You'd think the most expensive part would be the quark-gluon plasma chamber, but it's actually usually the tube to the top of the atmosphere to carry the cosmic rays down.

Micron HBM3E 12-High 36GB Higher Capacity AI Accelerators Shipping

2024-09-06 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/micron-hbm3e-12-high-36gb-higher-capacity-ai-accelerators-shipping/

Micron HBM3E 12-High will bring 36GB packages for higher capacity AI accelerators in the coming months as it starts to ship

The post Micron HBM3E 12-High 36GB Higher Capacity AI Accelerators Shipping appeared first on ServeTheHome.

Amazon Redshift data ingestion options

2024-09-05 Steve Phillips

Post Syndicated from Steve Phillips original https://aws.amazon.com/blogs/big-data/amazon-redshift-data-ingestion-options/

Amazon Redshift, a warehousing service, offers a variety of options for ingesting data from diverse sources into its high-performance, scalable environment. Whether your data resides in operational databases, data lakes, on-premises systems, Amazon Elastic Compute Cloud (Amazon EC2), or other AWS services, Amazon Redshift provides multiple ingestion methods to meet your specific needs. The currently available choices include:

The Amazon Redshift COPY command can load data from Amazon Simple Storage Service (Amazon S3), Amazon EMR, Amazon DynamoDB, or remote hosts over SSH. This native feature of Amazon Redshift uses massive parallel processing (MPP) to load objects directly from data sources into Redshift tables. Further, the auto-copy feature simplifies and automates data loading from Amazon S3 into Amazon Redshift.
Amazon Redshift federated queries run queries using source database compute with the results returned to Amazon Redshift.
Amazon Redshift zero-ETL integrations can load data from Amazon Aurora MySQL-Compatible Edition, Amazon Relational Database Service (Amazon RDS) for MySQL, Amazon RDS for PostgreSQL, and DynamoDB, with the added ability to perform transformations after loading.
The Amazon Redshift integration for Apache Spark combined with AWS Glue or Amazon EMR performs transformations before loading data into Amazon Redshift.
Amazon Redshift streaming supports ingestion of streaming sources, including Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon Data Firehose.
Finally, data can be loaded into Amazon Redshift with popular ETL tools like Informatica, Matillion and DBT Labs.

This post explores each option (as illustrated in the following figure), determines which are suitable for different use cases, and discusses how and why to select a specific Amazon Redshift tool or feature for data ingestion.

Amazon Redshift COPY command

The Redshift COPY command, a simple low-code data ingestion tool, loads data into Amazon Redshift from Amazon S3, DynamoDB, Amazon EMR, and remote hosts over SSH. It’s a fast and efficient way to load large datasets into Amazon Redshift. It uses massively parallel processing (MPP) architecture in Amazon Redshift to read and load large amounts of data in parallel from files or data from supported data sources. This allows you to utilize parallel processing by splitting data into multiple files, especially when the files are compressed.

Recommended use cases for the COPY command include loading large datasets and data from supported data sources. COPY automatically splits large uncompressed delimited text files into smaller scan ranges to utilize the parallelism of Amazon Redshift provisioned clusters and serverless workgroups. With auto-copy, automation enhances the COPY command by adding jobs for automatic ingestion of data.

COPY command advantages:

Performance – Efficiently loads large datasets from Amazon S3 or other sources in parallel with optimized throughput
Simplicity – Straightforward and user-friendly, requiring minimal setup
Cost-optimized – Uses Amazon Redshift MPP at a lower cost by reducing data transfer time
Flexibility – Supports file formats such as CSV, JSON, Parquet, ORC, and AVRO

Amazon Redshift federated queries

Amazon Redshift federated queries allow you to incorporate live data from Amazon RDS or Aurora operational databases as part of business intelligence (BI) and reporting applications.

Federated queries are useful for use cases where organizations want to combine data from their operational systems with data stored in Amazon Redshift. Federated queries allow querying data across Amazon RDS for MySQL and PostgreSQL data sources without the need for extract, transform, and load (ETL) pipelines. If storing operational data in a data warehouse is a requirement, synchronization of tables between operational data stores and Amazon Redshift tables is supported. In scenarios where data transformation is required, you can use Redshift stored procedures to modify data in Redshift tables.

Federated queries key features:

Real-time access – Enables querying of live data across discrete sources, such as Amazon RDS and Aurora, without the need to move the data
Unified data view – Provides a single view of data across multiple databases, simplifying data analysis and reporting
Cost savings – Eliminates the need for ETL processes to move data into Amazon Redshift, saving on storage and compute costs
Flexibility – Supports Amazon RDS and Aurora data sources, offering flexibility in accessing and analyzing distributed data

Amazon Redshift Zero-ETL integration

Aurora zero-ETL integration with Amazon Redshift allows access to operational data from Amazon Aurora MySQL-Compatible (and Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL in preview), and DynamoDB from Amazon Redshift without the need for ETL in near real time. You can use zero-ETL to simplify ingestion pipelines for performing change data capture (CDC) from an Aurora database to Amazon Redshift. Built on the integration of Amazon Redshift and Aurora storage layers, zero-ETL boasts simple setup, data filtering, automated observability, auto-recovery, and integration with either Amazon Redshift provisioned clusters or Amazon Redshift Serverless workgroups.

Zero-ETL integration benefits:

Seamless integration – Automatically integrates and synchronizes data between operational databases and Amazon Redshift without the need for custom ETL processes
Near real-time insights – Provides near real-time data updates, so the most current data is available for analysis
Ease of use – Simplifies data architecture by eliminating the need for separate ETL tools and processes
Efficiency – Minimizes data latency and provides data consistency across systems, enhancing overall data accuracy and reliability

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark, automatically included through Amazon EMR or AWS Glue, provides performance and security optimizations when compared to the community-provided connector. The integration enhances and simplifies security with AWS Identity and Access Management (IAM) authentication support. AWS Glue 4.0 provides a visual ETL tool for authoring jobs to read from and write to Amazon Redshift, using the Redshift Spark connector for connectivity. This simplifies the process of building ETL pipelines to Amazon Redshift. The Spark connector allows use of Spark applications to process and transform data before loading into Amazon Redshift. The integration minimizes the manual process of setting up a Spark connector and shortens the time needed to prepare for analytics and machine learning (ML) tasks. It allows you to specify the connection to a data warehouse and start working with Amazon Redshift data from your Apache Spark-based applications within minutes.

The integration provides pushdown capabilities for sort, aggregate, limit, join, and scalar function operations to optimize performance by moving only the relevant data from Amazon Redshift to the consuming Apache Spark application. Spark jobs are suitable for data processing pipelines and when you need to use Spark’s advanced data transformation capabilities.

With the Amazon Redshift integration for Apache Spark, you can simplify the building of ETL pipelines with data transformation requirements. It offers the following benefits:

High performance – Uses the distributed computing power of Apache Spark for large-scale data processing and analysis
Scalability – Effortlessly scales to handle massive datasets by distributing computation across multiple nodes
Flexibility – Supports a wide range of data sources and formats, providing versatility in data processing tasks
Interoperability – Seamlessly integrates with Amazon Redshift for efficient data transfer and queries

Amazon Redshift streaming ingestion

The key benefit of Amazon Redshift streaming ingestion is the ability to ingest hundreds of megabytes of data per second directly from streaming sources into Amazon Redshift with very low latency, supporting real-time analytics and insights. Supporting streams from Kinesis Data Streams, Amazon MSK, and Data Firehose, streaming ingestion requires no data staging, supports flexible schemas, and is configured with SQL. Streaming ingestion powers real-time dashboards and operational analytics by directly ingesting data into Amazon Redshift materialized views.

Amazon Redshift streaming ingestion unlocks near real-time streaming analytics with:

Low latency – Ingests streaming data in near real time, making streaming ingestion ideal for time-sensitive applications such as Internet of Things (IoT), financial transactions, and clickstream analysis
Scalability – Manages high throughput and large volumes of streaming data from sources such as Kinesis Data Streams, Amazon MSK, and Data Firehose
Integration – Integrates with other AWS services to build end-to-end streaming data pipelines
Continuous updates – Keeps data in Amazon Redshift continuously updated with the latest information from the data streams

Amazon Redshift ingestion use cases and examples

In this section, we discuss the details of different Amazon Redshift ingestion use cases and provide examples.

Redshift COPY use case: Application log data ingestion and analysis

Ingesting application log data stored in Amazon S3 is a common use case for the Redshift COPY command. Data engineers in an organization need to analyze application log data to gain insights into user behavior, identify potential issues, and optimize a platform’s performance. To achieve this, data engineers ingest log data in parallel from multiple files stored in S3 buckets into Redshift tables. This parallelization uses the Amazon Redshift MPP architecture, allowing for faster data ingestion compared to other ingestion methods.

The following code is an example of the COPY command loading data from a set of CSV files in an S3 bucket into a Redshift table:

COPY myschema.mytable
FROM 's3://my-bucket/data/files/'
IAM_ROLE ‘arn:aws:iam::1234567891011:role/MyRedshiftRole’
FORMAT AS CSV;

This code uses the following parameters:

mytable is the target Redshift table for data load
‘s3://my-bucket/data/files/‘ is the S3 path where the CSV files are located
IAM_ROLE specifies the IAM role required to access the S3 bucket
FORMAT AS CSV specifies that the data files are in CSV format

In addition to Amazon S3, the COPY command loads data from other sources, such as DynamoDB, Amazon EMR, remote hosts through SSH, or other Redshift databases. The COPY command provides options to specify data formats, delimiters, compression, and other parameters to handle different data sources and formats.

To get started with the COPY command, see Using the COPY command to load from Amazon S3.

Federated queries use case: Integrated reporting and analytics for a retail company

For this use case, a retail company has an operational database running on Amazon RDS for PostgreSQL, which stores real-time sales transactions, inventory levels, and customer information data. Additionally, a data warehouse runs on Amazon Redshift storing historical data for reporting and analytics purposes. To create an integrated reporting solution that combines real-time operational data with historical data in the data warehouse, without the need for multi-step ETL processes, complete the following steps:

Set up network connectivity. Make sure your Redshift cluster and RDS for PostgreSQL instance are in the same virtual private cloud (VPC) or have network connectivity established through VPC peering, AWS PrivateLink, or AWS Transit Gateway.
Create a secret and IAM role for federated queries:
1. In AWS Secrets Manager, create a new secret to store the credentials (user name and password) for your Amazon RDS for PostgreSQL instance.
2. Create an IAM role with permissions to access the Secrets Manager secret and the Amazon RDS for PostgreSQL instance.
3. Associate the IAM role with your Amazon Redshift cluster.
Create an external schema in Amazon Redshift:
1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
2. Create an external schema that references your Amazon RDS for PostgreSQL instance:

CREATE EXTERNAL SCHEMA postgres_schema
FROM POSTGRES
DATABASE 'mydatabase'
SCHEMA 'public'
URI 'endpoint-for-your-rds-instance.aws-region.rds.amazonaws.com:5432'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForRDS'
SECRET_ARN 'arn:aws:secretsmanager:aws-region:123456789012:secret:my-rds-secret-abc123';

Query tables in your Amazon RDS for PostgreSQL instance directly from Amazon Redshift using federated queries:

SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category
FROM
    postgres_schema.orders r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id
WHERE
    r.order_date >= '2024-01-01';

Create views or materialized views in Amazon Redshift that combine the operational data from federated queries with the historical data in Amazon Redshift for reporting purposes:

CREATE MATERIALIZED VIEW sales_report AS
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.category,
    h.historical_sales
FROM
    (
        SELECT
            order_id,
            order_date,
            customer_name,
            total_amount,
            product_id
        FROM
            postgres_schema.orders
    ) r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id;

With this implementation, federated queries in Amazon Redshift integrate real-time operational data from Amazon RDS for PostgreSQL instances with historical data in a Redshift data warehouse. This approach eliminates the need for multi-step ETL processes and enables you to create comprehensive reports and analytics that combine data from multiple sources.

To get started with Amazon Redshift federated query ingestion, see Querying data with federated queries in Amazon Redshift.

Zero-ETL integration use case: Near real-time analytics for an ecommerce application

Suppose an ecommerce application built on Aurora MySQL-Compatible manages online orders, customer data, and product catalogs. To perform near real-time analytics with data filtering on transactional data to gain insights into customer behavior, sales trends, and inventory management without the overhead of building and maintaining multi-step ETL pipelines, you can use zero-ETL integrations for Amazon Redshift. Complete the following steps:

Set up an Aurora MySQL cluster (must be running Aurora MySQL version 3.05-compatible with MySQL 8.0.32 or higher):
1. Create an Aurora MySQL cluster in your desired AWS Region.
2. Configure the cluster settings, such as the instance type, storage, and backup options.
Create a zero-ETL integration with Amazon Redshift:
1. On the Amazon RDS console, navigate to the Zero-ETL integrations
2. Choose Create integration and select your Aurora MySQL cluster as the source.
3. Choose an existing Redshift cluster or create a new cluster as the target.
4. Provide a name for the integration and review the settings.
5. Choose Create integration to initiate the zero-ETL integration process.
Verify the integration status:
1. After the integration is created, monitor the status on the Amazon RDS console or by querying the SVV_INTEGRATION and SYS_INTEGRATION_ACTIVITY system views in Amazon Redshift.
2. Wait for the integration to reach the Active state, indicating that data is being replicated from Aurora to Amazon Redshift.
Create analytics views:
1. Connect to your Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
2. Create views or materialized views that combine and transform the replicated data from Aurora for your analytics use cases:

CREATE MATERIALIZED VIEW orders_summary AS
SELECT
    o.order_id,
    o.customer_id,
    SUM(oi.quantity * oi.price) AS total_revenue,
    MAX(o.order_date) AS latest_order_date
FROM
    aurora_schema.orders o
    JOIN aurora_schema.order_items oi ON o.order_id = oi.order_id
GROUP BY
    o.order_id,
    o.customer_id;

Query the views or materialized views in Amazon Redshift to perform near real-time analytics on the transactional data from your Aurora MySQL cluster:

SELECT
	customer_id,
	SUM(total_revenue) AS total_customer_revenue,
	MAX(latest_order_date) AS most_recent_order
FROM
	orders_summary
GROUP BY
	customer_id
ORDER BY
	total_customer_revenue DESC;

This implementation achieves near real-time analytics for an ecommerce application’s transactional data using the zero-ETL integration between Aurora MySQL-Compatible and Amazon Redshift. The data automatically replicates from Aurora to Amazon Redshift, eliminating the need for multi-step ETL pipelines and supporting insights from the latest data quickly.

To get started with Amazon Redshift zero-ETL integrations, see Working with zero-ETL integrations. To learn more about Aurora zero-ETL integrations with Amazon Redshift, see Amazon Aurora zero-ETL integrations with Amazon Redshift.

Integration for Apache Spark use case: Gaming player events written to Amazon S3

Consider a large volume of gaming player events stored in Amazon S3. The events require data transformation, cleansing, and preprocessing to extract insights, generate reports, or build ML models. In this case, you can use the scalability and processing power of Amazon EMR to perform the required data changes using Apache Spark. After it’s processed, the transformed data must be loaded into Amazon Redshift for further analysis, reporting, and integration with BI tools.

In this scenario, you can use the Amazon Redshift integration for Apache Spark to perform the necessary data transformations and load the processed data into Amazon Redshift. The following implementation example assumes gaming player events in Parquet format are stored in Amazon S3 (s3://<bucket_name>/player_events/).

Launch an Amazon EMR (emr-6.9.0) cluster with Apache Spark (Spark 3.3.0) with Amazon Redshift integration with Apache Spark support.
Configure the necessary IAM role for accessing Amazon S3 and Amazon Redshift.
Add security group rules to Amazon Redshift to allow access to the provisioned cluster or serverless workgroup.
Create a Spark job that sets up a connection to Amazon Redshift, reads data from Amazon S3, performs transformations, and writes resulting data to Amazon Redshift. See the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import os

def main():

	# Create a SparkSession
	spark = SparkSession.builder \
    		.appName("RedshiftSparkJob") \
    		.getOrCreate()

	# Set Amazon Redshift connection properties
	Redshift_jdbc_url = "jdbc:redshift://<redshift-endpoint>:<port>/<database>"
	redshift_table = "<schema>.<table_name>"
	temp_s3_bucket = "s3://<bucket_name>/temp/"
	iam_role_arn = "<iam_role_arn>"

	# Read data from Amazon S3
	s3_data = spark.read.format("parquet") \
    		.load("s3://<bucket_name>/player_events/")

	# Perform transformations
	transformed_data = s3_data.withColumn("transformed_column", lit("transformed_value"))

	# Write the transformed data to Amazon Redshift
	transformed_data.write \
    		.format("io.github.spark_redshift_community.spark.redshift") \
    		.option("url", redshift_jdbc_url) \
    		.option("dbtable", redshift_table) \
    		.option("tempdir", temp_s3_bucket) \
    		.option("aws_iam_role", iam_role_arn) \
    		.mode("overwrite") \
    		.save()

if __name__ == "__main__":
    main()

In this example, you first import the necessary modules and create a SparkSession. Set the connection properties for Amazon Redshift, including the endpoint, port, database, schema, table name, temporary S3 bucket path, and the IAM role ARN for authentication. Read data from Amazon S3 in Parquet format using the spark.read.format("parquet").load() method. Perform a transformation on the Amazon S3 data by adding a new column transformed_column with a constant value using the withColumn method and the lit function. Write the transformed data to Amazon Redshift using the write method and the io.github.spark_redshift_community.spark.redshift format. Set the necessary options for the Redshift connection URL, table name, temporary S3 bucket path, and IAM role ARN. Use the mode("overwrite") option to overwrite the existing data in the Amazon Redshift table with the transformed data.

To get started with Amazon Redshift integration for Apache Spark, see Amazon Redshift integration for Apache Spark. For more examples of using the Amazon Redshift for Apache Spark connector, see New – Amazon Redshift Integration with Apache Spark.

Streaming ingestion use case: IoT telemetry near real-time analysis

Imagine a fleet of IoT devices (sensors and industrial equipment) that generate a continuous stream of telemetry data such as temperature readings, pressure measurements, or operational metrics. Ingesting this data in real time to perform analytics to monitor the devices, detect anomalies, and make data-driven decisions requires a streaming solution integrated with a Redshift data warehouse.

In this example, we use Amazon MSK as the streaming source for IoT telemetry data.

Create an external schema in Amazon Redshift:
1. Connect to an Amazon Redshift cluster using a SQL client or the query editor v2 on the Amazon Redshift console.
2. Create an external schema that references the MSK cluster:

CREATE EXTERNAL SCHEMA kafka_schema
FROM KAFKA
BROKER 'broker-1.example.com:9092,broker-2.example.com:9092'
TOPIC 'iot-telemetry-topic'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRoleForMSK';

Create a materialized view in Amazon Redshift:
1. Define a materialized view that maps the Kafka topic data to Amazon Redshift table columns.
2. CAST the streaming message payload data type to the Amazon Redshift SUPER type.
3. Set the materialized view to auto refresh.

CREATE MATERIALIZED VIEW iot_telemetry_view
AUTO REFRESH YES
AS SELECT
    kafka_partition,
    kafka_offset,
    kafka_timestamp_type,
    kafka_timestamp,
    CAST(kafka_value AS SUPER) payload
FROM kafka_schema.iot-telemetry-topic;

Query the iot_telemetry_view materialized view to access the real-time IoT telemetry data ingested from the Kafka topic. The materialized view will automatically refresh as new data arrives in the Kafka topic.

SELECT
    kafka_timestamp,
    payload:device_id,
    payload:temperature,
    payload:pressure
FROM iot_telemetry_view;

With this implementation, you can achieve near real-time analytics on IoT device telemetry data using Amazon Redshift streaming ingestion. As telemetry data is received by an MSK topic, Amazon Redshift automatically ingests and reflects the data in a materialized view, supporting query and analysis of the data in near real time.

To get started with Amazon Redshift streaming ingestion, see Streaming ingestion to a materialized view. To learn more about streaming and customer use cases, see Amazon Redshift Streaming Ingestion.

Conclusion

This post detailed the options available for Amazon Redshift data ingestion. The choice of data ingestion method depends on factors such as the size and structure of data, the need for real-time access or transformations, data sources, existing infrastructure, ease of use, and user skill-sets. Zero-ETL integrations and federated queries are suitable for simple data ingestion tasks or joining data between operational databases and Amazon Redshift analytics data. Large-scale data ingestion with transformation and orchestration benefit from Amazon Redshift integration with Apache Spark with Amazon EMR and AWS Glue. Bulk loading of data into Amazon Redshift regardless of dataset size fits perfectly with the capabilities of the Redshift COPY command. Utilizing streaming sources such as Kinesis Data Streams, Amazon MSK, or Data Firehose are ideal scenarios for utilizing AWS streaming services integration for data ingestion.

Evaluate the features and guidance provided for your data ingestion workloads and let us know your feedback in the comments.

About the Authors

Steve Phillips is a senior technical account manager at AWS in the North America region. Steve has worked with games customers for eight years and currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.

Sudipta Bagchi is a Sr. Specialist Solutions Architect at Amazon Web Services. He has over 14 years of experience in data and analytics, and helps customers design and build scalable and high-performant analytics solutions. Outside of work, he loves running, traveling, and playing cricket.

Honey, I shrunk the PHP payloads

New module content (1)

PHP Minify Encoder

Enhancements and features (2)

Bugs fixed (2)

Documentation

Get it

Amazon Redshift COPY command

Amazon Redshift federated queries

Federated queries key features:

Amazon Redshift Zero-ETL integration

Zero-ETL integration benefits:

Amazon Redshift integration for Apache Spark

Amazon Redshift streaming ingestion

Amazon Redshift ingestion use cases and examples

Redshift COPY use case: Application log data ingestion and analysis

Federated queries use case: Integrated reporting and analytics for a retail company

Zero-ETL integration use case: Near real-time analytics for an ecommerce application

Integration for Apache Spark use case: Gaming player events written to Amazon S3

Streaming ingestion use case: IoT telemetry near real-time analysis

Conclusion

About the Authors

The collective thoughts of the interwebz