Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=MxFxlGkj9AE
Yearly Archives: 2024
Salt Typhoon’s Reach Continues to Grow
Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/12/salt-typhoons-reach-continues-to-grow.html
The US government has identified a ninth telecom that was successfully hacked by Salt Typhoon.
Best of “How To”: Make Small Talk
Post Syndicated from The Atlantic original https://www.youtube.com/watch?v=RSNmJGWGwsk
2024-12-30 равносметъчно
Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3490
Равносметъчно…
(гледайки архива, съм нямал очаквания за тая година. Изглежда е било правилно, щото нямаше шанс да позная доста от нещата…)
– Децата, as usual, продължават да растат. Понеже гледането на деца в София е леко екстремно преживяване, започнаха училище в Бургас (където им намерихме едно много добро такова);
– Успяхме да случим FOSDEM 2024. Така и не написах нещо по темата, но вероятно ще напиша за следващия, понеже
– По-голямата част от годината се занимаваме да направим следващия revision на video-box-а на FOSDEM. Което води до много занимавания с хардуер и последно три дни сглобяване, на който му е интересно може да разгледа в repo-то. Ще гледам да го разпиша по-подробно като сме готови;
– Случи се OpenFest 2024, в който нямах участие, и в който някаква прилична част от екипа се разпадна. В момента съм се върнал и подреждам нещата (има събран core, говорено с малко спонсори, и даже дати – 18 и 19 октомври в техпарка). Знаех в какво се забърквам, знам, че не е ясно дали трябваше да си го причинявам и не можах да оставя така нещата;
– Това с OpenFest беше половината причина да напиша лекция тая година, този път за екипи. Ще ми се следващата ми лекция да е за нещо техническо, че имам поне 3 идеи;
– Работата си е интересна и много (както обикновено). Тази година успях да си напълня екипа и сега си правим малка реорганизация;
– И последното, chervarium почина по-миналата седмица. Мислим на 8ми януари на ИББ да го полеем прощално…
Да видим 2025.
Crashed into the River
Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=LXvv9_9ZFKk
Time Capsule Instructions
Post Syndicated from xkcd.com original https://xkcd.com/3031/

digiblurDIY Live Giveaway of all things Smart Home & More
Post Syndicated from digiblur DIY original https://www.youtube.com/watch?v=FDUvSVfBDrU
Kernel prepatch 6.13-rc5
Post Syndicated from corbet original https://lwn.net/Articles/1003730/
The 6.13-rc5 kernel prepatch is out for
testing. Linus says: “It’s been another week, but I’m happy to report
“
that clearly most people actually seem to have been enjoying the holidays,
because rc5 is tiny
Rough Rhinoplasty
Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=eojPv9nK6x4
STH Q4 2024 Letter from the Editor Re-aligning
Post Syndicated from Patrick Kennedy original https://www.servethehome.com/sth-q4-2024-letter-from-the-editor-re-aligning/
Every quarter, I like to do a small update to give our readers a behind-the-scenes look at what is happening. Often, there is a big difference between what folks see publicly and the inner workings of STH, so I like to peel that back. This quarter was a big lift on the growth side, and […]
The post STH Q4 2024 Letter from the Editor Re-aligning appeared first on ServeTheHome.
Cross-word Crazed
Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=5yj80iEoIzQ
Comic for 2024.12.29 – Dick Bigger
Post Syndicated from Explosm.net original https://explosm.net/comics/dick-bigger
New Cyanide and Happiness Comic
Vertiv Hydrogen Fuel Cell Quick Look
Post Syndicated from Patrick Kennedy original https://www.servethehome.com/vertiv-hydrogen-fuel-cell-quick-look/
We take a quick look at the Vertiv Hydrogen Fuel Cell solution that can generate 600kW of power and exhausts water and water vapor
The post Vertiv Hydrogen Fuel Cell Quick Look appeared first on ServeTheHome.
Boston Invaded by Spacemen
Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=NlnixMHGuoE
Let’s talk about 2024 and my setup(s)
Post Syndicated from BeardedTinker original https://www.youtube.com/watch?v=NsLEzyi9Qu4
Gigaplus GP-S25-0802P PoE 8-port 2.5GbE and 2-port 10G Review
Post Syndicated from Rohit Kumar original https://www.servethehome.com/gigaplus-gp-s25-0802p-poe-8-port-2-5gbe-and-2-port-10g-review/
The Gigaplus GP-S25-0802P is an updated version of our favorite 2.5GbE switch with dual SFP+ 10G ports, with PoE+ features
The post Gigaplus GP-S25-0802P PoE 8-port 2.5GbE and 2-port 10G Review appeared first on ServeTheHome.
First Allied Major Amphibious Invasion of WWII
Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=iy3UlZ13xzs
Comic for 2024.12.28 – Time Machine
Post Syndicated from Explosm.net original https://explosm.net/comics/time-machine-2
New Cyanide and Happiness Comic
Peter Attia on the science and art of longevity
Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=vFRdsQYnI00
Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 times faster than Spark 3.5.3 and Iceberg 1.6.1
Post Syndicated from Atul Payapilly original https://aws.amazon.com/blogs/big-data/amazon-emr-7-5-runtime-for-apache-spark-and-iceberg-can-run-spark-workloads-3-6-times-faster-than-spark-3-5-3-and-iceberg-1-6-1/
The Amazon EMR runtime for Apache Spark offers a high-performance runtime environment while maintaining 100% API compatibility with open source Apache Spark and Apache Iceberg table format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue all use the optimized runtimes.
In this post, we demonstrate the performance benefits of using the Amazon EMR 7.5 runtime for Spark and Iceberg compared to open source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.
Iceberg is a popular open source high-performance format for large analytic tables. Our benchmarks demonstrate that Amazon EMR can run TPC-DS 3 TB workloads 3.6 times faster, reducing the runtime from 1.54 hours to 0.42 hours. Additionally, the cost efficiency improves by 2.9 times, with the total cost decreasing from $16.00 to $5.39 when using Amazon Elastic Compute Cloud (Amazon EC2) On-Demand r5d.4xlarge instances, providing observable gains for data processing tasks.
This is a further 32% increase from the optimizations shipped in Amazon EMR 7.1 covered in a previous post, Amazon EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 times faster than Apache Spark 3.5.1 and Iceberg 1.5.2. Since then we have continued adding more support for DataSource V2 for eight more existing query optimizations in the EMR runtime for Spark.
In addition to these DataSource V2 specific improvements, we have made more optimizations to Spark operators since Amazon EMR 7.1 that also contribute to the additional speedup.
Benchmark results for Amazon EMR 7.5 compared to4 open source Spark 3.5.3 and Iceberg 1.6.1
To assess the Spark engine’s performance with the Iceberg table format, we performed benchmark tests using the 3 TB TPC-DS dataset, version 2.13 (our results derived from the TPC-DS dataset are not directly comparable to the official TPC-DS results due to setup differences). Benchmark tests for the EMR runtime for Spark and Iceberg were conducted on Amazon EMR 7.5 EC2 clusters vs open source Spark 3.5.3 and Iceberg 1.6.1 on EC2 clusters.
The setup instructions and technical details are available in our GitHub repository. To minimize the influence of external catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This uses the underlying file system, specifically Amazon S3, as the catalog. We can define this setup by configuring the property spark.sql.catalog.<catalog_name>.type. The fact tables used the default partitioning by the date column, which have a number of partitions varying from 200–2,100. No precalculated statistics were used for these tables.
We ran a total of 104 SparkSQL queries in three sequential rounds, and the average runtime of each query across these rounds was taken for comparison. The average runtime for the three rounds on Amazon EMR 7.5 with Iceberg enabled was 0.42 hours, demonstrating a 3.6-fold speed increase compared to open source Spark 3.5.3 and Iceberg 1.6.1. The following figure presents the total runtimes in seconds.

The following table summarizes the metrics.
| Metric | Amazon EMR 7.5 on EC2 | Amazon EMR 7.1 on EC2 | Open Source Spark 3.5.3 and Iceberg 1.6.1 |
| Average runtime in seconds | 1535.62 | 2033.17 | 5546.16 |
| Geometric mean over queries in seconds | 8.30046 | 10.13153 | 20.40555 |
| Cost* | $5.39 | $7.18 | $16.00 |
*Detailed cost estimates are discussed later in this post.
The following chart demonstrates the per-query performance improvement of Amazon EMR 7.5 relative to open source Spark 3.5.3 and Iceberg 1.6.1. The extent of the speedup varies from one query to another, with the fastest up to 9.4 times faster for q93, with Amazon EMR outperforming open source Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based on the performance improvement seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Cost comparison
Our benchmark provides the total runtime and geometric mean data to assess the performance of Spark and Iceberg in a complex, real-world decision support scenario. For additional insights, we also examine the cost aspect. We calculate cost estimates using formulas that account for EC2 On-Demand instances, Amazon Elastic Block Store (Amazon EBS), and Amazon EMR expenses.
- Amazon EC2 cost (includes SSD cost) = number of instances * r5d.4xlarge hourly rate * job runtime in hours
- r5d.4xlarge hourly rate = $1.152 per hour in us-east-1
- Root Amazon EBS cost = number of instances * Amazon EBS per GB-hourly rate * root EBS volume size * job runtime in hours
- Amazon EMR cost = number of instances * r5d.4xlarge Amazon EMR cost * job runtime in hours
- 4xlarge Amazon EMR cost = $0.27 per hour
- Total cost = Amazon EC2 cost + root Amazon EBS cost + Amazon EMR cost
The calculations reveal that the Amazon EMR 7.5 benchmark yields a 2.9-fold cost efficiency improvement over open source Spark 3.5.3 and Iceberg 1.6.1 in running the benchmark job.
| Metric | Amazon EMR 7.5 | Amazon EMR 7.1 | Open Source Spark 3.5.1 and Iceberg 1.5.2 |
| Runtime in hours | 0.426 | 0.564 | 1.540 |
|
Number of EC2 instances (Includes primary node) |
9 | 9 | 9 |
| Amazon EBS Size | 20gb | 20gb | 20gb |
|
Amazon EC2 (Total runtime cost) |
$4.35 | $5.81 | $15.97 |
| Amazon EBS cost | $0.01 | $0.01 | $0.04 |
| Amazon EMR cost | $1.02 | $1.36 | $0 |
| Total cost | $5.38 | $7.18 | $16.01 |
| Cost savings | Amazon EMR 7.5 is 2.9 times better | Amazon EMR 7.1 is 2.2 times better | Baseline |
In addition to the time-based metrics discussed so far, data from Spark event logs show that Amazon EMR scanned approximately 3.4 times less data from Amazon S3 and 4.1 times fewer records than the open source version in the TPC-DS 3 TB benchmark. This reduction in Amazon S3 data scanning contributes directly to cost savings for Amazon EMR workloads.
Run open source Spark benchmarks on Iceberg tables
We used separate EC2 clusters, each equipped with nine r5d.4xlarge instances, for testing both open source Spark 3.5.3 and Amazon EMR 7.5 for Iceberg workload. The primary node was equipped with 16 vCPU and 128 GB of memory, and the eight worker nodes together had 128 vCPU and 1024 GB of memory. We conducted tests using the Amazon EMR default settings to showcase the typical user experience and minimally adjusted the settings of Spark and Iceberg to maintain a balanced comparison.
The following table summarizes the Amazon EC2 configurations for the primary node and eight worker nodes of type r5d.4xlarge.
| EC2 Instance | vCPU | Memory (GiB) | Instance Storage (GB) | EBS Root Volume (GB) |
| r5d.4xlarge | 16 | 128 | 2 x 300 NVMe SSD | 20 GB |
Prerequisites
The following prerequisites are required to run the benchmarking:
- Using the instructions in the emr-spark-benchmark GitHub repo, set up the TPC-DS source data in your S3 bucket and on your local computer.
- Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application and copy the benchmark application to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.3.jar to your S3 bucket.
- Create Iceberg tables from the TPC-DS source data. Follow the instructions on GitHub to create Iceberg tables using the Hadoop catalog. For example, the following code uses an EMR 7.5 cluster with Iceberg enabled to create the tables:
Note the Hadoop catalog warehouse location and database name from the preceding step. We use the same iceberg tables to run benchmarks with Amazon EMR 7.5 and open source Spark.
This benchmark application is built from the branch tpcds-v2.13_iceberg. If you’re building a new benchmark application, switch to the correct branch after downloading the source code from the GitHub repo.
Create and configure a YARN cluster on Amazon EC2
To compare Iceberg performance between Amazon EMR on Amazon EC2 and open source Spark on Amazon EC2, follow the instructions in the emr-spark-benchmark GitHub repo to create an open source Spark cluster on Amazon EC2 using Flintrock with eight worker nodes.
Based on the cluster selection for this test, the following configurations are used:
Make sure to replace the placeholder <private ip of primary node>, in the yarn-site.xml file, with the primary node’s IP address of your Flintrock cluster.
Run the TPC-DS benchmark with Spark 3.5.3 and Iceberg 1.6.1
Complete the following steps to run the TPC-DS benchmark:
- Log in to the open source cluster primary node using
flintrock login $CLUSTER_NAME. - Submit your Spark job:
- Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables.
- The results are created in
s3://<YOUR_S3_BUCKET>/benchmark_run. - You can track progress in
/media/ephemeral0/spark_run.log.
Summarize the results
After the Spark job finishes, retrieve the test result file from the output S3 bucket at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv. This can be done either through the Amazon S3 console by navigating to the specified bucket location or by using the AWS Command Line Interface (AWS CLI). The Spark benchmark application organizes the data by creating a timestamp folder and placing a summary file within a folder labeled summary.csv. The output CSV files contain four columns without headers:
- Query name
- Median time
- Minimum time
- Maximum time
With the data from three separate test runs with one iteration each time, we can calculate the average and geometric mean of the benchmark runtimes.
Run the TPC-DS benchmark with the EMR runtime for Spark
Most of the instructions are similar to Steps to run Spark Benchmarking with a few Iceberg-specific details.
Prerequisites
Complete the following prerequisite steps:
- Run
aws configureto configure the AWS CLI shell to point to the benchmarking AWS account. Refer to Configure the AWS CLI for instructions. - Upload the benchmark application JAR file to Amazon S3.
Deploy the EMR cluster and run the benchmark job
Complete the following steps to run the benchmark job:
- Use the AWS CLI command as shown in Deploy EMR on EC2 Cluster and run benchmark job to spin up an EMR on EC2 cluster. Make sure to enable Iceberg. See Create an Iceberg cluster for more details. Choose the correct Amazon EMR version, root volume size, and same resource configuration as the open source Flintrock setup. Refer to create-cluster for a detailed description of the AWS CLI options.
- Store the cluster ID from the response. We need this for the next step.
- Submit the benchmark job in Amazon EMR using
add-stepsfrom the AWS CLI:- Replace <cluster ID> with the cluster ID from Step 2.
- The benchmark application is at
s3://<your-bucket>/spark-benchmark-assembly-3.5.3.jar. - Choose the correct Iceberg catalog warehouse location and database that has the created Iceberg tables. This should be the same as the one used for the open source TPC-DS benchmark run.
- The results will be in
s3://<your-bucket>/benchmark_run.
Summarize the results
After the step is complete, you can see the summarized benchmark result at s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/summary.csv/xxx.csv in the same way as the previous run and compute the average and geometric mean of the query runtimes.
Clean up
To prevent any future charges, delete the resources you created by following the instructions provided in the Cleanup section of the GitHub repository.
Summary
Amazon EMR is consistently enhancing the EMR runtime for Spark when used with Iceberg tables, achieving a performance that is 3.6 times faster than open source Spark 3.5.3 and Iceberg 1.6.1 with EMR 7.5 on TPC-DS 3 TB, v2.13. This is a further increase of 32% from EMR 7.1. We encourage you to keep up to date with the latest Amazon EMR releases to fully benefit from ongoing performance improvements.
To stay informed, subscribe to the AWS Big Data Blog’s RSS feed, where you can find updates on the EMR runtime for Spark and Iceberg, as well as tips on configuration best practices and tuning recommendations.
About the Authors
Atul Felix Payapilly is a software development engineer for Amazon EMR at Amazon Web Services.
Udit Mehrotra is an Engineering Manager for EMR at Amazon Web Services.