AWS Weekly Roundup: New AWS Heroes, Amazon API Gateway, Amazon Q and more (June 10, 2024)

2024-06-10 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-new-aws-heroes-amazon-api-gateway-amazon-q-and-more-june-10-2024/

In the last AWS Weekly Roundup, Channy reminded us on how life has ups and downs. It’s just how life is. But, that doesn’t mean that we should do it alone. Farouq Mousa, AWS Community Builder, is fighting brain cancer and Allen Helton, AWS Serverless Hero, his daughter is fighting leukemia.

If you have a moment, please visit their campaign pages and give your support.

Meanwhile, we’ve just finished a few AWS Summits in India, Korea and also Thailand. As always, I had so much fun working together at Developer Lounge with AWS Heroes, AWS Community Builders, and AWS User Group leaders. Here’s a photo from everyone here.

Last Week’s Launches
Here are some launches that caught my attention last week:

Welcome, new AWS Heroes! — Last week, we just announced new cohort for AWS Heroes, worldwide group of AWS experts who go above and beyond to share knowledge and empower their communities.

Amazon API Gateway increased integration timeout limit — If you’re using Regional REST APIs and private REST APIs in Amazon API Gateway, now you can increase the integration timeout limit greater than 29 seconds. This allows you to run various workloads requiring longer timeouts.

Amazon Q offers inline completion in the command line — Now, Amazon Q Developer provides real-time AI-generated code suggestions as you type in your command line. As a regular command line interface (CLI) user, I’m really excited about this.

New common control library in AWS Audit Manager — This announcement helps you to save time when mapping enterprise controls into AWS Audit Manager. Check out Danilo’s post where he elaborated how that you can simplify risk and complicance assessment with the new common control library.

Amazon Inspector container image scanning for Amazon CodeCatalyst and GitHub actions — If you need to integrate your CI/CD with software vulnerabilities checking, you can use Amazon Inspector. Now, with this native integration in GitHub actions and Amazon CodeCatalyst, it streamlines your development pipeline process.

Ingest streaming data with Amazon OpenSearch Ingestion and Amazon Managed Streaming for Apache Kafka — With this new capability, now you can build more efficient data pipelines for your complex analytics use cases. Now, you can seamlessly index the data from your Amazon MSK Serverless clusters in Amazon OpenSearch service.

Amazon Titan Text Embeddings V2 now available in Amazon Bedrock Knowledge Base — You now can embed your data into a vector database using Amazon Titan Text Embeddings V2. This will be helpful for you to retrieve relevant information for various tasks.

Max tokens	8,192
Languages	100+ in pre-training
Fine-tuning supported	No
Normalization supported	Yes
Vector size	256, 512, 1,024 (default)

From Community.aws
Here’s my 3 personal favorites posts from community.aws:

Upcoming AWS events
Check your calendars and sign up for these AWS and AWS Community events:

AWS Summits — Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Japan (June 20), Washington, DC (June 26–27), and New York (July 10).
AWS re:Inforce — Join us for AWS re:Inforce (June 10–12) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. Connect with the AWS teams that build the security tools and meet AWS customers to learn about their security journeys.
AWS Community Days — Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Midwest | Columbus (June 13), Sri Lanka (June 27), Cameroon (July 13), New Zealand (August 15), Nigeria (August 24), and New York (August 28).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

2024-06-10 Yonatan Dolan

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/how-cloudinary-transformed-their-petabyte-scale-streaming-data-lake-with-apache-iceberg-and-aws-analytics/

This post is co-written with Amit Gilad, Alex Dickman and Itay Takersman from Cloudinary.

Enterprises and organizations across the globe want to harness the power of data to make better decisions by putting data at the center of every decision-making process. Data-driven decisions lead to more effective responses to unexpected events, increase innovation and allow organizations to create better experiences for their customers. However, throughout history, data services have held dominion over their customers’ data. Despite the potential separation of storage and compute in terms of architecture, they are often effectively fused together. This amalgamation empowers vendors with authority over a diverse range of workloads by virtue of owning the data. This authority extends across realms such as business intelligence, data engineering, and machine learning thus limiting the tools and capabilities that can be used.

The landscape of data technology is swiftly advancing, driven frequently by projects led by the open source community in general and the Apache foundation specifically. This evolving open source landscape allows customers complete control over data storage, processing engines and permissions expanding the array of available options significantly. This approach also encourages vendors to compete based on the value they provide to businesses, rather than relying on potential fusing of storage and compute. This fosters a competitive environment that prioritizes customer acquisition and prompts vendors to differentiate themselves through unique features and offerings that cater directly to the specific needs and preferences of their clientele.

A modern data strategy redefines and enables sharing data across the enterprise and allows for both reading and writing of a singular instance of the data using an open table format. The open table format accelerates companies’ adoption of a modern data strategy because it allows them to use various tools on top of a single copy of the data.

Cloudinary is a cloud-based media management platform that provides a comprehensive set of tools and services for managing, optimizing, and delivering images, videos, and other media assets on websites and mobile applications. It’s widely used by developers, content creators, and businesses to streamline their media workflows, enhance user experiences, and optimize content delivery.

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon EMR, and AWS Glue.

Short overview of Cloudinary’s infrastructure

Cloudinary infrastructure handles over 20 billion requests daily with every request generating event logs. Various data pipelines process these logs, storing petabytes (PBs) of data per month, which after processing data stored on Amazon S3, are then stored in Snowflake Data Cloud. These datasets serve as a critical resource for Cloudinary internal teams and data science groups to allow detailed analytics and advanced use cases.

Until recently, this data was mostly prepared by automated processes and aggregated into results tables, used by only a few internal teams. Cloudinary struggled to use this data for additional teams who had more online, real time, lower-granularity, dynamic usage requirements. Making petabytes of data accessible for ad-hoc reports became a challenge as query time increased and costs skyrocketed along with growing compute resource requirements. Cloudinary data retention for the specific analytical data discussed in this post was defined as 30 days. However, new use cases drove the need for increased retention, which would have led to significantly higher cost.

The data is flowing from Cloudinary log providers into files written into Amazon S3 and notified through events pushed to Amazon Simple Queue Service (Amazon SQS). Those SQS events are ingested by a Spark application running in Amazon EMR Spark, which parses and enriches the data. The processed logs are written in Apache Parquet format back to Amazon S3 and then automatically loaded to a Snowflake table using Snowpipe.

Why Cloudinary chose Apache Iceberg

Apache Iceberg is a high-performance table format for huge analytic workloads. Apache Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for processing engines such as Apache Spark, Trino, Apache Flink, Presto, Apache Hive, and Impala to safely work with the same tables at the same time.

A solution based on Apache Iceberg encompasses complete data management, featuring simple built-in table optimization capabilities within an existing storage solution. These capabilities, along with the ability to use multiple engines on top of a singular instance of data, helps avoid the need for data movement between various solutions.

While exploring the various controls and options in configuring Apache Iceberg, Cloudinary had to adapt its data to use AWS Glue Data Catalog, as well as move a significant volume of data to Apache Iceberg on Amazon S3. At this point it became clear that costs would be significantly reduced, and while it had been a key factor since the planning phase, it was now possible to get concrete numbers. One example is that Cloudinary was now able to store 6 months of data for the same storage price that was previously paid for storing 1 month of data. This cost saving was achieved by using Amazon S3 storage tiers as well as improved compression (Zstandard), further enhanced by the fact that Parquet files were sorted.

Since Apache Iceberg is well supported by AWS data services and Cloudinary was already using Spark on Amazon EMR, they could integrate writing to Data Catalog and start an additional Spark cluster to handle data maintenance and compaction. As exploration continued with Apache Iceberg, some interesting performance metrics were found. For example, for certain queries, Athena runtime was 2x–4x faster than Snowflake.

Integration of Apache Iceberg

The integration of Apache Iceberg was done before loading data to Snowflake. The data is written to an Iceberg table using Apache Parquet data format and AWS Glue as the data catalog. In addition, a Spark application on Amazon EMR runs in the background handling compaction of the Parquet files to optimal size for querying through various tools such as Athena, Trino running on top of EMR, and Snowflake.

Challenges faced

Cloudinary faced several challenges while building its petabyte-scale data lake, including:

Determining optimal table partitioning
Optimizing ingestion
Solving the small files problem to improve query performance
Cost effectively maintaining Apache Iceberg tables
Choosing the right query engine

In this section, we describe each of these challenges and the solutions implemented to address them. Many of the tests to check performance and volumes of data scanned have used Athena because it provides a simple to use, fully serverless, cost effective, interface without the need to setup infrastructure.

Determining optimal table partitioning

Apache Iceberg makes partitioning easier for the user by implementing hidden partitioning. Rather than forcing the user to supply a separate partition filter at query time, Iceberg tables can be configured to map regular columns to the partition keys. Users don’t need to maintain partition columns or even understand the physical table layout to get fast and accurate query results.

Iceberg has several partitioning options. One example is when partitioning timestamps, which can be done by year, month, day, and hour. Iceberg keeps track of the relationship between a column value and its partition without requiring additional columns. Iceberg can also partition categorical column values by identity, hash buckets, or truncation. In addition, Iceberg partitioning is user-friendly because it also allows partition layouts to evolve over time without breaking pre-written queries. For example, when using daily partitions and the query pattern changes over time to be based on hours, it’s possible to evolve the partitions to hourly ones, thus making queries more efficient. When evolving such a partition definition, the data in the table prior to the change is unaffected, as is its metadata. Only data that is written to the table after the evolution is partitioned with the new definition, and the metadata for this new set of data is kept separately. When querying, each partition layout’s respective metadata is used to identify the files that need to be accessed; this is called split-planning. Split-planning is one of many Iceberg features that are made possible due to the table metadata, which creates a separation between the physical and the logical storage. This concept makes Iceberg extremely versatile.

Determining the correct partitioning is key when working with large data sets because it affects query performance and the amount of data being scanned. Because this migration was from existing tables from Snowflake native storage to Iceberg, it was crucial to test and provide a solution with the same or better performance for the existing workload and types of queries.

These tests were possible due to Apache Iceberg’s:

Hidden partitions
Partition transformations
Partition evolution

These allowed altering table partitions and testing which strategy works best without data rewrite.

Here are a few partitioning strategies that were tested:

PARTITIONED BY (days(day), customer_id)
PARTITIONED BY (days(day), hour(timestamp))
PARTITIONED BY (days(day), bucket(N, customer_id))
PARTITIONED BY (days(day))

Each partitioning strategy that was reviewed generated significantly different results both during writing as well as during query time. After careful results analysis, Cloudinary decided to partition the data by day and combine it with sorting, which allows them to sort data within partitions as would be elaborated in the compaction section.

Optimizing ingestion

Cloudinary receives billions of events in files from its providers in various formats and sizes and stores those on Amazon S3, resulting in terabytes of data processed and stored every day.

Because the data doesn’t come in a consistent manner and it’s not possible to predict the incoming rate and file size of the data, it was necessary to find a way of keeping cost down while maintaining high throughput.

This was achieved by using EventBridge to push each file received into Amazon SQS, where it was processed using Spark running on Amazon EMR in batches. This allowed processing the incoming data at high throughput and scale clusters according to queue size while keeping costs down.

Example of fetching 100 messages (files) from Amazon SQS with Spark:

var client = AmazonSQSClientBuilder.standard().withRegion("us-east-1").build()
var getMessageBatch: Iterable[Message] = DistributedSQSReceiver.client.receiveMessage(new ReceiveMessageRequest().withQueueUrl(queueUrl).withMaxNumberOfMessages(10)).getMessages.asScala
sparkSession.sparkContext.parallelize(10) .map(_ => getMessageBatch) .collect().flatMap(_.toList) .toList

When dealing with a high data ingestion rate for a specific partition prefix, Amazon S3 might potentially throttle requests and return a 503 status code (service unavailable). To address this scenario, Cloudinary used an Iceberg table property called write.object-storage.enabled, which incorporates a hash prefix into the stored Amazon S3 object path. This approach was deemed efficient and effectively mitigated Amazon S3 throttling problems.

Solving the small file problem and improving query performance

In modern data architectures, stream processing engines such as Amazon EMR are often used to ingest continuous streams of data into data lakes using Apache Iceberg. Streaming ingestion to Iceberg tables can suffer from two problems:

It generates many small files that lead to longer query planning, which in turn can impact read performance.
Poor data clustering, which can make file pruning less effective. This typically occurs in the streaming process when there is insufficient new data to generate optimal file sizes for reading, such as 512 MB.

Because partition is a key factor in the number of files produced and Cloudinary’s data is time based and most queries use a time filter, it was decided to address the optimization of our data lake in multiple ways.

First, Cloudinary set all the necessary configurations that helped reduce the number of files while appending data in the table by setting write.target-file-size-bytes, which allows defining the default target file size. Setting spark.sql.shuffle.partitions in Spark can reduce the number of output files by controlling the number of partitions used during shuffle operations, which affects how data is distributed across tasks, consequently minimizing the number of output files generated after transformations or aggregations.

Because the above approach only addressed the small file problem but didn’t eliminate it entirely, Cloudinary used another capability of Apache Iceberg that can compact data files in parallel using Spark with the rewriteDataFiles action. This action combines small files into larger files to reduce metadata overhead and minimize the amount of Amazon S3 GetObject API operation usage.

Here is where it can get complicated. When running compaction, Cloudinary needed to choose which strategy to apply out of the three that Apache Iceberg offers; each one having its own advantages and disadvantages:

Binpack – simply rewrites smaller files to a target size
Sort – data sorting based on different columns
Z-order – a technique to colocate related data in the same set of files

At first, the Binpack compaction strategy was evaluated. This strategy works fastest and combines small files together to reach the target file size defined and after running it a significant improvement in query performance was observed.

As mentioned previously, data was partitioned by day and most queries ran on a specific time range. Because data comes from external vendors and sometimes arrives late, it was noticed that when running queries on compacted days, a lot of data was being scanned, because the specific time range could reside across many files. The query engine (Athena, Snowflake, and Trino with Amazon EMR) needed to scan the entire partition to fetch only the relevant rows.

To increase query performance even further, Cloudinary decided to change the compaction process to use sort, so now data is partitioned by day and sorted by requested_at (timestamp when the action occurred) and customer ID.

This strategy is costlier for compaction because it needs to shuffle the data in order to sort it. However, after adopting this sort strategy, two things were noticeable: the same queries that ran before now scanned around 50 percent less data, and query run time was improved by 30 percent to 50 percent.

Cost effectively maintaining Apache Iceberg tables

Maintaining Apache Iceberg tables is crucial for optimizing performance, reducing storage costs, and ensuring data integrity. Iceberg provides several maintenance operations to keep your tables in good shape. By incorporating these operations Cloudinary were able to cost-effectively manage their Iceberg tables.

Expire snapshots

Each write to an Iceberg table creates a new snapshot, or version, of a table. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot.

Regularly expiring snapshots is recommended to delete data files that are no longer needed and to keep the size of table metadata small. Cloudinary decided to retain snapshots for up to 7 days to allow easier troubleshooting and handling of corrupted data which sometimes arrives from external sources and aren’t identified upon arrival. SparkActions.get().expireSnapshots(iceTable).expireOlderThan(TimeUnit.DAYS.toMillis(7)).execute()

Remove old metadata files

Iceberg keeps track of table metadata using JSON files. Each change to a table produces a new metadata file to provide atomicity.

Old metadata files are kept for history by default. Tables with frequent commits, like those written by streaming jobs, might need to regularly clean metadata files.

Configuring the following properties will make sure that only the latest ten metadata files are kept and anything older is deleted.

write.metadata.delete-after-commit.enabled=true 
write.metadata.previous-versions-max=10

Delete orphan files

In Spark and other distributed processing engines, when tasks or jobs fail, they might leave behind files that aren’t accounted for in the table metadata. Moreover, in certain instances, the standard snapshot expiration process might fail to identify files that are no longer necessary and not delete them.

Apache Iceberg offers a deleteOrphanFiles action that will take care of unreferenced files. This action might take a long time to complete if there are a large number of files in the data and metadata directories. A metadata or data file is considered orphan if it isn’t reachable by any valid snapshot. The set of actual files is built by listing the underlying storage using the Amazon S3 ListObjects operation, which makes this operation expensive. It’s recommended to run this operation periodically to avoid increased storage usage; however, too frequent runs can potentially offset this cost benefit.

A good example of how critical it is to run this procedure is to look at the following diagram, which shows how this procedure removed 112 TB of storage.

Rewriting manifest files

Apache Iceberg uses metadata in its manifest list and manifest files to speed up query planning and to prune unnecessary data files. Manifests in the metadata tree are automatically compacted in the order that they’re added, which makes queries faster when the write pattern aligns with read filters.

If a table’s write pattern doesn’t align with the query read filter pattern, metadata can be rewritten to re-group data files into manifests using rewriteManifests.

While Cloudinary already had a compaction process that optimized data files, they noticed that manifest files also required optimization. It turned out that in certain cases, Cloudinary reached over 300 manifest files—which were small, often under 8Mb in size—and due to late arriving data, manifest files were pointing to data in different partitions. This caused query planning to run for 12 seconds for each query.

Cloudinary initiated a separate scheduled process of rewriteManifests, and after it ran, the number of manifest files was reduced to approximately 170 files and as a result of more alignment between manifests and query filters (based on partitions), query planning was improved by three times to approximately 4 seconds.

Choosing the right query engine

As part of Cloudinary exploration aimed at testing various query engines, they initially outlined several key performance indicators (KPIs) to guide their search, including support for Apache Iceberg alongside integration with existing data sources such as MySQL and Snowflake, the availability of a web interface for effortless one-time queries, and cost optimization. In line with these criteria, they opted to evaluate various solutions including Trino on Amazon EMR, Athena, and Snowflake with Apache Iceberg support (at that time it was available as a Private Preview). This approach allowed for the assessment of each solution against defined KPIs, facilitating a comprehensive understanding of their capabilities and suitability for Cloudinary’s requirements.

Two of the more quantifiable KPIs that Cloudinary was planning to evaluate were cost and performance. Cloudinary realized early in the process that different queries and usage types can potentially benefit from different runtime engines. They decided to focus on four runtime engines.

Engine	Details
Snowflake native	XL data warehouse on top of data stored within Snowflake
Snowflake with Apache Iceberg support	XL data warehouse on top of data stored in S3 in Apache Iceberg tables
Athena	On-demand mode
Amazon EMR Trino	Opensource Trino on top of eight nodes (m6g.12xl) cluster

The test included four types of queries that represent different production workloads that Cloudinary is running. They’re ordered by size and complexity from the simplest one to the most heavy and complex.

Query	Description	Data scanned	Returned results set
Q1	Multi-day aggregation on a single tenant	Single digit GBs	<10 rows
Q2	Single-day aggregation by tenant across multiple tenant	Dozens of GBs	100 thousand rows
Q3	Multi-day aggregation across multiple tenants	Hundreds of GBs	<10 rows
Q4	Heavy series of aggregations and transformations on a multi-tenant dataset to derive access metrics	Single digit TBs	>1 billion rows

The following graphs show the cost and performance of the four engines across the different queries. To avoid chart scaling issues, all costs and query durations were normalized based on Trino running on Amazon EMR. Cloudinary considered Query 4 to be less suitable for Athena because it involved processing and transforming extremely large volumes of complex data.

Some important aspects to consider are:

Cost for EMR running Trino was derived based on query duration only, without considering cluster set up, which on average launches in just under 5 minutes.
Cost for Snowflake (both options) was derived based on query duration only, without considering cold start (more than 10 seconds on average) and a Snowflake warehouse minimum charge of 1 minute.
Cost for Athena was based on the amount of data scanned; Athena doesn’t require cluster set up and the query queue time is less than 1 second.
All costs are based on list on-demand (OD) prices.
Snowflake prices are based on Standard edition.

The above chart shows that, from a cost perspective, Amazon EMR running Trino on top of Apache Iceberg tables was superior to other engines, in certain cases up to ten times less expensive. However, Amazon EMR setup requires additional expertise and skills compared to the no-code, no infrastructure management offered by Snowflake and Athena.

In terms of query duration, it’s noticeable that there’s no clear engine of choice for all types of queries. In fact, Amazon EMR, which was the most cost-effective option, was only fastest in two out of the four query types. Another interesting point is that Snowflake’s performance on top of Apache Iceberg is almost on-par with data stored within Snowflake, which adds another great option for querying their Apache Iceberg data-lake. The following table shows the cost and time for each query and product.

.	Amazon EMR Trino	Snowflake (XL)	Snowflake (XL) Iceberg	Athena
Query1	$0.01 5 seconds	$0.08 8 seconds	$0.07 8 seconds	$0.02 11 seconds
Query2	$0.12 107 seconds	$0.25 28 seconds	$0.35 39 seconds	$0.18 94 seconds
Query3	$0.17 147 seconds	$1.07 120 seconds	$1.88 211 seconds	$1.22 26 seconds
Query4	$6.43 1,237 seconds	$11.73 1,324 seconds	$12.71 1,430 seconds	N/A

Benchmarking conclusions

While every solution presents its own set of advantages and drawbacks—whether in terms of pricing, scalability, optimizing for Apache Iceberg, or the contrast between open source versus closed source—the beauty lies in not being constrained to a single choice. Embracing Apache Iceberg frees you from relying solely on a single solution. In certain scenarios where queries must be run frequently while scanning up to hundreds of gigabytes of data with an aim to evade warm-up periods and keep costs down, Athena emerged as the best choice. Conversely, when tackling hefty aggregations that demanded significant memory allocation while being mindful of cost, the preference leaned towards using Trino on Amazon EMR. Amazon EMR was significantly more cost efficient when running longer queries, because boot time cost could be discarded. Snowflake stood out as a great option when queries could be joined with other tables already residing within Snowflake. This flexibility allowed harnessing the strengths of each service, strategically applying them to suit the specific needs of various tasks without being confined to a singular solution.

In essence, the true power lies in the ability to tailor solutions to diverse requirements, using the strengths of different environments to optimize performance, cost, and efficiency.

Conclusion

Data lakes built on Amazon S3 and analytics services such as Amazon EMR and Amazon Athena, along with the open source Apache Iceberg framework, provide a scalable, cost-effective foundation for modern data architectures. It enables organizations to quickly construct robust, high-performance data lakes that support ACID transactions and analytics workloads. This combination is the most refined way to have an enterprise-grade open data environment. The availability of managed services and open source software helps companies to implement data lakes that meet their needs.

Since building a data lake solution on top of Apache Iceberg, Cloudinary has seen major enhancements. The data lake infrastructure enables Cloudinary to extend their data retention by six times while lowering the cost of storage by over 25 percent. Furthermore, query costs dropped by more than 25–40 percent thanks to the efficient querying capabilities of Apache Iceberg and the query optimizations provided in the Athena version 3, which is now based on Trino as its engine. The ability to retain data for longer as well as providing it to various stakeholders while reducing cost is a key component in allowing Cloudinary to be more data driven in their operation and decision-making processes.

Using a transactional data lake architecture that uses Amazon S3, Apache Iceberg, and AWS Analytics services can greatly enhance an organization’s data infrastructure. This allows for sophisticated analytics and machine learning, fueling innovation while keeping costs down and allowing the use of a plethora of tools and services without limits.

About the Authors

Yonatan Dolan is a Principal Analytics Specialist at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value. Yonatan is an Apache Iceberg evangelist.

Amit Gilad is a Senior Data Engineer on the Data Infrastructure team at Cloudinar. He is currently leading the strategic transition from traditional data warehouses to a modern data lakehouse architecture, utilizing Apache Iceberg to enhance scalability and flexibility.

Alex Dickman is a Staff Data Engineer on the Data Infrastructure team at Cloudinary. He focuses on engaging with various internal teams to consolidate the team’s data infrastructure and create new opportunities for data applications, ensuring robust and scalable data solutions for Cloudinary’s diverse requirements.

Itay Takersman is a Senior Data Engineer at Cloudinary data infrastructure team. Focused on building resilient data flows and aggregation pipelines to support Cloudinary’s data requirements.

Exploring the 2024 EU Election: Internet traffic trends and cybersecurity insights

2024-06-10 João Tomé

Post Syndicated from João Tomé original https://blog.cloudflare.com/exploring-the-2024-eu-election-internet-traffic-trends-and-cybersecurity-insights

The 2024 European Parliament election took place June 6-9, 2024, with hundreds of millions of Europeans from the 27 countries of the European Union electing 720 members of the European Parliament. This was the first election after Brexit and without the UK, and it had an impact on the Internet. In this post, we will review some of the Internet traffic trends observed during the election days, as well as providing insight into cyberattack activity.

Elections matter, and as we have mentioned before (1, 2), 2024 is considered “the year of elections”, with voters going to the polls in at least 60 countries, as well as the 27 EU member states. That’s why we’re publishing a regularly updated election report on Cloudflare Radar. We’ve already included our analysis of recent elections in South Africa, India, Iceland, and Mexico, and provided a policy view on the EU elections.

The European Parliament election coincided with several other national or local elections in European Union member states, leading to direct consequences. For example, in Belgium, the prime minister announced his resignation, resulting in a drop in Internet traffic during the speech followed by a clear increase after the speech was over. In France, we saw a similar pattern with the announcement of legislative snap elections.

From analyzing patterns seen during previous elections in France and Brazil, we know that Internet traffic often decreases during voting hours, though not as significantly as during other major events like national holidays. This usual drop is typically followed by an increase in traffic as election results are announced.

Let’s start with a wider picture of the 2024 European Parliament election, focusing on the time of the biggest drop in Internet HTTP requests during the election days as compared to the previous week. Note that there were some national or local elections taking place at the same time, and European Union elections are known to have low turnout compared to national and local ones.

*Source: Cloudflare; created with Datawrapper*

Drops greater than 10% were observed only in the Czech Republic, Luxembourg, Slovakia, Cyprus, Belgium, Estonia, and Croatia. The table below includes the percentage that traffic dropped and the specific time during the election day it occurred. In countries with more than one election day, we considered the time and day of the biggest drop.

Countries	Elections day(s)	Local time	Drop in traffic %
Czech Republic	June 7 – 8	June 8, 14:30	-20%
Luxembourg	June 9	12:45	-18%
Slovakia	June 8	15:45; 19:00	-16%
Cyprus	June 9	10:00	-16%
Belgium	June 9	11:45	-14%
Estonia	June 7-9	June 9, 9:00	-13%
Croatia	June 9	18:00	-12%
Poland	June 9	18:00	-10%
Netherlands	June 6	10:15	-10%
Germany	June 9	13:45	-10%
Ireland	June 7	7:15	-9%
Finland	June 9	9:00	-9%
Portugal	June 9	15:45	-9%
Malta	June 8	12:15	-9%
Latvia	June 8	08:30, 16:15	-9%
Slovenia	June 9	18:00	-8%
Hungary	June 9	6:00	-8%
Austria	June 9	12:30	-7%
Italy	June 8 – 9	June 9, 16:00	-6%
France	June 9	13:30	-6%
Bulgaria	June 9	19:45	-5%
Greece	June 9	8:00	-5%
Spain	June 9	13:00	-4%
Lithuania	June 9	8:00	-3%
Romania	June 9	9:45	-1%
Denmark	June 9	–	–
Sweden	June 9	–	–

The data in the list above shows that Central European countries had the highest drop in Internet traffic, particularly the Czech Republic and Slovakia. Eastern Europe saw significant drops in Estonia and Poland. Southern Europe had consistent moderate drops across multiple countries, with Cyprus and Croatia showing higher losses. Northern Europe showed minimal to no traffic drop in Scandinavian countries, with Finland and Ireland experiencing moderate declines.

Looking at the specific (local) times of day during voting periods on election days, morning drops (06:00 – 10:00) were more common in Northern and Eastern Europe. Late morning to early afternoon drops (10:15 – 14:30) were predominantly observed in Western and Central Europe. Late afternoon drops (15:45 – 19:45) were more common in Central and Southern Europe.

Impact of notable announcements in Belgium and France

There’s more to say when we look at specific country trends. The 27 members of the European Union bring diversity in habits, languages, and cultures. That also impacted traffic, and this election in particular had a national impact in some of the countries.

In Belgium, national and regional elections took place on the same day, June 9. After polling stations closed at 16:00 local time (14:00 UTC), HTTP requests followed the typical pattern of increasing, peaking at 21:15 local time (19:15 UTC), with 7% more requests than the previous week. This trend was interrupted by Prime Minister Alexander De Croo’s speech at around 22:00 local time (20:00 UTC), admitting defeat in the national elections. This pattern is typical when important announcements are broadcast on TV, impacting Internet traffic.

How about France? President Emmanuel Macron announced at around 21:00 local time (19:00 UTC) that he would dissolve the national parliament for a snap legislative election. This followed the EU elections that gave a victory to his rival Marine Le Pen’s National Rally in the European Parliament vote. At the time of his speech, requests dropped 6% compared to the previous week, and increased right after Macron’s speech, peaking at 22:15 local time (20:15 UTC) with a 6% increase.

After voting ends, traffic increases

It was not only Belgium and France that had typical increases in HTTP requests at night when the first projections and results started to be announced. The same happened in the Netherlands, the first European country to enter the 2024 European Parliament election, on Thursday, June 6.— We have previously written about Dutch political websites being attacked on that day. Traffic was 4% higher than usual after 20:30 local time (18:30 UTC), and peaked at 01:15 with a 15% increase compared to the previous week.

Similar trends were seen in Italy on June 9, and in Germany on the same day. In Germany, at 21:45 (19:45 UTC), requests were already 8% higher, with a 23:00 (21:00 UTC) drop of 2% during election speeches, and a peak at 00:30 (22:30 UTC) with an 18% increase.

The same night-time trends were observed in other countries:

Slovakia had a peak increase of 24% at 23:45 local time (21:45 UTC) on June 8.
Spain saw a 21% peak increase at 21:00 local time (19:00 UTC) on June 9.
Poland had a 9% peak increase at 01:45 local time (23:45 UTC).
Portugal experienced a 29% peak increase at 00:15 local time (23:15 UTC).
Croatia had a 19% peak increase at 23:00 (21:00 UTC).
Slovenia had a 19% peak increase at 22:45 (20:45 UTC).
Lithuania had a 22% peak increase at 23:00 (20:00 UTC).
Estonia saw the highest peak increase, reaching 35% at 00:00 (21:00 UTC).

Growing interest in election information and news

Switching to domain trends, DNS traffic (using our 1.1.1.1 resolver) shows a more specific impact related to elections. Social media platforms invited users in Europe to vote, sometimes giving European or local websites as a reference. Here’s an example from Instagram:

Did this increase traffic to election-related sites in the European Union? Our DNS data shows a 26x peak growth at 19:00 UTC on Sunday, June 9, 2024. DNS traffic was already much higher compared to the previous week on June 8, with a peak growth of 8x at 17:00 UTC.

Looking at European news outlets’ domains, there was an initial 1.68x increase (compared to the previous week) at 13:00 UTC on June 9, 2024, and a second peak at 19:00 UTC.

For local election-results sites, there was a significant 55x peak growth at 22:00 UTC on June 9, 2024, compared to the previous week.

Government-focused cyberattacks

Focusing on attacks, as mentioned above, we recently published a blog post about the cyberattack on Dutch political-related websites that lasted two days – June 5 and 6. The main DDoS (Distributed Denial of Service attack) attack on June 5, the day before the Dutch election, reached 73,000 requests per second (rps).

Looking at government or state-related websites in the European Union in 2024, there have been several spikes in attacks targeting defense organizations, European courts, and educational institutions since the year started.

The main one was on February 25, 2024, when Cloudflare blocked a DDoS attack on a French government website that reached 420 million requests per hour and lasted over three hours.

Between January and June 2024, government sites in Belgium, France, and Germany were the main targets, receiving 49%, 25%, and 10% respectively of attack requests targeting EU government-related sites.

In a broader view, from January 1 to June 9, Cloudflare mitigated 8.6 billion threats to government websites in the EU, with 68% of those being DDoS threats. This amounts to an average of 53.42 million threats mitigated per day. These trends highlight the ongoing threat to critical infrastructure across Europe, with government sites frequently targeted by cyberattacks.

Just before the elections

Focusing on the five weeks before the EU election, we didn’t see significant attacks on European election-related organizations. However, there were a few DDoS threats that targeted government sites from European Union member states. Notable instances include attacks on the Bulgarian government on June 6, the French government on May 11 and June 9, another in France on May 23, Sweden on May 18 and April 29, and Denmark on May 7.

These attacks were not very large compared to others mentioned. The largest targeted the Bulgarian government on June 6, with 122 million daily DDoS requests and a peak of 110,500 requests per second at 11:29 local time (08:29 UTC).

On election day in France, June 9, a French government website was also the target of a smaller attack, with 42,000 DDoS requests per second at 11:57 local time (09:57 UTC).

Conclusion

The 2024 European Parliament election had some clear impacts on Internet traffic, and cyber threats were looming in the weeks before, most notably the Dutch political-related attack around election day.

While voting led to typical drops in Internet traffic, the announcement of results and significant political events caused spikes in activity.

If you want to follow more trends and insights about the Internet and elections in particular, you can check Cloudflare Radar, and more specifically our new 2024 Elections Insights report, that we’re updating as elections take place throughout the year.

[$] P4TC hits a brick wall

2024-06-10 corbet

Post Syndicated from corbet original https://lwn.net/Articles/977310/

P4, short for “Programming
Protocol-independent Packet Processors”, is a programming language aimed at
networking devices; it is useful for the configuration of firewalls and
complicated routing architectures. Since a lot of advanced networking is
done with Linux systems, it stands to reason that there would be value in
supporting P4 and, indeed, an
implementation of P4 in the kernel’s traffic-control subsystem was
first posted by Jamal Hadi Salim at the beginning of 2023. After nearly
18 months, though, this feature has not been merged, and the chances
of that happening would appear to be getting worse.

perl v5.40.0 released

2024-06-10 corbet

Post Syndicated from corbet original https://lwn.net/Articles/977765/

Version 5.40.0 of the Perl language has been released. “Perl 5.40.0
represents approximately 11 months of development since Perl 5.38.0 and
contains approximately 160,000 lines of changes across 1,500 files from 75
authors“. Significant changes include a new __CLASS__
keyword, a :reader: attribute for field variables, a new
“^^” logical-XOR operator (because two of those were not enough),
moving “try/catch” out of the experimental category, and more; see
this
page for lots of details.

Security updates for Monday

2024-06-10 jake

Post Syndicated from jake original https://lwn.net/Articles/977789/

Security updates have been issued by Fedora (galera and mariadb10.11), Mageia (0-plugins-base and plasma-workspace), Oracle (ruby:3.1 and ruby:3.3), Red Hat (bind, bind-dyndb-ldap, and dhcp), SUSE (apache2, glib2, libvirt, openssl-1_1, openssl-3, opera, python-Jinja2, python-requests, and squid), and Ubuntu (linux, linux-gcp, linux-gcp-5.15, linux-lowlatency,
linux-lowlatency-hwe-5.15, linux-xilinx-zynqmp, linux, linux-gcp, linux-gcp-6.5, linux-lowlatency,
linux-lowlatency-hwe-6.5, linux-raspi, linux, linux-ibm, linux-lowlatency, linux-raspi, linux-aws, linux-gcp, linux-azure, linux-azure-6.5, linux-starfive, linux-starfive-6.5, and linux-gke, linux-ibm, linux-intel-iotg, linux-oracle).

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

2024-06-10 Sudipta Mitra

Post Syndicated from Sudipta Mitra original https://aws.amazon.com/blogs/big-data/design-a-data-mesh-pattern-for-amazon-emr-based-data-lakes-using-aws-lake-formation-with-hive-metastore-federation/

In this post, we delve into the key aspects of using Amazon EMR for modern data management, covering topics such as data governance, data mesh deployment, and streamlined data discovery.

One of the key challenges in modern big data management is facilitating efficient data sharing and access control across multiple EMR clusters. Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. With the AWS Glue Data Catalog federation to external Hive metastore feature, you can now now apply data governance to the metadata residing across those EMR clusters and analyze them using AWS analytics services such as Amazon Athena, Amazon Redshift Spectrum, AWS Glue ETL (extract, transform, and load) jobs, EMR notebooks, EMR Serverless using Lake Formation for fine-grained access control, and Amazon SageMaker Studio. For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions.

In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters. This approach enables organizations to take advantage of the scalability and flexibility of EMR clusters while maintaining control and integrity of their data assets across the data mesh.

Use cases for Hive metastore federation for Amazon EMR

Hive metastore federation for Amazon EMR is applicable to the following use cases:

Governance of Amazon EMR-based data lakes – Producers generate data within their AWS accounts using an Amazon EMR-based data lake supported by EMRFS on Amazon Simple Storage Service (Amazon S3)and HBase. These data lakes require governance for access without the necessity of moving data to consumer accounts. The data resides on Amazon S3, which reduces the storage costs significantly.
Centralized catalog for published data – Multiple producers release data currently governed by their respective entities. For consumer access, a centralized catalog is necessary where producers can publish their data assets.
Consumer personas – Consumers include data analysts who run queries on the data lake, data scientists who prepare data for machine learning (ML) models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.
Cross-producer data access – Consumers may need to access data from multiple producers within the same catalog environment.
Data access entitlements – Data access entitlements involve implementing restrictions at the database, table, and column levels to provide appropriate data access control.

Solution overview

The following diagram shows how data from producers with their own Hive metastores (left) can be made available to consumers (right) using Lake Formation permissions enforced in a central governance account.

Producer and consumer are logical concepts used to indicate the production and consumption of data through a catalog. An entity can act both as a producer of data assets and as a consumer of data assets. The onboarding of producers is facilitated by sharing metadata, whereas the onboarding of consumers is based on granting permission to access this metadata.

The solution consists of multiple steps in the producer, catalog, and consumer accounts:

Deploy the AWS CloudFormation templates and set up the producer, central governance and catalog, and consumer accounts.
Test access to the producer cataloged Amazon S3 data using EMR Serverless in the consumer account.
Test access using Athena queries in the consumer account.
Test access using SageMaker Studio in the consumer account.

Producer

Producers create data within their AWS accounts using an Amazon EMR-based data lake and Amazon S3. Multiple producers then publish this data into a central catalog (data lake technology) account. Each producer account, along with the central catalog account, has either VPC peering or AWS Transit Gateway enabled to facilitate AWS Glue Data Catalog federation with the Hive metastore.

For each producer, an AWS Glue Hive metastore connector AWS Lambda function is deployed in the catalog account. This enables the Data Catalog to access Hive metastore information at runtime from the producer. The data lake locations (the S3 bucket location of the producers) are registered in the catalog account.

Central catalog

A catalog offers governed and secure data access to consumers. Federated databases are established within the catalog account’s Data Catalog using the Hive connection, managed by the catalog Lake Formation admin (LF-Admin). These federated databases in the catalog account are then shared by the data lake LF-Admin with the consumer LF-Admin of the external consumer account.

Data access entitlements are managed by applying access controls as needed at various levels, such as the database or table.

Consumer

The consumer LF-Admin grants the necessary permissions or restricted permissions to roles such as data analysts, data scientists, and downstream batch processing engine AWS Identity and Access Management (IAM) roles within its account.

Data access entitlements are managed by applying access control based on requirements at various levels, such as databases and tables.

Prerequisites

You need three AWS accounts with admin access to implement this solution. It is recommended to use test accounts. The producer account will host the EMR cluster and S3 buckets. The catalog account will host Lake Formation and AWS Glue. The consumer account will host EMR Serverless, Athena, and SageMaker notebooks.

Set up the producer account

Before you launch the CloudFormation stack, gather the following information from the catalog account:

Catalog AWS account ID (12-digit account ID)
Catalog VPC ID (for example, vpc-xxxxxxxx)
VPC CIDR (catalog account VPC CIDR; it should not overlap 10.0.0.0/16)

The VPC CIDR of the producer and catalog can’t overlap due to VPC peering and Transit Gateway requirements. The VPC CIDR should be a VPC from the catalog account where the AWS Glue metastore connector Lambda function will be eventually deployed.

The CloudFormation stack for the producer creates the following resources:

S3 bucket to host data for the Hive metastore of the EMR cluster.
VPC with the CIDR 10.0.0.0/16. Make sure there is no existing VPC with this CIDR in use.
VPC peering connection between the producer and catalog account.
Amazon Elastic Compute Cloud (Amazon EC2) security groups for the EMR cluster.
IAM roles required for the solution.
EMR 6.10 cluster launched with Hive.
Sample data downloaded to the S3 bucket.
A database and external tables, pointing to the downloaded sample data, in its Hive metastore.

Complete the following steps:

Launch the template PRODUCER.yml. It’s recommended to use an IAM role that has administrator privileges.
Gather the values for the following on the CloudFormation stack’s Outputs tab:
1. VpcPeeringConnectionId (for example, pcx-xxxxxxxxx)
2. DestinationCidrBlock (10.0.0.0/16)
3. S3ProducerDataLakeBucketName

Set up the catalog account

The CloudFormation stack for the catalog account creates the Lambda function for federation. Before you launch the template, on the Lake Formation console, add the IAM role and user deploying the stack as the data lake admin.

Then complete the following steps:

Launch the template CATALOG.yml.
For the RouteTableId parameter, use the catalog account VPC RouteTableId. This is the VPC where the AWS Glue Hive metastore connector Lambda function will be deployed.
On the stack’s Outputs tab, copy the value for LFRegisterLocationServiceRole (arn:aws:iam::account-id: role/role-name).
Confirm if the Data Catalog setting has the IAM access control options un-checked and the current cross-account version is set to 4.

Log in to the producer account and add the following bucket policy to the producer S3 bucket that was created during the producer account setup. Add the ARN of LFRegisterLocationServiceRole to the Principal section and provide the S3 bucket name under the Resource section.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::account-id: role/role-name"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::s3-bucket-name/*",
                "arn:aws:s3:::s3-bucket-name"
            ]
        }
    ]
}

In the producer account, on the Amazon EMR console, navigate to the primary node EC2 instance to get the value for Private IP DNS name (IPv4 only) (for example, ip-xx-x-x-xx.us-west-1.compute.internal).

Switch to the catalog account and deploy the AWS Glue Data Catalog federation Lambda function (GlueDataCatalogFederation-HiveMetastore).

The default Region is set to us-east-1. Change it to your desired Region before deploying the function.

Use the VPC that was used as the CloudFormation input for the VPC CIDR. You can use the VPC’s default security group ID. If using another security group, make sure the outbound allows traffic to 0.0.0.0/0.

Next, you create a federated database in Lake Formation.

On the Lake Formation console, choose Data sharing in the navigation pane.
Choose Create database.

Provide the following information:
1. For Connection name, choose your connection.
2. For Database name, enter a name for your database.
3. For Database identifier, enter emrhms_salesdb (this is the database created on the EMR Hive metastore).
Choose Create database.

On the Databases page, select the database and on the Actions menu, choose Grant to grant describe permissions to the consumer account.

Under Principals, select External accounts and choose your account ARN.
Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table.
Under Table permissions, provide the following information:
1. For Table permissions¸ select Select and Describe.
2. For Grantable permissions¸ select Select and Describe.
Under Data permissions, select All data access.
Choose Grant.

On the Tables page, select your table and on the Actions menu, choose Grant to grant select and describe permissions.

Under Principals, select External accounts and choose your account ARN.
Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
Under Database permissions¸ provide the following information:
1. For Database permissions¸ select Create table and Describe.
2. For Grantable permissions¸ select Create table and Describe.
Choose Grant.

Set up the consumer account

Consumers include data analysts who run queries on the data lake, data scientists who prepare data for ML models and conduct exploratory analysis, as well as downstream systems that run batch jobs on the data within the data lake.

The consumer account setup in this section shows how you can query the shared Hive metastore data using Athena for the data analyst persona, EMR Serverless to run batch scripts, and SageMaker Studio for the data scientist to further use data in the downstream model building process.

For EMR Serverless and SageMaker Studio, if you’re using the default IAM service role, add the required Data Catalog and Lake Formation IAM permissions to the role and use Lake Formation to grant table permission access to the role’s ARN.

Data analyst use case

In this section, we demonstrate how a data analyst can query the Hive metastore data using Athena. Before you get started, on the Lake Formation console, add the IAM role or user deploying the CloudFormation stack as the data lake admin.

Then complete the following steps:

Run the CloudFormation template CONSUMER.yml.
If the catalog and consumer accounts are not part of the organization in AWS Organizations, navigate to the AWS Resource Access Manager (AWS RAM) console and manually accept the resources shared from the catalog account.
On the Lake Formation console, on the Databases page, select your database and on the Actions menu, choose Create resource link.

Under Database resource link details, provide the following information:
1. For Resource link name, enter a name.
2. For Shared database’s region, choose a Region.
3. For Shared database, choose your database.
4. For Shared database’s owner ID, enter the account ID.
Choose Create.

Now you can use Athena to query the table on the consumer side, as shown in the following screenshot.

Batch job use case

Complete the following steps to set up EMR Serverless to run a sample Spark job to query the existing table:

On the Amazon EMR console, choose EMR Serverless in the navigation pane.
Choose Get started.

Choose Create and launch EMR Studio.

Under Application settings, provide the following information:
1. For Name, enter a name.
2. For Type, choose Spark.
3. For Release version, choose the current version.
4. For Architecture, select x86_64.
Under Application setup options, select Use custom settings.

Under Additional configurations, for Metastore configuration, select Use AWS Glue Data Catalog as metastore, then select Use Lake Formation for fine-grained access control.
Choose Create and start application.

On the application details page, on the Job runs tab, choose Submit job run.

Under Job details, provide the following information:
1. For Name, enter a name.
2. For Runtime role¸ choose Create new role.
3. Note the IAM role that gets created.
4. For Script location, enter the S3 bucket location created by the CloudFormation template (the script is emr-serverless-query-script.py).
Choose Submit job run.

Add the following AWS Glue access policy to the IAM role created in the previous step (provide your Region and the account ID of your catalog account):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDataBases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:1234567890:catalog",
                "arn:aws:glue:us-east-1:1234567890:database/*",
                "arn:aws:glue:us-east-1:1234567890:table/*/*"
            ]
        }
    ]
}

Add the following Lake Formation access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "LakeFormation:GetDataAccess"
            "Resource": "*"
        }
    ]
}

On the Databases page, select the database and on the Actions menu, choose Grant to grant Lake Formation access to the EMR Serverless runtime role.
Under Principals, select IAM users and roles and choose your role.
Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database.
Under Resource link permissions, for Resource link permissions, select Describe.
Choose Grant.

On the Databases page, select the database and on the Actions menu, choose Grant on target.

Provide the following information:
1. Under Principals, select IAM users and roles and choose your role.
2. Under LF-Tags or catalog resources, select Named Data Catalog resources and choose your database and table
3. Under Table permissions, for Table permissions, select Select.
4. Under Data permissions, select All data access.
Choose Grant.

Submit the job again by cloning it.
When the job is complete, choose View logs.

The output should look like the following screenshot.

Data scientist use case

For this use case, a data scientist queries the data through SageMaker Studio. Complete the following steps:

Set up SageMaker Studio.
Confirm that the domain user role has been granted permission by Lake Formation to SELECT data from the table.
Follow steps similar to the batch run use case to grant access.

The following screenshot shows an example notebook.

Clean up

We recommend deleting the CloudFormation stack after use, because the deployed resources will incur costs. There are no prerequisites to delete the producer, catalog, and consumer CloudFormation stacks. To delete the Hive metastore connector stack on the catalog account (serverlessrepo-GlueDataCatalogFederation-HiveMetastore), first delete the federated database you created.

Conclusion

In this post, we explained how to create a federated Hive metastore for deploying a data mesh architecture with multiple Hive data warehouses across EMR clusters.

By using Data Catalog metadata federation, organizations can construct a sophisticated data architecture. This approach not only seamlessly extends your Hive data warehouse but also consolidates access control and fosters integration with various AWS analytics services. Through effective data governance and meticulous orchestration of the data mesh architecture, organizations can provide data integrity, regulatory compliance, and enhanced data sharing across EMR clusters.

We encourage you to check out the features of the AWS Glue Hive metastore federation connector and explore how to implement a data mesh architecture across multiple EMR clusters. To learn more and get started, refer to the following resources:

About the Authors

Sudipta Mitra is a Senior Data Architect for AWS, and passionate about helping customers to build modern data analytics applications by making innovative use of latest AWS services and their constantly evolving features. A pragmatic architect who works backwards from customer needs, making them comfortable with the proposed solution, helping achieve tangible business outcomes. His main areas of work are Data Mesh, Data Lake, Knowledge Graph, Data Security and Data Governance.

Deepak Sharma is a Senior Data Architect with the AWS Professional Services team, specializing in big data and analytics solutions. With extensive experience in designing and implementing scalable data architectures, he collaborates closely with enterprise customers to build robust data lakes and advanced analytical applications on the AWS platform.

Nanda Chinnappa is a Cloud Infrastructure Architect with AWS Professional Services at Amazon Web Services. Nanda specializes in Infrastructure Automation, Cloud Migration, Disaster Recovery and Databases which includes Amazon RDS and Amazon Aurora. He helps AWS Customer’s adopt AWS Cloud and realize their business outcome by executing cloud computing initiatives.

Young people receive their data from space and Astro Pi certificates

2024-06-10 Fergus Kirkpatrick

Post Syndicated from Fergus Kirkpatrick original https://www.raspberrypi.org/blog/young-people-receive-their-data-from-space-and-astro-pi-certificates/

Across Europe and beyond, teams of young people are receiving data from the International Space Station (ISS) this week. That’s because they participated in the annual European Astro Pi Challenge, the unique programme we deliver in collaboration with ESA Education to give kids the chance to write code that runs in space.

The Astro Pi computers inside the International Space Station.

In this round of Astro Pi, over 26,400 young people took part across its two missions — Mission Space Lab and Mission Zero — and had their programs run on the Raspberry Pi computers on board the ISS.

Mission Space Lab teams find out the speed of the ISS

In Mission Space Lab, we asked young people to team up and write code to collect data on the ISS and calculate the speed at which the ISS is travelling. 236 teams wrote programs that passed all our tests and achieved flight status to run in space. And not only will the Mission Space Lab teams receive their participation certificates this week — they’ll also receive the data their programs captured on the ISS.

A picture of the Himalayas taken from space by the Astro Pi computers.

Many teams chose a feature extraction method to calculate the ISS’s speed, identifying two points on Earth from which to calculate the distance the ISS travelled over time. Using this method means using the high-quality camera on the Astro Pi computer to take some fantastic photos of Earth from the ISS’s World Observation Research Facility (WORF) window. Teams will receive these photos soon, which are unique views of Earth from space.

A picture of feature extraction between two images. — Feature extraction between two images

How fast does the ISS travel?

The actual speed that the ISS is travelling in space while at normal altitude is 7.66km/s. Its altitude can affect the speed, so it can vary, but the ISS’s boosters fire up if it dips too low.

To help teams with writing programs that can adapt to some of these variances, and to show them the type data they can collect, we gave them a programming tool we call Astro Pi Replay. Using this tool, teams can simulate how their program would run on the Astro Pi computers up in space.

The International Space Station orbiting Earth. — The International Space Station orbiting Earth

This is the first time we asked Mission Space Lab teams to focus on a particular scientific question. So how did they do? The graph below shows some of the speeds that teams’ programs estimated.

A graph showing the range of speeds calculated by Mission Space Lab teams. — The range of speeds calculated by Mission Space Lab teams

As you can see, a variety of speeds were estimated, but the average is fairly close to the ISS’s actual speed. Teams did a great job trying to solve the question and working like real space scientists. Once they receive their data this week, they can check how accurate their speed estimate was.

Mission Zero pixel art lights up astronauts’ daily tasks

In Astro Pi Mission Zero, a coding activity suitable for beginners, 16,039 teams of young people created code to make pixel art inspired by nature. Nearly half (44%) of the 24,409 participants were girls! 15,942 of the Mission Zero teams had their code run on the ISS after we checked that it followed the rules.

Mission Zero Submissions

Every team whose program ran on the ISS — with their pixel art showing for the astronauts to see as they worked — will receive certificates with the time, date, and location coordinates of their Mission Zero run.

We’ve been so impressed with this year’s pixel art creations that we’ve picked some as new examples for next year’s Mission Zero coding guide. That means young people will be able to choose one of a few pixel images to start with and recreate or remix them for their program. More info on that is coming soon, sign up to the Astro Pi newsletter to not miss it.

Let’s get ready for September

Thank you and congratulations to everyone who took part in the missions this year, and our special thanks to all the amazing educators who ran Astro Pi activities with young people.

The boot shape of Italy photographed from space by the Astro Pi computers. — The south of Italy photographed from space by the Astro Pi computers

For us, there is much to reflect on and celebrate from this year’s challenge. We’ve had the chance to run Mission Zero with young people in person and identify a few changes to help make the activity easier. As Mission Space Lab now involves simulating programs running on the ISS with our new Astro Pi Replay tool, we’ll be exploring how to improve this as well.

We hope to engage lots of previous and new participants in the Astro Pi Challenge when it starts up again in September. Sign up for the newsletter on astro-pi.org to be the first to hear about the new round.

The post Young people receive their data from space and Astro Pi certificates appeared first on Raspberry Pi Foundation.

Fala: America’s Number 1 Dog.

2024-06-10 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=DdP9D7g4usE

Exploiting Mistyped URLs

2024-06-10 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/06/exploiting-mistyped-urls.html

Interesting research: “Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains“:

Abstract: Web users often follow hyperlinks hastily, expecting them to be correctly programmed. However, it is possible those links contain typos or other mistakes. By discovering active but erroneous hyperlinks, a malicious actor can spoof a website or service, impersonating the expected content and phishing private information. In “typosquatting,” misspellings of common domains are registered to exploit errors when users mistype a web address. Yet, no prior research has been dedicated to situations where the linking errors of web publishers (i.e. developers and content contributors) propagate to users. We hypothesize that these “hijackable hyperlinks” exist in large quantities with the potential to generate substantial traffic. Analyzing large-scale crawls of the web using high-performance computing, we show the web currently contains active links to more than 572,000 dot-com domains that have never been registered, what we term ‘phantom domains.’ Registering 51 of these, we see 88% of phantom domains exceeding the traffic of a control domain, with up to 10 times more visits. Our analysis shows that these links exist due to 17 common publisher error modes, with the phantom domains they point to free for anyone to purchase and exploit for under $20, representing a low barrier to entry for potential attackers.