5 Ways Event Notifications Strengthens Your Backup Strategy Automatically

2024-12-19 David Johnson

Post Syndicated from David Johnson original https://www.backblaze.com/blog/5-ways-event-notifications-strengthens-your-backup-strategy-automatically/

A decorative image showing a cloud with diagrammed icons around it.

“Our backups are good, right?”

If you’re responsible for backup operations, you’ve probably heard this question more times than you can count. While the answer should be a simple “yes,” staying on top of backup activities often involves checking multiple systems, reviewing logs, and maintaining manual tracking processes.

Today, I’m sharing five ways you can implement Backblaze Event Notifications into your data protection strategy to keep you and your team informed. If you’re interested in Event Notifications for other use cases, check out our posts for media production and application workflows.

Event Notifications for IT backup: Simplified automation

Event Notifications monitors your B2 Cloud Storage buckets for data changes that you designate—like completed backups, file deletions, or policy violations—and delivers real-time alerts where you want them. These alerts can trigger automated actions in any system that accepts webhooks, from PagerDuty to Zendesk to Slack channels and more.

Think of it as your storage system’s notification service: instead of discovering changes during routine recovery verification checks, you get instant awareness when something happens to the data in your buckets.

What are webhooks?

Webhooks, if you’re not familiar, are a way for applications to communicate with each other by sending data automatically based on specific events, e.g., HTTP POST requests with a JSON payload. What sets Backblaze Event Notifications apart is that it works with any service that accepts webhooks. This means you can integrate backup monitoring into your existing tools and processes, rather than being locked into specific vendors’ ecosystems.

5 ways to stay in the know with your backup strategy

Here are specific, practical ways you can take advantage of Event Notifications for immediate benefits to your backup and archive workflows.

1. Backup verification and reporting

When your backup software writes files to B2 Cloud Storage, Event Notifications helps verify successful completion of backup jobs. Each time a backup file lands in a bucket, you’ll receive a notification with key details like file size, timestamp, and backup job name. By feeding this data directly into communication tools like Slack, you can maintain comprehensive logs of backup activity without manual checks.

Backup monitoring workflow

Gone are the days of discovering backup issues hours or days later during routine reviews—you’ll know exactly when backups are uploaded. Teams can configure custom alerts for backup size thresholds, receive immediate confirmation of successful backups, and, with the help of Zapier, you can enable an alert when Event Notifications did not trigger, indicating a backup was not uploaded during a specified window.

2. Security and compliance monitoring

Event Notifications can help protect your backup data from unauthorized changes. Security teams can establish automated alerts for suspicious activities like mass deletions or modifications. These alerts integrate with your existing security information and event management (SIEM) systems to provide unified threat monitoring.

Security alert workflow

Beyond threat detection, Event Notifications enables preemptive policy enforcement. Teams can configure automatic notifications that guide employees when their actions might conflict with backup policies—like modifying file names, moving files, or even deletion. For persistent policy conflicts, managers can receive automated escalation alerts to address potential training needs or process gaps. This systematic approach helps maintain backup integrity through education and awareness before issues occur, rather than just detecting violations after the fact.

3. Storage management automation

Storage management becomes more efficient when Event Notifications feeds activity data directly to your management tools. As files are uploaded to and removed from your buckets over time, Event Notifications provides valuable data that helps you analyze storage utilization trends and backup data growth patterns.

Data usage monitoring workflow

This constant flow of information empowers teams to anticipate capacity needs and optimize resource allocation. Moving from reactive to proactive storage management helps control costs by notifying you when backups become larger on average.

4. Cross-bucket backup monitoring

Organizations using Cloud Replication or managing backups across multiple buckets gain valuable oversight through Event Notifications. This capability tracks file replication between regions and monitors backup activity across your entire footprint, giving you a comprehensive view of your distributed backup strategy. Teams can spot replication delays or issues immediately, rather than waiting for scheduled status checks.

Cloud Replication notification workflow

Understanding how data moves and grows across different locations ensures your distributed backup strategy performs as designed. Event Notifications makes it possible to track successful replications, monitor consistency between primary and replica buckets, and receive immediate alerts about any issues. This visibility is especially valuable for organizations maintaining geographic redundancy or managing complex multi-site backup strategies.

5. Integration with IT workflows

Event Notifications connects seamlessly with existing IT tools and processes through standard webhooks. Backup events can automatically flow into ticketing systems like Jira Service Management, monitoring dashboards like Grafana, or team communication channels like Microsoft Teams and Mattermost. This integration means teams can manage backup operations through familiar tools and processes, without needing to constantly switch between different interfaces or learn new systems.

Data integration workflow

The result is streamlined operations without the need for separate backup monitoring systems, ensuring backup activities receive proper attention within normal IT procedures. Teams can create ServiceNow tickets for failed backups, update Jira boards with backup status, or send notifications to Teams channels—all automatically and in real-time.

Why Event Notifications makes sense for backup teams

Managing backup operations has traditionally meant juggling multiple monitoring tools and hoping you catch issues before they impact recovery capabilities. Event Notifications transforms this approach by providing:

Automated awareness: Replace manual checks with instant visibility into bucket changes.
Enhanced security: Track backup data access and modifications as they happen.
Simplified monitoring: Feed backup activity data directly to your management tools.
Better operations: Free up time to focus on improving backup strategies rather than monitoring them.
Flexible integration: Adapt backup monitoring to fit your existing processes, not the other way around.

How it works with your environment

Unlike traditional backup monitoring solutions that often require specific software for notification handling, Event Notifications works with any service that accepts webhooks. This fundamental difference means you aren’t locked into specific vendors’ ecosystems or forced to use particular monitoring tools.

Event Notifications is designed for reliability with at-least-once delivery, ensuring critical backup events are never missed. This reliability is especially important for teams building automated workflows that require consistency and transparency in their backup monitoring.

The pricing model is straightforward and predictable: Backblaze B2 Reserve customers receive unlimited notifications at no additional cost, while pay-as-you-go customers get 2,500 notifications free each day and pay just $0.004 per 10,000 additional calls. This transparent pricing applies regardless of which services you’re connecting to, enabling teams to build comprehensive backup monitoring without worrying about unpredictable costs.

Ready to automate your backup monitoring?

If you’re working with a Backblaze account manager, Event Notifications are already enabled—just ask them for setup guidance. Other existing customers can contact our Support team to request access.

New to Backblaze? Contact our Sales team to learn how Event Notifications can strengthen your backup operations.
Once enabled, visit the Event Notifications section in your B2 Cloud Storage buckets to configure your alerts. For detailed setup instructions and best practices, check out our Event Notifications documentation.

The post 5 Ways Event Notifications Strengthens Your Backup Strategy Automatically appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Fall 2024 SOC 1, 2, and 3 reports now available with 183 services in scope

2024-12-19 Paul Hong

Post Syndicated from Paul Hong original https://aws.amazon.com/blogs/security/fall-2024-soc-1-2-and-3-reports-now-available-with-183-services-in-scope/

We continue to expand the scope of our assurance programs at Amazon Web Services (AWS) and are pleased to announce that the Fall 2024 System and Organization Controls (SOC) 1, 2, and 3 reports are now available. The reports cover 183 services over the 12-month period from October 1, 2023 to September 30, 2024, so that customers have a full year of assurance with the reports. These reports demonstrate our continuous commitment to adhere to the heightened expectations for cloud service providers.

Going forward, we will issue SOC reports covering a 12-month period each quarter as follows:

Report	Period covered
Spring SOC 1, 2, and 3	April 1–March 31
Summer SOC 1	July 1–June 30
Fall SOC 1, 2, and 3	October 1–September 30
WWinter SOC 1	January 1–December 31

Customers can download the Fall 2024 SOC 1, 2, and 3 reports through AWS Artifact, a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

AWS strives to continuously bring services into the scope of its compliance programs to help you meet your architectural and regulatory needs. If you have questions or feedback about SOC compliance, reach out to your AWS account team.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.

Ultimate Dash Cam Comparison 2024: Viofo, Vantrue, Wolfbox, and more.

2024-12-19 The Hook Up

Post Syndicated from The Hook Up original https://www.youtube.com/watch?v=HswpRHu_X-k

След спирането на Сарафов

2024-12-19 Bozho

Post Syndicated from Bozho original https://blog.bozho.net/blog/4439

Спирането на процедурата за избор на главен прокурор е най-спешният приоритет. Чак такива залпове към парламента и „изтърваният“ Гешев не си позволи.

Днес постигнахме напредък по формирането на комисии, след като вчера приехме правилника. И съм умерен оптимист, че в парламента има воля за спирането на тази процедура за избор на главен прокурор.

Въпросът след това ще бъде „а сега какво?“. ВСС няма да се откаже да инсталира някой следващ Сарафов. Затова внесеният от нас законопроект е важен и отвъд избора на Сарафов. Той подобрава значително правилата, и спира всяка процедура за избор, ако Народното събрание започне смяна на членовете на ВСС. Само че за да се възползваме от тази промяна, трябва да има избор на нов ВСС, а за това трябва траен парламент. Т.е. правителство.

След това прокуратурата няма да стане изведнъж западноевропейска, ефективна, обективна и отчетна. Ще трябва да се поправят много неща – в Закона за съдебната власт, в Наказателно-процесуалния кодека, в Закона за СРС. Все реформи, които също изискват траен парламент.

С всяка алинея Пеевски ще губи инструменти за въздействие. Напр. ако въведем случайно разпределение за мерките за неотклонение и за разрешенията за СРС, изведнъж ще се окаже, че Пеевски няма никакви гаранции дали ще може неговите прокурори да извършват процесуален произвол, разрешен от назначените от него административни ръководители и техни заместници в съда.

Това и много други мерки са част от антикорупционната програма, която представихме в кампанията. И опитваме да осигурим време и консенсус по нейното прилагане. Защото иначе просто ще намерят и инсталират следващ Сарафов.

Материалът След спирането на Сарафов е публикуван за пръв път на БЛОГодаря.

Security updates for Thursday

2024-12-19 jzb

Post Syndicated from jzb original https://lwn.net/Articles/1002903/

Security updates have been issued by AlmaLinux (bluez, edk2:20220126gitbb1bba3d77, gstreamer1-plugins-base, gstreamer1-plugins-good, kernel, kernel-rt, mpg123, php:8.2, python3.11-urllib3, and tuned), Fedora (ColPack, glibc, golang-github-chainguard-dev-git-urls, golang-github-task, icecat, python-nbdime, python3.13, and python3.14), Mageia (kernel, kmod-xtables-addons, kmod-virtualbox, dwarves and kernel-linus), Red Hat (gstreamer1-plugins-base and gstreamer1-plugins-good), SUSE (curl, emacs, git-bug, glib2, helm, kernel, and traefik2), and Ubuntu (gst-plugins-base1.0, gst-plugins-good1.0, gstreamer1.0, libvpx, linux-gcp, phpunit, and yara).

Mailbox Insecurity

2024-12-19 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/12/mailbox-insecurity.html

It turns out that all cluster mailboxes in the Denver area have the same master key. So if someone robs a postal carrier, they can open any mailbox.

I get that a single master key makes the whole system easier, but it’s very fragile security.

HEMA accelerates their data governance journey with Amazon DataZone

2024-12-19 Luis Campos

Post Syndicated from Luis Campos original https://aws.amazon.com/blogs/big-data/hema-accelerates-their-data-governance-journey-with-amazon-datazone/

This post is cowritten by Tommaso Paracciani and Oghosa Omorisiagbon from HEMA.

Data has become an invaluable asset for businesses, offering critical insights to drive strategic decision-making and operational optimization. However, many companies today still struggle to effectively harness and use their data due to challenges such as data silos, lack of discoverability, poor data quality, and a lack of data literacy and analytical capabilities to quickly access and use data across the organization. To address these growing data management challenges, AWS customers are using Amazon DataZone, a data management service that makes it fast and effortless to catalog, discover, share, and govern data stored across AWS, on-premises, and third-party sources.

HEMA is a household Dutch retail brand name since 1926, providing daily convenience products using unique design. HEMA’s more than 17,000 employees bring exclusive, sustainably designed products in more than 750 stores in the Netherlands but also in Belgium, Luxembourg, France, Germany, and Austria, with webstores available in all these countries. HEMA built its first ecommerce system on AWS in 2018 and 5 years later, its developers have the freedom to innovate and build software fast with their choice of tools in the AWS Cloud. Today, this is powering every part of the organization, from the customer-favorite online cake customization feature to democratizing data to drive business insight.

This post describes how HEMA used Amazon DataZone to build their data mesh and enable streamlined data access across multiple business areas. It explains HEMA’s unique journey of deploying Amazon DataZone, the key challenges they overcame, and the transformative benefits they have realized since deployment in May 2024. From establishing an enterprise-wide data inventory and improving data discoverability, to enabling decentralized data sharing and governance, Amazon DataZone has been a game changer for HEMA.

Data landscape at HEMA

After moving its entire data platform from on premises to the AWS Cloud, the wave of change presented a unique opportunity for the HEMA Data & Cloud function to invest and commit in building a data mesh.

HEMA has a bespoke enterprise architecture, built around the concept of services. These services are individual software functionalities that fulfill a specific purpose within the company. Each service is hosted in a dedicated AWS account and is built and maintained by a product owner and a development team, as illustrated in the following figure.

HEMA runs over 400 services, and 20 of them run extract, transform, and load (ETL) pipelines with dedicated data resources, which produce and consume data assets shared across the data mesh.

Data management in a data mesh

Weeks after launch, HEMA’s data platform wasn’t where the company wanted it to be. Building an agile organization that runs on reliable and streamlined processes was the primary goal. Initially, the data inventories of different services were siloed within isolated environments, making data discovery and sharing across services manual and time-consuming for all teams involved.

Implementing robust data governance is challenging. In a data mesh architecture, this complexity is amplified by the organization’s decentralized nature. In this context, HEMA concluded that data governance was no longer a nice-to-have, but had become a foundational piece required to build a healthy data organization.

Why HEMA selected Amazon DataZone

By exploring the preview, HEMA saw how Amazon DataZone covered all the critical pillars of data management in a single solution. It was clear how Amazon DataZone would bring benefit to both the technical teams as well as the business end-users. The technical organization could take advantage of a robust programmatic solution to manage the availability, accessibility, and quality of the data assets that make the enterprise data catalog. The business end-users were given a tool to discover data assets produced within the mesh and seamlessly self-serve on their data sharing needs.

Features such as AI-generated metadata were key to providing end-users with reliable and use case-driven explanations of what a certain data product could provide and solve, while the subscription feature allowed them to start using a certain data asset within their own environment in a matter of seconds, as opposed to the existing lengthy and human-driven process.

These reasons, as well as the self-service capabilities, resulted in HEMA’s decision to adopt and roll out Amazon DataZone at the enterprise level.

Solution overview

The HEMA data landscape is multifaceted, with various teams across the organization using a range of technologies and systems, including Databricks. To effectively govern this complex data environment, HEMA has adopted a data mesh architecture on AWS. This architecture maintains a central intelligence platform (CIP) that enables the activities of both data producers and data consumers by providing the necessary platform and infrastructure. The overall structure can be represented in the following figure.

Each service uses two AWS accounts, one for pre-production and one for production. This separation means changes can be tested thoroughly before being deployed to live operations.

Amazon DataZone is the central piece in this architecture. It helps HEMA centralize all data assets across disparate data stacks into a single catalog. It plays a pivotal role in bridging the gap and integrating different systems, such as Databricks and native AWS services. The integration of Databricks Delta tables into Amazon DataZone is done using the AWS Glue Data Catalog. Delta tables’ technical metadata is stored in the Data Catalog, which is a native source for creating assets in the Amazon DataZone business catalog. Access control is enforced using AWS Lake Formation, which manages fine-grained access control and data sharing on data lake data. The following figure illustrates the data mesh architecture.

The Amazon DataZone implementation follows the same approach as individual services: HEMA maintains two distinct domain data catalogs: preprod-hema-data-catalog and prod-hema-data-catalog. These catalogs serve as the backbone for data sharing across pre-production and production accounts, allowing flexible access to data assets based on the environment’s needs.

The prod-hema-data-catalog is the production-grade catalog that supports data sharing across production services and, in some cases, pre-production services. This catalog only facilitates the production of data assets from production services (disallows publishing of assets belonging to pre-production services) and allows pre-production services to access production-grade data. The following diagram illustrates the architecture of both accounts.

To establish isolation between services in the data mesh, a project is dedicated to a unique service account. The environment profiles and environments are configured to be explicitly used only by the service. This Amazon DataZone configuration is managed centrally by the core team using AWS CloudFormation. After projects are created and configured by the central team, project teams have access to self-service capabilities to create their own environments according to their needs.

The following diagram illustrates the full workflow for onboarding HEMA service teams in Amazon DataZone.

The workflow includes the following steps:

A service team (either a data producer or a data consumer) initiates a request to the core data platform team to enable data sharing for their service accounts. This request is typically made when a service team has a use case where they need to either publish data to the catalog (for other teams to consume) or access data that another team has published.
After the request is received, the core data platform team assesses the requirements and initiates the creation of projects and environments in Amazon DataZone. This is done using AWS CloudFormation and a continuous integration and delivery (CI/CD) pipeline. The core data platform team makes sure that the appropriate AWS account (pre-production or production) is linked to the environment within the project in the respective catalogs.
After the projects and environments are set up, service teams can use Amazon DataZone features to produce and consume data assets:
1. Producers (for example, Service A) can publish their data assets to the Data Catalog and approve or reject subscription requests.
2. Consumers (for example, Service B) can search and access these published data assets using the Amazon DataZone catalog and request data access through subscription requests.

In a decentralized data mesh environment, there is a risk of service teams creating resources in service accounts they are not authorized to manage, which may lead to governance issues and data mismanagement. To address this challenge, HEMA followed two principles:

Amazon DataZone project structure – Each project contains resources that are solely managed by the service team (project members) responsible for it. Each service team’s project provides a clear boundary for the resources they manage.
Environment isolation – The core teams enforce governance policies in the Amazon DataZone configuration, allowing teams to only deploy resources within their own environments.

Adoption plan: Strategy

In HEMA’s data mesh, the catalog must be built in collaboration with all the services that produce data, so the key for the central data governance team was ideating an adoption plan that would add value to these teams, rather than disrupting the delivery of their projects. With that in mind, HEMA’s adoption strategy was designed on three core principles:

Launch it – Do not wait until you can ship to production a full-scale service that covers every single feature available. Instead, define an MVP that solves the most critical need for the business and make it available for the business as soon as you can.
Prove value – HEMA’s data team ran several internal seminars, and dedicated presentations with each of the involved teams to showcase how Amazon DataZone would simplify their data sharing needs. Do not tell them they must invest time to learn and start using a new service, but rather let them get drawn in by the new advantages of the new functionality and stimulate self-adoption.
Be there – This connects with what HEMA as a company stands for. Be close to the teams when they need support during the adoption stage, like HEMA is close to their customers whenever they need a new product for their lives. Create space for Q&A and develop a collaborative experience for everyone in their adoption curve.

Adoption plan: Action points

While deploying the adoption plan for a decentralized data marketplace using Amazon DataZone, HEMA followed a “start small, fine-tune, and iterate” approach. In practice, this meant that the Data & Cloud team started working with one business unit, expanding then to several business units, while focusing on one single feature: data asset subscription. To increase interest and adoption, this process was introduced for the core data assets that were more used in the company.

After this part of the process was well understood and embraced by everyone, the next step was to start supporting the data pipeline adaptation work needed for each business unit.

Finally, when all teams were onboarded and familiar with the subscription feature, HEMA moved to introduce the business units to the second critical feature: data publishing. In summary, HEMA released new features and allowed the domains to pick up the implementation at their preferred pace before moving onto the next one.

When adoption was at a point where all core data assets were being consumed through the Amazon DataZone catalog, the Lake Formation resource links used previously to share data across accounts were decommissioned, and at the same time the Data & Cloud team interrupted their duty to share data between business units, stimulating the peer-to-peer data sharing practice, where teams can directly talk to each other without having to involve a third party.

Results

The popularity of Amazon DataZone across the enterprise ramped up quickly, and all the involved business units started using the service daily to self-serve their needs. The existence of a central data catalog enabled teams to seamlessly search, discover, share, and subscribe to data assets produced within the business. Only a few months after launching the service, HEMA observed stunning statistics:

Over 200 data assets published to the catalog
Over 180 active subscriptions
Over 100 active users monthly
Over 20 business units (services) onboarded
Data sharing average turnaround time cut from 4 working days to few seconds, without the support of any other team

Additionally, they saw massive benefits that can’t be represented by statistics. Above all, the ability to autonomously discover data produced by other teams is enabling a series of new use cases for the business, which weren’t even visible to them earlier due to the lack of awareness and visibility on what others were producing. For example, the data science team quickly developed a new predictive model for sales by reusing data already available in Amazon DataZone, instead of rebuilding it from scratch. This is resulting in an energized data organization, which can collaborate and contribute to shaping the future of HEMA’s data operations.

Conclusion

At HEMA, Amazon DataZone made data governance a reality, and so the company wants to implement new features in close collaboration with AWS, while continuing to work on the rollout of items that are already in HEMA’s roadmap. The team is continuously developing the service, launching a series of new features that will continue to improve the data operations:

Data quality scores – This feature helps data producers monitor and optimize their data assets, while consumers can see upfront the nuances of a certain asset that they might be using or are looking to use within their ETL pipelines
Data lineage – This feature allows consumers and the central governance team to trace data sources, transformation stages, and observe cross-organizational usage of data assets
Fine-grained access control – This feature enables producers to be in full control of what they share with other units, making sure that only the relevant pieces of a data asset are shared with the consuming teams

The long-term vision of HEMA is clear: Amazon DataZone is set to become the central solution for data sharing and data cataloging across the enterprise. Although as of today, Amazon DataZone is focused on supporting the teams running ETL pipelines, the goal is to extend the service to all the business teams that work with data, with the ultimate goal of streamlining their daily operations. Data is one of the most valuable resources a company has, and HEMA is determined to democratize its role by building an efficient data organization, who relies on the most advanced data governance solution on the market.

About the authors

Luis Campos is the Data & AI Governance GTM Lead for the EMEA market at AWS where he helps customers with their data strategies starting with strong data governance and uses his expertise in end-to-end data & analytics management. Luis is also a public speaking coach, based in the Netherlands, and has two boys with 18 years apart, which has taught him to see problems from both ends of a spectrum.

Vincent Gromakowski is a Principal Analytics Solutions Architect at AWS where he enjoys solving customers’ data challenges. He uses his strong expertise on analytics, distributed systems and resource orchestration platform to be a trusted technical advisor for AWS customers.

Tommaso is the Head of Data & Cloud Platforms at HEMA. He joined the business with the goal of modernising the Data Organization by building cloud-based Data Platform – hosted in AWS – which would power a Data Mesh architecture. With a strong passion for both technical and organizational challenges, Tommaso leads the Solution Architecture efforts as well as all core Data Management and Data Governance initiatives, for which he is also a passionate public speaker. Outside the office, Tommaso is a full-time dad with a passion for traveling and sports.

Oghosa Omorisiagbon is a Senior Data Engineer at HEMA. He focuses on leveraging AWS-native tools to optimise data pipelines, modernise HEMA’s data infrastructure and introduce reliable and scalable end-to-end data architecture solutions. Outside of work, he enjoys traveling, playing video games and outdoor activities.

Accelerate queries on Apache Iceberg tables through AWS Glue auto compaction

2024-12-19 Navnit Shukla

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/accelerate-queries-on-apache-iceberg-tables-through-aws-glue-auto-compaction/

Data lakes were originally designed to store large volumes of raw, unstructured, or semi-structured data at a low cost, primarily serving big data and analytics use cases. Over time, as organizations began to explore broader applications, data lakes have become essential for various data-driven processes beyond just reporting and analytics. Today, they play a critical role in syncing with customer applications, enabling the ability to manage concurrent data operations while maintaining the integrity and consistency of information. This shift includes not only storing batch data but also ingesting and processing near real-time data streams, allowing businesses to merge historical insights with live data to power more responsive and adaptive decision-making. However, this new data lake architecture brings challenges around managing transactional support and handling the influx of small files generated by real-time data streams. Traditionally, customers addressed these challenges by performing complex extract, transform, and load (ETL) processes, which often led to data duplication and increased complexity in data pipelines. Additionally, to cope with the proliferation of small files, organizations had to develop custom mechanisms to compact and merge these files, leading to the creation and maintenance of bespoke solutions that were difficult to scale and manage. As data lakes increasingly handle sensitive business data and transactional workloads, maintaining strong data quality, governance, and compliance becomes vital to maintaining trust and regulatory alignment.

To simplify these challenges, organizations have adopted open table formats (OTFs) like Apache Iceberg, which provide built-in transactional capabilities and mechanisms for compaction. OTFs, such as Iceberg, address key limitations in traditional data lakes by offering features like ACID transactions, which maintain data consistency across concurrent operations, and compaction, which helps manage the issue of small files by merging them efficiently. By using features like Iceberg’s compaction, OTFs streamline maintenance, making it straightforward to manage object and metadata versioning at scale. However, although OTFs reduce the complexity of maintaining efficient tables, they still require some regular maintenance to make sure tables remain in an optimal state.

In this post, we explore new features of the AWS Glue Data Catalog, which now supports improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes consistently performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance. Many customers have streaming data continuously ingested in Iceberg tables, resulting in a large number of delete files that track changes in data files. With this new feature, as you enable the Data Catalog optimizer. It constantly monitors table partitions and runs the compaction process for both data and delta or delete files, and it regularly commits partial progress. The Data Catalog also now supports heavily nested complex data and supports schema evolution as you reorder or rename columns.

Automatic compaction with AWS Glue

Automatic compaction in the Data Catalog makes sure your Iceberg tables are always in optimal condition. The data compaction optimizer continuously monitors table partitions and invokes the compaction process when specific thresholds for the number of files and file sizes are met. For example, based on the Iceberg table configuration of the target file size, the compaction process will start and continue if the table or any of the partitions within the table have more than the default configuration (for example 100 files), each smaller than 75% of the target file size.

Iceberg supports two table modes: Merge-on-Read (MoR) and Copy-on-Write (CoW). These table modes provide different approaches for handling data updates and play a critical role in how data lakes manage changes and maintain performance:

Data compaction on Iceberg CoW – With CoW, any updates or deletes are directly applied to the table files. This means the entire dataset is rewritten when changes are made. Although this provides immediate consistency and simplifies reads (because readers only access the latest snapshot of the data), it can become costly and slow for write-heavy workloads due to the need for frequent rewrites. Announced during AWS re:Invent 2023, this feature focuses on optimizing data storage for Iceberg tables using the CoW mechanism. Compaction in CoW makes sure updates to the data result in new files being created, which are then compacted to improve query performance.
Data compaction on Iceberg MoR – Unlike CoW, MoR allows updates to be written separately from the existing dataset, and those changes are only merged when the data is read. This approach is beneficial for write-heavy scenarios because it avoids frequent full table rewrites. However, it can introduce complexity during reads because the system has to merge base and delta files as needed to provide a complete view of the data. MoR compaction, now generally available, allows for efficient handling of streaming data. It makes sure that while data is being continuously ingested, it’s also compacted in a way that optimizes read performance without compromising the ingestion speed.

Whether you are using CoW, MoR, or a hybrid of both, one challenge remains consistent: maintenance around the growing number of small files generated by each transaction. AWS Glue automatic compaction addresses this by making sure your Iceberg tables remain efficient and performant across both table modes.

This post provides a detailed comparison of query performance between auto compacted and non-compacted Iceberg tables. By analyzing key metrics such as query latency and storage efficiency, we demonstrate how the automatic compaction feature optimizes data lakes for better performance and cost savings. This comparison will help guide you in making informed decisions on enhancing your data lake environments.

Solution overview

This blog post explores the performance benefits of the newly launched feature in AWS Glue that supports automatic compaction of Iceberg tables with MoR capabilities. We run two versions of the same architecture: one where the tables are auto compacted, and another without compaction. By comparing both scenarios, this post demonstrates the efficiency, query performance, and cost benefits of auto compacted tables vs. non-compacted tables in a simulated Internet of Things (IoT) data pipeline.

The following diagram illustrates the solution architecture.

The solution consists of the following components:

Amazon Elastic Compute Cloud (Amazon EC2) simulates continuous IoT data streams, sending them to Amazon MSK for processing
Amazon Managed Streaming for Apache Kafka (Amazon MSK) ingests and streams data from the IoT simulator for real-time processing
Amazon EMR Serverless processes streaming data from Amazon MSK without managing clusters, writing results to the Amazon S3 data lake
Amazon Simple Storage Service (Amazon S3) stores data using Iceberg’s MoR format for efficient querying and analysis
The Data Catalog manages metadata for the datasets in Amazon S3, enabling organized data discovery and querying through Amazon Athena
Amazon Athena queries data from the S3 data lake with two table options:
- Non-compacted table – Queries raw data from the Iceberg table
- Compacted table – Queries data optimized by automatic compaction for faster performance.

The data flow consists of the following steps:

The IoT simulator on Amazon EC2 generates continuous data streams.
The data is sent to Amazon MSK, which acts as a streaming table.
EMR Serverless processes streaming data and writes the output to Amazon S3 in Iceberg format.
The Data Catalog manages the metadata for the datasets.
Athena is used to query the data, either directly from the non-compacted table or from the compacted table after auto compaction.

In this post, we guide you through setting up an evaluation environment for AWS Glue Iceberg auto compaction performance using the following GitHub repository. The process involves simulating IoT data ingestion, deduplication, and querying performance using Athena.

Compaction IoT performance test

We simulated IoT data ingestion with over 20 billion events and used MERGE INTO for data deduplication across two time-based partitions, involving heavy partition reads and shuffling. After ingestion, we ran queries in Athena to compare performance between compacted and non-compacted tables using the MoR format. This test aims to have low latency on ingestion but will lead to hundreds of millions of small files.

We use the following table configuration settings:

'write.delete.mode'='merge-on-read'
'write.update.mode'='merge-on-read'
'write.merge.mode'='merge-on-read'
'write.distribution.mode=none'

We use 'write.distribution.mode=none' to lower the latency. However, it will increase the number of Parquet files. For other scenarios, you may want to use hash or range distribution write modes to reduce the file count.

This test makes make append operations because we’re appending new data to the table but we don’t have any delete operations.

The following table shows some metrics of the Athena query performance.

	Execution Time (sec)		Performance Improvement (%)	Data Scanned (GB)
Query	employee (without compaction)	employeeauto (with compaction)	–	employee (without compaction)	employeeauto (with compaction)
`SELECT count(*) FROM "bigdata"."<tablename>"`	67.5896	3.8472	94.31%	0	0
`SELECT team, name, min(age) AS youngest_age FROM "bigdata"."<tablename>" GROUP BY team, name ORDER BY youngest_age ASC`	72.0152	50.4308	29.97%	33.72	32.96
`SELECT role, team, avg(age) AS average_age FROM bigdata."<tablename>" GROUP BY role, team ORDER BY average_age DESC`	74.1430	37.7676	49.06%	17.24	16.59
`SELECT name, age, start_date, role, team` `FROM bigdata."<tablename>"` `WHERE` `CAST(start_date as DATE) > CAST('2023-01-02' as DATE) and` `age > 40` `ORDER BY start_date DESC` `limit 100`	70.3376	37.1232	47.22%	105.74	110.32

Because the previous test didn’t perform any delete operations on the table, we conduct a new test involving hundreds of thousands of such operations. We use the previously auto compacted table (employeeauto) as a base, noting that this table uses MoR for all operations.

We run a query that deletes data from each even second on the table:

DELETE FROM iceberg_catalog.bigdata.employeeauto
WHERE start_date BETWEEN 'start' AND 'end'
AND SECOND(start_date) % 2 = 0;

This query runs with table optimizations enabled, using an Amazon EMR Studio notebook. After running the queries, we roll back the table to its previous state for a performance comparison. Iceberg’s time-traveling capabilities allow us to restore the table. We then disable the table optimizations, rerun the delete query, and follow up with Athena queries to analyze performance differences. The following table summarizes our results.

	Execution Time (sec)		Performance Improvement (%)	Data Scanned (GB)
Query	employee (without compaction)	employeeauto (with compaction)	–	employee (without compaction)	employeeauto (with compaction)
`SELECT count(*) FROM "bigdata"."<tablename>"`	29.820	8.71	70.77%	0	0
`SELECT team, name, min(age) as youngest_age` `FROM "bigdata"."<tablename>"` `GROUP BY team, name` `ORDER BY youngest_age ASC`	58.0600	34.1320	41.21%	33.27	19.13
`SELECT role, team, avg(age) AS average_age` `FROM bigdata."<tablename>"` `GROUP BY role, team` `ORDER BY average_age DESC`	59.2100	31.8492	46.21%	16.75	9.73
`SELECT name, age, start_date, role, team` `FROM bigdata."<tablename>"` `WHERE` `CAST(start_date as DATE) > CAST('2023-01-02' as DATE) and` `age > 40` `ORDER BY start_date DESC` `limit 100`	68.4650	33.1720	51.55%	112.64	61.18

We analyze the following key metrics:

Query runtime – We compared the runtimes between compacted and non-compacted tables using Athena as the query engine and found significant performance improvements with both MoR for ingestion and appends and MoR for delete operations.
Data scanned evaluation – We compared compacted and non-compacted tables using Athena as the query engine and observed a reduction in data scanned for most queries. This reduction translates directly into cost savings.

Prerequisites

To set up your own evaluation environment and test the feature, you need the following prerequisites:

A virtual private cloud (VPC) with at least two private subnets. For instructions, see Create a VPC.
An EC2 instance c5.xlarge using Amazon Linux 2023 running on one of those private subnets where you will launch the data simulator. For the security group, you can use the default for the VPC. For more information, see Get started with Amazon EC2.
An AWS Identity and Access Management (IAM) user with the correct permissions to create and configure all the required resources.

Set up Amazon S3 storage

Create an S3 bucket with the following structure:

s3bucket/
/jars
/employee.desc
/warehouse
/checkpoint
/checkpointAuto

Download the descriptor file employee.desc from the GitHub repo and place it in the S3 bucket.

Download the application on the releases page

Get the packaged application from the GitHub repo, then upload the JAR file to the jars directory on the S3 bucket. The warehouse will be where the Iceberg data and metadata will live and checkpoint will be used for the Structured Streaming checkpointing mechanism. Because we use two streaming job runs, one for compacted and one for non-compacted data, we also create a checkpointAuto folder.

Create a Data Catalog database

Create a database in the Data Catalog (for this post, we name our database bigdata). For instructions, see Getting started with the AWS Glue Data Catalog.

Create an EMR Serverless application

Create an EMR Serverless application with the following settings (for instructions, see Getting started with Amazon EMR Serverless):

Type: Spark
Version: 7.1.0
Architecture: x86_64
Java Runtime: Java 17
Metastore Integration: AWS Glue Data Catalog
Logs: Enable Amazon CloudWatch Logs if desired

Configure the network (VPC, subnets, and default security group) to allow the EMR Serverless application to reach the MSK cluster.

Take note of the application-id to use later for launching the jobs.

Create an MSK cluster

Create an MSK cluster on the Amazon MSK console. For more details, see Get started using Amazon MSK.

You need to use custom create with at least two brokers using 3.5.1, Apache Zookeeper mode version, and instance type kafka.m7g.xlarge. Do not use public access; choose two private subnets to deploy it (one broker per subnet or Availability Zone, for a total of two brokers). For the security group, remember that the EMR cluster and the Amazon EC2 based producer will need to reach the cluster and act accordingly. For security, use PLAINTEXT (in production, you should secure access to the cluster). Choose 200 GB as storage size for each broker and do not enable tiered storage. For network security groups, you can choose the default of the VPC.

For the MSK cluster configuration, use the following settings:

auto.create.topics.enable=true
default.replication.factor=2
min.insync.replicas=2
num.io.threads=8
num.network.threads=5
num.partitions=32
num.replica.fetchers=2
replica.lag.time.max.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
unclean.leader.election.enable=true
zookeeper.session.timeout.ms=18000
compression.type=zstd
log.retention.hours=2
log.retention.bytes=10073741824

Configure the data simulator

Log in to your EC2 instance. Because it’s running on a private subnet, you can use an instance endpoint to connect. To create one, see Connect to your instances using EC2 Instance Connect Endpoint. After you log in, issue the following commands:

sudo yum install java-17-amazon-corretto-devel
wget https://archive.apache.org/dist/kafka/3.5.1/kafka_2.12-3.5.1.tgz
tar xzvf kafka_2.12-3.5.1.tgz

Create Kafka topics

Create two Kafka topics—remember that you need to change the bootstrap server with the corresponding client information. You can get this data from the Amazon MSK console on the details page for your MSK cluster.

cd kafka_2.12-3.5.1/bin/

./kafka-topics.sh --topic protobuf-demo-topic-pure-auto --bootstrap-server kafkaBoostrapString --create
./kafka-topics.sh --topic protobuf-demo-topic-pure --bootstrap-server kafkaBoostrapString –create

Launch job runs

Issue job runs for the non-compacted and auto compacted tables using the following AWS Command Line Interface (AWS CLI) commands. You can use AWS CloudShell to run the commands.

For the non-compacted table, you need to change the s3bucket value as needed and the application-id. You also need an IAM role (execution-role-arn) with the corresponding permissions to access the S3 bucket and to access and write tables on the Data Catalog.

aws emr-serverless start-job-run --application-id application-identifier --name job-run-name --execution-role-arn arn-of-emrserverless-role --mode 'STREAMING' --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://s3bucket/jars/streaming-iceberg-ingest-1.0-SNAPSHOT.jar",
"entryPointArguments": ["true","s3://s3bucket/warehouse","s3://s3bucket/Employee.desc","s3://s3bucket/checkpoint","kafkaBootstrapString","true"],
"sparkSubmitParameters": "--class com.aws.emr.spark.iot.SparkCustomIcebergIngestMoR --conf spark.executor.cores=16 --conf spark.executor.memory=64g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.dynamicAllocation.minExecutors=3 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.dynamicAllocation.maxExecutors=5 --conf spark.sql.catalog.glue_catalog.http-client.apache.max-connections=3000 --conf spark.emr-serverless.executor.disk.type=shuffle_optimized --conf spark.emr-serverless.executor.disk=1000G --files s3://s3bucket/Employee.desc --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1"
}
}'

For the auto compacted table, you need to change the s3bucket value as needed, the application-id, and the kafkaBootstrapString. You also need an IAM role (execution-role-arn) with the corresponding permissions to access the S3 bucket and to access and write tables on the Data Catalog.

aws emr-serverless start-job-run --application-id application-identifier --name job-run-name --execution-role-arn arn-of-emrserverless-role --mode 'STREAMING' --job-driver '{
"sparkSubmit": {
"entryPoint": "s3://s3bucket/jars/streaming-iceberg-ingest-1.0-SNAPSHOT.jar",
"entryPointArguments": ["true","s3://s3bucket/warehouse","/home/hadoop/Employee.desc","s3://s3bucket/checkpointAuto","kafkaBootstrapString","true"],
"sparkSubmitParameters": "--class com.aws.emr.spark.iot.SparkCustomIcebergIngestMoRAuto --conf spark.executor.cores=16 --conf spark.executor.memory=64g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.dynamicAllocation.minExecutors=3 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.dynamicAllocation.maxExecutors=5 --conf spark.sql.catalog.glue_catalog.http-client.apache.max-connections=3000 --conf spark.emr-serverless.executor.disk.type=shuffle_optimized --conf spark.emr-serverless.executor.disk=1000G --files s3://s3bucket/Employee.desc --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1"
}
}'

Enable auto compaction

Enable auto compaction for the employeeauto table in AWS Glue. For instructions, see Enabling compaction optimizer.

Launch the data simulator

Download the JAR file to the EC2 instance and run the producer:

aws s3 cp s3://s3bucket/jars/streaming-iceberg-ingest-1.0-SNAPSHOT.jar .

Now you can start the protocol buffer producers.

For non-compacted tables, use the following commands:

java -cp streaming-iceberg-ingest-1.0-SNAPSHOT.jar 
com.aws.emr.proto.kafka.producer.ProtoProducer kafkaBoostrapString

For auto compacted tables, use the following commands:

java -cp streaming-iceberg-ingest-1.0-SNAPSHOT.jar 
com.aws.emr.proto.kafka.producer.ProtoProducerAuto kafkaBoostrapString

Test the solution in EMR Studio

For the delete test, we use an EMR Studio. For setup instructions, see Set up an EMR Studio. Next, you need to create an EMR Serverless interactive application to run the notebook; refer to Run interactive workloads with EMR Serverless through EMR Studio to create a Workspace.

Open the Workspace, select the interactive EMR Serverless application as the compute option, and attach it.

Download the Jupyter notebook, upload it to your environment, and run the cells using a PySpark kernel to run the test.

Clean up

This evaluation is for high-throughput scenarios and can lead to significant costs. Complete the following steps to clean up your resources:

Stop the Kafka producer EC2 instance.
Cancel the EMR job runs and delete the EMR Serverless application.
Delete the MSK cluster.
Delete the tables and database from the Data Catalog.
Delete the S3 bucket.

Conclusion

The Data Catalog has improved automatic compaction of Iceberg tables for streaming data, making it straightforward for you to keep your transactional data lakes always performant. Enabling automatic compaction on Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance.

Many customers have streaming data that is continuously ingested in Iceberg tables, resulting in a large set of delete files that track changes in data files. With this new feature, when you enable the Data Catalog optimizer, it constantly monitors table partitions and runs the compaction process for both data and delta or delete files and regularly commits the partial progress. The Data Catalog also has expanded support for heavily nested complex data and supports schema evolution as you reorder or rename columns.

In this post, we assessed the ingestion and query performance of simulated IoT data using AWS Glue Iceberg with auto compaction enabled. Our setup processed over 20 billion events, managing duplicates and late-arriving events, and employed a MoR approach for both ingestion/appends and deletions to evaluate the performance improvement and efficiency.

Overall, AWS Glue Iceberg with auto compaction proves to be a robust solution for managing high-throughput IoT data streams. These enhancements lead to faster data processing, shorter query times, and more efficient resource utilization, all of which are essential for any large-scale data ingestion and analytics pipeline.

For detailed setup instructions, see the GitHub repo.

About the Authors

Navnit Shukla serves as an AWS Specialist Solutions Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled Data Wrangling on AWS. He can be reached through LinkedIn.

Angel Conde Manjon is a Sr. PSA Specialist on Data & AI, based in Madrid, and focuses on EMEA South and Israel. He has previously worked on research related to data analytics and artificial intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Amit Singh currently serves as a Senior Solutions Architect at AWS, specializing in analytics and IoT technologies. With extensive expertise in designing and implementing large-scale distributed systems, Amit is passionate about empowering clients to drive innovation and achieve business transformation through AWS solutions.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

[$] FESCo provenpackager sanction causes problems

2024-12-19 jzb

Post Syndicated from jzb original https://lwn.net/Articles/1002450/

The Fedora Engineering Steering Council (FESCo) has made a series of
missteps in deciding to revoke a longtime Fedora contributor’s provenpackager
status. FESCo made the decision during a closed session, based on private
complaints. It then publicly announced its decision, including the
contributor’s name, while only supplying a vague account of the
contributor’s actions. This has left the Fedora community with more
questions than answers, and raised a number of complaints about the
transparency of FESCo’s process. In addition, the sequence of events has
sparked discussions about package ownership, as well as when and how it’s
appropriate to push changes to packages that a developer doesn’t own.

Implement a custom subscription workflow for unmanaged Amazon S3 assets published with Amazon DataZone

2024-12-19 Somdeb Bhattacharjee

Post Syndicated from Somdeb Bhattacharjee original https://aws.amazon.com/blogs/big-data/implement-a-custom-subscription-workflow-for-unmanaged-amazon-s3-assets-published-with-amazon-datazone/

Organizational data is often fragmented across multiple lines of business, leading to inconsistent and sometimes duplicate datasets. This fragmentation can delay decision-making and erode trust in available data. Amazon DataZone, a data management service, helps you catalog, discover, share, and govern data stored across AWS, on-premises systems, and third-party sources. Although Amazon DataZone automates subscription fulfillment for structured data assets—such as data stored in Amazon Simple Storage Service (Amazon S3), cataloged with the AWS Glue Data Catalog, or stored in Amazon Redshift—many organizations also rely heavily on unstructured data. For these customers, extending the streamlined data discovery and subscription workflows in Amazon DataZone to unstructured data, such as files stored in Amazon S3, is critical.

For example, Genentech, a leading biotechnology company, has vast sets of unstructured gene sequencing data organized across multiple S3 buckets and prefixes. They need to enable direct access to these data assets for downstream applications efficiently, while maintaining governance and access controls.

In this post, we demonstrate how to implement a custom subscription workflow using Amazon DataZone, Amazon EventBridge, and AWS Lambda to automate the fulfillment process for unmanaged data assets, such as unstructured data stored in Amazon S3. This solution enhances governance and simplifies access to unstructured data assets across the organization.

Solution overview

For our use case, the data producer has unstructured data stored in S3 buckets, organized with S3 prefixes. We want to publish this data to Amazon DataZone as discoverable S3 data. On the consumer side, users need to search for these assets, request subscriptions, and access the data within an Amazon SageMaker notebook, using their own custom AWS Identity and Access Management (IAM) roles.

The proposed solution involves creating a custom subscription workflow that uses the event-driven architecture of Amazon DataZone. Amazon DataZone keeps you informed of key activities (events) within your data portal, such as subscription requests, updates, comments, and system events. These events are delivered through the EventBridge default event bus.

An EventBridge rule captures subscription events and invokes a custom Lambda function. This Lambda function contains the logic to manage access policies for the subscribed unmanaged asset, automating the subscription process for unstructured S3 assets. This approach streamlines data access while ensuring proper governance.

To learn more about working with events using EventBridge, refer to Events via Amazon EventBridge default bus.

The solution architecture is shown in the following screenshot.

Custom subscription workflow architecture diagram

To implement the solution, we complete the following steps:

As a data producer, publish an unstructured S3 based data asset as S3ObjectCollectionType to Amazon DataZone.
For the consumer, create a custom AWS service environment in the consumer Amazon DataZone project and add a subscription target for the IAM role attached to a SageMaker notebook instance. Now, as a consumer, request access to the unstructured asset published in the previous step.
When the request is approved, capture the subscription created event using an EventBridge rule.
Invoke a Lambda function as the target for the EventBridge rule and pass the event payload to it:
The Lambda function does 2 things:
1. Fetches the asset details, including the Amazon Resource Name (ARN) of the S3 published asset and the IAM role ARN from the subscription target.
2. Uses the information to update the S3 bucket policy granting List/Get access to the IAM role.

Prerequisites

To follow along with the post, you should have an AWS account. If you don’t have one, you can sign up for one.

For this post, we assume you know how to create an Amazon DataZone domain and Amazon DataZone projects. For more information, see Create domains and Working with projects and environments in Amazon DataZone.

Also, for simplicity, we use the same IAM role for the Amazon DataZone admin (creating domains) as well the producer and consumer personas.

Publish unstructured S3 data to Amazon DataZone

We have uploaded some sample unstructured data into an S3 bucket. This is the data that will be published to Amazon DataZone. You can use any unstructured data, such as an image or text file.

On the Properties tab of the S3 folder, note the ARN of the S3 bucket prefix.

Complete the following steps to publish the data:

Create an Amazon DataZone domain in the account and navigate to the domain portal using the link for Data portal URL.

DataZone domain creation

Create a new Amazon DataZone project (for this post, we name it unstructured-data-producer-project) for publishing the unstructured S3 data asset.
On the Data tab of the project, choose Create data asset.

Data asset creation

Enter a name for the asset.
For Asset type, choose S3 object collection.
For S3 location ARN, enter the ARN of the S3 prefix.

After you create the asset, you can add glossaries or metadata forms, but it’s not necessary for this post. You can publish the data asset so it’s now discoverable within the Amazon DataZone portal.

Set up the SageMaker notebook and SageMaker instance IAM role

Create an IAM role which will be attached to the SageMaker notebook instance. For the trust policy, allow SageMaker to assume this role and leave the Permissions tab blank. We refer to this role as the instance-role throughout the post.

SageMaker instance role

Next, create a SageMaker notebook instance from the SageMaker console. Attach the instance-role to the notebook instance.

SageMaker instance

Set up the consumer Amazon DataZone project, custom AWS service environment, and subscription target

Complete the following steps:

Log in to the Amazon DataZone portal and create a consumer project (for this post, we call it custom-blueprint-consumer-project), which will used by the consumer persona to subscribe to the unstructured data asset.

Custom blueprint project name

We use the recently launched custom blueprints for AWS services for creating the environment in this consumer project. The custom blueprint allows you to bring your own environment IAM role to integrate your existing AWS resources with Amazon DataZone. For this post, we create a custom environment to directly integrate SageMaker notebook access from the Amazon DataZone portal.

Before you create the custom environment, create the environment IAM role that will be used in the custom blueprint. The role should have a trust policy as shown in the following screenshot. For the permissions, attach the AWS managed policy AmazonSageMakerFullAccess. We refer to this role as the environment-role throughout the post.

Custom Environment role

To create the custom environment, first enable the Custom AWS Service blueprint on the Amazon DataZone console.

Enable custom blueprint

Open the blueprint to create a new environment as shown in the following screenshot.
For Owning project, use the consumer project that you created earlier and for Permissions, use the environment-role.

Custom environment project and role

After you create the environment, open it to create a customized URL for the SageMaker notebook access.

SageMaker custom URL

Create a new custom AWS link and enter the URL from the SageMaker notebook.

You can find it by navigating to the SageMaker console and choosing Notebooks in the navigation pane.

Choose Customize to add the custom link.

Add the custom link

Next, create a subscription target in the custom environment to pass the instance role that needs access to the unstructured data.

A subscription target is an Amazon DataZone engineering concept that allows Amazon DataZone to fulfill subscription requests for managed assets by granting access based on the information defined in the target like domain-id, environment-id, or authorized-principals.

Currently, creation of subscription targets is only allowed using the AWS Command Line Interface (AWS CLI). You can use the command create-subscription-target to create the subscription target.

The following is an example JSON payload for the subscription target creation. Create it as a JSON file on your workstation (for this post, we call it blog-sub-target.json). Replace the domain ID and the environment ID with the corresponding values for your domain and environment.

{
"domainIdentifier": "<<your-domain-id>>",
"environmentIdentifier": "<<your-environment-id>>",
"name": "custom-s3-target-consumerenv",
"type": "GlueSubscriptionTargetType",
"manageAccessRole": "<<provide the environment-role here>>",
"applicableAssetTypes": ["S3ObjectCollectionAssetType"],
"provider": "Custom Provider",
"authorizedPrincipals": [ "<<provide the instance-role here>>"],
"subscriptionTargetConfig": [{
"formName": "GlueSubscriptionTargetConfigForm",
"content": "{\"databaseName\":\"customdb1\"}"
}]
}

You can get the domain ID from the user name button in the upper right Amazon DataZone data portal; it’s in the format dzd_<<some-random-characters>>.

For the environment ID, you can find it on the Settings tab of the environment within your consumer project.

Open an AWS CloudShell environment and upload the JSON payload file using the Actions option in the CloudShell terminal.
You can now create a new subscription target using the following AWS CLI command:

aws datazone create-subscription-target --cli-input-json file://blog-sub-target.json

Create subscription target

To verify the subscription target was created successfully, run the list-subscription-target command from the AWS CloudShell environment:

aws datazone list-subscription-targets —domain-identifier <<domain-id>> —environment-identifier <<environment-id>>

Create a function to respond to subscription events

Now that you have the consumer environment and subscription target set up, the next step is to implement a custom workflow for handling subscription requests.

The simplest mechanism to handle subscription events is a Lambda function. The exact implementation may vary based on environment; for this post, we walk through the steps to create a simple function to handle subscription creation and cancellation.

On the Lambda console, choose Functions in the navigation pane.
Choose Create function.
Select Author from scratch.
For Function name, enter a name (for example, create-s3policy-for-subscription-target).
For Runtime¸ choose Python 3.12.
Choose Create function.

Author Lambda function

This should open the Code tab for the function and allow editing of the Python code for the function. Let’s look at some of the key components of a function to handle the subscription for unmanaged S3 assets.

Handle only relevant events

When the function gets invoked, we check to make sure it’s one of the events that’s relevant for managing access. Otherwise, the function can simply return a message without taking further action.

def lambda_handler(event, context):
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

These subscription events should include both the domain ID and a request ID (among other attributes). You can use these to look up the details of the subscription request in Amazon DataZone:

sub_request = dz.get_subscription_request_details(
domainIdentifier = domain_id,
identifier= sub_request_id
)
asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
form_data = json.loads(asset_listing['forms'])
asset_id = asset_listing['entityId']
asset_version = asset_listing['entityRevision']
asset_type = asset_listing['entityType']

Part of the subscription request should include the ARN for the S3 bucket in question, so you can retrieve that:

# We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

You can also use the Amazon DataZone API calls to get the environment associated with the project making the subscription request for this S3 asset. After retrieving the environment ID, you can check which IAM principals have been authorized to access unmanaged S3 assets using the subscription target:

        list_sub_target = dz.list_subscription_targets(
            domainIdentifier=domain_id,
            environmentIdentifier=environment_id,
            maxResults=50,
            sortBy='CREATED_AT',
            sortOrder='DESCENDING'
            )
        
        print('asset type:', list_sub_target['items'][0]['applicableAssetTypes'])
        
        if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
            role_arn = list_sub_target['items'][0]['authorizedPrincipals']
            print('role arn',role_arn)

If this is a new subscription, add the relevant IAM principal to the S3 bucket policy by appending a statement that allows the desired S3 actions on this bucket for the new principal:

        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })

Conversely, if this is a subscription being revoked or cancelled, remove the previously added statement from the bucket policy to make sure the IAM principal no longer has access:

        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block

The completed function should be able to handle adding or removing principals like IAM roles or users to a bucket policy. Be sure to handle cases where there is no existing bucket policy or where a cancellation means removing the only statement in the policy, meaning the entire bucket policy is no longer needed.

The following is an example of a completed function:

import json
import boto3
import os


dz = boto3.client('datazone')
s3 = boto3.client('s3')

# The list of actions to be permitted on the bucket in the newly granted policy
S3_ACTION_STRING = 's3:*'

def build_policy_statements(event_type, statement_block, principal, sub_request_id, bucket_arn):
        # Generate a Sid that should be unique
        sid_string = ''.join(c for c in f'DZ{principal}{sub_request_id}' if c.isalnum())
        # Add a new policy statement that gives the prinicpal access to whole bucket.
        # If it turns out something other than bucket ARN is allowed in asset, we can
        # get more granular than that
        # Sid that should be unique in case we need to handle unsubscribe
        print('statement block :',statement_block)
        if event_type == 'Subscription Created':
            if bucket_arn[-1] == '/':
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
            else:
                statement_block.append({
                    'Sid' : sid_string,
                    'Action': S3_ACTION_STRING,
                    'Resource': [
                        bucket_arn,
                        bucket_arn + '/*'
                    ],
                    'Effect': 'Allow',
                    'Principal': {'AWS': principal}
                })
        elif event_type == 'Subscription Cancelled' or event_type == 'Subscription Revoked':
            # Remove the statement from the policy if it's there
            # Made sure to handle case where there's no Sid for a statement
            pruned_statement_block = []
            for statement in statement_block:
                if 'Sid' not in statement or statement['Sid'] != sid_string:
                    pruned_statement_block.append(statement)
            statement_block = pruned_statement_block
           

        return statement_block

def lambda_handler(event, context):
    """Lambda function reacting to DataZone subscribe events

    Parameters
    ----------
    event: dict, required
        Event Bridge Events Format

    context: object, required
        Lambda Context runtime methods and attributes

    Returns
    ------
        Simple reponse indicating success or failure reason
    """
    # Get the basic info about the event
    event_detail = event['detail']

    # Make sure it's one of the events we're interested in
    event_source = event['source']
    event_type = event['detail-type']

    if event_source != 'aws.datazone':
        return '{"Response" : "Not a DataZone event"}'
    elif event_type not in ['Subscription Created', 'Subscription Cancelled', 
                               'Subscription Revoked']:
        return '{"Response" : "Not a subscription created, cancelled, or revoked event"}'

    
    # get the domain_id and other information
    domain_id = event_detail['metadata']['domain']
    project_id = event_detail['metadata']['owningProjectId']
    sub_request_id = event_detail['data']['subscriptionRequestId']
    listing_id = event_detail['data']['subscribedListing']['id']
    listing_version = event_detail['data']['subscribedListing']['version']
    
    print('domain-id',domain_id)
    print('project-id:',project_id)
    
    sub_request = dz.get_subscription_request_details(
        domainIdentifier = domain_id,
        identifier= sub_request_id
    )
   
    # Retrieve info about the asset from the request
    asset_listing = sub_request['subscribedListings'][0]['item']['assetListing']
    form_data = json.loads(asset_listing['forms'])
    asset_id = asset_listing['entityId']
    asset_version = asset_listing['entityRevision']
    asset_type = asset_listing['entityType']

    # We only want to take action if this is a S3 asset
    if asset_type == 'S3ObjectCollectionAssetType':
        # Get the bucket ARN from the form info for the asset
        bucket_arn = form_data['S3ObjectCollectionForm']['bucketArn']
        
        #Get the principal from the subscription target
        principal = get_principal(domain_id,project_id)

        try:
            # Get the bucket name from the ARN                    
            bucket_name_with_prefix = bucket_arn.split(':')[5]
            bucket_name = bucket_name_with_prefix.split('/')[0]
           
        except IndexError:
            response = '{"Response" : "Could not find bucket name in ARN"}'
            return response

        # Get the current bucket policy, or else make a blank one if there currently
        # is no policy
        try:
            bucket_policy = json.loads(s3.get_bucket_policy(Bucket=bucket_name)['Policy'])
        except s3.exceptions.from_code('NoSuchBucketPolicy'):
            bucket_policy = {'Statement': []}
        except:
            response = '{"Response" : "Could not get bucket policy"}'
            return response
        
        # Gets new policy with the subscribing principal either added or removed based on
        # event type
        new_policy_statements = build_policy_statements(event_type, bucket_policy['Statement'], principal, 
                                               sub_request_id, bucket_arn)

            
        # Write back the new policy. This can fail if the new policy is too big
        # or if for some reason the function role doesn't have rights to do this
        # If we removed the only policy statement, then just delete the policy
        try: 
            if not new_policy_statements:
                s3.delete_bucket_policy(Bucket = bucket_name)
            else:
                bucket_policy['Statement'] = new_policy_statements
                policy_string = json.dumps(bucket_policy)
                print('policy string :',policy_string)
                s3.put_bucket_policy(
                    Bucket=bucket_name,
                    Policy = policy_string
                )
        except Exception as e: 
            response = f'{{"Response" : "Error updating bucket policy: {e.args}"}}'
            return response
        
        # If we got here everything went as planned
        response = f'{{"Response" : "Updated policy for " + {bucket_name}}}'
    else:
        response = '{"Response" : "Not an S3 asset"}'


    return response

def get_principal(domain_id,project_id):
    # Call list environments to get the environment id
    listenv_request = dz.list_environments(
        domainIdentifier = domain_id,
        projectIdentifier= project_id
    )
    
   # In our example environment, there is only one of these
    environment_id = listenv_request['items'][0]['id']

   # Get the role we want to give access to from the subscription target info
    list_sub_target = dz.list_subscription_targets(
        domainIdentifier=domain_id,
        environmentIdentifier=environment_id,
        maxResults=50,
        sortBy='CREATED_AT',
        sortOrder='DESCENDING'
        )

    if list_sub_target['items'][0]['applicableAssetTypes'] == ['S3ObjectCollectionAssetType']:
       role_arn = list_sub_target['items'][0]['authorizedPrincipals']
   else:
        role_arn = []

    return role_arn

Because this Lambda function is intended to manage bucket policies, the role assigned to it will need a policy that allows the following actions on any buckets it is intended to manage:

s3:GetBucketPolicy
s3:PutBucketPolicy
s3:DeleteBucketPolicy

Now you have a function that is capable of editing bucket policies to add or remove the principals configured for your subscription targets, but you need something to invoke this function any time a subscription is created, cancelled, or revoked. In the next section, we cover how to use EventBridge to integrate this new function with Amazon DataZone.

Respond to subscription events in EventBridge

For events that take place within Amazon DataZone, it publishes information about each event in EventBridge. You can watch for any of these events, and invoke actions based on matching predefined rules. In this case, we’re interested in asset subscriptions being created, cancelled, or revoked, because those will determine when we grant or revoke access to the data in Amazon S3.

On the EventBridge console, choose Rules in the navigation pane.

The default event bus should automatically be present; we use it for creating the Amazon DataZone subscription rule.

Choose Create rule.
In the Rule detail section, enter the following:
1. For Name, enter a name (for example, DataZoneSubscriptions).
2. For Description, enter a description that explains the purpose of the rule.
3. For Event bus, choose default.
4. Turn on Enable the rule on the selected event bus.
5. For Rule type, select Rule with an event pattern.
Choose Next.

EventBridge rule

In the Event source section, select AWS Events or EventBridge partner events as the source of the events.

Define Event source

In the Creation method section, select Custom Pattern (JSON editor) to enable exact specification of the events needed for this solution.

Choose custom pattern

In the Event pattern section, enter the following code:

{
"detail-type": ["Subscription Created", "Subscription Cancelled", "Subscription Revoked"],
"source": ["aws.datazone"]
}

Define custom pattern JSON

Choose Next.

Now that we’ve defined the events to watch for, we can make sure those Amazon DataZone events get sent to the Lambda function we defined in the previous section.

On the Select target(s) page, enter the following for Target 1:
1. For Target types, select AWS service.
2. For Select a target, choose Lambda function
3. For Function, choose create-s3policy-for-subscription-target.
Choose Skip to Review and create.

Define event target

On the Review and create page, choose Create rule.

Subscribe to the unstructured data asset

Now that you have the custom subscription workflow in place, you can test the workflow by subscribing to the unstructured data asset.

In the Amazon DataZone portal, search for the unstructured data asset you published by browsing the catalog.

Search unstructured asset

Subscribe to the unstructured data asset using the consumer project, which starts the Amazon DataZone approval workflow.

Subscribe to unstructured asset

You should get a notification for the subscription request; follow the link and approve it.

When the subscription is approved, it will invoke the custom EventBridge Lambda workflow, which will create the S3 bucket policies for the instance role to access the S3 object. You can verify that by navigating to the S3 bucket and reviewing the permissions.

Access the subscribed asset from the Amazon DataZone portal

Now that the consumer project has been given access to the unstructured asset, you can access it from the Amazon DataZone portal.

In the Amazon DataZone portal, open the consumer project and navigate to the Environments
Choose the SageMaker-Notebook

Choose SageMaker notebook on the consumer project

In the confirmation pop-up, choose Open custom.

Choose Custom

This will redirect you to the SageMaker notebook assuming the environment role. You can see the SageMaker notebook instance.

Choose Open JupyterLab.

Open JupyterLab Notebook

Choose conda_python3 to launch a new notebook.

Launch Notebook

Add code to run get_object on the unstructured S3 data that you subscribed earlier and run the cells.

Now, because the S3 bucket policy has been updated to allow the instance role access to the S3 objects, you should see the get_object call return a HTTPStatusCode of 200.

Multi-account implementation

In the instructions so far, we’ve deployed everything in a single AWS account, but in larger organizations, resources can be distributed throughout AWS accounts, often managed by AWS Organizations. The same pattern can be applied in a multi-account environment, with some minor additions. Instead of directly acting on a bucket, the Lambda function in the domain account can assume a role in other accounts that contain S3 buckets to be managed. In each account with an S3 bucket containing assets, create a role that allows editing the bucket policy and has a trust policy referencing the Lambda role in the domain account as a principal.

Clean up

If you’ve finished experimenting and don’t want to incur any further cost for the resources deployed, you can clean up the components as follows:

Delete the Amazon DataZone domain.
Delete the Lambda function.
Delete the SageMaker instance.
Delete the S3 bucket that hosted the unstructured asset.
Delete the IAM roles.

Conclusion

By implementing this custom workflow, organizations can extend the simplified subscription and access workflows provided by Amazon DataZone to their unstructured data stored in Amazon S3. This approach provides greater control over unstructured data assets, facilitating discovery and access across the enterprise.

We encourage you to try out the solution for your own use case, and share your feedback in the comments.

About the Authors

Somdeb Bhattacharjee is a Senior Solutions Architect specializing on data and analytics. He is part of the global Healthcare and Life sciences industry at AWS, helping his customers modernize their data platform solutions to achieve their business outcomes.

Sam Yates is a Senior Solutions Architect in the Healthcare and Life Sciences business unit at AWS. He has spent most of the past two decades helping life sciences companies apply technology in pursuit of their missions to help patients. Sam holds BS and MS degrees in Computer Science.

Fish shell announces 4.0 release

2024-12-19 daroc

Post Syndicated from daroc original https://lwn.net/Articles/1002820/

fish is a shell with a custom language and several affordances not available out of the box in other shells, such as directory-sensitive command completion. Although the project does not normally make beta releases, the

newly announced 4.0 release
will have one in order to ensure that no problems were introduced

after a major effort to switch the code base from C++ to Rust.

fish is a smart and user-friendly command line shell with clever features that just work, without needing an advanced degree in bash scriptology. Today we are announcing an open beta, inviting all users to try out the upcoming 4.0 release.

fish 4.0 is a big upgrade. It’s got lots of new features to make using the command line easier and more enjoyable, such as more natural key binding and expanded history search. And under the hood, we’ve rebuilt the foundation in Rust to embrace modern computing.

The role of email security in reducing user risk amid rising threats

2024-12-19 Ayush Kumar

Post Syndicated from Ayush Kumar original https://blog.cloudflare.com/the-role-of-email-security-in-reducing-user-risk-amid-rising-threats/

Phishing remains one of the most dangerous and persistent cyber threats for individuals and organizations. Modern attacks use a growing arsenal of deceptive techniques that bypass traditional secure email gateways (SEGs) and email authentication measures, targeting organizations, employees, and vendors. From business email compromise (BEC) to QR phishing and account takeovers, these threats are designed to exploit weaknesses across multiple communication channels, including email, Slack, Teams, SMS, and cloud drives.

Phishing remains the most popular attack vector for bad actors looking to gain unauthorized access or extract fraudulent payment, and it is estimated that 90% of all attacks start with a phishing email. However, as companies have shifted to using a multitude of apps to support communication and collaboration, attackers too have evolved their approach. Attackers now engage employees across a combination of channels in an attempt to build trust and pivot targeted users to less-secure apps and devices. Cloudflare is uniquely positioned to address this trend thanks to our integrated Zero Trust services, extensive visibility from protecting approximately 20% of all websites, and signals derived from processing billions of email messages a year.

Cloudflare recognizes that combating phishing requires an integrated approach and a more complete view of user-based risk. That’s why we’ve designed our email security solution to protect organizations before, during, and after message delivery, while also extending protection beyond email into the broader security ecosystem. Phishing is no longer just an email problem — it’s a multi-channel, cross-application threat.

Assessing holistic user risk

When it comes to protecting against user-based threats, Cloudflare employs a platform approach to security. Instead of forcing customers to rely on an array of fragmented tools that create unnecessary complexity and blind spots, we treat email security as part of an overall strategy for assessing and responding to user-related risk. Our email security solution works in tandem with our network solutions so that SOC teams can quickly assert what actions their users are performing outside of email. Given our extensive network visibility, our platform is not limited by API integrations, and can provide SOC teams with the best visibility and protection. This helps SOC teams not only combat phishing, but begin to identify and take action against a wider range of insider threats.

Within a single, unified dashboard, SOC teams can quickly review detailed information regarding the following questions, which we discuss in more detail below:

Who in the organization is being targeted?
Who are the attackers impersonating?
What risky behaviors are my users performing?

Who in the organization is being targeted?

Within the Cloudflare dashboard, SOC teams can view which users are the most targeted. This can help them determine which accounts should be hardened (e.g. MFA enforced), and identify risky users that should be monitored more closely for significant deviations in behavior. One way organizations can use this information is to require high-risk users to connect from a managed device. For instance, if they use Crowdstrike, we can require that these users be on a managed device and force a posture check before letting them access sensitive applications.

SOC teams can also dive into what types of attacks are hitting their users and at what frequency.

Customers can use these insights to adjust various platform policies, effectively blocking malicious content and securing sensitive resources. Above, we can see that attackers are frequently leveraging links to try to compromise users. Based on the link analysis we are seeing in email, SOC teams can use our gateway to block similar attacks, so that when attackers try to use other communication methods (LinkedIn, Teams, Slack, etc.) users will not be able to interact with those links.

To learn more about stopping these types of multichannel phishing attacks, please see our blog post, A wild week in phishing, and what it means for you.

Who are the attackers impersonating?

SOC teams can also get visibility into impersonation attempts within their email environment. Customers can see which users are being impersonated the most, and can use this information to build policies within our email security solution and broader set of Zero Trust services.

A list of frequently impersonated users can be added to the impersonation registry, which changes the sensitivity of our models to apply more scrutiny on messages coming from those users.

Given our unique position as a domain name registrar, customers can also report lookalike domains to Cloudflare for action to be taken against them. This helps prevent attackers from being able to impersonate our customers and negatively impact their reputation.

Finally, customers can also use our free DMARC management to track who is sending emails on their behalf. This information can be used to update SPF records and get customers to p=quarantine or p=reject so that their brand is more resistant to being spoofed.

What risky behaviors are my users performing?

Cloudflare provides visibility into user actions in several ways.

Within the email security solution, we can track internal messages and alert if we see any malicious or suspicious behaviors. This can be enhanced with our managed service offering, Phishguard, which can alert admins when they see any type of behavior that indicates fraud (like Business Email Compromise), account takeover, or insider threats.

SOC teams can also take advantage of our CASB solution to view the different actions that users have performed. Actions are labeled with different risk levels to let teams know which findings are critical and require remediation.

Customers are also able to view data loss prevention (DLP) violations that users have incurred to see if there is any unauthorized egress of data. We provide the ability to automatically block this egress based on different policies within our platform, making sure there is no exfiltration of sensitive data.

We also enable organizations to put internal applications behind our Access solution. This prevents any users with improper permissions or a high risk level from accessing critical applications. Our dashboard then provides metrics on these logins to see how many failures we observed, so that SOC teams can investigate the user further.

These signals feed into our Unified Risk Score, which can be exported if needed to take automated actions within other platforms.

Increasing SOC productivity

With all of our functionality unified within a single interface and fed by one data lake, we see an increase in SOC productivity because teams no longer have to spend time building rules or flipping between disparate interfaces and workflows.

AI-driven email security

Unlike legacy secure email gateways, our email security solution is driven by predictive AI models which eliminate the need for creating and updating rules. These models are also more effective than reactive measures because they are fed by a massive volume of diverse data from across Cloudflare’s network. This means models are trained on emerging threats earlier and can identify new tactics with a higher accuracy than legacy systems.

Automated isolation

To further reduce the risk posed by users visiting potentially malicious websites, customers can isolate browser sessions using our natively integrated, clientless remote browser that runs on our global network. Within an isolated browsing session, SOC teams can prohibit various behaviors such as copy/paste, upload/download, keyboard inputs, and more. This decreases the risk of users accessing a website and performing an action which could compromise the organization.

Our browser isolation solution also decreases the time SOC teams need to maintain policies. Rather than adding domains and applications one by one, teams can choose to isolate based on content categories. These categories are based on our threat intelligence, and are constantly updated. This means that as new websites emerge, SOC teams do not have to spend the time to chase down and update the proper policy — rather, it is done automatically.

Automated blocking

While some websites might require running in an isolated browser to mitigate the risk of users encountering malicious content, others may need to be fully blocked altogether. Customers can use the same process listed above to block any website that could be risky for users based on tags. However, we allow admins to also provide feedback to users to increase awareness. This can be done via a custom block page that allows SOC teams to communicate with users about their risky behaviors, so that they take actions to curb this behavior in the future and alert their SOC teams to attacks that might be occurring.

What’s on the horizon for 2025

In 2024, our email security team focused on refining the user interface and improving the incident investigation experience. Looking ahead to 2025, we plan to introduce additional capabilities that deepen the integration of our email security solution with our SASE platform, delivering enhanced insight and protection against user-based threats.

Configurable browser isolation for email

Our Email Link Isolation feature currently applies to links we consider suspicious. However, we intend to allow customers to add customized configurations to meet their internal policies. This enhancement will provide more granular control over which websites users can access from an email message without using an isolated browser.

Outbound DLP for email

We will be releasing an add-in for Microsoft Outlook that will allow customers to use our DLP engine for inspecting outbound email messages. This client-side application enables customers to configure downstream policies that trigger action when a DLP policy is violated, all while minimizing disruption to existing email infrastructure.

Expanded user risk scoring

Cloudflare will be increasing the signals that feed into our user risk scores. This will enable SOC teams to create more policies within Cloudflare or to take automated actions externally based on the level of risk observed.

These are just a few examples of significant releases that will be coming in 2025. Please stay tuned to the Cloudflare blog where we will be announcing these releases as they happen.

Try Cloudflare Email Security today

We provide all organizations (whether a Cloudflare customer or not) with free access to our Retro Scan tool, allowing them to use our predictive AI models to scan existing inbox messages. Retro Scan will detect and highlight any threats found, enabling organizations to remediate them directly in their email accounts. With these insights, organizations can implement further controls, either using Cloudflare Email Security or their preferred solution, to prevent similar threats from reaching their inboxes in the future.

The Books We Read in High School (Part 1)

2024-12-19 The Atlantic

Post Syndicated from The Atlantic original https://www.youtube.com/watch?v=vW_TP33zX6E

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

2024-12-19 Надежда Цекулова

Post Syndicated from Надежда Цекулова original https://www.toest.bg/otgovornost-na-obshtestvoto-eva-yambor-za-detsata-sus-sindrom-na-daun-v-avstria/

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

Ева е омъжена за българин и животът на семейството се развива в България допреди 18 години, когато се ражда синът им Йоан. След като научават, че детето е със синдром на Даун, родителите решават да се преместят в Австрия, за да се възползват от възможностите, които австрийската система гарантира при грижите, развитието и образованието на деца с особености в развитието. Днес Йоан работи в кафене – социално предприятие, възникнало като частна инициатива, но подкрепено от държавата, след като доказва ефективността си.

Разговор за това, което е най-важно за децата и хората с особености – да са сред нас, в поредицата „Разговори за образованието на специалните деца“ с подкрепата на „Лидл България“.

Представяме Ви в родителската Ви роля и искам да помоля да започнем с родителската Ви история с Вашето дете със синдром на Даун.

Родих в България преди почти 18 години. Бях в привилегирована ситуация в сравнение с повечето раждащи жени, защото държах на раждането ми да присъстват моят личен гинеколог и моята акушерка. Бях изгубила две деца до този момент, бях в голяма криза и имах нужда от подкрепа.

Когато Йоан се роди, го посрещна една акушерка от болницата. Тя беше първата, която го взе в ръцете си, и почти изкрещя: „Не, не сте ли направили амниоцентеза?“ С тези думи посрещна Йоан на този свят. Веднага видя, че той е със синдром на Даун. Но моят гинеколог и акушерката отхвърлиха това съмнение, сякаш неадекватната реакция на тази акушерка ги стресна и ги накара да не виждат очевидното. Така в началото заживях с мисълта, че синът ми е здрав.

Аз също бях толкова влюбена в моя син, че не виждах нищо необичайно.

Не Ви ли глождеше въпросът на акушерката „Не сте ли направили амниоцентеза?“?

Не, за мен да не правя това изследване след две загуби беше осъзнат избор. От една страна, не исках да поемам никакъв риск. От друга, вярвам, че всяка жена има правото да избере дали да има деца. Но личната ми етика не позволява, след като съзнателно съм създала дете, да избирам дали искам точно това дете.

И така повече от шест месеца отглеждах Йоан като най-големия подарък, имах невероятния късмет да не чуя от никого „горката“, да ми е спестена тревогата, че съм родила дете със синдром на Даун. И той се развиваше изключително добре първите месеци. Имахме късмет и че нямаше никакви здравословни проблеми, които са често срещани при деца със синдрома.

Кога разбрахте за състоянието на Йоан?

Когато беше на 10 или 11 месеца. Вече бях успяла да си изградя представа за него като за най-прекрасното дете, може би затова новината не ме смаза така, както се случва с други родители. Дадох си сметка каква разлика е това, малко по-късно. Когато дойдохме в Австрия, ми предложиха да се включа в едно проучване сред родители на деца със синдрома. В анкетата видях въпроса „Хората поздравиха ли Ви за раждането на Вашето дете?“. Едва тогава разбрах, че често семействата, на които се ражда дете със синдром на Даун, не получават поздравления от близките си. Никога няма да забравя какво изпитах, докато четях този въпросник, и дълбокото усещане колко съм благодарна, че това ми е било спестено от съдбата.

Спомних си как, когато родих, изпратих съобщение на всичките си контакти: „Щастието ми си има име – Йоан.“ Зададох си въпроса щях ли да го напиша, ако знаех, че той е със синдром на Даун. Днес отговорът ми е „да“, но какъв щеше да бъде в онзи ден? В първия момент ти не знаеш какво те очаква, защото тези хора все още не са достатъчно видими в обществото и ние нямаме знание за тях. Има отделни случаи, които ни показват само колко много работа ни предстои още – в САЩ има една много популярна манекенка, в Испания – известен актьор, в София две момичета работят в голям хотел… Това е страхотно, но не трябва да е изключение. Трябва да е нормално да ги виждаме навсякъде, за да няма страх в обществото и хората да не се боят да поздравят едно семейство, в което се е родило дете със синдром на Даун.

Бих искала да Ви върна към решението да напуснете България, след като сте разбрала, че Йоан е дете със синдром на Даун. Какво Ви привлече в австрийската система?

Говорим за времето преди близо 18 години. Тогава в България тепърва се зараждаше някаква практика. Имаше единици специалисти, които работеха на парче. Когато научих и започнах да търся в интернет, веднага намерих информация къде трябва да отида във Виена, каква финансова подкрепа ще получа в Австрия, какви изследвания трябва да се направят, какви медицински прегледи се препоръчват и колко често. Освен това аз съм австрийка, синът ми имаше право да получи тези грижи и за нас изборът да се преместим беше естествен.

Кой организира тази подкрепа – някаква неправителствена организация, частна инициатива или държавата?

Първото място, на което семейство като нашето бива изпратено, беше един Институт за синдром на Даун, част от държавна болница. Получавах всички услуги за Йоан със здравната си карта [документ, който удостоверява здравноосигурителния статус – б.а.]. Оттам ме насочиха и към едно сдружение на родители, но към онзи момент все още не бях готова да участвам в тази общност – те се занимаваха предимно с проблеми в сферата на образованието и реализацията на по-големите деца. Предлагаха се и услуги – например заниманията с логопед, – които си плащах сама, но след това част от разходите ми се възстановяваха, а друга част се приспадаха от данъците ми.

Имаше, разбира се, и частни инициативи, чиято роля е да надграждат съществуващото. Например една дама, с която към момента вече добре се познаваме, тогава тъкмо беше създала център за обучение на родители и учители. Тя е специален педагог по професия и когато ѝ се ражда дете с Даун, се амбицира и започва да събира и адаптира наличните знания за работа с такива деца в практически програми – от това родителите да знаят как да развиват потенциала на детето чрез всекидневни игри, избор на играчки, занимания, до това как на тези деца да се преподава математика например. Обучението в този център вече се плаща от джоба, но то е съвсем друго ниво на грижа.

Добре е, че споменахте математиката, за да направим прехода към образованието. В България на теория децата със синдром на Даун трябва да учат в масови училища, където да получават допълнителна специализирана подкрепа. На практика невинаги резултатът е добър. Какъв е Вашият опит с австрийската система?

И при нас в образованието нещата са сходни. В детската градина е добре, детето може да ходи в масова група, където да е само с деца в норма. Може да ходи и в специализирана, която е по-специално организирана, но пак на включващ принцип – тоест има деца с различни увреждания, но има и здрави деца. След това обаче става по-сложно.

Първо, според мен би трябвало родителят да има възможност да изпрати детето си на училище по-късно, защото развитието е по-бавно. Но в Австрия всички тръгват на училище на шест години, а отлагане е възможно само с една година. И друго – родителите имат право да пратят детето в масово училище, но ако специализираните учители нямат достатъчно часове, няма да има кой да се занимава с него и то няма да може да се развива със собственото си темпо. Защото това, което видяхме от опит, е, че интензивният контакт и индивидуалните занимания са ключови за развитието. Децата с Даун са различни, както и ние сме различни – някои учат лесно, други много, много трудно. И за да имат истинско образование, то трябва да е съобразено с индивидуалното им темпо и нужди. Затова решихме да пратим нашия син в частно училище.

Има един концептуален въпрос, свързан с дилемата къде е по-добре за децата с особености в развитието – в масовото учебно заведение, където са сред други деца, но пък има по-малко възможности с тях да се работи индивидуално, или в някакви специализирани центрове, където имат повече индивидуални занимания, но нямат досег с естествената среда на своите връстници. Това обсъжда ли се в Австрия?

Много, много хубав въпрос и много труден отговор. Да, в Австрия също съществува тази дилема. По принцип би трябвало вече да ги няма тези специализирани центрове, в които децата с някакви особености са изолирани и ние не ги виждаме – трябва всички да могат да учат заедно. Обаче реалността е, че ако си специално дете, ходиш в масово училище и там нямаш специален учител до теб, това пречи на развитието ти.

Според мен са нужни повече усилия и инвестиции двете неща да вървят ръка за ръка – децата да са в масовото училище, да контактуват с деца без увреждания, но там, в масовата среда, да има достатъчно квалифициран персонал, който да ги подкрепя и да има условия. Например децата с особености в развитието често са по-чувствителни и не могат да издържат през целия ден на толкова много хора, шум, емоции. В масовите училища трябва да има стая, в която тези деца да провеждат част от индивидуалните си специализирани занимания. Но да ви кажа, това не сме го реализирали в Австрия, тоест няма един отговор.

Мисля, че като общество винаги трябва да показваме, че сме отворени и има място за всички нас. Откровено казано, смятам, че най-голяма полза от приобщаващото образование имат хората без увреждания. Присъствието на различни деца в класовете учи останалите на толерантност, на емпатия, създава им социално-емоционални умения, които са толкова ценни в съвременния свят. Австрийската държава, подобно на българската, има още много работа, за да направи това приобщаващо обучение наистина ефективно за всички.

Работа на държавата ли е това?

Сто процента е задача на държавата. Друг е въпросът, че когато тя не я изпълнява, хората запретват ръкави и сами създават инициативи. Например в Австрия най-големият проблем с образованието на нашите деца възниква, като навършат 15 години. До тази възраст образованието е задължително и достъпът на ученици с особености в развитието е гарантиран, но след това не е и оставането на едно дете със специални потребности в клас зависи от добрата воля на директора на училището. Миналата година организирахме протести заради това, стана голям скандал, защото именно нашите деца имат най-голяма нужда да са по-дълго време в училище. Аз съм убедена, че това ще се промени, работа на политиците е да го променят. Но междувременно, за щастие, в Австрия има много инициативи, част от една такава е и моят син.

Образователни инициативи ли?

И образователни, да. С част от неговите преподаватели създадохме едно кафене. Там Йоан и други деца като него едновременно усвояват професия и ходят на училище. Опитахме да създадем модел, където могат да получат професионална квалификация и същевременно да продължат да учат. Създадохме го с много спонсорски пари, с много дарения. След това обаче минахме през специална оценка и сега вече има държавно финансиране за този модел – един държавен фонд плаща на преподавателите на Йоан, че те работят с него. Това е типичен пример за практиката в Австрия – една частна инициатива започва някаква дейност и когато докаже ефективността си, може да получи държавна подкрепа.

Тоест това е честа практика в Австрия?

Да, да. Хубаво е, че държавата разбира, че да се подкрепят такива ангажирани хора е голям плюс. И аз съм голяма оптимистка, че точно такива инициативи ще дадат стимул и на държавата да промени политиките си. В момента в Австрия имаме много работилници. Това са специализирани места, където работят хора с увреждания, но там те са скрити, не са между нас. А в това кафене синът ми всеки ден е сред хора и някои от посетителите специално ходят там. Има една дама, която пътува половин час с трамвай три пъти седмично. Като я попитахме защо го прави, тя отговори: „Толкова е уютно при вас, колко обичам тези деца, как мило ме посрещат те!“

Това е страхотен баланс и обществото е готово за нещо подобно. Австрия все още не го е направила като държава, не го е превърнала в политика, но това е посоката, в която трябва да се работи – да виждаме хората със синдром на Даун, хората с увреждания сред нас. Разбирам, че е трудно и няма идеални модели, но отговорността за непрекъснатото развитие на средата не бива да се прехвърля на друг – тя е на обществото. Това е много важно.

Интервюто е част от поредица разговори за достъпа до образование на децата от уязвими групи. Проектът се осъществява благодарение на най-голямата социално отговорна инициатива на „Лидл България“ – „Ти и Lidl за нашето утре“, в партньорство с Фондация „Работилница за граждански инициативи“, Българския дарителски форум и Асоциацията на европейските журналисти.

„Отговорност на обществото“: Ева Ямбор за децата със синдром на Даун в Австрия

[$] LWN.net Weekly Edition for December 19, 2024

2024-12-19 corbet

Post Syndicated from corbet original https://lwn.net/Articles/1001869/

The LWN.net Weekly Edition for December 19, 2024 is available.

A Hyena in the White House

2024-12-19 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=4H_kEr5tMM8

Comic for 2024.12.19 – Plans

2024-12-19 Explosm.net

Post Syndicated from Explosm.net original https://explosm.net/comics/plans

New Cyanide and Happiness Comic

Inventor of the Rubik’s cube on the cube and how it relates to problem solving

2024-12-18 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=f87Vff-Ik9A

AWS named Leader in the 2024 ISG Provider Lens report for Sovereign Cloud Infrastructure Services (EU)

2024-12-18 Marta Taggart

Post Syndicated from Marta Taggart original https://aws.amazon.com/blogs/security/aws-named-leader-in-the-2024-isg-provider-lens-report-for-sovereign-cloud-infrastructure-services-eu/

For the second year in a row, Amazon Web Services (AWS) is named as a Leader in the Information Services Group (ISG) Provider Lens Quadrant report for Sovereign Cloud Infrastructure Services (EU), published on December 18, 2024. ISG is a leading global technology research, analyst, and advisory firm that serves as a trusted business partner to more than 900 clients. This ISG report evaluates 19 providers of sovereign cloud infrastructure services in the multi public cloud environment and examines how they address the key challenges that enterprise clients face in the European Union (EU). ISG defines Leaders as providers who represent innovative strength and competitive stability.

ISG rated AWS ahead of other leading cloud providers on both the competitive strength and portfolio attractiveness axes, with the highest score on portfolio attractiveness. Competitive strength was assessed on multiple factors, including degree of awareness, core competencies, and go-to-market strategy. Portfolio attractiveness was assessed on multiple factors, including scope of portfolio, portfolio quality, strategy and vision, and local characteristics.

According to the ISG Provider Lens report, “AWS develops various innovative solutions to meet different sovereignty needs, guided by inputs from regulators, cybersecurity experts, partners and customers. These solutions address factors such as location, workload sensitivity and industry standards.”

Read the report to:

Gather insight on the factors that ISG believes will influence the sovereign cloud landscape in the EU.
Discover why AWS was named as a Leader with the highest score on portfolio attractiveness by ISG.
Learn what makes the AWS Cloud sovereign-by-design and how we continue to offer more control and more choice without compromising on the full power of AWS.

The recognition of AWS as a Leader in this report for the second year in a row is a testament to our efforts to help European customers and partners meet their digital sovereignty and resilience requirements. AWS continues to deliver on the AWS Digital Sovereignty Pledge, our commitment to offering AWS customers the most advanced set of sovereignty controls and features available in the cloud. Earlier this year, we announced plans to invest €7.8 billion in the AWS European Sovereign Cloud by 2040, building on our long-term commitment to Europe and ongoing support of the region’s sovereignty needs. The AWS European Sovereign Cloud, which will be a new, independent cloud for Europe, is set to launch by the end of 2025.

Download the full 2024 ISG Provider Lens Quadrant report for Sovereign Cloud Infrastructure Services (EU).

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

The Hasivo S600WP-5XGT-1SX-SE Might Be The Best 6-port 10GbE Switch with PoE

2024-12-18 Rohit Kumar

Post Syndicated from Rohit Kumar original https://www.servethehome.com/hasivo-s600wp-5xgt-1sx-se-review-6-port-10gbe-switch-with-poe-realtek/

The Hasivo S600WP-5XGT-1SX-SE is perhaps the cheapest 6-port 10Gbase-T and SFP+ web managed switch with PoE capabilities

The post The Hasivo S600WP-5XGT-1SX-SE Might Be The Best 6-port 10GbE Switch with PoE appeared first on ServeTheHome.

Event Notifications for IT backup: Simplified automation

What are webhooks?

5 ways to stay in the know with your backup strategy

1. Backup verification and reporting

2. Security and compliance monitoring

3. Storage management automation

4. Cross-bucket backup monitoring

5. Integration with IT workflows

Why Event Notifications makes sense for backup teams

How it works with your environment

Ready to automate your backup monitoring?

Data landscape at HEMA

Data management in a data mesh

Why HEMA selected Amazon DataZone

Solution overview

Adoption plan: Strategy

Adoption plan: Action points

Results

Conclusion

About the authors

Automatic compaction with AWS Glue

Solution overview

Compaction IoT performance test

Prerequisites

Set up Amazon S3 storage

Download the application on the releases page

Create a Data Catalog database

Create an EMR Serverless application

Create an MSK cluster

Configure the data simulator

Create Kafka topics

Launch job runs

Enable auto compaction

Launch the data simulator

Test the solution in EMR Studio

Clean up

Conclusion

About the Authors

Solution overview

Prerequisites

Publish unstructured S3 data to Amazon DataZone

Set up the SageMaker notebook and SageMaker instance IAM role

Set up the consumer Amazon DataZone project, custom AWS service environment, and subscription target

Create a function to respond to subscription events

Handle only relevant events

Respond to subscription events in EventBridge

Subscribe to the unstructured data asset

Access the subscribed asset from the Amazon DataZone portal

Multi-account implementation

Clean up

Conclusion

About the Authors

Assessing holistic user risk

Who in the organization is being targeted?

Who are the attackers impersonating?

What risky behaviors are my users performing?

Increasing SOC productivity

AI-driven email security

Automated isolation

Automated blocking

What’s on the horizon for 2025

Configurable browser isolation for email

Outbound DLP for email

Expanded user risk scoring

Try Cloudflare Email Security today

The collective thoughts of the interwebz