How AWS is helping customers achieve their digital sovereignty and resilience goals

Post Syndicated from Max Peterson original https://aws.amazon.com/blogs/security/how-aws-is-helping-customers-achieve-their-digital-sovereignty-and-resilience-goals/

As we’ve innovated and expanded the Amazon Web Services (AWS) Cloud, we continue to prioritize making sure customers are in control and able to meet regulatory requirements anywhere they operate. With the AWS Digital Sovereignty Pledge, which is our commitment to offering all AWS customers the most advanced set of sovereignty controls and features available in the cloud, we are investing in an ambitious roadmap of capabilities for data residency, granular access restriction, encryption, and resilience. Today, I’ll focus on the resilience pillar of our pledge and share how customers are able to improve their resilience posture while meeting their digital sovereignty and resilience goals with AWS.

Resilience is the ability for any organization or government agency to respond to and recover from crises, disasters, or other disruptive events while maintaining its core functions and services. Resilience is a core component of sovereignty and it’s not possible to achieve digital sovereignty without it. Customers need to know that their workloads in the cloud will continue to operate in the face of natural disasters, network disruptions, and disruptions due to geopolitical crises. Public sector organizations and customers in highly regulated industries rely on AWS to provide the highest level of resilience and security to help meet their needs. AWS protects millions of active customers worldwide across diverse industries and use cases, including large enterprises, startups, schools, and government agencies. For example, the Swiss public transport organization BERNMOBIL improved its ability to protect data against ransomware attacks by using AWS.

Building resilience into everything we do

AWS has made significant investments in building and running the world’s most resilient cloud by building safeguards into our service design and deployment mechanisms and instilling resilience into our operational culture. We build to guard against outages and incidents, and account for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. To avoid single points of failure, we minimize interconnectedness within our global infrastructure. The AWS global infrastructure is geographically dispersed, spanning 105 Availability Zones (AZs) within 33 AWS Regions around the world. Each Region is comprised of multiple Availability Zones, and each AZ includes one or more discrete data centers with independent and redundant power infrastructure, networking, and connectivity. Availability Zones in a Region are meaningfully distant from each other, up to 60 miles (approximately 100 km) to help prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency. AWS is the only cloud provider to offer three or more Availability Zones within each of its Regions, providing more redundancy and better isolation to contain issues. Common points of failure, such as generators and cooling equipment, aren’t shared across Availability Zones and are designed to be supplied by independent power substations. To better isolate issues and achieve high availability, customers can partition applications across multiple Availability Zones in the same Region. Learn more about how AWS maintains operational resilience and continuity of service.

Resilience is deeply ingrained in how we design services. At AWS, the services we build must meet extremely high availability targets. We think carefully about the dependencies that our systems take. Our systems are designed to stay resilient even when those dependencies are impaired; we use what is called static stability to achieve this level of resilience. This means that systems operate in a static state and continue to operate as normal without needing to make changes during a failure or when dependencies are unavailable. For example, in Amazon Elastic Compute Cloud (Amazon EC2), after an instance is launched, it’s just as available as a physical server in a data center. The same property holds for other AWS resources such as virtual private clouds (VPCs), Amazon Simple Storage Service (Amazon S3) buckets and objects, and Amazon Elastic Block Store (Amazon EBS) volumes. Learn more in our Fault Isolation Boundaries whitepaper.

Information Services Group (ISG) cited strengthened resilience when naming AWS a Leader in their recent report, Provider Lens for Multi Public Cloud Services – Sovereign Cloud Infrastructure Services (EU), “AWS delivers its services through multiple Availability Zones (AZs). Clients can partition applications across multiple AZs in the same AWS region to enhance the range of sovereign and resilient options. AWS enables its customers to seamlessly transport their encrypted data between regions. This ensures data sovereignty even during geopolitical instabilities.”

AWS empowers governments of all sizes to safeguard digital assets in the face of disruptions. We proudly worked with the Ukrainian government to securely migrate data and workloads to the cloud immediately following Russia’s invasion, preserving vital government services that will be critical as the country rebuilds. We supported the migration of over 10 petabytes of data. For context, that means we migrated data from 42 Ukraine government authorities, 24 Ukrainian universities, a remote learning K–12 school serving hundreds of thousands of displaced children, and dozens of other private sector companies.

For customers who are running workloads on-premises or for remote use cases, we offer solutions such as AWS Local Zones, AWS Dedicated Local Zones, and AWS Outposts. Customers deploy these solutions to help meet their needs in highly regulated industries. For example, to help meet the rigorous performance, resilience, and regulatory demands for the capital markets, Nasdaq used AWS Outposts to provide market operators and participants with added agility to rapidly adjust operational systems and strategies to keep pace with evolving industry dynamics.

Enabling you to build resilience into everything you do

Millions of customers trust that AWS is the right place to build and run their business-critical and mission-critical applications. We provide a comprehensive set of purpose-built resilience services, strategies, and architectural best practices that you can use to improve your resilience posture and meet your sovereignty goals. These services, strategies, and best practices are outlined in the AWS Resilience Lifecycle Framework across five stages—Set Objectives, Design and Implement, Evaluate and Test, Operate, and Respond and Learn. The Resilience Lifecycle Framework is modeled after a standard software development lifecycle, so you can easily incorporate resilience into your existing processes.

For example, you can use the AWS Resilience Hub to set your resilience objectives, evaluate your resilience posture against those objectives, and implement recommendations for improvement based on the AWS Well-Architected Framework and AWS Trusted Advisor. Within Resilience Hub, you can create and run AWS Fault Injection Service experiments, which allow you to test how your application will respond to certain types of disruptions. Recently, Pearson, a global provider of educational content, assessment, and digital services to learners and enterprises, used Resilience Hub to improve their application resilience.

Other AWS resilience services such as AWS Backup, AWS Elastic Disaster Recovery (AWS DRS), and Amazon Route53 Application Recovery Controller (Route 53 ARC) can help you quickly respond and recover from disruptions. When Thomson Reuters, an international media company that provides solutions for tax, law, media, and government to clients in over 100 countries, wanted to improve data protection and application recovery for one of its business units, they adopted AWS DRS. AWS DRS provides Thomson Reuters continuous replication, so changes they made in the source environment were updated in the disaster recovery site within seconds.

Achieve your resilience goals with AWS and our AWS Partners

AWS offers multiple ways for you to achieve your resilience goals, including assistance from AWS Partners and AWS Professional Services. AWS Resilience Competency Partners specialize in improving customers’ critical workloads’ availability and resilience in the cloud. AWS Professional Services offers Resilience Architecture Readiness Assessments, which assess customer capabilities in eight critical domains—change management, disaster recovery, durability, observability, operations, redundancy, scalability, and testing—to identify gaps and areas for improvement.

We remain committed to continuing to enhance our range of sovereign and resilient options, allowing customers to sustain operations through disruption or disconnection. AWS will continue to innovate based on customer needs to help you build and run resilient applications in the cloud to keep up with the changing world.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Max Peterson

Max Peterson

Max is the Vice President of AWS Sovereign Cloud. He leads efforts to ensure that all AWS customers around the world have the most advanced set of sovereignty controls, privacy safeguards, and security features available in the cloud. Before his current role, Max served as the VP of AWS Worldwide Public Sector (WWPS) and created and led the WWPS International Sales division, with a focus on empowering government, education, healthcare, aerospace and satellite, and nonprofit organizations to drive rapid innovation while meeting evolving compliance, security, and policy requirements. Max has over 30 years of public sector experience and served in other technology leadership roles before joining Amazon. Max has earned both a Bachelor of Arts in Finance and Master of Business Administration in Management Information Systems from the University of Maryland.

Use Amazon Verified Permissions for fine-grained authorization at scale

Post Syndicated from Abhishek Panday original https://aws.amazon.com/blogs/security/use-amazon-verified-permissions-for-fine-grained-authorization-at-scale/

Implementing user authentication and authorization for custom applications requires significant effort. For authentication, customers often use an external identity provider (IdP) such as Amazon Cognito. Yet, authorization logic is typically implemented in code. This code can be prone to errors, especially as permissions models become complex, and presents significant challenges when auditing permissions and deciding who has access to what. As a result, within Common Weakness Enumeration’s (CWE’s) list of the Top 25 Most Dangerous Software Weaknesses for 2023, four are related to incorrect authorization.

At re:Inforce 2023, we launched Amazon Verified Permissions, a fine-grained permissions management service for the applications you build. Verified Permissions centralizes permissions in a policy store and lets developers use those permissions to authorize user actions within their applications. Permissions are expressed as Cedar policies. You can learn more about the benefits of moving your permissions centrally and expressing them as policies in Policy-based access control in application development with Amazon Verified Permissions.

In this post, we explore how you can provide a faster and richer user experience while still authorizing all requests in the application. You will learn two techniques—bulk authorization and response caching—to improve the efficiency of your applications. We describe how you can apply these techniques when listing authorized resources and actions and loading multiple components on webpages.

Use cases

You can use Verified Permissions to enforce permissions that determine what the user is able to see at the level of the user interface (UI), and what the user is permitted to do at the level of the API.

  1. UI permissions enable developers to control what a user is allowed see in the application. Developers enforce permissions in the UI to control the list of resources a user can see and the actions they can take. For example, a UI-level permission in a banking application might determine whether a transfer funds button is enabled for a given account.
  2. API permissions enable developers to control what a user is allowed to do in an application. Developers control access to individual API calls made by an application on behalf of the user. For example, an API-level permission in a banking application might determine whether a user is permitted to initiate a funds transfer from an account.

Cedar provides consistent and readable policies that can be used at both the level of the UI and the API. For example, a single policy can be checked at the level of the UI to determine whether to show the transfer funds button and checked at the level of the API to determine authority to initiate the funds transfer.

Challenges

Verified Permissions can be used for implementing fine-grained API permissions. Customer applications can use Verified Permissions to authorize API requests, based on centrally managed Cedar policies, with low latency. Applications authorize such requests by calling the IsAuthorized API of the service, and the response contains whether the request is allowed or denied. Customers are happy with the latency of individual authorization requests, but have asked us to help them improve performance for use cases that require multiple authorization requests. They typically mention two use cases:

  • Compound authorizationCompound authorization is needed when one high-level API action involves many low-level actions, each of which has its own permissions. This requires the application to make multiple requests to Verified Permissions to authorize the user action. For example, in a banking application, loading a credit card statement requires three API calls: GetCreditCardDetails, GetCurrentStatement, and GetCreditLimit. This requires three calls to Verified Permissions, one for each API call.
  • UI permissions: Developers implement UI permissions by calling the same authorization API for every possible resource a principal can access. Each request involves an API call, and the UI can only be presented after all of them have completed. Alternatively, for a resource-centric view, the application can make the call for multiple principals to determine which ones have access.

Solution

In this post, we show you two techniques to optimize the application’s latency based on API permissions and UI permissions.

  1. Batch authorization allows you to make up to 30 authorization decisions in a single API call. This feature was released in November 2023. See the what’s new post and API specifications to learn more.
  2. Response caching enables you to cache authorization responses in a policy enforcement point such as Amazon API Gateway, AWS AppSync, or AWS Lambda. You can cache responses using native enforcement point caches (for example, API Gateway caching) or managed caching services such as Amazon ElastiCache.

Solving for enforcing fine grained permissions while delivering a great user experience

You can use UI permissions to authorize what resources and actions a user can view in an application. We see developers implementing these controls by first generating a small set of resources based on database filters and then further reducing the set down to authorized resources by checking permissions on each resource using Verified Permissions. For example, when a user of a business banking system tries to view balances on company bank accounts, the application first filters the list to the set of bank accounts for that company. The application then filters the list further to only include the accounts that the user is authorized to view by making an API request to Verified Permissions for each account in the list. With batch authorization, the application can make a single API call to Verified Permissions to filter the list down to the authorized accounts.

Similarly, you can use UI permissions to determine what components of a page or actions should be visible to users of the application. For example, in a banking application, the application wants to control the sub-products (such as credit card, bank account, or stock trading) visible to a user or only display authorized actions (such as transfer or change address) when displaying an account overview page. Customers want to use Verified Permissions to determine which components of the page to display, but that can adversely impact the user experience (UX) if they make multiple API calls to build the page. With batch authorization, you can make one call to Verified Permissions to determine permissions for all components of the page. This enables you to provide a richer experience in your applications by displaying only the components that the user is allowed to access while maintaining low page load latency.

Solving for enforcing permissions for every API call without impacting performance

Compound authorization is where a single user action results in a sequence of multiple authorization calls. You can use bulk authorization combined with response caching to improve efficiency. The application makes a single bulk authorization request to Verified Permissions to determine whether each of the component API calls are permitted and the response is cached. This cache is then referenced for each component’s API call in the sequence.

Sample application – Use cases, personas, and permissions

We’re using an online order management application for a toy store to demonstrate how you can apply batch authorization and response caching to improve UX and application performance.

One function of the application is to enable employees in a store to process online orders.

Personas

The application is used by two types of users:

  • Pack associates are responsible for picking, packing, and shipping orders. They’re assigned to a specific department.
  • Store managers are responsible for overseeing the operations of a store.

Use cases

The application supports these use cases:

  1. Listing orders: Users can list orders. A user should only see the orders for which they have view permissions.
    • Pack associates can list all orders of their department.
    • Store managers can list all orders of their store.

    Figure 1 shows orders for Julian, who is a pack associate in the Soft Toy department

    Figure 1: Orders for Julian in the Soft Toy department

    Figure 1: Orders for Julian in the Soft Toy department

  2. Order actions: Users can take some actions on an order. The application enables the relevant UI elements based on the user’s permissions.
    • Pack associates can select Get Box Size and Mark as Shipped, as shown in Figure 2.
    • Store managers can select Get Box Size, Mark as Shipped, Cancel Order, and Route to different warehouse.
    Figure 2: Actions available to Julian as a pack associate

    Figure 2: Actions available to Julian as a pack associate

  3. Viewing an order: Users can view the details of a specific order. When a user views an order, the application loads the details, label, and receipt. Figure 3 shows the available actions for Julian who is a pack associate.
    Figure 3: Order Details for Julian, showing permitted actions

    Figure 3: Order Details for Julian, showing permitted actions

Policy design

The application uses Verified Permissions as a centralized policy store. These policies are expressed in Cedar. The application uses the Role management using policy templates approach for implementing role-based access controls. We encourage you to read best practices for using role-based access control in Cedar to understand if the approach fits your use case.

In the sample application, the policy template for the store owner role looks like the following:

permit (
        principal == ?principal,
        action in [
                avp::sample::toy::store::Action::"OrderActions",
                avp::sample::toy::store::Action::"AddPackAssociate",
                avp::sample::toy::store::Action::"AddStoreManager",
                avp::sample::toy::store::Action::"ListPackAssociates",
                avp::sample::toy::store::Action::"ListStoreManagers"
        ],
        resource in ?resource
);

When a user is assigned a role, the application creates a policy from the corresponding template by passing the user and store. For example, the policy created for the store owner is as follows:

permit (
    principal ==  avp::sample::toy::store::User::"test_user_pool|sub_store_manager_user", 
    action in  [
                avp::sample::toy::store::Action::"OrderActions",
                avp::sample::toy::store::Action::"AddPackAssociate",
                avp::sample::toy::store::Action::"AddStoreManager",
                avp::sample::toy::store::Action::"ListPackAssociates",
                avp::sample::toy::store::Action::"ListStoreManagers"
    ],
    resource in avp::sample::toy::store::Store::"toy store 1"
);

To learn more about the policy design of this application, see the readme file of the application.

Use cases – Design and implementation

In this section, we discuss high level design, challenges with the barebones integration, and how you can use the preceding techniques to reduce latency and costs.

Listing orders

Figure 4: Architecture for listing orders

Figure 4: Architecture for listing orders

As shown in Figure 4, the process to list orders is:

  1. The user accesses the application hosted in AWS Amplify.
  2. The user then authenticates through Amazon Cognito and obtains an identity token.
  3. The application uses Amplify to load the order page. The console calls the API ListOrders to load the order.
  4. The API is hosted in API Gateway and protected by a Lambda authorizer function.
  5. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  6. Then the Lambda function invokes Verified Permissions to authorize the request. The function checks against Verified Permissions for each order in the data store for the ListOrder call. If Verified Permissions returns deny, the order is not provided to the user. If Verified Permissions returns allow, the request is moved forward.

Challenge

Figure 5 shows that the application called IsAuthorized multiple times, sequentially. Multiple sequential calls cause the page to be slow to load and increase infrastructure costs.

Figure 5: Graph showing repeated calls to IsAuthorized

Figure 5: Graph showing repeated calls to IsAuthorized

Reduce latency using batch authorization

If you transition to using batch authorization, the application can receive 30 authorization decisions with a single API call to Verified Permissions. As you can see in Figure 6, the time to authorize has reduced from close to 800 ms to 79 ms, delivering a better overall user experience.

Figure 6: Reduced latency by using batch authorization

Figure 6: Reduced latency by using batch authorization

Order actions

Figure 7: Order actions architecture

Figure 7: Order actions architecture

As shown in Figure 7, the process to get authorized actions for an order is:

  1. The user goes to the application landing page on Amplify.
  2. The application calls the Order actions API at API Gateway
  3. The application sends a request to initiate order actions to display only authorized actions to the user.
  4. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  5. The Lambda function then checks with Verified Permissions for each order action. If Verified Permissions returns deny, the action is dropped. If Verified Permissions returns allow, the request is moved forward and the action is added to a list of order actions to be sent in a follow-up request to Verified Permissions to provide the actions in the user’s UI.

Challenge

As you saw with listing orders, Figure 8 shows how the application is still calling IsAuthorized multiple times, sequentially. This means the page remains slow to load and has increased impacts on infrastructure costs.

Figure 8: Graph showing repeated calls to IsAuthorized

Figure 8: Graph showing repeated calls to IsAuthorized

Reduce latency using batch authorization

If you add another layer by transitioning to using batch authorization once again, the application can receive all decisions with a single API call to Verified Permissions. As you can see from Figure 9, the time to authorize has reduced from close to 500 ms to 150 ms, delivering an improved user experience.

Figure 9: Graph showing results of layering batch authorization

Figure 9: Graph showing results of layering batch authorization

Viewing an order

Figure 10: Order viewing architecture

Figure 10: Order viewing architecture

The process to view an order, shown in Figure 10, is:

  1. The user accesses the application hosted in Amplify.
  2. The user authenticates through Amazon Cognito and obtains an identity token.
  3. The application calls three APIs hosted at API Gateway.
  4. The API’s: Get order details, Get label, and Get receipt are targeted sequentially to load the UI for the user in the application.
  5. A Lambda authorizer protects each of the above-mentioned APIs and is launched for each invoke.
  6. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  7. For each API, the following steps are repeated. The Lambda authorizer is invoked three times during page load.
    1. The Lambda function invokes Verified Permissions to authorize the request. If Verified Permissions returns deny, the request is rejected and an HTTP unauthorized response (403) is sent back. If Verified Permissions returns allow, the request is moved forward. 
    2. If the request is allowed, API Gateway calls the Lambda Order Management function to process the request. This is the primary Lambda function supporting the application and typically contains the core business logic of the application.

Challenge

In using the standard authorization pattern for this use case, the application calls Verified Permissions three times. This is because the user action to view an order requires compound authorization because each API call made by the console is authorized. While this enforces least privilege, it impacts the page load and reload latency of the application.

Reduce latency using batch authorization and decision caching

You can use batch authorization and decision caching to reduce latency. In the sample application, the cache is maintained by API Gateway. As shown in Figure 11, applying these techniques to the console application results in only one call to Verified Permissions, reducing latency.

Figure 11: Batch authorization with decision caching architecture

Figure 11: Batch authorization with decision caching architecture

The decision caching processshown in Figure 11, is:

  1. The user accesses the application hosted in Amplify.
  2. The user then authenticates through Amazon Cognito and obtains an identity token.
  3. The application then calls three APIs hosted at API Gateway
  4. When the Lambda function for the Get order details API is invoked, it uses the Lambda Authorizer to call batch authorization to get authorization decisions for the requested action, Get order details, and related actions, Get label and Get receipt.
  5. A Lambda authorizer protects each of the above-mentioned APIs but because of batch authorization, is invoked only once.
  6. The Lambda function collects entity information from an in-memory data store to formulate the isAuthorized request.
  7. The Lambda function invokes Verified Permissions to authorize the request. If Verified Permissions returns deny, the request is rejected and an HTTP unauthorized response (403) is sent back. If Verified Permissions returns allow, the request is moved forward.
    1. API Gateway caches the authorization decision for all actions (the requested action and related actions).
    2. If the request is allowed by the Lambda authorizer function, API Gateway calls the order management Lambda function to process the request. This is the primary Lambda function supporting the application and typically contains the core business logic of the application.
    3. When subsequent APIs are called, the API Gateway uses the cached authorization decisions and doesn’t use the Lambda authorization function.

Caching considerations

You’ve seen how you can use caching to implement fine-grained authorization at scale in your applications. This technique works well when your application has high cache hit rates, where authorization results are frequently loaded from the cache. Applications where the users initiate the same action multiple times or have a predictable sequence of actions will observe high cache hit rates. Another consideration is that employing caching can delay the time between policy updates and policy enforcement. We don’t recommend using caching for authorization decisions if your application requires policies to take effect quickly or your policies are time dependent (for example, a policy that gives access between 10:00 AM and 2:00 PM).

Conclusion

In this post, we showed you how to implement fine grained permissions in application at scale using Verified Permissions. We covered how you can use batch authorization and decision caching to improve performance and ensure Verified Permissions remains a cost-effective solution for large-scale applications. We applied these techniques to a demo application, avp-toy-store-sample, that is available to you for hands-on testing. For more information about Verified Permissions, see the Amazon Verified Permissions product details and Resources.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Abhishek Panday

Abhishek Panday

Abhishek is a product manager in the Amazon Verified Permissions team. He’s been working with AWS for more than two years and has been at Amazon for over five years. Abhishek enjoys working with customers to understand their challenges and building products to solve those challenges. Abhishek currently lives in Seattle and enjoys playing soccer, hiking, and cooking Indian cuisines.

Jeremy Wave

Jeremy Ware

Jeremy is a Security Specialist Solutions Architect focused on Identity and Access Management. Jeremy and his team enable AWS customers to implement sophisticated, scalable, and secure architectures to solve business challenges. Jeremy has spent many years working to improve the security maturity at numerous global enterprises. In his free time, Jeremy loves to enjoy the outdoors with his family.

ASRock Rack SIENAD8-2L2T AMD EPYC 8004 Siena Motherboard Review

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/asrock-rack-sienad8-2l2t-amd-epyc-8004-siena-motherboard-review/

In our ASRock Rack SIENAD8-2L2T review, we see how this ATX size AMD EPYC 8004 motherboard delivers utility with a little added flair

The post ASRock Rack SIENAD8-2L2T AMD EPYC 8004 Siena Motherboard Review appeared first on ServeTheHome.

Create an end-to-end data strategy for Customer 360 on AWS

Post Syndicated from Ismail Makhlouf original https://aws.amazon.com/blogs/big-data/create-an-end-to-end-data-strategy-for-customer-360-on-aws/

Customer 360 (C360) provides a complete and unified view of a customer’s interactions and behavior across all touchpoints and channels. This view is used to identify patterns and trends in customer behavior, which can inform data-driven decisions to improve business outcomes. For example, you can use C360 to segment and create marketing campaigns that are more likely to resonate with specific groups of customers.

In 2022, AWS commissioned a study conducted by the American Productivity and Quality Center (APQC) to quantify the Business Value of Customer 360. The following figure shows some of the metrics derived from the study. Organizations using C360 achieved 43.9% reduction in sales cycle duration, 22.8% increase in customer lifetime value, 25.3% faster time to market, and 19.1% improvement in net promoter score (NPS) rating.

Without C360, businesses face missed opportunities, inaccurate reports, and disjointed customer experiences, leading to customer churn. However, building a C360 solution can be complicated. A Gartner Marketing survey found only 14% of organizations have successfully implemented a C360 solution, due to lack of consensus on what a 360-degree view means, challenges with data quality, and lack of cross-functional governance structure for customer data.

In this post, we discuss how you can use purpose-built AWS services to create an end-to-end data strategy for C360 to unify and govern customer data that address these challenges. We structure it in five pillars that power C360: data collection, unification, analytics, activation, and data governance, along with a solution architecture that you can use for your implementation.

The five pillars of a mature C360

When you embark on creating a C360, you work with multiple use cases, types of customer data, and users and applications that require different tools. Building a C360 on the right datasets, adding new datasets over time while maintaining the quality of data, and keeping it secure needs an end-to-end data strategy for your customer data. You also need to provide tools that make it straightforward for your teams to build products that mature your C360.

We recommend building your data strategy around five pillars of C360, as shown in the following figure. This starts with basic data collection, unifying and linking data from various channels related to unique customers, and progresses towards basic to advanced analytics for decision-making, and personalized engagement through various channels. As you mature in each of these pillars, you progress towards responding to real-time customer signals.

The following diagram illustrates the functional architecture that combines the building blocks of a Customer Data Platform on AWS with additional components used to design an end-to-end C360 solution. This is aligned to the five pillars we discuss in this post.

Pillar 1: Data collection

As you start building your customer data platform, you have to collect data from various systems and touchpoints, such as your sales systems, customer support, web and social media, and data marketplaces. Think of the data collection pillar as a combination of ingestion, storage, and processing capabilities.

Data ingestion

You have to build ingestion pipelines based on factors like types of data sources (on-premises data stores, files, SaaS applications, third-party data), and flow of data (unbounded streams or batch data). AWS provides different services for building data ingestion pipelines:

  • AWS Glue is a serverless data integration service that ingests data in batches from on-premises databases and data stores in the cloud. It connects to more than 70 data sources and helps you build extract, transform, and load (ETL) pipelines without having to manage pipeline infrastructure. AWS Glue Data Quality checks for and alerts on poor data, making it straightforward to spot and fix issues before they harm your business.
  • Amazon AppFlow ingests data from software as a service (SaaS) applications like Google Analytics, Salesforce, SAP, and Marketo, giving you the flexibility to ingest data from more than 50 SaaS applications.
  • AWS Data Exchange makes it straightforward to find, subscribe to, and use third-party data for analytics. You can subscribe to data products that help enrich customer profiles, for example demographics data, advertising data, and financial markets data.
  • Amazon Kinesis ingests streaming events in real time from point-of-sales systems, clickstream data from mobile apps and websites, and social media data. You could also consider using Amazon Managed Streaming for Apache Kafka (Amazon MSK) for streaming events in real time.

The following diagram illustrates the different pipelines to ingest data from various source systems using AWS services.

Data storage

Structured, semi-structured, or unstructured batch data is stored in an object storage because these are cost-efficient and durable. Amazon Simple Storage Service (Amazon S3) is a managed storage service with archiving features that can store petabytes of data with eleven 9’s of durability. Streaming data with low latency needs is stored in Amazon Kinesis Data Streams for real-time consumption. This allows immediate analytics and actions for various downstream consumers—as seen with Riot Games’ central Riot Event Bus.

Data processing

Raw data is often cluttered with duplicates and irregular formats. You need to process this to make it ready for analysis. If you are consuming batch data and streaming data, consider using a framework that can handle both. A pattern such as the Kappa architecture views everything as a stream, simplifying the processing pipelines. Consider using Amazon Managed Service for Apache Flink to handle the processing work. With Managed Service for Apache Flink, you can clean and transform the streaming data and direct it to the appropriate destination based on latency requirements. You can also implement batch data processing using Amazon EMR on open source frameworks such as Apache Spark at 3.5 times better performance than the self-managed version. The architecture decision of using a batch or streaming processing system will depend on various factors; however, if you want to enable real-time analytics on your customer data, we recommend using a Kappa architecture pattern.

Pillar 2: Unification

To link the diverse data arriving from various touchpoints to a unique customer, you need to build an identity processing solution that identifies anonymous logins, stores useful customer information, links them to external data for better insights, and groups customers in domains of interest. Although the identity processing solution helps build the unified customer profile, we recommend considering this as part of your data processing capabilities. The following diagram illustrates the components of such a solution.

The key components are as follows:

  • Identity resolution – Identity resolution is a deduplication solution, where records are matched to identify a unique customer and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods. This can be achieved using AWS Entity Resolution, which enables using rules and machine learning (ML) techniques to match records and resolve identities. Alternatively, you can build identity graphs using Amazon Neptune for a single unified view of your customers.
  • Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format. Instead of showing every transaction detail, you can offer an aggregated spend value and a link to their Customer Relationship Management (CRM) record. For customer service interactions, provide an average CSAT score and a link to the call center system for a deeper dive into their communication history.
  • Profile enrichment – After you resolve a customer to a single identity, enhance their profile using various data sources. Enrichment typically involves adding demographic, behavioral, and geolocation data. You can use third-party data products from AWS Marketplace delivered through AWS Data Exchange to gain insights on income, consumption patterns, credit risk scores, and many more dimensions to further refine the customer experience.
  • Customer segmentation – After uniquely identifying and enriching a customer’s profile, you can segment them based on demographics like age, spend, income, and location using applications in Managed Service for Apache Flink. As you advance, you can incorporate AI services for more precise targeting techniques.

After you have done the identity processing and segmentation, you need a storage capability to store the unique customer profile and provide search and query capabilities on top of it for downstream consumers to use the enriched customer data.

The following diagram illustrates the unification pillar for a unified customer profile and single view of the customer for downstream applications.

Unified customer profile

Graph databases excel in modeling customer interactions and relationships, offering a comprehensive view of the customer journey. If you are dealing with billions of profiles and interactions, you can consider using Neptune, a managed graph database service on AWS. Organizations such as Zeta and Activision have successfully used Neptune to store and query billions of unique identifiers per month and millions of queries per second at millisecond response time.

Single customer view

Although graph databases provide in-depth insights, yet they can be complex for regular applications. It is prudent to consolidate this data into a single customer view, serving as a primary reference for downstream applications, ranging from ecommerce platforms to CRM systems. This consolidated view acts as a liaison between the data platform and customer-centric applications. For such purposes, we recommend using Amazon DynamoDB for its adaptability, scalability, and performance, resulting in an up-to-date and efficient customer database. This database will accept a lot of write queries back from the activation systems that learn new information about the customers and feed them back.

Pillar 3: Analytics

The analytics pillar defines capabilities that help you generate insights on top of your customer data. Your analytics strategy applies to the wider organizational needs, not just C360. You can use the same capabilities to serve financial reporting, measure operational performance, or even monetize data assets. Strategize based on how your teams explore data, run analyses, wrangle data for downstream requirements, and visualize data at different levels. Plan on how you can enable your teams to use ML to move from descriptive to prescriptive analytics.

The AWS modern data architecture shows a way to build a purpose-built, secure, and scalable data platform in the cloud. Learn from this to build querying capabilities across your data lake and the data warehouse.

The following diagram breaks down the analytics capability into data exploration, visualization, data warehousing, and data collaboration. Let’s find out what role each of these components play in the context of C360.

Data exploration

Data exploration helps unearth inconsistencies, outliers, or errors. By spotting these early on, your teams can have cleaner data integration for C360, which in turn leads to more accurate analytics and predictions. Consider the personas exploring the data, their technical skills, and the time to insight. For instance, data analysts who know to write SQL can directly query the data residing in Amazon S3 using Amazon Athena. Users interested in visual exploration can do so using AWS Glue DataBrew. Data scientists or engineers can use Amazon EMR Studio or Amazon SageMaker Studio to explore data from the notebook, and for a low-code experience, you can use Amazon SageMaker Data Wrangler. Because these services directly query S3 buckets, you can explore the data as it lands in the data lake, reducing time to insight.

Visualization

Turning complex datasets into intuitive visuals unravels hidden patterns in the data, and is crucial for C360 use cases. With this capability, you can design reports for different levels catering to varying needs: executive reports offering strategic overviews, management reports highlighting operational metrics, and detailed reports diving into the specifics. Such visual clarity helps your organization make informed decisions across all tiers, centralizing the customer’s perspective.

The following diagram shows a sample C360 dashboard built on Amazon QuickSight. QuickSight offers scalable, serverless visualization capabilities. You can benefit from its ML integrations for automated insights like forecasting and anomaly detection or natural language querying with Amazon Q in QuickSight, direct data connectivity from various sources, and pay-per-session pricing. With QuickSight, you can embed dashboards to external websites and applications, and the SPICE engine enables rapid, interactive data visualization at scale. The following screenshot shows an example C360 dashboard built on QuickSight.

Data warehouse

Data warehouses are efficient in consolidating structured data from multifarious sources and serving analytics queries from a large number of concurrent users. Data warehouses can provide a unified, consistent view of a vast amount of customer data for C360 use cases. Amazon Redshift addresses this need by adeptly handling large volumes of data and diverse workloads. It provides strong consistency across datasets, allowing organizations to derive reliable, comprehensive insights about their customers, which is essential for informed decision-making. Amazon Redshift offers real-time insights and predictive analytics capabilities for analyzing data from terabytes to petabytes. With Amazon Redshift ML, you can embed ML on top of the data stored in the data warehouse with minimum development overhead. Amazon Redshift Serverless simplifies application building and makes it straightforward for companies to embed rich data analytics capabilities.

Data collaboration

You can securely collaborate and analyze collective datasets from your partners without sharing or copying one another’s underlying data using AWS Clean Rooms. You can bring together disparate data from across engagement channels and partner datasets to form a 360-degree view of your customers. AWS Clean Rooms can enhance C360 by enabling use cases like cross-channel marketing optimization, advanced customer segmentation, and privacy-compliant personalization. By safely merging datasets, it offers richer insights and robust data privacy, meeting business needs and regulatory standards.

Pillar 4: Activation

The value of data diminishes the older it gets, leading to higher opportunity costs over time. In a survey conducted by Intersystems, 75% of the organizations surveyed believe untimely data inhibited business opportunities. In another survey, 58% of organizations (out of 560 respondents of HBR Advisory council and readers) stated they saw an increase in customer retention and loyalty using real-time customer analytics.

You can achieve a maturity in C360 when you build the ability to act on all the insights acquired from the previous pillars we discussed in real time. For example, at this maturity level, you can act on customer sentiment based on the context you automatically derived with an enriched customer profile and integrated channels. For this you need to implement prescriptive decision-making on how to address the customer’s sentiment. To do this at scale, you have to use AI/ML services for decision-making. The following diagram illustrates the architecture to activate insights using ML for prescriptive analytics and AI services for targeting and segmentation.

Use ML for the decision-making engine

With ML, you can improve the overall customer experience—you can create predictive customer behavior models, design hyper-personalized offers, and target the right customer with the right incentive. You can build them using Amazon SageMaker, which features a suite of managed services mapped to the data science lifecycle, including data wrangling, model training, model hosting, model inference, model drift detection, and feature storage. SageMaker enables you to build and operationalize your ML models, infusing them back into your applications to produce the right insight to the right person at the right time.

Amazon Personalize supports contextual recommendations, through which you can improve the relevance of recommendations by generating them within a context—for instance, device type, location, or time of day. Your team can get started without any prior ML experience using APIs to build sophisticated personalization capabilities in a few clicks. For more information, see Customize your recommendations by promoting specific items using business rules with Amazon Personalize.

Activate channels across marketing, advertising, direct-to-consumer, and loyalty

Now that you know who your customers are and who to reach out to, you can build solutions to run targeting campaigns at scale. With Amazon Pinpoint, you can personalize and segment communications to engage customers across multiple channels. For example, you can use Amazon Pinpoint to build engaging customer experiences through various communication channels like email, SMS, push notifications, and in-app notifications.

Pillar 5: Data governance

Establishing the right governance that balances control and access gives users trust and confidence in data. Imagine offering promotions on products that a customer doesn’t need, or bombarding the wrong customers with notifications. Poor data quality can lead to such situations, and ultimately results in customer churn. You have to build processes that validate data quality and take corrective actions. AWS Glue Data Quality can help you build solutions that validate the quality of data at rest and in transit, based on predefined rules.

To set up a cross-functional governance structure for customer data, you need a capability for governing and sharing data across your organization. With Amazon DataZone, admins and data stewards can manage and govern access to data, and consumers such as data engineers, data scientists, product managers, analysts, and other business users can discover, use, and collaborate with that data to drive insights. It streamlines data access, letting you find and use customer data, promotes team collaboration with shared data assets, and provides personalized analytics either via a web app or API on a portal. AWS Lake Formation makes sure data is accessed securely, guaranteeing the right people see the right data for the right reasons, which is crucial for effective cross-functional governance in any organization. Business metadata is stored and managed by Amazon DataZone, which is underpinned by technical metadata and schema information, which is registered in the AWS Glue Data Catalog. This technical metadata is also used both by other governance services such as Lake Formation and Amazon DataZone, and analytics services such as Amazon Redshift, Athena, and AWS Glue.

Bringing it all together

Using the following diagram as a reference, you can create projects and teams for building and operating different capabilities. For example, you can have a data integration team focus on the data collection pillar—you can then align functional roles, like data architects and data engineers. You can build your analytics and data science practices to focus on the analytics and activation pillars, respectively. Then you can create a specialized team for customer identity processing and for building the unified view of the customer. You can establish a data governance team with data stewards from different functions, security admins, and data governance policymakers to design and automate policies.

Conclusion

Building a robust C360 capability is fundamental for your organization to gain insights into your customer base. AWS Databases, Analytics, and AI/ML services can help streamline this process, providing scalability and efficiency. Following the five pillars to guide your thinking, you can build an end-to-end data strategy that defines the C360 view across the organization, makes sure data is accurate, and establishes cross-functional governance for customer data. You can categorize and prioritize the products and features you have to build within each pillar, select the right tool for the job, and build the skills you need in your teams.

Visit AWS for Data Customer Stories to learn how AWS is transforming customer journeys, from the world’s largest enterprises to growing startups.


About the Authors

Ismail Makhlouf is a Senior Specialist Solutions Architect for Data Analytics at AWS. Ismail focuses on architecting solutions for organizations across their end-to-end data analytics estate, including batch and real-time streaming, big data, data warehousing, and data lake workloads. He primarily works with organizations in retail, ecommerce, FinTech, HealthTech, and travel to achieve their business objectives with well architected data platforms.

Sandipan Bhaumik (Sandi) is a Senior Analytics Specialist Solutions Architect at AWS. He helps customers modernize their data platforms in the cloud to perform analytics securely at scale, reduce operational overhead, and optimize usage for cost-effectiveness and sustainability.

Security updates for Tuesday

Post Syndicated from jake original https://lwn.net/Articles/966678/

Security updates have been issued by CentOS (kernel), Debian (firefox-esr), Fedora (webkitgtk), Mageia (curaengine & blender and gnutls), Red Hat (firefox, grafana, grafana-pcp, libreoffice, nodejs:18, and thunderbird), SUSE (glade), and Ubuntu (crmsh, debian-goodies, linux-aws, linux-aws-6.5, linux-aws-hwe, linux-azure, linux-azure-4.15, linux-oracle, linux-azure, linux-azure-5.4, linux-oracle, linux-oracle-5.15, pam, and thunderbird).

Applying Spot-to-Spot consolidation best practices with Karpenter

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/applying-spot-to-spot-consolidation-best-practices-with-karpenter/

This post is written by Robert Northard – AWS Container Specialist Solutions Architect, and Carlos Manzanedo Rueda – AWS WW SA Leader for Efficient Compute

Karpenter is an open source node lifecycle management project built for Kubernetes. In this post, you will learn how to use the new Spot-to-Spot consolidation functionality released in Karpenter v0.34.0, which helps further optimize your cluster. Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances are spare Amazon EC2 capacity available for up to 90% off compared to On-Demand prices. One difference between On-Demand and Spot is that Spot Instances can be interrupted by Amazon EC2 when the capacity is needed back. Karpenter’s built-in support for Spot Instances allows users to seamlessly implement Spot best practices and helps users optimize the cost of stateless, fault tolerant workloads. For example, when Karpenter observes a Spot interruption, it automatically starts a new node in response.

Karpenter provisions nodes in response to unschedulable pods based on aggregated CPU, memory, volume requests, and other scheduling constraints. Over time, Karpenter has added functionality to simplify instance lifecycle configuration, providing a termination controller, instance expiration, and drift detection. Karpenter also helps optimize Kubernetes clusters by selecting the optimal instances while still respecting Kubernetes pod-to-node placement nuances, such as nodeSelector, affinity and anti-affinity, taints and tolerations, and topology spread constraints.

The Kubernetes scheduler assigns pods to nodes based on their scheduling constraints. Over time, as workloads are scaled out and scaled in or as new instances join and leave, the cluster placement and instance load might end up not being optimal. In many cases, it results in unnecessary extra costs. Karpenter has a consolidation feature that improves cluster placement by identifying and taking action in situations such as:

  1. when a node is empty
  2. when a node can be removed as the pods that are running on it can be rescheduled into other existing nodes
  3. when the number of pods in a node has gone down and the node can now be replaced with a lower-priced and rightsized variant (which is shown in the following figure)
Karpenter consolidation, replacing one 2xlarge Amazon EC2 Instance with an xlarge Amazon EC2 Instance.

Karpenter consolidation, replacing one 2xlarge Amazon EC2 Instance with an xlarge Amazon EC2 Instance.

Karpenter versions prior to v0.34.0 only supported consolidation for Amazon EC2 On-Demand Instances. On-Demand consolidation allowed consolidating from On-Demand into Spot Instances and to lower-priced On-Demand Instances. However, once a pod was placed on a Spot Instance, Spot nodes were only removed when the nodes were empty. In v0.34.0, you can enable the feature gate to use Spot-to-Spot consolidation.

Solution overview

When launching Spot Instances, Karpenter uses the price-capacity-optimized allocation strategy when calling the Amazon EC2 instant Fleet API (shown in the following figure) and passes in a selection of compute instance types based on the Karpenter NodePool configuration. The Amazon EC2 Fleet API in instant mode is a synchronous API call that immediately returns a list of instances that launched and any instance that could not be launched. For any instances that could not be launched, Karpenter might request alternative capacity or remove any soft Kubernetes scheduling constraints for the workload.

Karpenter instance orchestration

Karpenter instance orchestration

Spot-to-Spot consolidation needed an approach that was different from On-Demand consolidation. For On-Demand consolidation, rightsizing and lowest price are the main metrics used. For Spot-to-Spot consolidation to take place, Karpenter requires a diversified instance configuration (see the example NodePool defined in the walkthrough) with at least 15 instances types. Without this constraint, there would be a risk of Karpenter selecting an instance that has lower availability and, therefore, higher frequency of interruption.

Prerequisites

The following prerequisites are required to complete the walkthrough:

  • Install an Amazon Elastic Kubernetes Service (Amazon EKS) cluster (version 1.29 or higher) with Karpenter (v0.34.0 or higher). The Karpenter Getting Started Guide provides steps for setting up an Amazon EKS cluster and adding Karpenter.
  • Enable replacement with Spot consolidation through the SpotToSpotConsolidation feature gate. This can be enabled during a helm install of the Karpenter chart by adding –-set settings.featureGates.spotToSpotConsolidation=true argument.
  • Install kubectl, the Kubernetes command line tool for communicating with the Kubernetes control plane API, and kubectl context configured with Cluster Operator and Cluster Developer permissions.

Walkthrough

The following walkthrough guides you through the steps for simulating Spot-to-Spot consolidation.

1. Create a Karpenter NodePool and EC2NodeClass

Create a Karpenter NodePool and EC2NodeClass. Replace the following with your own values. If you used the Karpenter Getting Started Guide to create your installation, then the value would be your cluster name.

  • Replace <karpenter-discovery-tag-value> with your subnet tag for Karpenter subnet and security group auto-discovery.
  • Replace <role-name> with the name of the AWS Identity and Access Management (IAM) role for node identity.
cat <<EOF > nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        intent: apps
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c","m","r"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano","micro","small","medium"]
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values: ["nitro"]
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  subnetSelectorTerms:          
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  role: "<role-name>"
  tags:
    Name: karpenter.sh/nodepool/default
    IntentLabel: "apps"
EOF

kubectl apply -f nodepool.yaml

The NodePool definition demonstrates a flexible configuration with instances from the C, M, or R EC2 instance families. The configuration is restricted to use smaller instance sizes but is still diversified as much as possible. For example, this might be needed in scenarios where you deploy observability DaemonSets. If your workload has specific requirements, then see the supported well-known labels in the Karpenter documentation.

2. Deploy a sample workload

Deploy a sample workload by running the following command. This command creates a Deployment with five pod replicas using the pause container image:

cat <<EOF > inflate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 5
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      nodeSelector:
        intent: apps
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
              memory: 1.5Gi
EOF
kubectl apply -f inflate.yaml

Next, check the Kubernetes nodes by running a kubectl get nodes CLI command. The capacity pool (instance type and Availability Zone) selected depends on any Kubernetes scheduling constraints and spare capacity size. Therefore, it might differ from this example in the walkthrough. You can see Karpenter launched a new node of instance type c6g.2xlarge, an AWS Graviton2-based instance, in the eu-west-1c Region:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                     STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE   ZONE         CAPACITY-TYPE
ip-10-0-12-17.eu-west-1.compute.internal Ready    <none>   80s   v1.29.0-eks-a5ec690   default    c6g.2xlarge     eu-west-1c   spot

3. Scale in a sample workload to observe consolidation

To invoke a Karpenter consolidation event scale, inflate the deployment to 1. Run the following command:

kubectl scale --replicas=1 deployment/inflate 

Tail the Karpenter logs by running the following command. If you installed Karpenter in a different Kubernetes namespace, then replace the name for the -n argument in the command:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter --all-containers=true -f --tail=20

After a few seconds, you should see the following disruption via consolidation message in the Karpenter logs. The message indicates the c6g.2xlarge Spot node has been targeted for replacement and Karpenter has passed the following 15 instance types—m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)—to the Amazon EC2 Fleet API:

{"level":"INFO","time":"2024-02-19T12:09:50.299Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 candidates ip-10-0-12-181.eu-west-1.compute.internal/c6g.2xlarge/spot and replacing with spot node from types m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)","commit":"17d6c05","command-id":"60f27cb5-98fa-40fb-8231-05b31fd41892"}

Check the Kubernetes nodes by running the following kubectl get nodes CLI command. You can see that Karpenter launched a new node of instance type c6g.large:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                      STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE ZONE       CAPACITY-TYPE
ip-10-0-12-156.eu-west-1.compute.internal           Ready    <none>   2m1s   v1.29.0-eks-a5ec690   default    c6g.large       eu-west-1c   spot

Use kubectl get nodeclaims to list all objects of type NodeClaim and then describe the NodeClaim Kubernetes resource using kubectl get nodeclaim/<claim-name> -o yaml. In the NodeClaim .spec.requirements, you can also see the 15 instance types passed to the Amazon EC2 Fleet API:

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
...
spec:
  nodeClassRef:
    name: default
  requirements:
  ...
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c5.large
    - c5ad.large
    - c6g.large
    - c6gn.large
    - c6i.large
    - c6id.large
    - c7a.large
    - c7g.large
    - c7gd.large
    - m6a.large
    - m6g.large
    - m6gd.large
    - m7g.large
    - m7i-flex.large
    - r6g.large
...

What would happen if a Spot node could not be consolidated?

If a Spot node cannot be consolidated because there are not 15 instance types in the compute selection, then the following message will appear in the events for the NodeClaim object. You might get this event if you overly constrained your instance type selection:

Normal  Unconsolidatable   31s   karpenter  SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 1

Spot best practices with Karpenter

The following are some best practices to consider when using Spot Instances with Karpenter.

  • Avoid overly constraining instance type selection: Karpenter selects Spot Instances using the price-capacity-optimized allocation strategy, which balances the price and availability of AWS spare capacity. Although a minimum of 15 instances are needed, you should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost.
  • Gracefully handle Spot interruptions and consolidation actions: Karpenter natively handles Spot interruption notifications by consuming events from an Amazon Simple Queue Service (Amazon SQS) queue, which is populated with Spot interruption notifications through Amazon EventBridge. As soon as Karpenter receives a Spot interruption notification, it gracefully drains the interrupted node of any running pods while also provisioning a new node for which those pods can schedule. With Spot Instances, this process needs to complete within 2 minutes. For a pod with a termination period longer than 2 minutes, the old node will be interrupted prior to those pods being rescheduled. To test a replacement node, AWS Fault Injection Service (FIS) can be used to simulate Spot interruptions.
  • Carefully configure resource requests and limits for workloads: Rightsizing and optimizing your cluster is a shared responsibility. Karpenter effectively optimizes and scales infrastructure, but the end result depends on how well you have rightsized your pod requests and any other Kubernetes scheduling constraints. Karpenter does not consider limits or resource utilization. For most workloads with non-compressible resources, such as memory, it is generally recommended to set requests==limits because if a workload tries to burst beyond the available memory of the host, an out-of-memory (OOM) error occurs. Karpenter consolidation can increase the probability of this as it proactively tries to reduce total allocatable resources for a Kubernetes cluster. For help with rightsizing your Kubernetes pods, consider exploring Kubecost, Vertical Pod Autoscaler configured in recommendation mode, or an open source tool such as Goldilocks.
  • Configure metrics for Karpenter: Karpenter emits metrics in the Prometheus format, so consider using Amazon Managed Service for Prometheus to track interruptions caused by Karpenter Drift, consolidation, Spot interruptions, or other Amazon EC2 maintenance events. These metrics can be used to confirm that interruptions are not having a significant impact on your service’s availability and monitor NodePool usage and pod lifecycles. The Karpenter Getting Started Guide contains an example Grafana dashboard configuration.

You can learn more about other application best practices in the Reliability section of the Amazon EKS Best Practices Guide.

Cleanup

To avoid incurring future charges, delete any resources you created as part of this walkthrough. If you followed the Karpenter Getting Started Guide to set up a cluster and add Karpenter, follow the clean-up instructions in the Karpenter documentation to delete the cluster. Alternatively, if you already had a cluster with Karpenter, delete the resources created as part of this walkthrough:

kubectl delete -f inflate.yaml
kubectl delete -f nodepool.yaml

Conclusion

In this post, you learned how Karpenter can actively replace a Spot node with another more cost-efficient Spot node. Karpenter can consolidate Spot nodes that have the right balance between lower price and low-frequency interruptions when there are at least 15 selectable instances to balance price and availability.

To get started, check out the Karpenter documentation as well as Karpenter Blueprints, which is a repository including common workload scenarios following the best practices.

You can share your feedback on this feature by a raising a GitHub Issue.

NIS2, DORA, CER and GDPR: A Comparative Overview of Crucial EU Compliance Directives and Regulations

Post Syndicated from Editor original https://nebosystems.eu/comparative-guide-dora-gdpr-nis2-cer/

In the evolving regulatory landscape, organizations operating within the EU must navigate through a complex web of regulations and directives, including NIS2 (Network & Information System) Directive, CER (Critical Entities Resilience) Directive, DORA (Digital Operational Resilience Act) and GDPR (General Data Protection Regulation). Each of these frameworks has a distinct focus, from enhancing cybersecurity and operational resilience to protecting personal data and ensuring the resilience of critical entities.

This guide outlines the essential aspects of DORA (EU) 2022/2554, GDPR (EU) 2016/679, NIS2 (EU) 2022/2555 directive and the CER/RCE (EU) 2022/2557 directive, including their scope, objectives, key requirements, sanctions for non-compliance, implementation deadlines, technical and organizational measures, key differences and compliance intersections.

Scope and Applicability

  • NIS2 (Network & Information System) Directive : Applies to essential and important entities across various sectors expanding the scope of its predecessor, the NIS Directive.
  • Essential entities include sectors such as energy (including electricity, oil, and gas), transport (air, rail, water and road), banking, financial market infrastructures, health care, drinking water, wastewater, and digital infrastructure. Essential entities are those whose disruption would cause significant impacts on public safety, security, or economic or societal activities.
  • Important Entities covers postal and courier services, waste management, manufacture, production and distribution of chemicals, food production, distribution and sale, manufacturing of medical devices, computers and electronics, machinery equipment, motor vehicles, digital providers such as online marketplaces, online search engines, and social networking services platforms, and certain entities within the public administration sector.
  • DORA (Digital Operational Resilience Act): Specifically focuses on the resilience of the financial sector to ICT risks, encompassing a wide range of entities that play pivotal roles in the financial ecosystem. This includes credit institutions, investment firms, insurance and reinsurance companies, payment institutions, electronic money institutions, crypto-asset service providers, central securities depositories, central counterparties, trading venues, managers of alternative investment funds and management companies of undertakings for collective investment in transferable securities (UCITS). Additionally, it covers ICT third-party service providers to these financial entities, emphasizing the importance of digital operational resilience not just within financial entities themselves but also within their extended digital supply chains.
  • GDPR (General Data Protection Regulation): Has a global reach, affecting any organization that processes personal data of EU citizens, focusing on data protection and privacy regardless of the sector.
  • CER (Critical Entities Resilience) Directive: Aims to enhance the resilience of critical entities operating in vital sectors such as energy, transport, banking, financial market infrastructure, health, drinking water, waste water, public administration, space, digital infrastructure, production, processing and distribution of food sector within the EU.

Objectives

  • NIS2 Directive: Seeks to significantly raise cybersecurity standards and improve incident response capabilities across the EU.
  • DORA: Ensures that the financial sector can withstand, respond to, and recover from ICT-related disruptions and threats.
  • GDPR: Protects EU citizens’ personal data, ensuring privacy and giving individuals control over their personal information.
  • CER Directive: Focuses on enhancing the overall resilience of entities that are critical to the maintenance of vital societal or economic activities against a range of non-cyber and cyber threats.

Key Requirements

  • NIS2 Directive: Mandates robust risk management measures, timely incident reporting, supply chain security and resilience testing among affected entities.
  • DORA: Requires financial entities to establish comprehensive ICT risk management frameworks, report significant ICT-related incidents, conduct resilience testing and manage risks related to third-party ICT service providers.
  • GDPR: Enforces principles such as lawful processing, data minimization and transparency; upholds data subjects’ rights; mandates data breach notifications; and requires data protection measures to be embedded in business processes.
  • CER Directive: Calls for national risk assessments, enhanced security measures, incident notification, and crisis management for critical entities, ensuring they can maintain essential services under adverse conditions.

Sanctions and Penalties

  • NIS2 Directive: The directive suggests Member States ensure that penalties for non-compliance are effective, proportionate, and dissuasive, but does not specify amounts, leaving it to individual Member States to set.
  • DORA: Specific sanctions and penalties are not detailed, implying that penalties would be defined at the Member State level or in subsequent regulatory guidance.
  • GDPR: Known for its strict penalties, organizations can face fines up to €20 million or 4% of their total global turnover, whichever is higher.
  • CER Directive: Similar to NIS2, the CER Directive leaves the specifics of sanctions and penalties to Member States, emphasizing the need for them to be effective, proportionate, and dissuasive.

Implementation Deadline Date

  • NIS2 Directive: Member States are required to transpose and apply the measures of the NIS2 Directive by 18 October 2024 .
  • DORA: The regulation will become applicable from 17 January 2025, marking the deadline for entities within the financial sector to comply with its requirements .
  • GDPR: This regulation has been in effect since 25 May 2018, requiring immediate compliance from the effective date.
  • CER Directive: Similar to NIS2, the CER Directive must be transposed and applied by Member States by 18 October 2024 .

Key Differences

While NIS2, DORA, GDPR and CER directives and regulations share common goals related to security and privacy, they differ significantly in their primary focus and applicability:

  • NIS2 Directive primarily enhances cybersecurity across various critical sectors, emphasizing sector-specific risk management and incident reporting.
  • DORA focuses on the financial sector’s digital operational resilience, detailing ICT risk management and third-party risk, specific to financial services.
  • GDPR is dedicated to personal data protection, granting extensive rights to individuals regarding their data, applicable across all sectors.
  • CER Directive aims to ensure the resilience of entities vital for societal and economic well-being, focusing on both cyber and physical resilience measures.

Overlapping Areas

Despite their differences, these frameworks overlap in several key areas, allowing for synergistic compliance efforts:

  • Risk Management: NIS2, DORA and the CER Directive all emphasize robust risk management, albeit with different focal points (cybersecurity, ICT and critical entity resilience, respectively).
  • Incident Reporting: NIS2 and DORA require incident reporting within their respective domains, which can streamline processes for entities covered by both.
  • Data Protection Measures: GDPR’s data protection principles can complement the cybersecurity measures under NIS2 and CER, enhancing overall data security.

Incident Response and Recovery

  • NIS2 Directive: Requires entities to have incident response capabilities in place, ensuring timely detection, analysis, and response to incidents. It emphasizes the need for recovery plans to restore services after an incident.
  • DORA: Mandates financial entities to establish and implement an incident management process capable of responding swiftly to ICT-related incidents, including recovery objectives, restoration of systems, and lessons learned activities.
  • GDPR: While not explicitly detailing incident response processes, GDPR mandates notification of personal data breaches to supervisory authorities and, in certain cases, to the affected individuals, highlighting the need for an effective response mechanism.
  • CER Directive: Stresses the importance of having incident response plans, ensuring critical entities can quickly respond to and recover from disruptive incidents, maintaining essential services.

Technical and Organizational Measures

  • NIS2 Directive: Entities should incorporate state-of-the-art cybersecurity solutions like advanced threat detection systems, comprehensive data encryption, secure network configurations and regular security assessments to safeguard sensitive information. Additional technical measures might include continuous monitoring and anomaly detection systems to identify suspicious activities in real time, and the implementation of Security Information and Event Management (SIEM) systems and next-generation firewalls (NGFWs). Organizational strategies involve establishing a robust cybersecurity governance framework, conducting frequent cybersecurity awareness training, and formulating clear policies for effective incident response and thorough business continuity planning.
  • DORA: For compliance with DORA, financial entities are advised to utilize secure communication protocols and robust encryption for protecting data during transmission and storage, supplemented by multi-factor authentication systems to enhance access security. Additional technical measures could involve the deployment of advanced cybersecurity tools like Security Information and Event Management (SIEM) systems for integrated threat analysis and response, and next-generation firewalls (NGFWs). On the organizational front, setting up a dedicated ICT risk management team, clearly defining cybersecurity roles, and embedding cybersecurity risk considerations into the overarching risk management framework are essential.
  • GDPR: In alignment with GDPR, technical safeguards such as strong data encryption, pseudonymization of personal data where feasible, and stringent access control mechanisms are pivotal. Expanding on these, additional technical measures may include the use of Data Loss Prevention (DLP) tools to prevent unauthorized data disclosure or loss and employing regular penetration testing to identify and rectify vulnerabilities. Organizational measures encompass the implementation of comprehensive data protection policies, conducting DPIAs for high-risk data processing activities, and appointing a Data Protection Officer in specific scenarios to oversee data protection strategies and compliance.
  • CER Directive: Adhering to the CER Directive involves applying network segmentation to isolate and protect critical systems, utilizing intrusion detection and prevention systems, and ensuring resilient data backup and recovery strategies. Enhancing these measures, technical strategies could also include the deployment of next-generation firewalls (NGFWs) and the use of automated patch management systems to ensure timely application of security updates. Organizational approaches include developing a detailed incident management plan, establishing a dedicated crisis management team, and conducting regular resilience testing and drills to validate and improve recovery processes.

Compliance Intersections and Synergies

While each framework has its unique focus, there are notable intersections, particularly in the areas of risk management, incident reporting, and the overarching emphasis on security and resilience. For instance, the risk management strategies advocated by NIS2 and the CER Directive can complement the ICT risk management framework of DORA. GDPR’s requirement for data protection by design and default can also support the cybersecurity measures outlined in NIS2 and CER, promoting a secure and privacy-focused operational environment. Furthermore, the incident reporting mechanisms mandated by both NIS2 and DORA underscore a shared commitment to transparency and accountability in the face of security incidents, which can drive improvements in organizational responses to breaches, including those involving personal data under GDPR. This alignment not only streamlines compliance processes but also fortifies the organization’s overall security framework, enhancing its ability to protect against and respond to cyber threats and operational disruptions. By recognizing and acting upon these synergies, organizations can more effectively allocate resources, avoid duplicative efforts, and foster a culture of continuous improvement in cybersecurity and data protection practices.

Conclusion

Understanding the nuances and requirements of NIS2, DORA, GDPR, and the CER Directive is crucial for organizations operating within the EU, especially those that fall under the scope of multiple frameworks. By recognizing the overlaps and leveraging synergies between these regulations and directives, organizations can streamline their compliance efforts, enhance their operational resilience and data protection measures, and contribute to a safer, more secure digital and physical environment within the EU. This integrated approach not only ensures regulatory compliance but also builds a strong foundation of trust with customers, stakeholders, and regulatory bodies.

For streamlined compliance with EU directives like NIS2, DORA and GDPR, Nebosystems offers expert services tailored to your needs. Learn more about our cybersecurity solutions or get in touch directly.


References:

NIS2 (Network & Information System) Directive (EU) 2022/2555. EUR-Lex.

General Data Protection Regulation (EU) 2016/679. EUR-Lex.

Digital Operational Resilience Act (EU) 2022/2554. EUR-Lex.

Critical Entities Resilience Directive (EU) 2022/2557. EUR-Lex.

From .com to .beauty: The evolving threat landscape of unwanted email

Post Syndicated from João Tomé original https://blog.cloudflare.com/top-level-domains-email-phishing-threats


You’re browsing your inbox and spot an email that looks like it’s from a brand you trust. Yet, something feels off. This might be a phishing attempt, a common tactic where cybercriminals impersonate reputable entities — we’ve written about the top 50 most impersonated brands used in phishing attacks. One factor that can be used to help evaluate the email’s legitimacy is its Top-Level Domain (TLD) — the part of the email address that comes after the dot.

In this analysis, we focus on the TLDs responsible for a significant share of malicious or spam emails since January 2023. For the purposes of this blog post, we are considering malicious email messages to be equivalent to phishing attempts. With an average of 9% of 2023’s emails processed by Cloudflare’s Cloud Email Security service marked as spam and 3% as malicious, rising to 4% by year-end, we aim to identify trends and signal which TLDs have become more dubious over time. Keep in mind that our measurements represent where we observe data across the email delivery flow. In some cases, we may be observing after initial filtering has taken place, at a point where missed classifications are likely to cause more damage. This information derived from this analysis could serve as a guide for Internet users, corporations, and geeks like us, searching for clues, as Internet detectives, in identifying potential threats. To make this data readily accessible, Cloudflare Radar, our tool for Internet insights, now includes a new section dedicated to email security trends.

Cyber attacks often leverage the guise of authenticity, a tactic Cloudflare thwarted following a phishing scheme similar to the one that compromised Twilio in 2022. The US Cybersecurity and Infrastructure Security Agency (CISA) notes that 90% of cyber attacks start with phishing, and fabricating trust is a key component of successful malicious attacks. We see there are two forms of authenticity that attackers can choose to leverage when crafting phishing messages, visual and organizational. Attacks that leverage visual authenticity rely on attackers using branding elements, like logos or images, to build credibility. Organizationally authentic campaigns rely on attackers using previously established relationships and business dynamics to establish trust and be successful.

Our findings from 2023 reveal that recently introduced generic TLDs (gTLDs), including several linked to the beauty industry, are predominantly used both for spam and malicious attacks. These TLDs, such as .uno, .sbs, and .beauty, all introduced since 2014, have seen over 95% of their emails flagged as spam or malicious. Also, it’s important to note that in terms of volume, “.com” accounts for 67% of all spam and malicious emails (more on that below).

TLDs

2023 Spam %

2023 Malicious %

2023 Spam + malicious %

TLD creation

.uno

62%

37%

99%

2014

.sbs

64%

35%

98%

2021

.best

68%

29%

97%

2014

.beauty

77%

20%

97%

2021

.top

74%

23%

97%

2014

.hair

78%

18%

97%

2021

.monster

80%

17%

96%

2019

.cyou

34%

62%

96%

2020

.wiki

69%

26%

95%

2014

.makeup

32%

63%

95%

2021

Email and Top-Level Domains history

In 1971, Ray Tomlinson sent the first networked email over ARPANET, using the @ character in the address. Five decades later, email remains relevant but also a key entry point for attackers.

Before the advent of the World Wide Web, email standardization and growth in the 1980s, especially within academia and military communities, led to interoperability. Fast forward 40 years, and this interoperability is once again a hot topic, with platforms like Threads, Mastodon, and other social media services aiming for the open communication that Jack Dorsey envisioned for Twitter. So, in 2024, it’s clear that social media, messaging apps like Slack, Teams, Google Chat, and others haven’t killed email, just as “video didn’t kill the radio star.”

The structure of a domain name.

The domain name system, managed by ICANN, encompasses a variety of TLDs, from the classic “.com” (1985) to the newer generic options. There are also the country-specific (ccTLDs), where the Internet Assigned Numbers Authority (IANA) is responsible for determining an appropriate trustee for each ccTLD. An extensive 2014 expansion by ICANN was designed to “increase competition and choice in the domain name space,” introducing numerous new options for specific professional, business, and informational purposes, which in turn, also opened up new possibilities for phishing attempts.

3.4 billion unwanted emails

Cloudflare’s Cloud Email Security service is helping protect our customers, and that also comes with insights. In 2022, Cloudflare blocked 2.4 billion unwanted emails, and in 2023 that number rose to over 3.4 billion unwanted emails, 26% of all messages processed. This total includes spam, malicious, and “bulk” (practice of sending a single email message, unsolicited or solicited, to a large number of recipients simultaneously) emails. That means an average of 9.3 million per day, 6500 per minute, 108 per second.

Bear in mind that new customers also make the numbers grow — in this case, driving a 42% increase in unwanted emails from 2022 to 2023. But this gives a sense of scale in this email area. Those unwanted emails can include malicious attacks that are difficult to detect, becoming more frequent, and can have devastating consequences for individuals and businesses that fall victim to them. Below, we’ll give more details on email threats, where malicious messages account for almost 3% of emails averaged across all of 2023 and it shows a growth tendency during the year, with higher percentages in the last months of the year. Let’s take a closer look.

Top phishing TLDs (and types of TLDs)

First, let’s start with an 2023 overview of top level domains with a high percentage of spam and malicious messages. Despite excluding TLDs with fewer than 20,000 emails, our analysis covers unwanted emails considered to be spam and malicious from more than 350 different TLDs (and yes, there are many more).

A quick overview highlights the TLDs with the highest rates of spam and malicious attacks as a proportion of their outbound email, those with the largest volume share of spam or malicious emails, and those with the highest rates of just-malicious and just-spam TLD senders. It reveals that newer TLDs, especially those associated with the beauty industry (generally available since 2021 and serving a booming industry), have the highest rates as a proportion of their emails. However, it’s relevant to recognize that “.com” accounts for 67% of all spam and malicious emails. Malicious emails often originate from recently created generic TLDs like “.bar”, “.makeup”, or “.cyou”, as well as certain country-code TLDs (ccTLDs) employed beyond their geographical implications.

Highest % of spam and malicious emails

Volume share
of spam + malicious 

Highest % of malicious 

Highest % of spam

TLD

Spam + mal %

TLD

Spam + mal %

TLD

Malicious %

TLD

Spam %

.uno

99%

.com

67%

.bar

70%

.autos

93%

.sbs

98%

.shop

5%

.makeup

63%

.today

92%

.best

97%

.net

4%

.cyou

62%

.directory

91%

.beauty

97%

.no

3%

.ml

56%

.boats

87%

.top

97%

.org

2%

.tattoo

54%

.center

85%

.hair

97%

.ru

1%

.om

47%

.monster

80%

.monster

96%

.jp

1%

.cfd

46%

.lol

79%

.cyou

96%

.click

1%

.skin

39%

.hair

78%

.wiki

95%

.beauty

1%

.uno

37%

.shop

78%

.makeup

95%

.cn

1%

.pw

37%

.beauty

77%

Focusing on volume share, “.com” dominates the spam + malicious list at 67%, and is joined in the top 3 by another “classic” gTLD, “.net”, at 4%. They also lead by volume when we look separately at the malicious (68% of all malicious emails are “.com” and “.net”) and spam (71%) categories, as shown below. All of the generic TLDs introduced since 2014 represent 13.4% of spam and malicious and over 14% of only malicious emails. These new TLDs (most of them are only available since 2016) are notable sources of both spam and malicious messages. Meanwhile, country-code TLDs contribute to more than 12% of both categories of unwanted emails.

This breakdown highlights the critical role of both established and new generic TLDs, which surpass older ccTLDs in terms of malicious emails, pointing to the changing dynamics of email-based threats.

Type of TLDs

Spam

Malicious 

Spam + malicious

ccTLDs

13%

12%

12%

.com and .net only

71%

68%

71%

new gTLDs 

13%

14%

13.4%

That said, “.shop” deserves a highlight of its own. The TLD, which has been available since 2016, is #2 by volume of spam and malicious emails, accounting for 5% of all of those emails. It also represents, when we separate those two categories, 5% of all malicious emails, and 5% of all spam emails. As we’re going to see below, its influence is growing.

Full 2023 top 50 spam & malicious TLDs list

For a more detailed perspective, below we present the top 50 TLDs with the highest percentages of spam and malicious emails during 2023. We also include a breakdown of those two categories.

It’s noticeable that even outside the top 10, other recent generic TLDs are also higher in the ranking, such as “.autos” (the #1 in the spam list), “.today”, “.bid” or “.cam”. TLDs that seem to promise entertainment or fun or are just leisure or recreational related (including “.fun” itself), occupy a position in our top 50 ranking.

2023 Top 50 spam & malicious TLDs (by highest %)

Rank

TLD

Spam %

Malicious %

Spam + malicious %

1

.uno

62%

37%

99%

2

.sbs

64%

35%

98%

3

.best

68%

29%

97%

4

.beauty

77%

20%

97%

5

.top

74%

23%

97%

6

.hair

78%

18%

97%

7

.monster

80%

17%

96%

8

.cyou

34%

62%

96%

9

.wiki

69%

26%

95%

10

.makeup

32%

63%

95%

11

.autos

93%

2%

95%

12

.today

92%

3%

94%

13

.shop

78%

16%

94%

14

.bid

74%

18%

92%

15

.cam

67%

25%

92%

16

.directory

91%

0%

91%

17

.icu

75%

15%

91%

18

.ml

33%

56%

89%

19

.lol

79%

10%

89%

20

.skin

49%

39%

88%

21

.boats

87%

1%

88%

22

.tattoo

34%

54%

87%

23

.click

61%

27%

87%

24

.ltd

70%

17%

86%

25

.rest

74%

11%

86%

26

.center

85%

0%

85%

27

.fun

64%

21%

85%

28

.cfd

39%

46%

84%

29

.bar

14%

70%

84%

30

.bio

72%

11%

84%

31

.tk

66%

17%

83%

32

.yachts

58%

23%

81%

33

.one

63%

17%

80%

34

.ink

68%

10%

78%

35

.wf

76%

1%

77%

36

.no

76%

0%

76%

37

.pw

39%

37%

75%

38

.site

42%

31%

73%

39

.life

56%

16%

72%

40

.homes

62%

10%

72%

41

.services

67%

2%

69%

42

.mom

63%

5%

68%

43

.ir

37%

29%

65%

44

.world

43%

21%

65%

45

.lat

40%

24%

64%

46

.xyz

46%

18%

63%

47

.ee

62%

1%

62%

48

.live

36%

26%

62%

49

.pics

44%

16%

60%

50

.mobi

41%

19%

60%

Change in spam & malicious TLD patterns

Let’s look at TLDs where spam + malicious emails comprised the largest share of total messages from that TLD, and how that list of TLDs changed from the first half of 2023 to the second half. This shows which TLDs were most problematic at different times during the year.

Highlighted in bold in the following table are those TLDs that climbed in the rankings for the percentage of spam and malicious emails from July to December 2023, compared with January to June. Generic TLDs “.uno”, “.makeup” and “.directory” appeared in the top list and in higher positions for the first time in the last six months of the year.

January – June 2023

July – Dec 2023

tld

Spam + malicious %

tld

Spam + malicious %

.click

99%

.uno

99%

.best

99%

.sbs

98%

.yachts

99%

.beauty

97%

.hair

99%

.best

97%

.autos

99%

.makeup

95%

.wiki

98%

.monster

95%

.today

98%

.directory

95%

.mom

98%

.bid

95%

.sbs

97%

.top

93%

.top

97%

.shop

92%

.monster

97%

.today

92%

.beauty

97%

.cam

92%

.bar

96%

.cyou

92%

.rest

95%

.icu

91%

.cam

95%

.boats

88%

.homes

94%

.wiki

88%

.pics

94%

.rest

88%

.lol

94%

.hair

87%

.quest

93%

.fun

87%

.cyou

93%

.cfd

86%

.ink

92%

.skin

85%

.shop

92%

.ltd

84%

.skin

91%

.one

83%

.ltd

91%

.center

83%

.tattoo

91%

.services

81%

.no

90%

.lol

78%

.ml

90%

.wf

78%

.center

90%

.pw

76%

.store

90%

.life

76%

.icu

89%

.click

75%

From the rankings, it’s clear that the recent generic TLDs have the highest spam and malicious percentage of all emails. The top 10 TLDs in both halves of 2023 are all recent and generic, with several introduced since 2021.

Reasons for the prominence of these gTLDs include the availability of domain names that can seem legitimate or mimic well-known brands, as we explain in this blog post. Cybercriminals often use popular or catchy words. Some gTLDs allow anonymous registration. Their low cost and the delay in updated security systems to recognize new gTLDs as spam and malicious sources also play a role — note that, as we’ve seen, cyber criminals also like to change TLDs and methods.

The impact of a lawsuit?

There’s also been a change in the types of domains with the highest malicious percentage in 2023, possibly due to Meta’s lawsuit against Freenom, filed in December 2022 and refiled in March 2023. Freenom provided domain name registry services for free in five ccTLDs, which wound up being used for purposes beyond local businesses or content: “.cf” (Central African Republic), “.ga” (Gabon), “.gq” (Equatorial Guinea), “.ml” (Mali), and “.tk” (Tokelau). However, Freenom stopped new registrations during 2023 following the lawsuit, and in February 2024, announced its decision to exit the domain name business.

Focusing on Freenom TLDs, which appeared in our top 50 ranking only in the first half of 2023, we see a clear shift. Since October, these TLDs have become less relevant in terms of all emails, including malicious and spam percentages. In February 2023, they accounted for 0.17% of all malicious emails we tracked, and 0.04% of all spam and malicious. Their presence has decreased since then, becoming almost non-existent in email volume in September and October, similar to other analyses.

TLDs ordered by volume of spam + malicious

In addition to looking at their share, another way to examine the data is to identify the TLDs that have a higher volume of spam and malicious emails — the next table is ordered that way. This means that we are able to show more familiar (and much older) TLDs, such as “.com”. We’ve included here the percentage of all emails in any given TLD that are classified as spam or malicious, and also spam + malicious to spotlight those that may require more caution. For instance, with high volume “.shop”, “.no”, “.click”, “.beauty”, “.top”, “.monster”, “.autos”, and “.today” stand out with a higher spam and malicious percentage (and also only malicious email percentage).

In the realm of country-code TLDs, Norway’s “.no” leads in spam, followed by China’s “.cn”, Russia’s “.ru”, Ukraine’s “.ua”, and Anguilla’s “.ai”, which recently has been used more for artificial intelligence-related domains than for the country itself.

In bold and red, we’ve highlighted the TLDs where spam + malicious represents more than 20% of all emails in that TLD — already what we consider a high number for domains with a lot of emails.

TLDs with more spam + malicious emails (in volume) in 2023

Rank

TLD

Spam %

Malicious %

Spam + mal %

1

.com

3.6%

0.8%

4.4%

2

.shop

77.8%

16.4%

94.2%

3

.net

2.8%

1.0%

3.9%

4

.no

76.0%

0.3%

76.3%

5

.org

3.3%

1.8%

5.2%

6

.ru

15.2%

7.7%

22.9%

7

.jp

3.4%

2.5%

5.9%

8

.click

60.6%

26.6%

87.2%

9

.beauty

77.0%

19.9%

96.9%

10

.cn

25.9%

3.3%

29.2%

11

.top

73.9%

22.8%

96.6%

12

.monster

79.7%

16.8%

96.5%

13

.de

13.0%

2.1%

15.2%

14

.best

68.1%

29.4%

97.4%

15

.gov

0.6%

2.0%

2.6%

16

.autos

92.6%

2.0%

94.6%

17

.ca

5.2%

0.5%

5.7%

18

.uk

3.2%

0.8%

3.9%

19

.today

91.7%

2.6%

94.3%

20

.io

3.6%

0.5%

4.0%

21

.us

5.7%

1.9%

7.6%

22

.co

6.3%

0.8%

7.1%

23

.biz

27.2%

14.0%

41.2%

24

.edu

0.9%

0.2%

1.1%

25

.info

20.4%

5.4%

25.8%

26

.ai

28.3%

0.1%

28.4%

27

.sbs

63.8%

34.5%

98.3%

28

.it

2.5%

0.3%

2.8%

29

.ua

37.4%

0.6%

38.0%

30

.fr

8.5%

1.0%

9.5%

The curious case of “.gov” email spoofing

When we concentrate our research on message volume to identify TLDs with more malicious emails blocked by our Cloud Email Security service, we discover a trend related to “.gov”.

TLDs ordered by malicious email volume

% of all malicious emails

.com

63%

.net

5%

.shop

5%

.org

3%

.gov

2%

.ru

2%

.jp

2%

.click

1%

.best

0.9%

.beauty

0.8%

The first three domains, “.com” (63%), “.net” (5%), and “.shop” (5%), were previously seen in our rankings and are not surprising. However, in fourth place is “.org”, known for being used by non-profit and other similar organizations, but it has an open registration policy. In fifth place is “.gov”, used only by the US government and administered by CISA. Our investigation suggests that it appears in the ranking because of typical attacks where cybercriminals pretend to be a legitimate address (email spoofing, creation of email messages with a forged sender address). In this case, they use “.gov” when launching attacks.

The spoofing behavior linked to “.gov” is similar to that of other TLDs. It includes fake senders failing SPF validation and other DNS-based authentication methods, along with various other types of attacks. An email failing SPF, DKIM, and DMARC checks typically indicates that a malicious sender is using an unauthorized IP, domain, or both. So, there are more straightforward ways to block spoofed emails without examining their content for malicious elements.

Ranking TLDs by proportions of malicious and spam email in 2023

In this section, we have included two lists: one ranks TLDs by the highest percentage of malicious emails — those you should exercise greater caution with; the second ranks TLDs by just their spam percentage. These contrast with the previous top 50 list ordered by combined spam and malicious percentages. In the case of malicious emails, the top 3 with the highest percentage are all generic TLDs. The #1 was “.bar”, with 70% of all emails being categorized as malicious, followed by “.makeup”, and “.cyou” — marketed as the phrase “see you”.

The malicious list also includes some country-code TLDs (ccTLDs) not primarily used for country-related topics, like .ml (Mali), .om (Oman), and .pw (Palau). The list also includes other ccTLDs such as .ir (Iran) and .kg (Kyrgyzstan), .lk (Sri Lanka).

In the spam realm, it’s “autos”, with 93%, and other generic TLDs such as “.today”, and “.directory” that take the first three spots, also seeing shares over 90%.

2023 ordered by malicious email %

2023 ordered by spam email %

tld

Malicious % 

tld

Spam %

.bar

70%

.autos

93%

.makeup

63%

.today

92%

.cyou

62%

.directory

91%

.ml

56%

.boats

87%

.tattoo

54%

.center

85%

.om

47%

.monster

80%

.cfd

46%

.lol

79%

.skin

39%

.hair

78%

.uno

37%

.shop

78%

.pw

37%

.beauty

77%

.sbs

35%

.no

76%

.site

31%

.wf

76%

.store

31%

.icu

75%

.best

29%

.bid

74%

.ir

29%

.rest

74%

.lk

27%

.top

74%

.work

27%

.bio

72%

.click

27%

.ltd

70%

.wiki

26%

.wiki

69%

.live

26%

.best

68%

.cam

25%

.ink

68%

.lat

24%

.cam

67%

.yachts

23%

.services

67%

.top

23%

.tk

66%

.world

21%

.sbs

64%

.fun

21%

.fun

64%

.beauty

20%

.one

63%

.mobi

19%

.mom

63%

.kg

19%

.uno

62%

.hair

18%

.homes

62%

How it stands in 2024: new higher-risk TLDs

2024 has seen new players enter the high-risk zone for unwanted emails. In this list we have only included the new TLDs that weren’t in the top 50 during 2023, and joined the list in January. New entrants include Samoa’s “.ws”, Indonesia’s “.id” (also used because of its “identification” meaning), and the Cocos Islands’ “.cc”. These ccTLDs, often used for more than just country-related purposes, have shown high percentages of malicious emails, ranging from 20% (.cc) to 95% (.ws) of all emails.

January 2024: Newer TLDs in the top 50 list

TLD

Spam %

Malicious %

Spam + mal %

.ws

3%

95%

98%

.company

96%

0%

96%

.digital

72%

2%

74%

.pro

66%

6%

73%

.tz

62%

4%

65%

.id

13%

39%

51%

.cc

25%

21%

46%

.space

32%

8%

40%

.enterprises

2%

37%

40%

.lv

30%

1%

30%

.cn

26%

3%

29%

.jo

27%

1%

28%

.info

21%

5%

26%

.su

20%

5%

25%

.ua

23%

1%

24%

.museum

0%

24%

24%

.biz

16%

7%

24%

.se

23%

0%

23%

.ai

21%

0%

21%

Overview of email threat trends since 2023

With Cloudflare’s Cloud Email Security, we gain insight into the broader email landscape over the past months. The spam percentage of all emails stood at 8.58% in 2023. As mentioned before, keep in mind with these percentages that our protection typically kicks in after other email providers’ filters have already removed some spam and malicious emails.

How about malicious emails? Almost 3% of all emails were flagged as malicious during 2023, with the highest percentages occurring in Q4. Here’s the “malicious” evolution, where we’re also including the January and February 2024 perspective:

The week before Christmas and the first week of 2024 experienced a significant spike in malicious emails, reaching an average of 7% and 8% across the weeks, respectively. Not surprisingly, there was a noticeable decrease during Christmas week, when it dropped to 3%. Other significant increases in the percentage of malicious emails were observed the week before Valentine’s Day, the first week of September (coinciding with returns to work and school in the Northern Hemisphere), and late October.

Threat categories in 2023

We can also look to different types of threats in 2023. Links were present in 49% of all threats. Other categories included extortion (36%), identity deception (27%), credential harvesting (23%), and brand impersonation (18%). These categories are defined and explored in detail in Cloudflare’s 2023 phishing threats report. Extortion saw the most growth in Q4, especially in November and December reaching 38% from 7% of all threats in Q1 2023.

Other trends: Attachments are still popular

Other less “threatening” trends show that 20% of all emails included attachments (as the next chart shows), while 82% contained links in the body. Additionally, 31% were composed in plain text, and 18% featured HTML, which allows for enhanced formatting and visuals. 39% of all emails used remote content.

Conclusion: Be cautious, prepared, safe

The landscape of spam and malicious (or phishing) emails constantly evolves alongside technology, the Internet, user behaviors, use cases, and cybercriminals. As we’ve seen through Cloudflare’s Cloud Email Security insights, new generic TLDs have emerged as preferred channels for these malicious activities, highlighting the need for vigilance when dealing with emails from unfamiliar domains.

There’s no shortage of advice on staying safe from phishing. Email remains a ubiquitous yet highly exploited tool in daily business operations. Cybercriminals often bait users into clicking malicious links within emails, a tactic used by both sophisticated criminal organizations and novice attackers. So, always exercise caution online.

Cloudflare’s Cloud Email Security provides insights that underscore the importance of robust cybersecurity infrastructure in fighting the dynamic tactics of phishing attacks.

If you want to learn more about email security, you can check Cloudflare Radar’s new email section, visit our Learning Center or reach out for a complimentary phishing risk assessment for your organization.

(Contributors to this blog post include Jeremy Eckman, Phil Syme, and Oren Falkowitz.)

How we’re creating more impact with Ada Computer Science

Post Syndicated from Ben Durbin original https://www.raspberrypi.org/blog/how-were-creating-more-impact-with-ada-computer-science/

We offer Ada Computer Science as a platform to support educators and learners alike. But we don’t take its usefulness for granted: as part of our commitment to impact, we regularly gather user feedback and evaluate all of our products, and Ada is no exception. In this blog, we share some of the feedback we’ve gathered from surveys and interviews with the people using Ada.

A secondary school age learner in a computing classroom.

What’s new on Ada?

Ada Computer Science is our online learning platform designed for teachers, students, and anyone interested in learning about computer science. If you’re teaching or studying a computer science qualification at school, you can use Ada Computer Science for classwork, homework, and revision. 

Launched last year as a partnership between us and the University of Cambridge, Ada’s comprehensive resources cover topics like algorithms, data structures, computational thinking, and cybersecurity. It also includes 1,000 self-marking questions, which both teachers and students can use to assess their knowledge and understanding. 

Throughout 2023, we continued to develop the support Ada offers. For example, we: 

  • Added over 100 new questions
  • Expanded code specimens to cover Java and Visual Basic as well as Python and C#
  • Added an integrated way of learning about databases through writing and executing SQL
  • Incorporated a beta version of an embedded Python editor with the ability to run code and compare the output with correct solutions 

A few weeks ago we launched two all-new topics about artificial intelligence (AI) and machine learning.

So far, all the content on Ada Computer Science is mapped to GCSE and A level exam boards in England, and we’ve just released new resources for the Scottish Qualification Authority’s Computer Systems area of study to support students in Scotland with their National 5 and Higher qualifications.

Who is using Ada?

Ada is being used by a wide variety of users, from at least 127 countries all across the globe. Countries where Ada is most popular include the UK, US, Canada, Australia, Brazil, India, China, Nigeria, Ghana, Kenya, China, Myanmar, and Indonesia.

Children in a Code Club in India.

Just over half of students using Ada are completing work set by their teacher. However, there are also substantial numbers of young people benefitting from using Ada for their own independent learning. So far, over half a million question attempts have been made on the platform.

How are people using Ada?

Students use Ada for a wide variety of purposes. The most common response in our survey was for revision, but students also use it to complete work set by teachers, to learn new concepts, and to check their understanding of computer science concepts.

Teachers also use Ada for a combination of their own learning, in the classroom with their students, and for setting work outside of lessons. They told us that they value Ada as a source of pre-made questions.

“I like having a bank of questions as a teacher. It’s tiring to create more. I like that I can use the finder and create questions very quickly.” — Computer science teacher, A level

“I like the structure of how it [Ada] is put together. [Resources] are really easy to find and being able to sort by exam board makes it really useful because… at A level there is a huge difference between exam boards.” — GCSE and A level teacher

What feedback are people giving about Ada?

Students and teachers alike were very positive about the quality and usefulness of Ada Computer Science. Overall, 89% of students responding to our survey agreed that Ada is useful for helping them to learn about computer science, and 93% of teachers agreed that it is high quality.

“The impact for me was just having a resource that I felt I always could trust.” — Head of Computer Science

A graph showing that students and teachers consider Ada Computer Science to be useful and high quality.

Most teachers also reported that using Ada reduces their workload, saving an average of 3 hours per week.

“[Quizzes] are the most useful because it’s the biggest time saving…especially having them nicely self-marked as well.” — GCSE and A level computer science teacher

Even more encouragingly, Ada users report a positive impact on their knowledge, skills, and attitudes to computer science. Teachers report that, as a result of using Ada, their computer science subject knowledge and their confidence in teaching has increased, and report similar benefits for their students.

“They can easily…recap and see how they’ve been getting on with the different topic areas.” — GCSE and A level computer science teacher

“I see they’re answering the questions and learning things without really realising it, which is quite nice.” — GCSE and A level computer science teacher

How do we use people’s feedback to improve the platform?

Our content team is made up of experienced computer science teachers, and we’re always updating the site in response to feedback from the teachers and students who use our resources. We receive feedback through support tickets, and we have a monthly meeting where we comb through every wrong answer that students entered to help us identify new misconceptions. We then use all of this to improve the content, and the feedback we give students on the platform.

A computer science teacher sits with students at computers in a classroom.

We’d love to hear from you

We’ll be conducting another round of surveys later this year, so when you see the link, please fill in the form. In the meantime, if you have any feedback or suggestions for improvements, please get in touch.

And if you’ve not signed up to Ada yet as a teacher or student, you can take a look right now over at adacomputerscience.org

The post How we’re creating more impact with Ada Computer Science appeared first on Raspberry Pi Foundation.

On Secure Voting Systems

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/03/on-secure-voting-systems.html

Andrew Appel shepherded a public comment—signed by twenty election cybersecurity experts, including myself—on best practices for ballot marking devices and vote tabulation. It was written for the Pennsylvania legislature, but it’s general in nature.

From the executive summary:

We believe that no system is perfect, with each having trade-offs. Hand-marked and hand-counted ballots remove the uncertainty introduced by use of electronic machinery and the ability of bad actors to exploit electronic vulnerabilities to remotely alter the results. However, some portion of voters mistakenly mark paper ballots in a manner that will not be counted in the way the voter intended, or which even voids the ballot. Hand-counts delay timely reporting of results, and introduce the possibility for human error, bias, or misinterpretation.

Technology introduces the means of efficient tabulation, but also introduces a manifold increase in complexity and sophistication of the process. This places the understanding of the process beyond the average person’s understanding, which can foster distrust. It also opens the door to human or machine error, as well as exploitation by sophisticated and malicious actors.

Rather than assert that each component of the process can be made perfectly secure on its own, we believe the goal of each component of the elections process is to validate every other component.

Consequently, we believe that the hallmarks of a reliable and optimal election process are hand-marked paper ballots, which are optically scanned, separately and securely stored, and rigorously audited after the election but before certification. We recommend state legislators adopt policies consistent with these guiding principles, which are further developed below.

Run large-scale simulations with AWS Batch multi-container jobs

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/run-large-scale-simulations-with-aws-batch-multi-container-jobs/

Industries like automotive, robotics, and finance are increasingly implementing computational workloads like simulations, machine learning (ML) model training, and big data analytics to improve their products. For example, automakers rely on simulations to test autonomous driving features, robotics companies train ML algorithms to enhance robot perception capabilities, and financial firms run in-depth analyses to better manage risk, process transactions, and detect fraud.

Some of these workloads, including simulations, are especially complicated to run due to their diversity of components and intensive computational requirements. A driving simulation, for instance, involves generating 3D virtual environments, vehicle sensor data, vehicle dynamics controlling car behavior, and more. A robotics simulation might test hundreds of autonomous delivery robots interacting with each other and other systems in a massive warehouse environment.

AWS Batch is a fully managed service that can help you run batch workloads across a range of AWS compute offerings, including Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Fargate, and Amazon EC2 Spot or On-Demand Instances. Traditionally, AWS Batch only allowed single-container jobs and required extra steps to merge all components into a monolithic container. It also did not allow using separate “sidecar” containers, which are auxiliary containers that complement the main application by providing additional services like data logging. This additional effort required coordination across multiple teams, such as software development, IT operations, and quality assurance (QA), because any code change meant rebuilding the entire container.

Now, AWS Batch offers multi-container jobs, making it easier and faster to run large-scale simulations in areas like autonomous vehicles and robotics. These workloads are usually divided between the simulation itself and the system under test (also known as an agent) that interacts with the simulation. These two components are often developed and optimized by different teams. With the ability to run multiple containers per job, you get the advanced scaling, scheduling, and cost optimization offered by AWS Batch, and you can use modular containers representing different components like 3D environments, robot sensors, or monitoring sidecars. In fact, customers such as IPG Automotive, MORAI, and Robotec.ai are already using AWS Batch multi-container jobs to run their simulation software in the cloud.

Let’s see how this works in practice using a simplified example and have some fun trying to solve a maze.

Building a Simulation Running on Containers
In production, you will probably use existing simulation software. For this post, I built a simplified version of an agent/model simulation. If you’re not interested in code details, you can skip this section and go straight to how to configure AWS Batch.

For this simulation, the world to explore is a randomly generated 2D maze. The agent has the task to explore the maze to find a key and then reach the exit. In a way, it is a classic example of pathfinding problems with three locations.

Here’s a sample map of a maze where I highlighted the start (S), end (E), and key (K) locations.

Sample ASCII maze map.

The separation of agent and model into two separate containers allows different teams to work on each of them separately. Each team can focus on improving their own part, for example, to add details to the simulation or to find better strategies for how the agent explores the maze.

Here’s the code of the maze model (app.py). I used Python for both examples. The model exposes a REST API that the agent can use to move around the maze and know if it has found the key and reached the exit. The maze model uses Flask for the REST API.

import json
import random
from flask import Flask, request, Response

ready = False

# How map data is stored inside a maze
# with size (width x height) = (4 x 3)
#
#    012345678
# 0: +-+-+ +-+
# 1: | |   | |
# 2: +-+ +-+-+
# 3: | |   | |
# 4: +-+-+ +-+
# 5: | | | | |
# 6: +-+-+-+-+
# 7: Not used

class WrongDirection(Exception):
    pass

class Maze:
    UP, RIGHT, DOWN, LEFT = 0, 1, 2, 3
    OPEN, WALL = 0, 1
    

    @staticmethod
    def distance(p1, p2):
        (x1, y1) = p1
        (x2, y2) = p2
        return abs(y2-y1) + abs(x2-x1)


    @staticmethod
    def random_dir():
        return random.randrange(4)


    @staticmethod
    def go_dir(x, y, d):
        if d == Maze.UP:
            return (x, y - 1)
        elif d == Maze.RIGHT:
            return (x + 1, y)
        elif d == Maze.DOWN:
            return (x, y + 1)
        elif d == Maze.LEFT:
            return (x - 1, y)
        else:
            raise WrongDirection(f"Direction: {d}")


    def __init__(self, width, height):
        self.width = width
        self.height = height        
        self.generate()
        

    def area(self):
        return self.width * self.height
        

    def min_lenght(self):
        return self.area() / 5
    

    def min_distance(self):
        return (self.width + self.height) / 5
    

    def get_pos_dir(self, x, y, d):
        if d == Maze.UP:
            return self.maze[y][2 * x + 1]
        elif d == Maze.RIGHT:
            return self.maze[y][2 * x + 2]
        elif d == Maze.DOWN:
            return self.maze[y + 1][2 * x + 1]
        elif d ==  Maze.LEFT:
            return self.maze[y][2 * x]
        else:
            raise WrongDirection(f"Direction: {d}")


    def set_pos_dir(self, x, y, d, v):
        if d == Maze.UP:
            self.maze[y][2 * x + 1] = v
        elif d == Maze.RIGHT:
            self.maze[y][2 * x + 2] = v
        elif d == Maze.DOWN:
            self.maze[y + 1][2 * x + 1] = v
        elif d ==  Maze.LEFT:
            self.maze[y][2 * x] = v
        else:
            WrongDirection(f"Direction: {d}  Value: {v}")


    def is_inside(self, x, y):
        return 0 <= y < self.height and 0 <= x < self.width


    def generate(self):
        self.maze = []
        # Close all borders
        for y in range(0, self.height + 1):
            self.maze.append([Maze.WALL] * (2 * self.width + 1))
        # Get a random starting point on one of the borders
        if random.random() < 0.5:
            sx = random.randrange(self.width)
            if random.random() < 0.5:
                sy = 0
                self.set_pos_dir(sx, sy, Maze.UP, Maze.OPEN)
            else:
                sy = self.height - 1
                self.set_pos_dir(sx, sy, Maze.DOWN, Maze.OPEN)
        else:
            sy = random.randrange(self.height)
            if random.random() < 0.5:
                sx = 0
                self.set_pos_dir(sx, sy, Maze.LEFT, Maze.OPEN)
            else:
                sx = self.width - 1
                self.set_pos_dir(sx, sy, Maze.RIGHT, Maze.OPEN)
        self.start = (sx, sy)
        been = [self.start]
        pos = -1
        solved = False
        generate_status = 0
        old_generate_status = 0                    
        while len(been) < self.area():
            (x, y) = been[pos]
            sd = Maze.random_dir()
            for nd in range(4):
                d = (sd + nd) % 4
                if self.get_pos_dir(x, y, d) != Maze.WALL:
                    continue
                (nx, ny) = Maze.go_dir(x, y, d)
                if (nx, ny) in been:
                    continue
                if self.is_inside(nx, ny):
                    self.set_pos_dir(x, y, d, Maze.OPEN)
                    been.append((nx, ny))
                    pos = -1
                    generate_status = len(been) / self.area()
                    if generate_status - old_generate_status > 0.1:
                        old_generate_status = generate_status
                        print(f"{generate_status * 100:.2f}%")
                    break
                elif solved or len(been) < self.min_lenght():
                    continue
                else:
                    self.set_pos_dir(x, y, d, Maze.OPEN)
                    self.end = (x, y)
                    solved = True
                    pos = -1 - random.randrange(len(been))
                    break
            else:
                pos -= 1
                if pos < -len(been):
                    pos = -1
                    
        self.key = None
        while(self.key == None):
            kx = random.randrange(self.width)
            ky = random.randrange(self.height)
            if (Maze.distance(self.start, (kx,ky)) > self.min_distance()
                and Maze.distance(self.end, (kx,ky)) > self.min_distance()):
                self.key = (kx, ky)


    def get_label(self, x, y):
        if (x, y) == self.start:
            c = 'S'
        elif (x, y) == self.end:
            c = 'E'
        elif (x, y) == self.key:
            c = 'K'
        else:
            c = ' '
        return c

                    
    def map(self, moves=[]):
        map = ''
        for py in range(self.height * 2 + 1):
            row = ''
            for px in range(self.width * 2 + 1):
                x = int(px / 2)
                y = int(py / 2)
                if py % 2 == 0: #Even rows
                    if px % 2 == 0:
                        c = '+'
                    else:
                        v = self.get_pos_dir(x, y, self.UP)
                        if v == Maze.OPEN:
                            c = ' '
                        elif v == Maze.WALL:
                            c = '-'
                else: # Odd rows
                    if px % 2 == 0:
                        v = self.get_pos_dir(x, y, self.LEFT)
                        if v == Maze.OPEN:
                            c = ' '
                        elif v == Maze.WALL:
                            c = '|'
                    else:
                        c = self.get_label(x, y)
                        if c == ' ' and [x, y] in moves:
                            c = '*'
                row += c
            map += row + '\n'
        return map


app = Flask(__name__)

@app.route('/')
def hello_maze():
    return "<p>Hello, Maze!</p>"

@app.route('/maze/map', methods=['GET', 'POST'])
def maze_map():
    if not ready:
        return Response(status=503, retry_after=10)
    if request.method == 'GET':
        return '<pre>' + maze.map() + '</pre>'
    else:
        moves = request.get_json()
        return maze.map(moves)

@app.route('/maze/start')
def maze_start():
    if not ready:
        return Response(status=503, retry_after=10)
    start = { 'x': maze.start[0], 'y': maze.start[1] }
    return json.dumps(start)

@app.route('/maze/size')
def maze_size():
    if not ready:
        return Response(status=503, retry_after=10)
    size = { 'width': maze.width, 'height': maze.height }
    return json.dumps(size)

@app.route('/maze/pos/<int:y>/<int:x>')
def maze_pos(y, x):
    if not ready:
        return Response(status=503, retry_after=10)
    pos = {
        'here': maze.get_label(x, y),
        'up': maze.get_pos_dir(x, y, Maze.UP),
        'down': maze.get_pos_dir(x, y, Maze.DOWN),
        'left': maze.get_pos_dir(x, y, Maze.LEFT),
        'right': maze.get_pos_dir(x, y, Maze.RIGHT),

    }
    return json.dumps(pos)


WIDTH = 80
HEIGHT = 20
maze = Maze(WIDTH, HEIGHT)
ready = True

The only requirement for the maze model (in requirements.txt) is the Flask module.

To create a container image running the maze model, I use this Dockerfile.

FROM --platform=linux/amd64 public.ecr.aws/docker/library/python:3.12-alpine

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD [ "python3", "-m" , "flask", "run", "--host=0.0.0.0", "--port=5555"]

Here’s the code for the agent (agent.py). First, the agent asks the model for the size of the maze and the starting position. Then, it applies its own strategy to explore and solve the maze. In this implementation, the agent chooses its route at random, trying to avoid following the same path more than once.

import random
import requests
from requests.adapters import HTTPAdapter, Retry

HOST = '127.0.0.1'
PORT = 5555

BASE_URL = f"http://{HOST}:{PORT}/maze"

UP, RIGHT, DOWN, LEFT = 0, 1, 2, 3
OPEN, WALL = 0, 1

s = requests.Session()

retries = Retry(total=10,
                backoff_factor=1)

s.mount('http://', HTTPAdapter(max_retries=retries))

r = s.get(f"{BASE_URL}/size")
size = r.json()
print('SIZE', size)

r = s.get(f"{BASE_URL}/start")
start = r.json()
print('START', start)

y = start['y']
x = start['x']

found_key = False
been = set((x, y))
moves = [(x, y)]
moves_stack = [(x, y)]

while True:
    r = s.get(f"{BASE_URL}/pos/{y}/{x}")
    pos = r.json()
    if pos['here'] == 'K' and not found_key:
        print(f"({x}, {y}) key found")
        found_key = True
        been = set((x, y))
        moves_stack = [(x, y)]
    if pos['here'] == 'E' and found_key:
        print(f"({x}, {y}) exit")
        break
    dirs = list(range(4))
    random.shuffle(dirs)
    for d in dirs:
        nx, ny = x, y
        if d == UP and pos['up'] == 0:
            ny -= 1
        if d == RIGHT and pos['right'] == 0:
            nx += 1
        if d == DOWN and pos['down'] == 0:
            ny += 1
        if d == LEFT and pos['left'] == 0:
            nx -= 1 

        if nx < 0 or nx >= size['width'] or ny < 0 or ny >= size['height']:
            continue

        if (nx, ny) in been:
            continue

        x, y = nx, ny
        been.add((x, y))
        moves.append((x, y))
        moves_stack.append((x, y))
        break
    else:
        if len(moves_stack) > 0:
            x, y = moves_stack.pop()
        else:
            print("No moves left")
            break

print(f"Solution length: {len(moves)}")
print(moves)

r = s.post(f'{BASE_URL}/map', json=moves)

print(r.text)

s.close()

The only dependency of the agent (in requirements.txt) is the requests module.

This is the Dockerfile I use to create a container image for the agent.

FROM --platform=linux/amd64 public.ecr.aws/docker/library/python:3.12-alpine

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD [ "python3", "agent.py"]

You can easily run this simplified version of a simulation locally, but the cloud allows you to run it at larger scale (for example, with a much bigger and more detailed maze) and to test multiple agents to find the best strategy to use. In a real-world scenario, the improvements to the agent would then be implemented into a physical device such as a self-driving car or a robot vacuum cleaner.

Running a simulation using multi-container jobs
To run a job with AWS Batch, I need to configure three resources:

  • The compute environment in which to run the job
  • The job queue in which to submit the job
  • The job definition describing how to run the job, including the container images to use

In the AWS Batch console, I choose Compute environments from the navigation pane and then Create. Now, I have the choice of using Fargate, Amazon EC2, or Amazon EKS. Fargate allows me to closely match the resource requirements that I specify in the job definitions. However, simulations usually require access to a large but static amount of resources and use GPUs to accelerate computations. For this reason, I select Amazon EC2.

Console screenshot.

I select the Managed orchestration type so that AWS Batch can scale and configure the EC2 instances for me. Then, I enter a name for the compute environment and select the service-linked role (that AWS Batch created for me previously) and the instance role that is used by the ECS container agent (running on the EC2 instances) to make calls to the AWS API on my behalf. I choose Next.

Console screenshot.

In the Instance configuration settings, I choose the size and type of the EC2 instances. For example, I can select instance types that have GPUs or use the Graviton processor. I do not have specific requirements and leave all the settings to their default values. For Network configuration, the console already selected my default VPC and the default security group. In the final step, I review all configurations and complete the creation of the compute environment.

Now, I choose Job queues from the navigation pane and then Create. Then, I select the same orchestration type I used for the compute environment (Amazon EC2). In the Job queue configuration, I enter a name for the job queue. In the Connected compute environments dropdown, I select the compute environment I just created and complete the creation of the queue.

Console screenshot.

I choose Job definitions from the navigation pane and then Create. As before, I select Amazon EC2 for the orchestration type.

To use more than one container, I disable the Use legacy containerProperties structure option and move to the next step. By default, the console creates a legacy single-container job definition if there’s already a legacy job definition in the account. That’s my case. For accounts without legacy job definitions, the console has this option disabled.

Console screenshot.

I enter a name for the job definition. Then, I have to think about which permissions this job requires. The container images I want to use for this job are stored in Amazon ECR private repositories. To allow AWS Batch to download these images to the compute environment, in the Task properties section, I select an Execution role that gives read-only access to the ECR repositories. I don’t need to configure a Task role because the simulation code is not calling AWS APIs. For example, if my code was uploading results to an Amazon Simple Storage Service (Amazon S3) bucket, I could select here a role giving permissions to do so.

In the next step, I configure the two containers used by this job. The first one is the maze-model. I enter the name and the image location. Here, I can specify the resource requirements of the container in terms of vCPUs, memory, and GPUs. This is similar to configuring containers for an ECS task.

Console screenshot.

I add a second container for the agent and enter name, image location, and resource requirements as before. Because the agent needs to access the maze as soon as it starts, I use the Dependencies section to add a container dependency. I select maze-model for the container name and START as the condition. If I don’t add this dependency, the agent container can fail before the maze-model container is running and able to respond. Because both containers are flagged as essential in this job definition, the overall job would terminate with a failure.

Console screenshot.

I review all configurations and complete the job definition. Now, I can start a job.

In the Jobs section of the navigation pane, I submit a new job. I enter a name and select the job queue and the job definition I just created.

Console screenshot.

In the next steps, I don’t need to override any configuration and create the job. After a few minutes, the job has succeeded, and I have access to the logs of the two containers.

Console screenshot.

The agent solved the maze, and I can get all the details from the logs. Here’s the output of the job to see how the agent started, picked up the key, and then found the exit.

SIZE {'width': 80, 'height': 20}
START {'x': 0, 'y': 18}
(32, 2) key found
(79, 16) exit
Solution length: 437
[(0, 18), (1, 18), (0, 18), ..., (79, 14), (79, 15), (79, 16)]

In the map, the red asterisks (*) follow the path used by the agent between the start (S), key (K), and exit (E) locations.

ASCII-based map of the solved maze.

Increasing observability with a sidecar container
When running complex jobs using multiple components, it helps to have more visibility into what these components are doing. For example, if there is an error or a performance problem, this information can help you find where and what the issue is.

To instrument my application, I use AWS Distro for OpenTelemetry:

Using telemetry data collected in this way, I can set up dashboards (for example, using CloudWatch or Amazon Managed Grafana) and alarms (with CloudWatch or Prometheus) that help me better understand what is happening and reduce the time to solve an issue. More generally, a sidecar container can help integrate telemetry data from AWS Batch jobs with your monitoring and observability platforms.

Things to know
AWS Batch support for multi-container jobs is available today in the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDKs in all AWS Regions where Batch is offered. For more information, see the AWS Services by Region list.

There is no additional cost for using multi-container jobs with AWS Batch. In fact, there is no additional charge for using AWS Batch. You only pay for the AWS resources you create to store and run your application, such as EC2 instances and Fargate containers. To optimize your costs, you can use Reserved Instances, Savings Plan, EC2 Spot Instances, and Fargate in your compute environments.

Using multi-container jobs accelerates development times by reducing job preparation efforts and eliminates the need for custom tooling to merge the work of multiple teams into a single container. It also simplifies DevOps by defining clear component responsibilities so that teams can quickly identify and fix issues in their own areas of expertise without distraction.

To learn more, see how to set up multi-container jobs in the AWS Batch User Guide.

Danilo

[$] Nix at SCALE

Post Syndicated from daroc original https://lwn.net/Articles/965631/

The first-ever NixCon
in North America was co-located with
SCALE this year. The
event drew a mix of experienced
Nix users
and people new to the project.
I attended talks that covered using Nix to build Docker images, upcoming changes
to how NixOS performs early booting, and ideas for making the set of services
provided in nixpkgs
more useful for self hosting. (LWN covered the relationship between
Nix, NixOS, and nixpkgs in a
recent article.)
Near the end of the
conference, a collection of Nix contributors gave a “State of the Union”
about the growth of the project and highlighting areas of concern.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close