Cloudflare Calls: millions of cascading trees all the way down

Post Syndicated from Renan Dincer original https://blog.cloudflare.com/cloudflare-calls-anycast-webrtc


Following its initial announcement in September 2022, Cloudflare Calls is now in open beta and available in your Cloudflare Dashboard. Cloudflare Calls lets developers build real-time audio/video apps using WebRTC, and it abstracts away the complexity by turning the Cloudflare network into a singular SFU. In this post, we dig into how we make this possible.

WebRTC growing pains

WebRTC is the only way to send UDP traffic out of a web browser – everything else uses TCP.

As a developer, you need a UDP-based transport layer for applications demanding low latency and real-time feedback, such as audio/video conferencing and interactive gaming. This is because unlike WebSocket and other TCP-based solutions, UDP is not subject to head-of-line blocking, a frequent topic on the Cloudflare Blog.

When building a new video conferencing app, you typically start with a peer-to-peer web application using WebRTC, where clients exchange data directly. This approach is efficient for small-scale demos, but scalability issues arise as the number of participants increases. This is because the amount of data each client must transmit grows substantially, following an almost exponential increase relative to the number of participants, as each client needs to send data to n-1 other clients.

Selective Forwarding Units (SFUs) play pivotal roles in scaling WebRTC applications. An SFU functions by receiving multiple media or data flows from participants and deciding which streams should be forwarded to other participants, thus acting as a media stream routing hub. This mechanism significantly reduces bandwidth requirements and improves scalability by managing stream distribution based on network conditions and participant needs. Even though it hasn’t always been this way from when video calling on computers first became popular, SFUs are often found in the cloud, rather than home computers of clients, because of superior connectivity offered in a data center.

A modern audio/video application thus quickly becomes complicated with the addition of this server side element. Since all clients connect to this central SFU server, there are numerous things to consider when you’re architecting and scaling a real-time application:

  • How close is the SFU server location(s) to the end user clients, how is a client assigned to a server?
  • Where is the SFU hosted, and if it’s hosted in the cloud, what are the egress costs from VMs?
  • How many participants can fit in a “room”? Are all participants sending and receiving data? With cameras on? Audio only?
  • Some SFUs require the use of custom SDKs. Which platforms do these run on and are they compatible with the application you’re trying to build?
  • Monitoring/reliability/other issues that come with running infrastructure

Some of these concerns, and the complexity of WebRTC infrastructure in general, has made the community look in different directions. However, it is clear that in 2024, WebRTC is alive and well with plenty of new and old uses. AI startups build characters that converse in real time, cars leverage WebRTC to stream live footage of their cameras to smartphones, and video conferencing tools are going strong.

WebRTC has been interesting to us for a while. Cloudflare Stream implemented WHIP and WHEP WebRTC video streaming protocols in 2022, which remain the lowest latency way to broadcast video. OBS Studio implemented WHIP broadcasting support as have a variety of software and hardware vendors alongside Cloudflare. In late 2022, we launched Cloudflare Calls in closed beta. When we blogged about it back then, we were very impressed with how WebRTC fared, and spoke to many customers about their pain points as well as creative ideas the existing browser APIs can foster. We also saw other WebRTC-based apps like Clubhouse rise in popularity and Twitter Spaces play a role in popular culture. Today, we see real-time applications of a different sort. Many AI projects have impressive demos with voice/video interactions. All of these apps are built with the same WebRTC APIs and system architectures.

We are confident that Cloudflare Calls is a new kind of WebRTC infrastructure you should try. When we set out to build Cloudflare Calls, we had a few ideas that we weren’t sure would work, but were worth trying:

  • Build every WebRTC component on Anycast with a single IP address for DTLS, ICE, STUN, SRTP, SCTP, etc.
  • Don’t force an SDK – WebRTC APIs by themselves are enough, and allow for the most novel uses to shine, because best developers always find ways to hit the limits of SDKs.
  • Deploy in all 310+ cities Cloudflare operates in – use every Cloudflare server, not just a subset
  • Exchange offer and answer over HTTP between Cloudflare and the WebRTC client. This way there is only a single PeerConnection to manage.

Now we know this is all possible, because we made it happen, and we think it’s the best experience a developer can get with pure WebRTC.

Is Cloudflare Calls a real SFU?

Cloudflare is in the business of having computers in numerous places. Historically, our core competency was operating a caching HTTP reverse proxy, and we are very good at this. With Cloudflare Calls, we asked ourselves “how can we build a large distributed system that brings together our global network to form one giant stateful system that feels like a single machine?”

When using Calls, every PeerConnection automatically connects to the closest Cloudflare data center instead of a single server. Rather than connecting every client that needs to communicate with each other to a single server, anycast spreads out connections as much as possible to minimize last mile latency sourced from your ISP between your client and Cloudflare.

It’s good to minimize last mile latency because after the data enters Cloudflare’s control, the underlying media can be managed carefully and routed through the Cloudflare backbone. This is crucial for WebRTC applications where millisecond delays can significantly impact user experience. To give you a sense about latency between Cloudflare’s data centers and end-users, about 95% of the Internet connected population is within 50ms of a Cloudflare data center. As I write this, I am about 20ms away, but in the past, I have been lucky enough to be connected to a **great** home Wi-Fi network less than 1ms away in Manhattan. “But you are just one user!” you might be thinking, so here is a chart from Cloudflare Radar showing recent global latency measurements:

This setup allows more opportunities for packets lost to be replied with retransmissions closer to users, more opportunities for bandwidth adjustments.

Eliminating SFU region selection

A traditional challenge in WebRTC infrastructure involves the manual selection of Selective Forwarding Units (SFUs) based on geographic location to minimize latency. Some systems solve this problem by selecting a location for the SFU after the first user joins the “room”. This makes routing inefficient when the rest of the participants in the conversation are clustered elsewhere. The anycast architecture of Calls eliminates this issue. When a client initiates a connection, BGP dynamically determines the closest data center. Each selected server only becomes responsible for the PeerConnection of the clients closest to it.

One might see this is actually a simpler way of managing servers, as there is no need to maintain a layer of WebRTC load balancing for traffic or CPU capacity between servers. However, anycast has its own challenges, and we couldn’t take a laissez-faire approach.

Steps to establishing a PeerConnection

One of the challenging parts in assigning a server to a client PeerConnection is supporting dual stack networking for backwards compatibility with clients that only support the old version of the Internet Protocol, IPv4.

Cloudflare Calls uses a single IP address per protocol, and our L4 load balancer directs packets to a single server per client by using the 4-tuple {client IP, client port, destination IP, destination port} hashing. This means that every ICE connectivity check packet arrives at different servers: one for IPv4 and one for IPv6.

ICE is not the only protocol used for WebRTC; there is also STUN and TURN for connectivity establishment. Actual media bits are encrypted using DTLS, which carries most of the data during a session.

DTLS packets don’t have any identifiers in them that would indicate they belong to a specific connection (unlike QUIC’s connection ID field), so every server should be able to handle DTLS packets and get the necessary certificates to be able to decrypt them for processing. DTLS encryption is negotiated at the SDP layer using the HTTPS API.

The HTTPS API for Calls also lands on a different server than DTLS and ICE connectivity checks. Since DTLS packets need information from the SDP exchanged using the HTTPS API, and ICE connectivity checks depend on the HTTPS API for userFragment and password fields in the connectivity check packets, it would be very useful for all of these to be available in one server. Yet in our setup, they’re not.

Fippo and Gustavo of WebRTCHacks complained (gracefully noted) about slow replies to ICE connectivity checks in their great article as they were digging into our WHIP implementation right around our announcement in 2022:

Looking at the Wireshark dumps we see a surprisingly large amount of time pass between the first STUN request and the first STUN response – it was 1.8 seconds in the screenshot below.

In other tests, it was shorter, but still 600ms long.

After that, the DTLS packets do not get an immediate response, requiring multiple attempts. This ultimately leads to a call setup time of almost three seconds – way above the global average of 800ms Fippo has measured previously (for the complete handshake, 200ms for the DTLS handshake). For Cloudflare with their extensive network, we expected this to be way below that average.

Gustavo and Fippo observed our solution to this problem of different parts of the WebRTC negotiation landing on different servers. Since Cloudflare Calls unbundles the WebRTC protocol to make the entire network act like a single computer, at this critical moment, we need to form consensus across the network. We form consensus by configuring every server to handle any incoming PeerConnection just in time. When a packet arrives, if the server doesn’t know about it, it quickly learns about the negotiated parameters from another server, such as the ufrag and the DTLS fingerprint from the SDP, and responds with the appropriate response.

Getting faster

Even though we’ve sped up the process of forming consensus across the Cloudflare network, any delays incurred can still have weird side effects. For example, up until a few months ago, delays of a few hundred milliseconds caused slow connections in Chrome.

A connectivity check packet delayed by a few hundred milliseconds signals to Chrome that this is a high latency network, even though every other STUN message after that was replied to in less than 5-10ms. Chrome thus delays sending a USE-CANDIDATE attribute in the responses for a few seconds, degrading the user experience.

Fortunately, Chrome also sends DTLS ClientHello before USE-CANDIDATE (behavior we’ve seen only on Chrome), so to help speed up Chrome, Calls uses DTLS packets in place of STUN packets with USE-CANDIDATE attributes.

After solving this issue with Chrome, PeerConnections globally now take about 100-250ms to get connected. This includes all consensus management, STUN packets, and a complete DTLS handshake.

Sessions and Tracks are the building blocks of Cloudflare’s SFU, not rooms

Once a PeerConnection is established to Cloudflare, we call this a Session. Many media Tracks or DataChannels can be published using a single Session, which returns a unique ID for each. These then can be subscribed to over any other PeerConnection anywhere around the world using the unique ID. The tracks can be published or subscribed anytime during the lifecycle of the PeerConnection.

In the background, Cloudflare takes care of scaling through a fan-out architecture with cascading trees that are unique per track. This structure works by creating a hierarchy of nodes where the root node distributes the stream to intermediate nodes, which then fan out to end-users. This significantly reduces the bandwidth required at the source and ensures scalability by distributing the load across the network. This simple but powerful architecture allows developers to build anything from 1:1 video calls to large 1:many or many:many broadcasting scenarios with Calls.

There is no “room” concept in Cloudflare Calls. Each client can add as many tracks into a PeerConnection as they’d like. The limit is the bandwidth available between Cloudflare and the client, which is practically limited by the client side every time. The signaling or the concept of a “room” is left to the application developer, who can choose to pull as many tracks as they’d like from the tracks they have pushed elsewhere into a PeerConnection. This allows developers to move participants into breakout rooms and then back into a plenary room, and then 1:1 rooms while keeping the same PeerConnection and MediaTracks active.

Cloudflare offers an unopinionated approach to bandwidth management, allowing for greater control in customizing logic to suit your business needs. There is no active bandwidth management or restriction on the number of tracks. The WebRTC Stats API provides a standardized way to access data on packet loss and possible congestion, enabling you to incorporate client-side logic based on this information. For instance, if poor Wi-Fi connectivity leads to degraded service, your front-end could inform the user through a notice and automatically reduce the number of video tracks for that client.

“NACK shield” at the edge

The Internet can’t guarantee timely and orderly delivery of packets, leading to the necessity of retransmission mechanisms, particularly in protocols like TCP. This ensures data eventually reaches its destination, despite possible delays. Real-time systems, however, need special consideration of these delays. A packet that is delayed past its deadline for rendering on the screen is worthless, but a packet that is lost can be recovered if it can be retransmitted within a very short period of time, on the order of milliseconds. This is where NACKs come to play.

A WebRTC client receiving data constantly checks for packet loss. When one or more packets don’t arrive at the expected time or a sequence number discontinuity is seen on the receiving buffer, a special NACK packet is sent back to the source in order to ask for a packet retransmission.

In a peer-to-peer topology, if it receives a NACK packet, the source of the data has to retransmit packets for every participant. When an SFU is used, the SFU could send NACKs back to source, or keep a complex buffer for each client to handle retransmissions.

This gets more complicated with Cloudflare Calls, since both the publisher and the subscriber connect to Cloudflare, likely to different servers and also probably in different locations. In addition, there is a possibility of other Cloudflare data centers in the middle, either through Argo, or just as part of scaling to many subscribers on the same track.

It is common for SFUs to backpropagate NACK packets back to the source, losing valuable time to recover packets. Calls goes beyond this and can handle NACK packets in the location closest to the user, which decreases overall latency. The latency advantage gives more chance for the packet to be recovered compared to a centralized SFU or no NACK handling at all.

Since there is possibly a number of Cloudflare data centers between clients, packet loss within the Cloudflare network is also possible. We handle this by generating NACK packets in the network. With each hop that is taken with the packets, the receiving end can generate NACK packets. These packets are then recovered or backpropagated to the publisher to be recovered.

Cloudflare Calls does TURN over Anycast too

Separately from the SFU, Calls also offers a TURN service. TURN relays act as relay points for traffic between WebRTC clients like the browser and SFUs, particularly in scenarios where direct communication is obstructed by NATs or firewalls. TURN maintains an allocation of public IP addresses and ports for each session, ensuring connectivity even in restrictive network environments.

Cloudflare Calls’ TURN service supports a few ports to help with misbehaving middleboxes and firewalls:

  • TURN-over-UDP over port 3748 (standard), and also port 53
  • TURN-over-TCP over ports 3748 and 80
  • TURN-over-TLS over ports 5349 and 443

TURN works the same way as Calls, available over anycast and always connecting to the closest datacenter.

Pricing and how to get started

Cloudflare Calls is now in open beta and available in your Cloudflare Dashboard. Depending on your use case, you can set up an SFU application and/or a TURN service with only a few clicks.

To kick off its open beta phase, Calls is available at no cost for a limited time. Starting May 15, 2024, customers will receive the first terabyte each month for free, with any usage beyond that charged at $0.05 per real-time gigabyte. Beta customers will be provided at least 30 days to upgrade from the free beta to a paid subscription. Additionally, there are no charges for in-bound traffic to Cloudflare. For volume pricing, talk to your account manager.

Cloudflare Calls is ideal if you are building new WebRTC apps. If you have existing SFUs or TURN infrastructure, you may still consider using Calls alongside your existing infrastructure. Building a bridge to Calls from other places is not difficult as Cloudflare Calls supports standard WebRTC APIs and acts like just another WebRTC peer.

We understand that getting started with a new platform is difficult, so we’re also open sourcing our internal video conferencing app, Orange Meets. Orange Meets supports small and large conference calls by maintaining room state in Workers Durable Objects. It has screen sharing, client-side noise-canceling, and background blur. It is written with TypeScript and React and is available on GitHub.

We’re hiring

We think the current state of Cloudflare Calls enables many use cases. Calls already supports publishing and subscribing to media tracks and DataChannels. Soon, it will support features like simulcasting.

But we’re just scratching the surface and there is so much more to build on top of this foundation.

If you are passionate about WebRTC (and other real-time protocols!!), the Media Platform team building the Calls product at Cloudflare is hiring and would love to talk to you.

What’s New in Rapid7 Products & Services: Q1 2024 in Review

Post Syndicated from Margaret Wei original https://blog.rapid7.com/2024/04/04/whats-new-in-rapid7-products-services-q1-2024-in-review/

What’s New in Rapid7 Products & Services: Q1 2024 in Review

We kicked off 2024 with a continued focus on bringing security professionals (which if you’re reading this blog, is likely you!) the tools and functionality needed to anticipate risks, pinpoint threats, and respond faster with confidence. Below we’ve highlighted some key releases and updates from this past quarter across Rapid7 products and services—including InsightCloudSec, InsightVM, InsightIDR, Rapid7 Labs, and our managed services.

Anticipate Imminent Threats Across Your Environment

Monitor, remediate, and takedown threats with Managed Digital Risk Protection (DRP)

Rapid7’s new Managed Digital Risk Protection (DRP) service provides expert monitoring and remediation of external threats across the clear, deep, and dark web to prevent attacks earlier.

Now available in our highest tier of Managed Threat Complete and as an add on for all other Managed D&R customers, Managed DRP extends your team with Rapid7 security experts to:

  • Identify the first signs of a cyber threat to prevent a breach
  • Rapidly remediate and takedown threats to minimize exposure
  • Protect against ransomware data leakage, phishing, credential leakage, data leakage, and provide dark web monitoring

Read more about the benefits of Managed DRP in our blog here.

What’s New in Rapid7 Products & Services: Q1 2024 in Review

Ensure safe AI development in the cloud with Rapid7 AI/ML Security Best Practices

We’ve recently expanded InsightCloudSec’s support for GenAI development and training services (including AWS Bedrock, Azure OpenAI Service and GCP Vertex) to provide more coverage so teams can effectively identify, assess, and quickly act to resolve risks related to AI/ML development.

This expanded generative AI coverage enriches our proprietary compliance pack, Rapid7 AI/ML Security Best Practices, which continuously assesses your environment through event-driven harvesting to ensure your team is safely developing with AI in a manner that won’t leave you exposed to common risks like data leakage, model poisoning, and more.

As with all critical resources connected to your InsightCloudSec environment, these risks are enriched with Layered Context to automatically prioritize AI/ML risk based on exploitability and potential impact. They’re also continuously monitored for effective permissions and actual usage to rightsize permissions to ensure alignment with LPA. In addition to this extensive visibility, InsightCloudSec offers native automation to alert on and even remediate risk across your environment without the need for human intervention.

Stay ahead of emerging threats with insights and guidance from Rapid7 Labs

In the first quarter of this year, Rapid7 initiated the Emergent Threat Response (ETR) process for 12 different threats, including (but not limited to):

  • Zero-day exploitation of Ivanti Connect Secure and Ivanti Pulse Secure gateways, the former of which has historically been targeted by both financially motivated and state-sponsored threat actors in addition to low-skilled attackers.
  • Critical CVEs affecting outdated versions of Atlassian Confluence and VMware vCenter Server, both widely deployed products in corporate environments that have been high-value targets for adversaries, including in large-scale ransomware campaigns.
  • High-risk authentication bypass and remote code execution vulnerabilities in ConnectWise ScreenConnect, widely used software with potential for large-scale ransomware attacks, providing coverage before CVE identifiers were assigned.
  • Two authentication bypass vulnerabilities in JetBrains TeamCity CI/CD server that were discovered by Rapid7’s research team.

Rapid7’s ETR program is a cross-team effort to deliver fast, expert analysis alongside first-rate security content for the highest-priority security threats to help you understand any potential exposure and act quickly to defend your network. Keep up with future ETRs on our blog here.

Pinpoint Critical and Actionable Insights to Effectively and Confidently Respond

Introducing the newest tier of Managed Threat Complete

Since we released Managed Threat Complete last year, organizations all over the globe have unified their vulnerability management programs with their threat detection and response programs. Now, teams have a unified view into the full kill chain and a tailored service to turbocharge their program, mitigate the most pressing risks and eliminate threats.

Managed Threat Complete Ultimate goes beyond our previously available Managed Threat Complete bundles to include:

  • Managed Digital Risk Protection for monitoring and remediation of threats across the clear, deep, and dark web
  • Managed Vulnerability Management for clarity guidance to remediate the highest priority risk
  • Velociraptor, Rapid7’s leading open-source DFIR framework, from monitoring and hunting to in-depth investigations into potential threats, access the tool that is leveraged by our Incident Response experts on behalf of our managed customers
  • Ransomware Prevention for recognizing threats and stopping attacks before they happen with multi-layered prevention (coming soon – stay tuned)

Get to the data you need faster with new Log Search and Investigation features in InsightIDR

Our latest enhancements to Log Search and Investigations will help drive efficiency for your team and give you time back in your day-to-day—and when you really need it in the heat of an incident. Faster search times, easier-to-write queries, and intuitive recommendations will help you find event trends within your data and save you time without sacrificing results.

  • Triage investigations faster with log data readily accessible from the investigations timeline – with a click of the new “view log entry” button you’ll instantly see the context and log data behind an associated alert.
  • Create precise queries quickly with new automatic suggestions – as you type in Log Search, the query bar will automatically suggest the elements of LEQL that you can use in your query to get to the data you need—like users, IP addresses, and processes—faster.
  • Save time sifting through search results with new LEQL ‘select’ clause – define exactly what keys to return in the search results so you can quickly answer questions from log data and avoid superfluous information.

Easily view vital cloud alert context with Simplified Cloud Threat Alerts

This quarter we launched Simplified Cloud Threat Alerts within InsightIDR to make it easier to quickly understand what a cloud alert – like those from AWS GuardDuty – means, which can be a daunting task for even the most experienced analysts due to the scale and complexity of cloud environments.

With this new feature, you can view details and known issues with the resources (e.g. assets, users, etc.) implicated in the alert and have clarity on the steps that should be taken to appropriately respond to the alert. This will help you:

  • Quickly understand what a given cloud resource is, its intended purpose, what applications it supports and who “owns” it.
  • Get a clear picture around what an alert means, what next steps to take to verify the alert, or how to respond if the alert is in fact malicious.
  • Prioritize response efforts based on potential impact with insight into whether or not the compromised resource is misconfigured, has active vulnerabilities, or has been recently updated in a manner that signals potential pre-attack reconnaissance.

A growing library of actionable detections in InsightIDR

In Q1 2024 we added 1,349 new detection rules. See them in-product or visit the Detection Library for descriptions and recommendations.

Stay tuned!

As always, we’re continuing to work on exciting product enhancements and releases throughout the year. Keep an eye on our blog and release notes as we continue to highlight the latest in product and service investments at Rapid7.

Surveillance by the New Microsoft Outlook App

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/04/surveillance-by-the-new-microsoft-outlook-app.html

The ProtonMail people are accusing Microsoft’s new Outlook for Windows app of conducting extensive surveillance on its users. It shares data with advertisers, a lot of data:

The window informs users that Microsoft and those 801 third parties use their data for a number of purposes, including to:

  • Store and/or access information on the user’s device
  • Develop and improve products
  • Personalize ads and content
  • Measure ads and content
  • Derive audience insights
  • Obtain precise geolocation data
  • Identify users through device scanning

Commentary.

Careers in computer science: Two perspectives

Post Syndicated from Dan Fisher original https://www.raspberrypi.org/blog/careers-in-computer-science-two-perspectives/

As educators, it’s important that we showcase the wide range of career opportunities available in the field of computing, not only to inspire learners, but also to help them feel sure they’re choosing to study a subject that is useful for their future. For example, a survey from the BBC in September 2023 found that more than a quarter of UK teenagers often feel anxious, with “exams and school life” among the main causes. To help young people chart their career paths, we recently hosted two live webinars for National Careers Week in the UK.

Our goal for the webinars was to highlight the breadth of careers within computing and to provide insights from professionals who are pursuing their own diverse and rewarding paths. Each webinar featured engaging discussions and an interactive Q&A session with learners who use our Ada Computer Science platform. The learners could ask their own questions to get firsthand knowledge and perspectives from our guest speakers.

Our guest speakers

Jess Van Brummelen is a Human–Computer Interaction Research Scientist at Niantic, the video games company behind augmented reality game Pokémon Go. After developing an interest in programming during her undergraduate degree in mechanical engineering, she went on to complete a Master’s degree and PhD in computer science at MIT.

Ashley Edwards is a Senior Research Scientist at Google DeepMind, working on reinforcement learning. She received her PhD in 2019 from Georgia Tech, spent time as an intern at Google Brain, and worked as a research scientist at Uber AI Labs.

You can read extracts from our interviews with Jess and Ashley and watch the full videos below. Teachers have contacted us to say they’ll be using the webinars for careers-focused sessions with their students. We hope you will do the same!

Please note that we have edited the extracts below to add clarity.

Jess Van Brummelen

Jessica Van Brummelen.

Hi Jess. What advice would you give to a student who is thinking about a career in human–computer interaction in the gaming industry?

In terms of HCI and gaming, I’d actually recommend that you keep gaming! It’s a small part of my job but it’s really important to understand what’s fun and enjoyable in games. Not only that; gaming can be great for learning to problem-solve — there’s been all sorts of research on the positive impact of gaming.

A second thing, going back to how I felt in my mechanical engineering classes, I really felt like an ‘other’ and not someone who is the standard computer scientist or engineer. I would encourage students to pursue their dreams anyway because it’s so important to have diversity in these types of careers, especially technology, because it goes out to so many different people and it can really affect society. It’s really important that the people who make it come from many different backgrounds and cultures so we can create technology that is better for everyone.

[From Owen, a student on the livestream] What’s the most impossible idea you’ve come up with while working at Niantic?

I’m currently publishing a paper addressing the question, ‘Can we guide people without using anything visual on their phone?’ That means using audio and haptic (technology that transmits information via touch, e.g. vibrations) prompts instead. We tried out different commands where the phone said ‘turn left’ and ‘turn right’, but we really wanted to test how to guide someone more specifically in a game environment. For example, if there was a hidden object on a wall in a game that a person couldn’t see, could we guide them to that object while they’re walking? So I ran a study where I guided people to scan a statue by moving around it. Scanning is the process of using the camera on your phone to scan an object in real life, which is then reconstructed on your phone. Scanning objects can trigger other augmented reality experiences within a game. For example, you might scan a real-life box in a room and this might trigger an animation of that box opening to reveal a secret within the game. We tested a lot of different things. For example, test subjects listened to music as they were walking and when they were on the right path, the music sounded really good. But when they were off the path, it sounded terrible. So it helped them to look for the right path. Then if you were pointing the phone in the wrong direction for scanning objects, you would get warning vibrations on the phone. So we did the study and we were hoping it would improve safety. It turns out it was neutral on improving safety — I think this is because it was such a novel system. People weren’t used to using it and still bumped into things! But it did make people better at scanning the objects, which was interesting.

Watch Jess’s full interview:

Ashley Edwards

Ashley Edwards.

Hi Ashley. Is there something you studied in school that you found to be more useful now than you ever thought it would be?

Maths! I always enjoyed doing maths, but I didn’t realise I would need it as a computer scientist. You see it popping up all the time, especially in machine learning. Having a strong knowledge of calculus and linear algebra is really helpful.

How do you train an AI model using machine learning

You start by asking the question, ‘What is the problem I’m trying to solve?’ Then typically you need input data and the outputs you want to achieve, so you ask two more questions, ‘What data do I want to come in?’ and ‘What do I want to come out?’ Let’s say you decide to use a supervised learning model (a category of machine learning where labelled data sets are used to train algorithms to detect patterns and predict outcomes) to predict whether a photo contains a cat. You train the model using a giant set of images with labels that say either ‘This is a cat’ or ‘This isn’t a cat’. By training the model with the images, you get to a point where your model can analyse the features of any image and predict whether it contains a cat or not.

In my field of research, I work on something called reinforcement learning, which is where you train your model through trial and error and the use of ‘rewards’. Let’s imagine we are trying to train a robot. We might write a program that tells the robot, ‘I am going to give you a reward if you take the right step forward and it’s going to be a positive reward. If you fall over, I’m going to give you a negative reward.’ So you train the robot to prioritise the right behaviours to optimise the rewards it’s getting.

[From a student] Will I still need to learn to code in the future?

I think it is going to be very different in the future, but we’ll still need to learn how to build different types of algorithms and we’re going to need to understand the concepts behind coding as well. We’ll still need to ask questions like, ‘What is it that I want to build?’ and ‘Is this actually doing the correct thing?’

Watch Ashley’s full interview:

Broadening access

Jess and Ashley are forging successful careers not only through a combination of smart choices, hard work, talent, and a passion for technology; they also had access to opportunities to discover their passion and receive an education in this field. Too many young people around the world still don’t have these opportunities.

That is why we provide free resources and training to help schools broaden access to computing education. For example, our free learning platform, Ada Computer Science, provides students aged 14 to 19 with high-quality computing resources and interactive questions, written by experts from our team. To learn more, visit adacomputerscience.org.

The post Careers in computer science: Two perspectives appeared first on Raspberry Pi Foundation.

Оценките – (не)нужното зло

Post Syndicated from original https://www.toest.bg/otsenkite-ne-nuzhnoto-zlo/

Оценките – (не)нужното зло

Много от темите, засягащи образователната система и нейното актуализиране, тепърва се промъкват в България. Смесването на различни възрасти в едно занятие, блоковото разпределение на часовете с цел по-качествено научаване, преминаването в полудигитална среда с електронни учебници – все неща, за които чуваме, но не виждаме в действие. На фона на тази стагнация и нежелание за модернизация да говорим за оценките в училище е чиста революция, но все някога тази дискусия трябва да започне.

В различните държави оценките могат не само да бъдат изразени различно (българската шестица в Съединените щати е оценка А), но и да са разделени на различни степени – еквивалент на нашата уж шестобална система във Франция например е 20-точкова система с много по-детайлно разделение на оценяването. Общото между всички тези различни системи обаче са количествените и качествените показатели, които се съдържат във всяка оценка.

Дали оценките се пишат с букви или цифри, няма значение – те служат за рационализиране на комуникационния процес между учебните институции или между техните части и участници. Това, което не правят обаче, е да служат на отделния учещ и да допринасят за неговия напредък. Защото основаващата се на сравнение и съревнование учебна система търси единствено количествения израз на постиженията и напълно изключва качествения. 

Актуално състояние

Оценките и начините на оценяване са поредният инструмент, който не се е променял поне от времето на родителите ни. Ако имате дете в училищна възраст, може да го разпитате как го изпитват. 

Все още, през 2024 г., разполагаме с прилично количество авторитарни учители, които изправят ученика до бюрото си и се държат с него като на разпит. 

Все още (понякога оправданото) безсилие на учителите се изразява в класическото наказание: „Извадете по един двоен лист“ (все едно някой някога е можел да изпише два листа, докато се намира в ступор от рязкото прекъсване на някое забавление в клас).

Все още учителите са просто хора и е съвсем естествено да са субективни и да дават предимство на ученици, към които изпитват симпатия, както и да им е трудно да признаят макар и минимална положителна промяна у някой ученик, когото не харесват толкова.

Колкото и правилници за оценяване да бъдат съставени и въведени, като например Наредба № 11 от 1 септември 2016 г. за оценяване на резултатите от обучението на учениците, никой от тях няма да може да намали щетите от субективното оценяване, от гоненето на успех (от родители и учители), от безкрайните уроци и курсове, целящи да вкарат поредното дете в „хубаво“ училище.

Ако все пак се примирим с неизбежността на оценките, е хубаво да си дадем сметка, че имаме шестобална система, която реално не е шестобална, защото оценките в нея са от слаб 2 до отличен 6. Двойката в тази система се използва най-често за наказание, а не като инструмент, който води до включването на изработени механизми за помощ. Защото в реалността вместо двойка и помощ идва тройката – за запазване на броя на учениците и за изпълнението на квотите. Ученикът научава, че няма кой да му помогне, но същевременно се научава, че независимо какво прави, той просто ще премине нататък. Поне до 12. клас – за след това не е нужно да се мисли.

Съвсем умишлено няма да говоря за безумието на националното външно оценяване, защото то е достойно за поредица анализи на всеки от участниците в него – от създателите му до последния изтерзан родител, напълнил още няколко джоба в сферата на частните уроци и школи и изпразнил и последните резерви от желание и любопитство на детето, което тепърва ще има още поне пет години гимназиален курс пред себе си. 

Ако трябва да съберем настоящето в две думи, те биха били несправедливо и насилствено.

Как може да бъде

Откакто има оценки, се смята, че те са инструмент за мотивация учещият да се представя по-добре, а ако ги няма, ще липсва стимул за постигане на целите. Научно погледнато обаче, има нищожно малко доказателства, че оценките карат учениците да са по-трудолюбиви и да учат повече. За сметка на това има огромно количество доказателства, че точно оценките възпрепятстват успешното представяне при изпитване и влияят негативно на мотивацията за учене.

Психологията отдавна е доказала, че ако изживяваме поражение и негативни чувства в дадена ситуация, ние съзнателно и несъзнателно ще се опитваме да избегнем попадането в нея. Така например, ако всяко изпитване по предмет Х е свързано с унизителното изкарване пред целия клас и демонстрирането на силната позиция на учителя, а в същото време имаме ученик, който среща затруднения по дадения предмет, то не би трябвало да се чудим, когато този ученик не може да насмогне със знанията си и постепенно започне да избягва не само ученето по предмета, но и самите часове. 

Обръщаме се към мотивационната психология за отговор, която прави разлика между вътрешна и външна мотивация. Външната мотивация са оценките – всяка оценка ни поставя етикет и ни отрежда мястото, което сме „заслужили“. Много по-важна обаче е вътрешната мотивация, която има три основни изграждащи я елемента: самостоятелност, компетентност и свързаност. 

Самостоятелността е правото на избор, свободата да поемеш контрол. Всеки ученик в момента е обект на обучението си; не участва по никакъв начин във формирането на нито една от задачите, които трябва да изпълни. Дори самостоятелните задачи, като презентация или проект, са изкривени и се използват като инструмент за повишаване на същата тази оценка и учениците знаят това. Ако позволим на децата да изберат темата си, да се задълбочат в посоката, която ги влече заради личния им интерес, а не заради постижението, ще постигнем много повече от добър успех. 

Под компетентност тук разбираме възможността да се усвояват нови умения. Дали 12-годишното еднотипно заучаване на предварително зададено съдържание развива нови умения? Може би, особено умения да се заобиколи всячески скуката. Да научим децата да общуват, да си сътрудничат и да мислят критично са само първите стъпки. Много по-важно е да бъдат развити умения – тогава знанието самò ще намери пътя си.

Свързаността е емоционалната част, чувството да си част от нещо. Всеки от нас има спомен за онзи учител, който го е вдъхновил да открие повече за себе си и света и е отворил нова врата към бъдещето. Да бъдеш приет и да бъдеш уважаван като отделен човек винаги води до желание да отвърнеш със същото. Да поставим децата над оценките им е подкрепата, от която се нуждаят. И няма нищо страшно плануваният урок понякога да отпадне заради тема, която силно е повлияла върху децата и те се нуждаят да говорят по нея.

Оценката е само израз на моментно състояние. Тя не показва колко сме добри, дали сме умни, дали сме успешни. Тя показва кои сме в точно този час – разсеяни, притеснени и паникьосани или спокойни, съсредоточени и подкрепени. По-важни от оценката са умението и желанието да учим през целия си живот, без да го приемаме като тегоба.

И понеже поставянето на оценка е неизбежно, нека поне се прави с обратна връзка. Уважителна, човешка и индивидуална обратна връзка в едно или няколко изречения. Когато е нужно – и в разговор. 

Английската думата education, която означава и „образование“, и „възпитание“, произлиза от латинското educo, educare, което освен всичко друго означава и „отглеждам“. Отглеждането е сред процесите, през които минаваме, за да станем пълноценни членове на заобикалящия ни свят, за да можем един ден самостоятелно да поемем отговорност за себе си и околните и да разкрием и развием потенциала си. Често цитираната поговорка, че е нужно цяло село, за да се отгледа едно дете, всъщност е вярна. Освен на родителите си, детето разчита на останалите членове на семейството си, на учителите – в детската градина, в началното училище, в следващите степени, на треньорите си, на подкрепящите го във всеки негов интерес обучители, на професори, на колеги, ръководители и много, много още. 

Не е ли абсурдно все още да смятаме, че образованието трябва да се състои от дългогодишно изсипване на факти върху деца, които нямат търпение да открият света и мястото си в него, и проверяването на степента на запомняне на тези факти под строг поглед? Дали слагането на етикет на всяко дете може да го мотивира и ако да – как и колко? За това дори не ни е нужна генерална реформа, а просто повече човечност. Защото не е задължително добрите оценки и успешният живот да вървят ръка за ръка.

Когато виждаме хората такива, каквито са, ги правим по-лоши. Но ако се отнасяме с тях, сякаш са това, което трябва да бъдат, тогава ги отвеждаме там, където трябва да отидат.

Йохан Волфганг фон Гьоте


Светът се променя с бясна скорост. Професиите, в които ще се развиват поколенията, започващи днес образователния си път, все още не са измислени. Подготвена ли е нашата образователна система, за да отговори на тези предизвикателства? Какво може и трябва да се промени? А как?

Веднъж месечно в рубриката „Възможното образование“ ще говорим за промяната – такава, каквато искаме да я видим, за добрите примери и за посоките, в които може би е добре да обърне поглед българската образователна система.

This is the Astera Labs Aries 6 PCIe Gen6 and CXL Retimer

Post Syndicated from Cliff Robinson original https://www.servethehome.com/this-is-the-astera-labs-aries-6-pcie-gen6-and-cxl-retimer/

We saw the Astera Labs Aries 6 PCIe Gen6 and CXL retimer running at NVIDIA GTC 2024 and looking great compared to Broadcom’s planned chip

The post This is the Astera Labs Aries 6 PCIe Gen6 and CXL Retimer appeared first on ServeTheHome.

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Post Syndicated from Andrea Filippo La Scola original https://aws.amazon.com/blogs/big-data/amazon-datazone-now-integrates-with-aws-glue-data-quality-and-external-data-quality-solutions/

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets.

Many organizations already use AWS Glue Data Quality to define and enforce data quality rules on their data, validate data against predefined rules, track data quality metrics, and monitor data quality over time using artificial intelligence (AI). Other organizations monitor the quality of their data through third-party solutions.

Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets. Additionally, Amazon DataZone now offers APIs for importing data quality scores from external systems.

In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API.

Challenges

One of the most common questions we get from customers is related to displaying data quality scores in the Amazon DataZone business data catalog to let business users have visibility into the health and reliability of the datasets.

As data becomes increasingly crucial for driving business decisions, Amazon DataZone users are keenly interested in providing the highest standards of data quality. They recognize the importance of accurate, complete, and timely data in enabling informed decision-making and fostering trust in their analytics and reporting processes.

Amazon DataZone data assets can be updated at varying frequencies. As data is refreshed and updated, changes can happen through upstream processes that put it at risk of not maintaining the intended quality. Data quality scores help you understand if data has maintained the expected level of quality for data consumers to use (through analysis or downstream processes).

From a producer’s perspective, data stewards can now set up Amazon DataZone to automatically import the data quality scores from AWS Glue Data Quality (scheduled or on demand) and include this information in the Amazon DataZone catalog to share with business users. Additionally, you can now use new Amazon DataZone APIs to import data quality scores produced by external systems into the data assets.

With the latest enhancement, Amazon DataZone users can now accomplish the following:

  • Access insights about data quality standards directly from the Amazon DataZone web portal
  • View data quality scores on various KPIs, including data completeness, uniqueness, accuracy
  • Make sure users have a holistic view of the quality and trustworthiness of their data.

In the first part of this post, we walk through the integration between AWS Glue Data Quality and Amazon DataZone. We discuss how to visualize data quality scores in Amazon DataZone, enable AWS Glue Data Quality when creating a new Amazon DataZone data source, and enable data quality for an existing data asset.

In the second part of this post, we discuss how you can import data quality scores produced by external systems into Amazon DataZone via API. In this example, we use Amazon EMR Serverless in combination with the open source library Pydeequ to act as an external system for data quality.

Visualize AWS Glue Data Quality scores in Amazon DataZone

You can now visualize AWS Glue Data Quality scores in data assets that have been published in the Amazon DataZone business catalog and that are searchable through the Amazon DataZone web portal.

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane.

By selecting the corresponding asset, you can understand its content through the readme, glossary terms, and technical and business metadata. Additionally, the overall quality score indicator is displayed in the Asset Details section.

A data quality score serves as an overall indicator of a dataset’s quality, calculated based on the rules you define.

On the Data quality tab, you can access the details of data quality overview indicators and the results of the data quality runs.

The indicators shown on the Overview tab are calculated based on the results of the rulesets from the data quality runs.

Each rule is assigned an attribute that contributes to the calculation of the indicator. For example, rules that have the Completeness attribute will contribute to the calculation of the corresponding indicator on the Overview tab.

To filter data quality results, choose the Applicable column dropdown menu and choose your desired filter parameter.

You can also visualize column-level data quality starting on the Schema tab.

When data quality is enabled for the asset, the data quality results become available, providing insightful quality scores that reflect the integrity and reliability of each column within the dataset.

When you choose one of the data quality result links, you’re redirected to the data quality detail page, filtered by the selected column.

Data quality historical results in Amazon DataZone

Data quality can change over time for many reasons:

  • Data formats may change because of changes in the source systems
  • As data accumulates over time, it may become outdated or inconsistent
  • Data quality can be affected by human errors in data entry, data processing, or data manipulation

In Amazon DataZone, you can now track data quality over time to confirm reliability and accuracy. By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes.

Enable AWS Glue Data Quality when creating a new Amazon DataZone data source

In this section, we walk through the steps to enable AWS Glue Data Quality when creating a new Amazon DataZone data source.

Prerequisites

To follow along, you should have a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataLakeProfile). For instructions, refer to Amazon DataZone quickstart with AWS Glue data.

You also need to define and run a ruleset against your data, which is a set of data quality rules in AWS Glue Data Quality. To set up the data quality rules and for more information on the topic, refer to the following posts:

After you create the data quality rules, make sure that Amazon DataZone has the permissions to access the AWS Glue database managed through AWS Lake Formation. For instructions, see Configure Lake Formation permissions for Amazon DataZone.

In our example, we have configured a ruleset against a table containing patient data within a healthcare synthetic dataset generated using Synthea. Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications.

The ruleset contains 27 individual rules (one of them failing), so the overall data quality score is 96%.

If you use Amazon DataZone managed policies, there is no action needed because these will get automatically updated with the needed actions. Otherwise, you need to allow Amazon DataZone to have the required permissions to list and get AWS Glue Data Quality results, as shown in the Amazon DataZone user guide.

Create a data source with data quality enabled

In this section, we create a data source and enable data quality. You can also update an existing data source to enable data quality. We use this data source to import metadata information related to our datasets. Amazon DataZone will also import data quality information related to the (one or more) assets contained in the data source.

  1. On the Amazon DataZone console, choose Data sources in the navigation pane.
  2. Choose Create data source.
  3. For Name, enter a name for your data source.
  4. For Data source type, select AWS Glue.
  5. For Environment, choose your environment.
  6. For Database name, enter a name for the database.
  7. For Table selection criteria, choose your criteria.
  8. Choose Next.
  9. For Data quality, select Enable data quality for this data source.

If data quality is enabled, Amazon DataZone will automatically fetch data quality scores from AWS Glue at each data source run.

  1. Choose Next.

Now you can run the data source.

While running the data source, Amazon DataZone imports the last 100 AWS Glue Data Quality run results. This information is now visible on the asset page and will be visible to all Amazon DataZone users after publishing the asset.

Enable data quality for an existing data asset

In this section, we enable data quality for an existing asset. This might be useful for users that already have data sources in place and want to enable the feature afterwards.

Prerequisites

To follow along, you should have already run the data source and produced an AWS Glue table data asset. Additionally, you should have defined a ruleset in AWS Glue Data Quality over the target table in the Data Catalog.

For this example, we ran the data quality job multiple times against the table, producing the related AWS Glue Data Quality scores, as shown in the following screenshot.

Import data quality scores into the data asset

Complete the following steps to import the existing AWS Glue Data Quality scores into the data asset in Amazon DataZone:

  1. Within the Amazon DataZone project, navigate to the Inventory data pane and choose the data source.

If you choose the Data quality tab, you can see that there’s still no information on data quality because AWS Glue Data Quality integration is not enabled for this data asset yet.

  1. On the Data quality tab, choose Enable data quality.
  2. In the Data quality section, select Enable data quality for this data source.
  3. Choose Save.

Now, back on the Inventory data pane, you can see a new tab: Data quality.

On the Data quality tab, you can see data quality scores imported from AWS Glue Data Quality.

Ingest data quality scores from an external source using Amazon DataZone APIs

Many organizations already use systems that calculate data quality by performing tests and assertions on their datasets. Amazon DataZone now supports importing third-party originated data quality scores via API, allowing users that navigate the web portal to view this information.

In this section, we simulate a third-party system pushing data quality scores into Amazon DataZone via APIs through Boto3 (Python SDK for AWS).

For this example, we use the same synthetic dataset as earlier, generated with Synthea.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

  1. Read a dataset of patients in Amazon Simple Storage Service (Amazon S3) directly from Amazon EMR using Spark.

The dataset is created as a generic S3 asset collection in Amazon DataZone.

  1. In Amazon EMR, perform data validation rules against the dataset.
  2. The metrics are saved in Amazon S3 to have a persistent output.
  3. Use Amazon DataZone APIs through Boto3 to push custom data quality metadata.
  4. End-users can see the data quality scores by navigating to the data portal.

Prerequisites

We use Amazon EMR Serverless and Pydeequ to run a fully managed Spark environment. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ.

To allow Amazon EMR to send data to the Amazon DataZone domain, make sure that the IAM role used by Amazon EMR has the permissions to do the following:

  • Read from and write to the S3 buckets
  • Call the post_time_series_data_points action for Amazon DataZone:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Statement1",
                "Effect": "Allow",
                "Action": [
                    "datazone:PostTimeSeriesDataPoints"
                ],
                "Resource": [
                    "<datazone_domain_arn>"
                ]
            }
        ]
    }

Make sure that you added the EMR role as a project member in the Amazon DataZone project. On the Amazon DataZone console, navigate to the Project members page and choose Add members.

Add the EMR role as a contributor.

Ingest and analyze PySpark code

In this section, we analyze the PySpark code that we use to perform data quality checks and send the results to Amazon DataZone. You can download the complete PySpark script.

To run the script entirely, you can submit a job to EMR Serverless. The service will take care of scheduling the job and automatically allocating the resources needed, enabling you to track the job run statuses throughout the process.

You can submit a job to EMR within the Amazon EMR console using EMR Studio or programmatically, using the AWS CLI or using one of the AWS SDKs.

In Apache Spark, a SparkSession is the entry point for interacting with DataFrames and Spark’s built-in functions. The script will start initializing a SparkSession:

with SparkSession.builder.appName("PatientsDataValidation") \
        .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
        .getOrCreate() as spark:

We read a dataset from Amazon S3. For increased modularity, you can use the script input to refer to the S3 path:

s3inputFilepath = sys.argv[1]
s3outputLocation = sys.argv[2]

df = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .load(s3inputFilepath) #s3://<bucket_name>/patients/patients.csv

Next, we set up a metrics repository. This can be helpful to persist the run results in Amazon S3.

metricsRepository = FileSystemMetricsRepository(spark, s3_write_path)

Pydeequ allows you to create data quality rules using the builder pattern, which is a well-known software engineering design pattern, concatenating instruction to instantiate a VerificationSuite object:

key_tags = {'tag': 'patient_df'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

check = Check(spark, CheckLevel.Error, "Integrity checks")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .useRepository(metricsRepository) \
    .addCheck(
        check.hasSize(lambda x: x >= 1000) \
        .isComplete("birthdate")  \
        .isUnique("id")  \
        .isComplete("ssn") \
        .isComplete("first") \
        .isComplete("last") \
        .hasMin("healthcare_coverage", lambda x: x == 1000.0)) \
    .saveOrAppendResult(resultKey) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

The following is the output for the data validation rules:

+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+
|check           |check_level|check_status|constraint                                          |constraint_status|constraint_message                                  |
+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+
|Integrity checks|Error      |Error       |SizeConstraint(Size(None))                          |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(birthdate,None))|Success          |                                                    |
|Integrity checks|Error      |Error       |UniquenessConstraint(Uniqueness(List(id),None))     |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(ssn,None))      |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(first,None))    |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(last,None))     |Success          |                                                    |
|Integrity checks|Error      |Error       |MinimumConstraint(Minimum(healthcare_coverage,None))|Failure          |Value: 0.0 does not meet the constraint requirement!|
+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+

At this point, we want to insert these data quality values in Amazon DataZone. To do so, we use the post_time_series_data_points function in the Boto3 Amazon DataZone client.

The PostTimeSeriesDataPoints DataZone API allows you to insert new time series data points for a given asset or listing, without creating a new revision.

At this point, you might also want to have more information on which fields are sent as input for the API. You can use the APIs to obtain the specification for Amazon DataZone form types; in our case, it’s amazon.datazone.DataQualityResultFormType.

You can also use the AWS CLI to invoke the API and display the form structure:

aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy'

This output helps identify the required API parameters, including fields and value limits:

$version: "2.0"
namespace amazon.datazone
structure DataQualityResultFormType {
    @amazon.datazone#timeSeriesSummary
    @range(min: 0, max: 100)
    passingPercentage: Double
    @amazon.datazone#timeSeriesSummary
    evaluationsCount: Integer
    evaluations: EvaluationResults
}
@length(min: 0, max: 2000)
list EvaluationResults {
    member: EvaluationResult
}

@length(min: 0, max: 20)
list ApplicableFields {
    member: String
}

@length(min: 0, max: 20)
list EvaluationTypes {
    member: String
}

enum EvaluationStatus {
    PASS,
    FAIL
}

string EvaluationDetailType

map EvaluationDetails {
    key: EvaluationDetailType
    value: String
}

structure EvaluationResult {
    description: String
    types: EvaluationTypes
    applicableFields: ApplicableFields
    status: EvaluationStatus
    details: EvaluationDetails
}

To send the appropriate form data, we need to convert the Pydeequ output to match the DataQualityResultsFormType contract. This can be achieved with a Python function that processes the results.

For each DataFrame row, we extract information from the constraint column. For example, take the following code:

CompletenessConstraint(Completeness(birthdate,None))

We convert it to the following:

{
  "constraint": "CompletenessConstraint",
  "statisticName": "Completeness_custom",
  "column": "birthdate"
}

Make sure to send an output that matches the KPIs that you want to track. In our case, we are appending _custom to the statistic name, resulting in the following format for KPIs:

  • Completeness_custom
  • Uniqueness_custom

In a real-world scenario, you might want to set a value that matches with your data quality framework in relation to the KPIs that you want to track in Amazon DataZone.

After applying a transformation function, we have a Python object for each rule evaluation:

..., {
   'applicableFields': ["healthcare_coverage"],
   'types': ["Minimum_custom"],
   'status': 'FAIL',
   'description': 'MinimumConstraint - Minimum - Value: 0.0 does not meet the constraint requirement!'
 },...

We also use the constraint_status column to compute the overall score:

(number of success / total number of evaluation) * 100

In our example, this results in a passing percentage of 85.71%.

We set this value in the passingPercentage input field along with the other information related to the evaluations in the input of the Boto3 method post_time_series_data_points:

import boto3

# Instantiate the client library to communicate with Amazon DataZone Service
#
datazone = boto3.client(
    service_name='datazone', 
    region_name=<Region(String) example: us-east-1>
)

# Perform the API operation to push the Data Quality information to Amazon DataZone
#
datazone.post_time_series_data_points(
    domainIdentifier=<DataZone domain ID>,
    entityIdentifier=<DataZone asset ID>,
    entityType='ASSET',
    forms=[
        {
            "content": json.dumps({
                    "evaluationsCount":<Number of evaluations (number)>,
                    "evaluations": [<List of objects {
                        'description': <Description (String)>,
                        'applicableFields': [<List of columns involved (String)>],
                        'types': [<List of KPIs (String)>],
                        'status': <FAIL/PASS (string)>
                        }>
                     ],
                    "passingPercentage":<Score (number)>
                }),
            "formName": <Form name(String) example: PydeequRuleSet1>,
            "typeIdentifier": "amazon.datazone.DataQualityResultFormType",
            "timestamp": <Date (timestamp)>
        }
    ]
)

Boto3 invokes the Amazon DataZone APIs. In these examples, we used Boto3 and Python, but you can choose one of the AWS SDKs developed in the language you prefer.

After setting the appropriate domain and asset ID and running the method, we can check on the Amazon DataZone console that the asset data quality is now visible on the asset page.

We can observe that the overall score matches with the API input value. We can also see that we were able to add customized KPIs on the overview tab through custom types parameter values.

With the new Amazon DataZone APIs, you can load data quality rules from third-party systems into a specific data asset. With this capability, Amazon DataZone allows you to extend the types of indicators present in AWS Glue Data Quality (such as completeness, minimum, and uniqueness) with custom indicators.

Clean up

We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and the EMR application you created during this process.

Conclusion

In this post, we highlighted the latest features of Amazon DataZone for data quality, empowering end-users with enhanced context and visibility into their data assets. Furthermore, we delved into the seamless integration between Amazon DataZone and AWS Glue Data Quality. You can also use the Amazon DataZone APIs to integrate with external data quality providers, enabling you to maintain a comprehensive and robust data strategy within your AWS environment.

To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide.


About the Authors


Andrea Filippo
is a Partner Solutions Architect at AWS supporting Public Sector partners and customers in Italy. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies.

Emanuele is a Solutions Architect at AWS, based in Italy, after living and working for more than 5 years in Spain. He enjoys helping large companies with the adoption of cloud technologies, and his area of expertise is mainly focused on Data Analytics and Data Management. Outside of work, he enjoys traveling and collecting action figures.

Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling.

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Post Syndicated from Andries Engelbrecht original https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-your-data-lake-with-amazon-s3-aws-glue-and-snowflake/

This is post is co-written with Andries Engelbrecht and Scott Teal from Snowflake.

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services.

Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines. In this post, we discuss the following:

  • Advantages of Iceberg tables for data lakes
  • Two architectural patterns for sharing Iceberg tables between AWS and Snowflake:
    • Manage your Iceberg tables with AWS Glue Data Catalog
    • Manage your Iceberg tables with Snowflake
  • The process of converting existing data lakes tables to Iceberg tables without copying the data

Now that you have a high-level understanding of the topics, let’s dive into each of them in detail.

Advantages of Apache Iceberg

Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.

Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Originally developed at Netflix before being open sourced to the Apache Software Foundation, Apache Iceberg was a blank-slate design to solve common data lake challenges like user experience, reliability, and performance, and is now supported by a robust community of developers focused on continually improving and adding new features to the project, serving real user needs and providing them with optionality.

Transactional data lakes built on AWS and Snowflake

Snowflake provides various integrations for Iceberg tables with multiple storage options, including Amazon S3, and multiple catalog options, including AWS Glue Data Catalog and Snowflake. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration. By adding a metadata layer to data lakes, you get a better user experience, simplified management, and improved performance and reliability on very large datasets.

Manage your Iceberg table with AWS Glue

You can use AWS Glue to ingest, catalog, transform, and manage the data on Amazon Simple Storage Service (Amazon S3). AWS Glue is a serverless data integration service that allows you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes in Iceberg format. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. Snowflake integrates with AWS Glue Data Catalog to access the Iceberg table catalog and the files on Amazon S3 for analytical queries. This greatly improves performance and compute cost in comparison to external tables on Snowflake, because the additional metadata improves pruning in query plans.

You can use this same integration to take advantage of the data sharing and collaboration capabilities in Snowflake. This can be very powerful if you have data in Amazon S3 and need to enable Snowflake data sharing with other business units, partners, suppliers, or customers.

The following architecture diagram provides a high-level overview of this pattern.

The workflow includes the following steps:

  1. AWS Glue extracts data from applications, databases, and streaming sources. AWS Glue then transforms it and loads it into the data lake in Amazon S3 in Iceberg table format, while inserting and updating the metadata about the Iceberg table in AWS Glue Data Catalog.
  2. The AWS Glue crawler generates and updates Iceberg table metadata and stores it in AWS Glue Data Catalog for existing Iceberg tables on an S3 data lake.
  3. Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location.
  4. In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3.
  5. Snowflake can query across Iceberg and Snowflake table formats. You can share data for collaboration with one or more accounts in the same Snowflake region. You can also use data in Snowflake for visualization using Amazon QuickSight, or use it for machine learning (ML) and artificial intelligence (AI) purposes with Amazon SageMaker.

Manage your Iceberg table with Snowflake

A second pattern also provides interoperability across AWS and Snowflake, but implements data engineering pipelines for ingestion and transformation to Snowflake. In this pattern, data is loaded to Iceberg tables by Snowflake through integrations with AWS services like AWS Glue or through other sources like Snowpipe. Snowflake then writes data directly to Amazon S3 in Iceberg format for downstream access by Snowflake and various AWS services, and Snowflake manages the Iceberg catalog that tracks snapshot locations across tables for AWS services to access.

Like the previous pattern, you can use Snowflake-managed Iceberg tables with Snowflake data sharing, but you can also use S3 to share datasets in cases where one party does not have access to Snowflake.

The following architecture diagram provides an overview of this pattern with Snowflake-managed Iceberg tables.

This workflow consists of the following steps:

  1. In addition to loading data via the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you can integrate data via the Snowflake Data Sharing.
  2. Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction.
  3. Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads using services like QuickSight and SageMaker.
  4. Apache Spark services on AWS can access snapshot locations from Snowflake via a Snowflake Iceberg Catalog SDK and directly scan the Iceberg table files in Amazon S3.

Comparing solutions

These two patterns highlight options available to data personas today to maximize their data interoperability between Snowflake and AWS using Apache Iceberg. But which pattern is ideal for your use case? If you’re already using AWS Glue Data Catalog and only require Snowflake for read queries, then the first pattern can integrate Snowflake with AWS Glue and Amazon S3 to query Iceberg tables. If you’re not already using AWS Glue Data Catalog and require Snowflake to perform reads and writes, then the second pattern is likely a good solution that allows for storing and accessing data from AWS.

Considering that reads and writes will probably operate on a per-table basis rather than the entire data architecture, it is advisable to use a combination of both patterns.

Migrate existing data lakes to a transactional data lake using Apache Iceberg

You can convert existing Parquet, ORC, and Avro-based data lake tables on Amazon S3 to Iceberg format to reap the benefits of transactional integrity while improving performance and user experience. There are several Iceberg table migration options (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating existing data lake tables in-place to Iceberg format, which is preferable to rewriting all of the underlying data files—a costly and time-consuming effort with large datasets. In this section, we focus on ADD_FILES, because it’s useful for custom migrations.

For ADD_FILES options, you can use AWS Glue to generate Iceberg metadata and statistics for an existing data lake table and create new Iceberg tables in AWS Glue Data Catalog for future use without needing to rewrite the underlying data. For instructions on generating Iceberg metadata and statistics using AWS Glue, refer to Migrate an existing data lake to a transactional data lake using Apache Iceberg or Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue.

This option requires that you pause data pipelines while converting the files to Iceberg tables, which is a straightforward process in AWS Glue because the destination just needs to be changed to an Iceberg table.

Conclusion

In this post, you saw the two architecture patterns for implementing Apache Iceberg in a data lake for better interoperability across AWS and Snowflake. We also provided guidance on migrating existing data lake tables to Iceberg format.

Sign up for AWS Dev Day on April 10 to get hands-on not only with Apache Iceberg, but also with streaming data pipelines with Amazon Data Firehose and Snowpipe Streaming, and generative AI applications with Streamlit in Snowflake and Amazon Bedrock.


About the Authors

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architectures on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Brian Dolan joined Amazon as a Military Relations Manager in 2012 after his first career as a Naval Aviator. In 2014, Brian joined Amazon Web Services, where he helped Canadian customers from startups to enterprises explore the AWS Cloud. Most recently, Brian was a member of the Non-Relational Business Development team as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces before joining the Analytics Worldwide Specialist Organization in 2022 as a Go-To-Market Specialist for AWS Glue.

Nidhi Gupta is a Sr. Partner Solution Architect at AWS. She spends her days working with customers and partners, solving architectural challenges. She is passionate about data integration and orchestration, serverless and big data processing, and machine learning. Nidhi has extensive experience leading the architecture design and production release and deployments for data workloads.

Scott Teal is a Product Marketing Lead at Snowflake and focuses on data lakes, storage, and governance.

AlmaLinux OS – CVE-2024-1086 and XZ (AlmaLinux blog)

Post Syndicated from jzb original https://lwn.net/Articles/968299/

AlmaLinux has announced
updated kernels for AlmaLinux 8 and 9 to address CVE-2024-1086, a
use-after-free vulnerability in the kernel that could be exploited to
gain local privilege escalation. This is notable because the fix
marks a divergence between AlmaLinux and Red Hat Enterprise Linux (RHEL):

In January of this year, a kernel flaw was disclosed and named CVE-2024-1086.
This flaw is trivially exploitable on most RHEL-equivalent
systems. There are many proof-of-concept posts available now,
including one from our Infrastructure team lead, Jonathan Wright (Dealing
with CVE-2024-1086
). In multi-user scenarios, this flaw is
especially problematic.

Though this was flagged as something to be fixed in Red Hat
Enterprise Linux, Red Hat has only rated this as a moderate
impact
.

The AlmaLinux project would also like to note that it is not
impacted by the XZ backdoor. “Because enterprise Linux takes a bit
longer to adopt those updates (sometimes to the chagrin of our users),
the version of XZ that had the back door inserted hadn’t made it
further than Fedora in our ecosystem.

Malcolm: Improvements to static analysis in the GCC 14 compiler

Post Syndicated from corbet original https://lwn.net/Articles/968297/

David Malcolm writes
about some static-analyzer features
that are coming in the GCC 14
release.

Solving the halting problem?

Obviously I’m kidding with the title here, but for GCC 14 I’ve
implemented a new warning: -Wanalyzer-infinite-loop that’s able to
detect some simple cases of infinite loops.

See also: this report from the 2023 GNU
Tools Cauldron.

Simplify your query management with search templates in Amazon OpenSearch Service

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/simplify-your-query-management-with-search-templates-in-amazon-opensearch-service/

Amazon OpenSearch Service is an Apache-2.0-licensed distributed search and analytics suite offered by AWS. This fully managed service allows organizations to secure data, perform keyword and semantic search, analyze logs, alert on anomalies, explore interactive log analytics, implement real-time application monitoring, and gain a more profound understanding of their information landscape. OpenSearch Service provides the tools and resources needed to unlock the full potential of your data. With its scalability, reliability, and ease of use, it’s a valuable solution for businesses seeking to optimize their data-driven decision-making processes and improve overall operational efficiency.

This post delves into the transformative world of search templates. We unravel the power of search templates in revolutionizing the way you handle queries, providing a comprehensive guide to help you navigate through the intricacies of this innovative solution. From optimizing search processes to saving time and reducing complexities, discover how incorporating search templates can elevate your query management game.

Search templates

Search templates empower developers to articulate intricate queries within OpenSearch, enabling their reuse across various application scenarios, eliminating the complexity of query generation in the code. This flexibility also grants you the ability to modify your queries without requiring application recompilation. Search templates in OpenSearch use the mustache template, which is a logic-free templating language. Search templates can be reused by their name. A search template that is based on mustache has a query structure and placeholders for the variable values. You use the _search API to query, specifying the actual values that OpenSearch should use. You can create placeholders for variables that will be changed to their true values at runtime. Double curly braces ({{}}) serve as placeholders in templates.

Mustache enables you to generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

In the following example, the search template runs the query in the “source” block by passing in the values for the field and value parameters from the “params” block:

GET /myindex/_search/template
 { 
      "source": {   
         "query": { 
             "bool": {
               "must": [
                 {
                   "match": {
                    "{{field}}": "{{value}}"
                 }
             }
        ]
     }
    }
  },
 "params": {
    "field": "place",
    "value": "sweethome"
  }
}

You can store templates in the cluster with a name and refer to them in a search instead of attaching the template in each request. You use the PUT _scripts API to publish a template to the cluster. Let’s say you have an index of books, and you want to search for books with publication date, ratings, and price. You could create and publish a search template as follows:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, you define a search template called find_book that uses the mustache template language with defined placeholders for the gte_date, gte_rating, and lte_price parameters.

To use the search template stored in the cluster, you can send a request to OpenSearch with the appropriate parameters. For example, you can search for products that have been published in the last year with ratings greater than 4.0, and priced less than $20:

POST /books/_search/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

This query will return all books that have been published in the last year, with a rating of at least 4.0, and a price less than $20 from the books index.

Default values in search templates

Default values are values that are used for search parameters when the query that engages the template doesn’t specify values for them. In the context of the find_book example, you can set default values for the from, size, and gte_date parameters in case they are not provided in the search request. To set default values, you can use the following mustache template:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}{{^gte_date}}now-1y{{/gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        },
        "from": "{{from}}{{^from}}0{{/from}}",
        "size": "{{size}}{{^size}}2{{/size}}"
      }
    }
  }
}

In this template, the {{from}}, {{size}}, and {{gte_date}} parameters are placeholders that can be filled in with specific values when the template is used in a search. If no value is specified for {{from}}, {{size}}, and {{gte_date}}, OpenSearch uses the default values of 0, 2, and now-1y, respectively. This means that if a user searches for products without specifying from, size, and gte_date, the search will return just two products matching the search criteria for 1 year.

You can also use the render API as follows if you have a stored template and want to validate it:

POST _render/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

Conditions in search templates

The conditional statement that allows you to control the flow of your search template based on certain conditions. It’s often used to include or exclude certain parts of the search request based on certain parameters. The syntax as follows:

{{#Any condition}}
  ... code to execute if the condition is true ...
{{/Any}}

The following example searches for books based on the gte_date, gte_rating, and lte_price parameters and an optional stock parameter. The if condition is used to include the condition_block/term query only if the stock parameter is present in the search request. If the is_available parameter is not present, the condition_block/term query will be skipped.

GET /books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#is_available}}
        {
          "term": {
            "in_stock": "{{is_available}}"
          }
        },
        {{/is_available}}
          {
            "range": {
              "publish_date": {
                "gte": "{{gte_date}}"
              }
            }
          },
          {
            "range": {
              "rating": {
                "gte": "{{gte_rating}}"
              }
            }
          },
          {
            "range": {
              "price": {
                "lte": "{{lte_price}}"
              }
            }
          }
        ]
      }
    }
  }""",
  "params": {
    "gte_date": "now-3y",
    "gte_rating": 4.0,
    "lte_price": 20,
    "is_available": true
  }
}

By using a conditional statement in this way, you can make your search requests more flexible and efficient by only including the necessary filters when they are needed.

To make the query valid inside the JSON, it needs to be escaped with triple quotes (""") in the payload.

Loops in search templates

A loop is a feature of mustache templates that allows you to iterate over an array of values and run the same code block for each item in the array. It’s often used to generate a dynamic list of filters or queries based on the values passed in the search request. The syntax is as follows:

{{#list item in array}}
  ... code to execute for each item ...
{{/list}}

The following example searches for books based on a query string ({{query}}) and an array of categories to filter the search results. The mustache loop is used to generate a match filter for each item in the categories array.

GET books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#list}}
        {
          "match": {
            "category": "{{list}}"
          }
        }
        {{/list}}
          {
          "match": {
            "title": "{{name}}"
          }
        }
        ]
      }
    }
  }""",
  "params": {
    "name": "killer",
    "list": ["Classics", "comics", "Horror"]
  }
}

The search request is rendered as follows:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "killer"
          }
        },
        {
          "match": {
            "category": "Classics"
          }
        },
        {
          "match": {
            "category": "comics"
          }
        },
        {
          "match": {
            "category": "Horror"
          }
        }
      ]
    }
  }
}

The loop has generated a match filter for each item in the categories array, resulting in a more flexible and efficient search request that filters by multiple categories. By using the loops, you can generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

Advantages of using search templates

The following are key advantages of using search templates:

  • Maintainability – By separating the query definition from the application code, search templates make it straightforward to manage changes to the query or tune search relevancy. You don’t have to compile and redeploy your application.
  • Consistency – You can construct search templates that allow you to design standardized query patterns and reuse them throughout your application, which can help maintain consistency across your queries.
  • Readability – Because templates can be constructed using a more terse and expressive syntax, complicated queries are straightforward to test and debug.
  • Testing – Search templates can be tested and debugged independently of the application code, facilitating simpler problem-solving and relevancy tuning without having to re-deploy the application. You can easily create A/B testing with different templates for the same search.
  • Flexibility – Search templates can be quickly updated or adjusted to account for modifications to the data or search specifications.

Best practices

Consider the following best practices when using search templates:

  •  Before deploying your template to production, make sure it is fully tested. You can test the effectiveness and correctness of your template with example data. It is highly recommended to run the application tests that use these templates before publishing.
  • Search templates allow for the addition of input parameters, which you can use to modify the query to suit the needs of a particular use case. Reusing the same template with varied inputs is made simpler by parameterizing the inputs.
  • Manage the templates in an external source control system.
  • Avoid hard-coding values inside the query—instead, use defaults.

Conclusion

In this post, you learned the basics of search templates, a powerful feature of OpenSearch, and how templates help streamline search queries and improve performance. With search templates, you can build more robust search applications in less time.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

Stay tuned for more exciting updates and new features in OpenSearch Service.


About the authors

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Madhan Kumar Baskaran works as a Search Engineer at AWS, specializing in Amazon OpenSearch Service. His primary focus involves assisting customers in constructing scalable search applications and analytics solutions. Based in Bengaluru, India, Madhan has a keen interest in data engineering and DevOps.

[$] A memory model for Rust code in the kernel

Post Syndicated from corbet original https://lwn.net/Articles/967049/

The Rust programming language differs from C in many ways; those
differences tend to be what users admire in the language. But those
differences can also lead to an impedance mismatch when Rust code is
integrated into a C-dominated system, and it can be even worse in the
kernel, which is not a typical C program. Memory models are a case in
point. A programming language’s view of memory is sufficiently fundamental
and arcane that many developers never have to learn much about it. It is
hard to maintain that sort of blissful ignorance while working in the
kernel, though, so a recent discussion of how to choose a memory model for
kernel code in Rust is of interest.

KDE6 release: D-Bus and Polkit Galore (SUSE security team blog)

Post Syndicated from corbet original https://lwn.net/Articles/968220/

The SUSE Security Team Blog is carrying a
detailed article
on SUSE’s review of the KDE6 release.

The SUSE security team restricts the installation of system wide
D-Bus services and Polkit policies in openSUSE distributions and
derived SUSE products. Any package that ships these features needs
to be reviewed by us first, before it can be added to production
repositories.

In November, openSUSE KDE packagers approached us with a long list
of KDE components for an upcoming KDE6 major release. The packages
needed adjusted D-Bus and Polkit whitelistings due to renamed
interfaces or other breaking changes. Looking into this many
components at once was a unique experience that also led to new
insights, which will be discussed in this article.

Security updates for Wednesday

Post Syndicated from jzb original https://lwn.net/Articles/968218/

Security updates have been issued by Debian (py7zr), Fedora (biosig4c++ and podman), Oracle (kernel, kernel-container, and ruby:3.1), Red Hat (.NET 7.0, bind9.16, curl, expat, grafana, grafana-pcp, kernel, kernel-rt, kpatch-patch, less, opencryptoki, and postgresql-jdbc), and Ubuntu (cacti).

Redict 7.3.0 released

Post Syndicated from corbet original https://lwn.net/Articles/968183/

The first stable release of Redict, a fork of the Redis in-memory database
under a copyleft license, has been announced.

You may be wondering why Redict would be of interest to you,
particularly when compared with Valkey,
another Redis fork that was announced on Thursday.

In technical terms, we are focusing on stability and long-term
maintenance, and on achieving excellence within our current
scope. We believe that Redict is near feature-complete and that it
is more valuable to our users if we take a conservative stance to
innovation and focus on long-term reliability instead. This is in
part a choice we’ve made to distinguish ourselves from Valkey,
whose commercial interests are able to invest more resources into
developing more radical innovations, but also an acknowledgement of
a cultural difference between our projects, in that the folks
behind Redict place greater emphasis on software with a finite
scope and ambitions towards long-term stability rather than
focusing on long-term growth in scope and complexity.

Improving Cloudflare Workers and D1 developer experience with Prisma ORM

Post Syndicated from Jon Harrell (Guest Author) original https://blog.cloudflare.com/prisma-orm-and-d1


Working with databases can be difficult. Developers face increasing data complexity and needs beyond simple create, read, update, and delete (CRUD) operations. Unfortunately, these issues also compound on themselves: developers have a harder time iterating in an increasingly complex environment. Cloudflare Workers and D1 help by reducing time spent managing infrastructure and deploying applications, and Prisma provides a great experience for your team to work and interact with data.  

Together, Cloudflare and Prisma make it easier than ever to deploy globally available apps with a focus on developer experience. To further that goal, Prisma Object Relational Mapper (ORM) now natively supports Cloudflare Workers and D1 in Preview. With version 5.12.0 of Prisma ORM you can now interact with your data stored in D1 from your Cloudflare Workers with the convenience of the Prisma Client API. Learn more and try it out now.

What is Prisma?

From writing to debugging, SQL queries take a long time and slow developer productivity. Even before writing queries, modeling tables can quickly become unwieldy, and migrating data is a nerve-wracking process. Prisma ORM looks to resolve all of these issues by providing an intuitive data modeling language, an automated migration workflow, and a developer-friendly and type-safe client for JavaScript and TypeScript, allowing developers to focus on what they enjoy: developing!

Prisma is focused on making working with data easy. Alongside an ORM, Prisma offers Accelerate and Pulse, products built on Cloudflare that cover needs from connection pooling, to query caching, to real-time type-safe database subscriptions.

How to get started with Prisma ORM, Cloudflare Workers, and D1

To get started with Prisma ORM and D1, first create a basic Cloudflare Workers app. This guide will start with the ”Hello World” Worker example app, but any Workers example app will work. If you don’t have a project yet, start by creating a new one. Name your project something memorable, like my-d1-prisma-app and select “Hello World” worker and TypeScript. For now, we will choose to not deploy and will wait until after we have set up D1 and Prisma ORM.

npm create cloudflare@latest

Next, move into your newly created project and make sure that dependencies are installed:

cd my-d1-prisma-app && npm install

After dependencies are installed, we can move on to the D1 setup.

First, create a new D1 database for your app.

npx wrangler d1 create prod-prisma-d1-app
.
.
.

[[d1_databases]]
binding = "DB" # i.e. available in your Worker on env.DB
database_name = "prod-prisma-d1-app"
database_id = "<unique-ID-for-your-database>"

The section starting with [[d1_databases]] is the binding configuration needed in your wrangler.toml for your Worker to communicate with D1. Add that now:

// wrangler.toml
name="my-d1-prisma-app"
main = "src/index.ts"
compatibility_date = "2024-03-20"
compatibility_flags = ["nodejs_compat"]

[[d1_databases]]
binding = "DB" # i.e. available in your Worker on env.DB
database_name = "prod-prisma-d1-app"
database_id = "<unique-ID-for-your-database>"

Your application now has D1 available! Next, add Prisma ORM to manage your queries, schema and migrations! To add Prisma ORM, first make sure the latest version is installed. Prisma ORM versions 5.12.0 and up support Cloudflare Workers and D1.

npm install prisma@latest @prisma/client@latest @prisma/adapter-d1

Now run npx prisma init in order to create the necessary files to start with. Since D1 uses SQLite’s SQL dialect, we set the provider to be sqlite.

npx prisma init --datasource-provider sqlite

This will create a few files, but the one to look at first is your Prisma schema file, available at prisma/schema.prisma

// schema.prisma
// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider = "prisma-client-js"
}

datasource db {
  provider = "sqlite"
  url  = env("DATABASE_URL")
}

Before you can create any models, first enable the driverAdapters Preview feature. This will allow the Prisma Client to use an adapter to communicate with D1.

// schema.prisma
// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider = "prisma-client-js"
+ previewFeatures = ["driverAdapters"]
}

datasource db {
  provider = "sqlite"
  url      = env("DATABASE_URL")
}

Now you are ready to create your first model! In this app, you will be creating a “ticker”, a mainstay of many classic Internet sites.

Add a new model to your schema, Visit, which will track that an individual visited your site. A Visit is a simple model that will have a unique ID and the time at which an individual visited your site.

// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider        = "prisma-client-js"
  previewFeatures = ["driverAdapters"]
}

datasource db {
  provider = "sqlite"
  url      = env("DATABASE_URL")
}

+ model Visit {
+   id        Int      @id @default(autoincrement())
+   visitTime DateTime @default(now())
+ }

Now that you have a schema and a model, let’s create a migration. First use wrangler to generate an empty migration file and prisma migrate to fill it. If prompted, select “yes” to create a migrations folder at the root of your project.

npx wrangler d1 migrations create prod-prisma-d1-app init
 ⛅️ wrangler 3.36.0
-------------------
✔ No migrations folder found. Set `migrations_dir` in wrangler.toml to choose a different path.
Ok to create /path/to/your/project/my-d1-prisma-app/migrations? … yes
✅ Successfully created Migration '0001_init.sql'!

The migration is available for editing here
/path/to/your/project/my-d1-prisma-app/migrations/0001_init.sql
npx prisma migrate diff --script --from-empty --to-schema-datamodel ./prisma/schema.prisma >> migrations/0001_init.sql

The npx prisma migrate diff command takes the difference between your database (which is currently empty) and the Prisma schema. It then saves this difference to a new file in the migrations directory.

// 0001_init.sql
-- Migration number: 0001 	 2024-03-21T22:15:50.184Z
-- CreateTable
CREATE TABLE "Visit" (
    "id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
    "visitTime" DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP

Now you can migrate your local and remote D1 database instances using wrangler and re-generate your Prisma Client to begin making queries.

npx wrangler d1 migrations apply prod-prisma-d1-app --local
npx wrangler d1 migrations apply prod-prisma-d1-app --remote
npx prisma generate

Make sure to import PrismaClient and PrismaD1, define the binding for your D1 database, and you’re ready to use Prisma in your application.

// src/index.ts
import { PrismaClient } from "@prisma/client";
import { PrismaD1 } from "@prisma/adapter-d1";

export interface Env {
  DB: D1Database,
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const adapter = new PrismaD1(env.DB);
    const prisma = new PrismaClient({ adapter });
    const { pathname } = new URL(request.url);

    if (pathname === '/') {
      const numVisitors = await prisma.visit.count();
      return new Response(
        `You have had ${numVisitors} visitors!`
      );
    }

    return new Response('');
  },
};

You may notice that there’s always 0 visitors. Add another route to create a new visitor whenever someone visits the /visit route

// src/index.ts
import { PrismaClient } from "@prisma/client";
import { PrismaD1 } from "@prisma/adapter-d1";

export interface Env {
  DB: D1Database,
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const adapter = new PrismaD1(env.DB);
    const prisma = new PrismaClient({ adapter });
    const { pathname } = new URL(request.url);

    if (pathname === '/') {
      const numVisitors = await prisma.visit.count();
      return new Response(
        `You have had ${numVisitors} visitors!`
      );
    } else if (pathname === '/visit') {
      const newVisitor = await prisma.visit.create({ data: {} });
      return new Response(
        `You visited at ${newVisitor.visitTime}. Thanks!`
      );
    }

    return new Response('');
  },
};

Your app is now set up to record visits and report how many visitors you have had!

Summary and further reading

We were able to build a simple app easily with Cloudflare Workers, D1 and Prisma ORM, but the benefits don’t stop there! Check the official documentation for information on using Prisma ORM with D1 along with workflows for migrating your data, and even extending the Prisma Client for your specific needs.

The collective thoughts of the interwebz