Surveillance by the New Microsoft Outlook App

2024-04-04 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/04/surveillance-by-the-new-microsoft-outlook-app.html

The ProtonMail people are accusing Microsoft’s new Outlook for Windows app of conducting extensive surveillance on its users. It shares data with advertisers, a lot of data:

The window informs users that Microsoft and those 801 third parties use their data for a number of purposes, including to:

Store and/or access information on the user’s device
Develop and improve products
Personalize ads and content
Measure ads and content
Derive audience insights
Obtain precise geolocation data
Identify users through device scanning

Commentary.

Food Delivery Apps: Last Week Tonight with John Oliver (HBO)

2024-04-04 LastWeekTonight

Post Syndicated from LastWeekTonight original https://www.youtube.com/watch?v=aFsfJYWpqII

Careers in computer science: Two perspectives

2024-04-04 Dan Fisher

Post Syndicated from Dan Fisher original https://www.raspberrypi.org/blog/careers-in-computer-science-two-perspectives/

As educators, it’s important that we showcase the wide range of career opportunities available in the field of computing, not only to inspire learners, but also to help them feel sure they’re choosing to study a subject that is useful for their future. For example, a survey from the BBC in September 2023 found that more than a quarter of UK teenagers often feel anxious, with “exams and school life” among the main causes. To help young people chart their career paths, we recently hosted two live webinars for National Careers Week in the UK.

Our goal for the webinars was to highlight the breadth of careers within computing and to provide insights from professionals who are pursuing their own diverse and rewarding paths. Each webinar featured engaging discussions and an interactive Q&A session with learners who use our Ada Computer Science platform. The learners could ask their own questions to get firsthand knowledge and perspectives from our guest speakers.

Our guest speakers

Jess Van Brummelen is a Human–Computer Interaction Research Scientist at Niantic, the video games company behind augmented reality game Pokémon Go. After developing an interest in programming during her undergraduate degree in mechanical engineering, she went on to complete a Master’s degree and PhD in computer science at MIT.

Ashley Edwards is a Senior Research Scientist at Google DeepMind, working on reinforcement learning. She received her PhD in 2019 from Georgia Tech, spent time as an intern at Google Brain, and worked as a research scientist at Uber AI Labs.

You can read extracts from our interviews with Jess and Ashley and watch the full videos below. Teachers have contacted us to say they’ll be using the webinars for careers-focused sessions with their students. We hope you will do the same!

Please note that we have edited the extracts below to add clarity.

Jess Van Brummelen

Hi Jess. What advice would you give to a student who is thinking about a career in human–computer interaction in the gaming industry?

In terms of HCI and gaming, I’d actually recommend that you keep gaming! It’s a small part of my job but it’s really important to understand what’s fun and enjoyable in games. Not only that; gaming can be great for learning to problem-solve — there’s been all sorts of research on the positive impact of gaming.

A second thing, going back to how I felt in my mechanical engineering classes, I really felt like an ‘other’ and not someone who is the standard computer scientist or engineer. I would encourage students to pursue their dreams anyway because it’s so important to have diversity in these types of careers, especially technology, because it goes out to so many different people and it can really affect society. It’s really important that the people who make it come from many different backgrounds and cultures so we can create technology that is better for everyone.

[From Owen, a student on the livestream] What’s the most impossible idea you’ve come up with while working at Niantic?

I’m currently publishing a paper addressing the question, ‘Can we guide people without using anything visual on their phone?’ That means using audio and haptic (technology that transmits information via touch, e.g. vibrations) prompts instead. We tried out different commands where the phone said ‘turn left’ and ‘turn right’, but we really wanted to test how to guide someone more specifically in a game environment. For example, if there was a hidden object on a wall in a game that a person couldn’t see, could we guide them to that object while they’re walking? So I ran a study where I guided people to scan a statue by moving around it. Scanning is the process of using the camera on your phone to scan an object in real life, which is then reconstructed on your phone. Scanning objects can trigger other augmented reality experiences within a game. For example, you might scan a real-life box in a room and this might trigger an animation of that box opening to reveal a secret within the game. We tested a lot of different things. For example, test subjects listened to music as they were walking and when they were on the right path, the music sounded really good. But when they were off the path, it sounded terrible. So it helped them to look for the right path. Then if you were pointing the phone in the wrong direction for scanning objects, you would get warning vibrations on the phone. So we did the study and we were hoping it would improve safety. It turns out it was neutral on improving safety — I think this is because it was such a novel system. People weren’t used to using it and still bumped into things! But it did make people better at scanning the objects, which was interesting.

Watch Jess’s full interview:

Ashley Edwards

Hi Ashley. Is there something you studied in school that you found to be more useful now than you ever thought it would be?

Maths! I always enjoyed doing maths, but I didn’t realise I would need it as a computer scientist. You see it popping up all the time, especially in machine learning. Having a strong knowledge of calculus and linear algebra is really helpful.

How do you train an AI model using machine learning?

You start by asking the question, ‘What is the problem I’m trying to solve?’ Then typically you need input data and the outputs you want to achieve, so you ask two more questions, ‘What data do I want to come in?’ and ‘What do I want to come out?’ Let’s say you decide to use a supervised learning model (a category of machine learning where labelled data sets are used to train algorithms to detect patterns and predict outcomes) to predict whether a photo contains a cat. You train the model using a giant set of images with labels that say either ‘This is a cat’ or ‘This isn’t a cat’. By training the model with the images, you get to a point where your model can analyse the features of any image and predict whether it contains a cat or not.

In my field of research, I work on something called reinforcement learning, which is where you train your model through trial and error and the use of ‘rewards’. Let’s imagine we are trying to train a robot. We might write a program that tells the robot, ‘I am going to give you a reward if you take the right step forward and it’s going to be a positive reward. If you fall over, I’m going to give you a negative reward.’ So you train the robot to prioritise the right behaviours to optimise the rewards it’s getting.

[From a student] Will I still need to learn to code in the future?

I think it is going to be very different in the future, but we’ll still need to learn how to build different types of algorithms and we’re going to need to understand the concepts behind coding as well. We’ll still need to ask questions like, ‘What is it that I want to build?’ and ‘Is this actually doing the correct thing?’

Watch Ashley’s full interview:

Broadening access

Jess and Ashley are forging successful careers not only through a combination of smart choices, hard work, talent, and a passion for technology; they also had access to opportunities to discover their passion and receive an education in this field. Too many young people around the world still don’t have these opportunities.

That is why we provide free resources and training to help schools broaden access to computing education. For example, our free learning platform, Ada Computer Science, provides students aged 14 to 19 with high-quality computing resources and interactive questions, written by experts from our team. To learn more, visit adacomputerscience.org.

The post Careers in computer science: Two perspectives appeared first on Raspberry Pi Foundation.

Оценките – (не)нужното зло

2024-04-04

Post Syndicated from original https://www.toest.bg/otsenkite-ne-nuzhnoto-zlo/

Оценките – (не)нужното зло

Много от темите, засягащи образователната система и нейното актуализиране, тепърва се промъкват в България. Смесването на различни възрасти в едно занятие, блоковото разпределение на часовете с цел по-качествено научаване, преминаването в полудигитална среда с електронни учебници – все неща, за които чуваме, но не виждаме в действие. На фона на тази стагнация и нежелание за модернизация да говорим за оценките в училище е чиста революция, но все някога тази дискусия трябва да започне.

В различните държави оценките могат не само да бъдат изразени различно (българската шестица в Съединените щати е оценка А), но и да са разделени на различни степени – еквивалент на нашата уж шестобална система във Франция например е 20-точкова система с много по-детайлно разделение на оценяването. Общото между всички тези различни системи обаче са количествените и качествените показатели, които се съдържат във всяка оценка.

Дали оценките се пишат с букви или цифри, няма значение – те служат за рационализиране на комуникационния процес между учебните институции или между техните части и участници. Това, което не правят обаче, е да служат на отделния учещ и да допринасят за неговия напредък. Защото основаващата се на сравнение и съревнование учебна система търси единствено количествения израз на постиженията и напълно изключва качествения.

Актуално състояние

Оценките и начините на оценяване са поредният инструмент, който не се е променял поне от времето на родителите ни. Ако имате дете в училищна възраст, може да го разпитате как го изпитват.

Все още, през 2024 г., разполагаме с прилично количество авторитарни учители, които изправят ученика до бюрото си и се държат с него като на разпит.

Все още (понякога оправданото) безсилие на учителите се изразява в класическото наказание: „Извадете по един двоен лист“ (все едно някой някога е можел да изпише два листа, докато се намира в ступор от рязкото прекъсване на някое забавление в клас).

Все още учителите са просто хора и е съвсем естествено да са субективни и да дават предимство на ученици, към които изпитват симпатия, както и да им е трудно да признаят макар и минимална положителна промяна у някой ученик, когото не харесват толкова.

Колкото и правилници за оценяване да бъдат съставени и въведени, като например Наредба № 11 от 1 септември 2016 г. за оценяване на резултатите от обучението на учениците, никой от тях няма да може да намали щетите от субективното оценяване, от гоненето на успех (от родители и учители), от безкрайните уроци и курсове, целящи да вкарат поредното дете в „хубаво“ училище.

Ако все пак се примирим с неизбежността на оценките, е хубаво да си дадем сметка, че имаме шестобална система, която реално не е шестобална, защото оценките в нея са от слаб 2 до отличен 6. Двойката в тази система се използва най-често за наказание, а не като инструмент, който води до включването на изработени механизми за помощ. Защото в реалността вместо двойка и помощ идва тройката – за запазване на броя на учениците и за изпълнението на квотите. Ученикът научава, че няма кой да му помогне, но същевременно се научава, че независимо какво прави, той просто ще премине нататък. Поне до 12. клас – за след това не е нужно да се мисли.

Съвсем умишлено няма да говоря за безумието на националното външно оценяване, защото то е достойно за поредица анализи на всеки от участниците в него – от създателите му до последния изтерзан родител, напълнил още няколко джоба в сферата на частните уроци и школи и изпразнил и последните резерви от желание и любопитство на детето, което тепърва ще има още поне пет години гимназиален курс пред себе си.

Ако трябва да съберем настоящето в две думи, те биха били несправедливо и насилствено.

Как може да бъде

Откакто има оценки, се смята, че те са инструмент за мотивация учещият да се представя по-добре, а ако ги няма, ще липсва стимул за постигане на целите. Научно погледнато обаче, има нищожно малко доказателства, че оценките карат учениците да са по-трудолюбиви и да учат повече. За сметка на това има огромно количество доказателства, че точно оценките възпрепятстват успешното представяне при изпитване и влияят негативно на мотивацията за учене.

Психологията отдавна е доказала, че ако изживяваме поражение и негативни чувства в дадена ситуация, ние съзнателно и несъзнателно ще се опитваме да избегнем попадането в нея. Така например, ако всяко изпитване по предмет Х е свързано с унизителното изкарване пред целия клас и демонстрирането на силната позиция на учителя, а в същото време имаме ученик, който среща затруднения по дадения предмет, то не би трябвало да се чудим, когато този ученик не може да насмогне със знанията си и постепенно започне да избягва не само ученето по предмета, но и самите часове.

Обръщаме се към мотивационната психология за отговор, която прави разлика между вътрешна и външна мотивация. Външната мотивация са оценките – всяка оценка ни поставя етикет и ни отрежда мястото, което сме „заслужили“. Много по-важна обаче е вътрешната мотивация, която има три основни изграждащи я елемента: самостоятелност, компетентност и свързаност.

Самостоятелността е правото на избор, свободата да поемеш контрол. Всеки ученик в момента е обект на обучението си; не участва по никакъв начин във формирането на нито една от задачите, които трябва да изпълни. Дори самостоятелните задачи, като презентация или проект, са изкривени и се използват като инструмент за повишаване на същата тази оценка и учениците знаят това. Ако позволим на децата да изберат темата си, да се задълбочат в посоката, която ги влече заради личния им интерес, а не заради постижението, ще постигнем много повече от добър успех.

Под компетентност тук разбираме възможността да се усвояват нови умения. Дали 12-годишното еднотипно заучаване на предварително зададено съдържание развива нови умения? Може би, особено умения да се заобиколи всячески скуката. Да научим децата да общуват, да си сътрудничат и да мислят критично са само първите стъпки. Много по-важно е да бъдат развити умения – тогава знанието самò ще намери пътя си.

Свързаността е емоционалната част, чувството да си част от нещо. Всеки от нас има спомен за онзи учител, който го е вдъхновил да открие повече за себе си и света и е отворил нова врата към бъдещето. Да бъдеш приет и да бъдеш уважаван като отделен човек винаги води до желание да отвърнеш със същото. Да поставим децата над оценките им е подкрепата, от която се нуждаят. И няма нищо страшно плануваният урок понякога да отпадне заради тема, която силно е повлияла върху децата и те се нуждаят да говорят по нея.

Оценката е само израз на моментно състояние. Тя не показва колко сме добри, дали сме умни, дали сме успешни. Тя показва кои сме в точно този час – разсеяни, притеснени и паникьосани или спокойни, съсредоточени и подкрепени. По-важни от оценката са умението и желанието да учим през целия си живот, без да го приемаме като тегоба.

И понеже поставянето на оценка е неизбежно, нека поне се прави с обратна връзка. Уважителна, човешка и индивидуална обратна връзка в едно или няколко изречения. Когато е нужно – и в разговор.

Английската думата education, която означава и „образование“, и „възпитание“, произлиза от латинското educo, educare, което освен всичко друго означава и „отглеждам“. Отглеждането е сред процесите, през които минаваме, за да станем пълноценни членове на заобикалящия ни свят, за да можем един ден самостоятелно да поемем отговорност за себе си и околните и да разкрием и развием потенциала си. Често цитираната поговорка, че е нужно цяло село, за да се отгледа едно дете, всъщност е вярна. Освен на родителите си, детето разчита на останалите членове на семейството си, на учителите – в детската градина, в началното училище, в следващите степени, на треньорите си, на подкрепящите го във всеки негов интерес обучители, на професори, на колеги, ръководители и много, много още.

Не е ли абсурдно все още да смятаме, че образованието трябва да се състои от дългогодишно изсипване на факти върху деца, които нямат търпение да открият света и мястото си в него, и проверяването на степента на запомняне на тези факти под строг поглед? Дали слагането на етикет на всяко дете може да го мотивира и ако да – как и колко? За това дори не ни е нужна генерална реформа, а просто повече човечност. Защото не е задължително добрите оценки и успешният живот да вървят ръка за ръка.

Когато виждаме хората такива, каквито са, ги правим по-лоши. Но ако се отнасяме с тях, сякаш са това, което трябва да бъдат, тогава ги отвеждаме там, където трябва да отидат.

Йохан Волфганг фон Гьоте

Светът се променя с бясна скорост. Професиите, в които ще се развиват поколенията, започващи днес образователния си път, все още не са измислени. Подготвена ли е нашата образователна система, за да отговори на тези предизвикателства? Какво може и трябва да се промени? А как?

Веднъж месечно в рубриката „Възможното образование“ ще говорим за промяната – такава, каквато искаме да я видим, за добрите примери и за посоките, в които може би е добре да обърне поглед българската образователна система.

[$] LWN.net Weekly Edition for April 4, 2024

2024-04-04 corbet

Post Syndicated from corbet original https://lwn.net/Articles/966925/

The LWN.net Weekly Edition for April 4, 2024 is available.

This is the Astera Labs Aries 6 PCIe Gen6 and CXL Retimer

2024-04-04 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/this-is-the-astera-labs-aries-6-pcie-gen6-and-cxl-retimer/

We saw the Astera Labs Aries 6 PCIe Gen6 and CXL retimer running at NVIDIA GTC 2024 and looking great compared to Broadcom’s planned chip

The post This is the Astera Labs Aries 6 PCIe Gen6 and CXL Retimer appeared first on ServeTheHome.

Financial expert on the types of investments you should have and how you should handle trends

2024-04-03 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=YbNFROdr7Z8

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

2024-04-03 Andrea Filippo La Scola

Post Syndicated from Andrea Filippo La Scola original https://aws.amazon.com/blogs/big-data/amazon-datazone-now-integrates-with-aws-glue-data-quality-and-external-data-quality-solutions/

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. This information empowers end-users to make informed decisions as to whether or not to use specific assets.

Many organizations already use AWS Glue Data Quality to define and enforce data quality rules on their data, validate data against predefined rules, track data quality metrics, and monitor data quality over time using artificial intelligence (AI). Other organizations monitor the quality of their data through third-party solutions.

Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets. Additionally, Amazon DataZone now offers APIs for importing data quality scores from external systems.

In this post, we discuss the latest features of Amazon DataZone for data quality, the integration between Amazon DataZone and AWS Glue Data Quality and how you can import data quality scores produced by external systems into Amazon DataZone via API.

Challenges

One of the most common questions we get from customers is related to displaying data quality scores in the Amazon DataZone business data catalog to let business users have visibility into the health and reliability of the datasets.

As data becomes increasingly crucial for driving business decisions, Amazon DataZone users are keenly interested in providing the highest standards of data quality. They recognize the importance of accurate, complete, and timely data in enabling informed decision-making and fostering trust in their analytics and reporting processes.

Amazon DataZone data assets can be updated at varying frequencies. As data is refreshed and updated, changes can happen through upstream processes that put it at risk of not maintaining the intended quality. Data quality scores help you understand if data has maintained the expected level of quality for data consumers to use (through analysis or downstream processes).

From a producer’s perspective, data stewards can now set up Amazon DataZone to automatically import the data quality scores from AWS Glue Data Quality (scheduled or on demand) and include this information in the Amazon DataZone catalog to share with business users. Additionally, you can now use new Amazon DataZone APIs to import data quality scores produced by external systems into the data assets.

With the latest enhancement, Amazon DataZone users can now accomplish the following:

Access insights about data quality standards directly from the Amazon DataZone web portal
View data quality scores on various KPIs, including data completeness, uniqueness, accuracy
Make sure users have a holistic view of the quality and trustworthiness of their data.

In the first part of this post, we walk through the integration between AWS Glue Data Quality and Amazon DataZone. We discuss how to visualize data quality scores in Amazon DataZone, enable AWS Glue Data Quality when creating a new Amazon DataZone data source, and enable data quality for an existing data asset.

In the second part of this post, we discuss how you can import data quality scores produced by external systems into Amazon DataZone via API. In this example, we use Amazon EMR Serverless in combination with the open source library Pydeequ to act as an external system for data quality.

Visualize AWS Glue Data Quality scores in Amazon DataZone

You can now visualize AWS Glue Data Quality scores in data assets that have been published in the Amazon DataZone business catalog and that are searchable through the Amazon DataZone web portal.

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane.

By selecting the corresponding asset, you can understand its content through the readme, glossary terms, and technical and business metadata. Additionally, the overall quality score indicator is displayed in the Asset Details section.

A data quality score serves as an overall indicator of a dataset’s quality, calculated based on the rules you define.

On the Data quality tab, you can access the details of data quality overview indicators and the results of the data quality runs.

The indicators shown on the Overview tab are calculated based on the results of the rulesets from the data quality runs.

Each rule is assigned an attribute that contributes to the calculation of the indicator. For example, rules that have the Completeness attribute will contribute to the calculation of the corresponding indicator on the Overview tab.

To filter data quality results, choose the Applicable column dropdown menu and choose your desired filter parameter.

You can also visualize column-level data quality starting on the Schema tab.

When data quality is enabled for the asset, the data quality results become available, providing insightful quality scores that reflect the integrity and reliability of each column within the dataset.

When you choose one of the data quality result links, you’re redirected to the data quality detail page, filtered by the selected column.

Data quality historical results in Amazon DataZone

Data quality can change over time for many reasons:

Data formats may change because of changes in the source systems
As data accumulates over time, it may become outdated or inconsistent
Data quality can be affected by human errors in data entry, data processing, or data manipulation

In Amazon DataZone, you can now track data quality over time to confirm reliability and accuracy. By analyzing the historical report snapshot, you can identify areas for improvement, implement changes, and measure the effectiveness of those changes.

Enable AWS Glue Data Quality when creating a new Amazon DataZone data source

In this section, we walk through the steps to enable AWS Glue Data Quality when creating a new Amazon DataZone data source.

Prerequisites

To follow along, you should have a domain for Amazon DataZone, an Amazon DataZone project, and a new Amazon DataZone environment (with a DataLakeProfile). For instructions, refer to Amazon DataZone quickstart with AWS Glue data.

You also need to define and run a ruleset against your data, which is a set of data quality rules in AWS Glue Data Quality. To set up the data quality rules and for more information on the topic, refer to the following posts:

After you create the data quality rules, make sure that Amazon DataZone has the permissions to access the AWS Glue database managed through AWS Lake Formation. For instructions, see Configure Lake Formation permissions for Amazon DataZone.

In our example, we have configured a ruleset against a table containing patient data within a healthcare synthetic dataset generated using Synthea. Synthea is a synthetic patient generator that creates realistic patient data and associated medical records that can be used for testing healthcare software applications.

The ruleset contains 27 individual rules (one of them failing), so the overall data quality score is 96%.

If you use Amazon DataZone managed policies, there is no action needed because these will get automatically updated with the needed actions. Otherwise, you need to allow Amazon DataZone to have the required permissions to list and get AWS Glue Data Quality results, as shown in the Amazon DataZone user guide.

Create a data source with data quality enabled

In this section, we create a data source and enable data quality. You can also update an existing data source to enable data quality. We use this data source to import metadata information related to our datasets. Amazon DataZone will also import data quality information related to the (one or more) assets contained in the data source.

On the Amazon DataZone console, choose Data sources in the navigation pane.
Choose Create data source.
For Name, enter a name for your data source.
For Data source type, select AWS Glue.
For Environment, choose your environment.
For Database name, enter a name for the database.
For Table selection criteria, choose your criteria.
Choose Next.
For Data quality, select Enable data quality for this data source.

If data quality is enabled, Amazon DataZone will automatically fetch data quality scores from AWS Glue at each data source run.

Choose Next.

Now you can run the data source.

While running the data source, Amazon DataZone imports the last 100 AWS Glue Data Quality run results. This information is now visible on the asset page and will be visible to all Amazon DataZone users after publishing the asset.

Enable data quality for an existing data asset

In this section, we enable data quality for an existing asset. This might be useful for users that already have data sources in place and want to enable the feature afterwards.

Prerequisites

To follow along, you should have already run the data source and produced an AWS Glue table data asset. Additionally, you should have defined a ruleset in AWS Glue Data Quality over the target table in the Data Catalog.

For this example, we ran the data quality job multiple times against the table, producing the related AWS Glue Data Quality scores, as shown in the following screenshot.

Import data quality scores into the data asset

Complete the following steps to import the existing AWS Glue Data Quality scores into the data asset in Amazon DataZone:

Within the Amazon DataZone project, navigate to the Inventory data pane and choose the data source.

If you choose the Data quality tab, you can see that there’s still no information on data quality because AWS Glue Data Quality integration is not enabled for this data asset yet.

On the Data quality tab, choose Enable data quality.
In the Data quality section, select Enable data quality for this data source.
Choose Save.

Now, back on the Inventory data pane, you can see a new tab: Data quality.

On the Data quality tab, you can see data quality scores imported from AWS Glue Data Quality.

Ingest data quality scores from an external source using Amazon DataZone APIs

Many organizations already use systems that calculate data quality by performing tests and assertions on their datasets. Amazon DataZone now supports importing third-party originated data quality scores via API, allowing users that navigate the web portal to view this information.

In this section, we simulate a third-party system pushing data quality scores into Amazon DataZone via APIs through Boto3 (Python SDK for AWS).

For this example, we use the same synthetic dataset as earlier, generated with Synthea.

The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Read a dataset of patients in Amazon Simple Storage Service (Amazon S3) directly from Amazon EMR using Spark.

The dataset is created as a generic S3 asset collection in Amazon DataZone.

In Amazon EMR, perform data validation rules against the dataset.
The metrics are saved in Amazon S3 to have a persistent output.
Use Amazon DataZone APIs through Boto3 to push custom data quality metadata.
End-users can see the data quality scores by navigating to the data portal.

Prerequisites

We use Amazon EMR Serverless and Pydeequ to run a fully managed Spark environment. To learn more about Pydeequ as a data testing framework, see Testing Data quality at scale with Pydeequ.

To allow Amazon EMR to send data to the Amazon DataZone domain, make sure that the IAM role used by Amazon EMR has the permissions to do the following:

Read from and write to the S3 buckets

Call the post_time_series_data_points action for Amazon DataZone:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": [
                "datazone:PostTimeSeriesDataPoints"
            ],
            "Resource": [
                "<datazone_domain_arn>"
            ]
        }
    ]
}

Make sure that you added the EMR role as a project member in the Amazon DataZone project. On the Amazon DataZone console, navigate to the Project members page and choose Add members.

Add the EMR role as a contributor.

Ingest and analyze PySpark code

In this section, we analyze the PySpark code that we use to perform data quality checks and send the results to Amazon DataZone. You can download the complete PySpark script.

To run the script entirely, you can submit a job to EMR Serverless. The service will take care of scheduling the job and automatically allocating the resources needed, enabling you to track the job run statuses throughout the process.

You can submit a job to EMR within the Amazon EMR console using EMR Studio or programmatically, using the AWS CLI or using one of the AWS SDKs.

In Apache Spark, a SparkSession is the entry point for interacting with DataFrames and Spark’s built-in functions. The script will start initializing a SparkSession:

with SparkSession.builder.appName("PatientsDataValidation") \
        .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
        .getOrCreate() as spark:

We read a dataset from Amazon S3. For increased modularity, you can use the script input to refer to the S3 path:

s3inputFilepath = sys.argv[1]
s3outputLocation = sys.argv[2]

df = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .load(s3inputFilepath) #s3://<bucket_name>/patients/patients.csv

Next, we set up a metrics repository. This can be helpful to persist the run results in Amazon S3.

metricsRepository = FileSystemMetricsRepository(spark, s3_write_path)

Pydeequ allows you to create data quality rules using the builder pattern, which is a well-known software engineering design pattern, concatenating instruction to instantiate a VerificationSuite object:

key_tags = {'tag': 'patient_df'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

check = Check(spark, CheckLevel.Error, "Integrity checks")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .useRepository(metricsRepository) \
    .addCheck(
        check.hasSize(lambda x: x >= 1000) \
        .isComplete("birthdate")  \
        .isUnique("id")  \
        .isComplete("ssn") \
        .isComplete("first") \
        .isComplete("last") \
        .hasMin("healthcare_coverage", lambda x: x == 1000.0)) \
    .saveOrAppendResult(resultKey) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

The following is the output for the data validation rules:

+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+
|check           |check_level|check_status|constraint                                          |constraint_status|constraint_message                                  |
+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+
|Integrity checks|Error      |Error       |SizeConstraint(Size(None))                          |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(birthdate,None))|Success          |                                                    |
|Integrity checks|Error      |Error       |UniquenessConstraint(Uniqueness(List(id),None))     |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(ssn,None))      |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(first,None))    |Success          |                                                    |
|Integrity checks|Error      |Error       |CompletenessConstraint(Completeness(last,None))     |Success          |                                                    |
|Integrity checks|Error      |Error       |MinimumConstraint(Minimum(healthcare_coverage,None))|Failure          |Value: 0.0 does not meet the constraint requirement!|
+----------------+-----------+------------+----------------------------------------------------+-----------------+----------------------------------------------------+

At this point, we want to insert these data quality values in Amazon DataZone. To do so, we use the post_time_series_data_points function in the Boto3 Amazon DataZone client.

The PostTimeSeriesDataPoints DataZone API allows you to insert new time series data points for a given asset or listing, without creating a new revision.

At this point, you might also want to have more information on which fields are sent as input for the API. You can use the APIs to obtain the specification for Amazon DataZone form types; in our case, it’s amazon.datazone.DataQualityResultFormType.

You can also use the AWS CLI to invoke the API and display the form structure:

aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy'

This output helps identify the required API parameters, including fields and value limits:

$version: "2.0"
namespace amazon.datazone
structure DataQualityResultFormType {
    @amazon.datazone#timeSeriesSummary
    @range(min: 0, max: 100)
    passingPercentage: Double
    @amazon.datazone#timeSeriesSummary
    evaluationsCount: Integer
    evaluations: EvaluationResults
}
@length(min: 0, max: 2000)
list EvaluationResults {
    member: EvaluationResult
}

@length(min: 0, max: 20)
list ApplicableFields {
    member: String
}

@length(min: 0, max: 20)
list EvaluationTypes {
    member: String
}

enum EvaluationStatus {
    PASS,
    FAIL
}

string EvaluationDetailType

map EvaluationDetails {
    key: EvaluationDetailType
    value: String
}

structure EvaluationResult {
    description: String
    types: EvaluationTypes
    applicableFields: ApplicableFields
    status: EvaluationStatus
    details: EvaluationDetails
}

To send the appropriate form data, we need to convert the Pydeequ output to match the DataQualityResultsFormType contract. This can be achieved with a Python function that processes the results.

For each DataFrame row, we extract information from the constraint column. For example, take the following code:

CompletenessConstraint(Completeness(birthdate,None))

We convert it to the following:

{
  "constraint": "CompletenessConstraint",
  "statisticName": "Completeness_custom",
  "column": "birthdate"
}

Make sure to send an output that matches the KPIs that you want to track. In our case, we are appending _custom to the statistic name, resulting in the following format for KPIs:

Completeness_custom
Uniqueness_custom

In a real-world scenario, you might want to set a value that matches with your data quality framework in relation to the KPIs that you want to track in Amazon DataZone.

After applying a transformation function, we have a Python object for each rule evaluation:

..., {
   'applicableFields': ["healthcare_coverage"],
   'types': ["Minimum_custom"],
   'status': 'FAIL',
   'description': 'MinimumConstraint - Minimum - Value: 0.0 does not meet the constraint requirement!'
 },...

We also use the constraint_status column to compute the overall score:

(number of success / total number of evaluation) * 100

In our example, this results in a passing percentage of 85.71%.

We set this value in the passingPercentage input field along with the other information related to the evaluations in the input of the Boto3 method post_time_series_data_points:

import boto3

# Instantiate the client library to communicate with Amazon DataZone Service
#
datazone = boto3.client(
    service_name='datazone', 
    region_name=<Region(String) example: us-east-1>
)

# Perform the API operation to push the Data Quality information to Amazon DataZone
#
datazone.post_time_series_data_points(
    domainIdentifier=<DataZone domain ID>,
    entityIdentifier=<DataZone asset ID>,
    entityType='ASSET',
    forms=[
        {
            "content": json.dumps({
                    "evaluationsCount":<Number of evaluations (number)>,
                    "evaluations": [<List of objects {
                        'description': <Description (String)>,
                        'applicableFields': [<List of columns involved (String)>],
                        'types': [<List of KPIs (String)>],
                        'status': <FAIL/PASS (string)>
                        }>
                     ],
                    "passingPercentage":<Score (number)>
                }),
            "formName": <Form name(String) example: PydeequRuleSet1>,
            "typeIdentifier": "amazon.datazone.DataQualityResultFormType",
            "timestamp": <Date (timestamp)>
        }
    ]
)

Boto3 invokes the Amazon DataZone APIs. In these examples, we used Boto3 and Python, but you can choose one of the AWS SDKs developed in the language you prefer.

After setting the appropriate domain and asset ID and running the method, we can check on the Amazon DataZone console that the asset data quality is now visible on the asset page.

We can observe that the overall score matches with the API input value. We can also see that we were able to add customized KPIs on the overview tab through custom types parameter values.

With the new Amazon DataZone APIs, you can load data quality rules from third-party systems into a specific data asset. With this capability, Amazon DataZone allows you to extend the types of indicators present in AWS Glue Data Quality (such as completeness, minimum, and uniqueness) with custom indicators.

Clean up

We recommend deleting any potentially unused resources to avoid incurring unexpected costs. For example, you can delete the Amazon DataZone domain and the EMR application you created during this process.

Conclusion

In this post, we highlighted the latest features of Amazon DataZone for data quality, empowering end-users with enhanced context and visibility into their data assets. Furthermore, we delved into the seamless integration between Amazon DataZone and AWS Glue Data Quality. You can also use the Amazon DataZone APIs to integrate with external data quality providers, enabling you to maintain a comprehensive and robust data strategy within your AWS environment.

To learn more about Amazon DataZone, refer to the Amazon DataZone User Guide.

About the Authors

Andrea Filippo is a Partner Solutions Architect at AWS supporting Public Sector partners and customers in Italy. He focuses on modern data architectures and helping customers accelerate their cloud journey with serverless technologies.

Emanuele is a Solutions Architect at AWS, based in Italy, after living and working for more than 5 years in Spain. He enjoys helping large companies with the adoption of cloud technologies, and his area of expertise is mainly focused on Data Analytics and Data Management. Outside of work, he enjoys traveling and collecting action figures.

Varsha Velagapudi is a Senior Technical Product Manager with Amazon DataZone at AWS. She focuses on improving data discovery and curation required for data analytics. She is passionate about simplifying customers’ AI/ML and analytics journey to help them succeed in their day-to-day tasks. Outside of work, she enjoys nature and outdoor activities, reading, and traveling.

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

2024-04-03 Andries Engelbrecht

Post Syndicated from Andries Engelbrecht original https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-your-data-lake-with-amazon-s3-aws-glue-and-snowflake/

This is post is co-written with Andries Engelbrecht and Scott Teal from Snowflake.

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. For many enterprises and large organizations, it is not feasible to have one processing engine or tool to deal with the various business requirements. They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases. Implementing these solutions requires data sharing between purpose-built data stores. This is why Snowflake and AWS are delivering enhanced support for Apache Iceberg to enable and facilitate data interoperability between data services.

Apache Iceberg is an open-source table format that provides reliability, simplicity, and high performance for large datasets with transactional integrity between various processing engines. In this post, we discuss the following:

Advantages of Iceberg tables for data lakes
Two architectural patterns for sharing Iceberg tables between AWS and Snowflake:
- Manage your Iceberg tables with AWS Glue Data Catalog
- Manage your Iceberg tables with Snowflake
The process of converting existing data lakes tables to Iceberg tables without copying the data

Now that you have a high-level understanding of the topics, let’s dive into each of them in detail.

Advantages of Apache Iceberg

Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.

Iceberg tables maintain metadata to abstract large collections of files, providing data management features including time travel, rollback, data compaction, and full schema evolution, reducing management overhead. Originally developed at Netflix before being open sourced to the Apache Software Foundation, Apache Iceberg was a blank-slate design to solve common data lake challenges like user experience, reliability, and performance, and is now supported by a robust community of developers focused on continually improving and adding new features to the project, serving real user needs and providing them with optionality.

Transactional data lakes built on AWS and Snowflake

Snowflake provides various integrations for Iceberg tables with multiple storage options, including Amazon S3, and multiple catalog options, including AWS Glue Data Catalog and Snowflake. AWS provides integrations for various AWS services with Iceberg tables as well, including AWS Glue Data Catalog for tracking table metadata. Combining Snowflake and AWS gives you multiple options to build out a transactional data lake for analytical and other use cases such as data sharing and collaboration. By adding a metadata layer to data lakes, you get a better user experience, simplified management, and improved performance and reliability on very large datasets.

Manage your Iceberg table with AWS Glue

You can use AWS Glue to ingest, catalog, transform, and manage the data on Amazon Simple Storage Service (Amazon S3). AWS Glue is a serverless data integration service that allows you to visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes in Iceberg format. With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. Snowflake integrates with AWS Glue Data Catalog to access the Iceberg table catalog and the files on Amazon S3 for analytical queries. This greatly improves performance and compute cost in comparison to external tables on Snowflake, because the additional metadata improves pruning in query plans.

You can use this same integration to take advantage of the data sharing and collaboration capabilities in Snowflake. This can be very powerful if you have data in Amazon S3 and need to enable Snowflake data sharing with other business units, partners, suppliers, or customers.

The following architecture diagram provides a high-level overview of this pattern.

The workflow includes the following steps:

AWS Glue extracts data from applications, databases, and streaming sources. AWS Glue then transforms it and loads it into the data lake in Amazon S3 in Iceberg table format, while inserting and updating the metadata about the Iceberg table in AWS Glue Data Catalog.
The AWS Glue crawler generates and updates Iceberg table metadata and stores it in AWS Glue Data Catalog for existing Iceberg tables on an S3 data lake.
Snowflake integrates with AWS Glue Data Catalog to retrieve the snapshot location.
In the event of a query, Snowflake uses the snapshot location from AWS Glue Data Catalog to read Iceberg table data in Amazon S3.
Snowflake can query across Iceberg and Snowflake table formats. You can share data for collaboration with one or more accounts in the same Snowflake region. You can also use data in Snowflake for visualization using Amazon QuickSight, or use it for machine learning (ML) and artificial intelligence (AI) purposes with Amazon SageMaker.

Manage your Iceberg table with Snowflake

A second pattern also provides interoperability across AWS and Snowflake, but implements data engineering pipelines for ingestion and transformation to Snowflake. In this pattern, data is loaded to Iceberg tables by Snowflake through integrations with AWS services like AWS Glue or through other sources like Snowpipe. Snowflake then writes data directly to Amazon S3 in Iceberg format for downstream access by Snowflake and various AWS services, and Snowflake manages the Iceberg catalog that tracks snapshot locations across tables for AWS services to access.

Like the previous pattern, you can use Snowflake-managed Iceberg tables with Snowflake data sharing, but you can also use S3 to share datasets in cases where one party does not have access to Snowflake.

The following architecture diagram provides an overview of this pattern with Snowflake-managed Iceberg tables.

This workflow consists of the following steps:

In addition to loading data via the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you can integrate data via the Snowflake Data Sharing.
Snowflake writes Iceberg tables to Amazon S3 and updates metadata automatically with every transaction.
Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads using services like QuickSight and SageMaker.
Apache Spark services on AWS can access snapshot locations from Snowflake via a Snowflake Iceberg Catalog SDK and directly scan the Iceberg table files in Amazon S3.

Comparing solutions

These two patterns highlight options available to data personas today to maximize their data interoperability between Snowflake and AWS using Apache Iceberg. But which pattern is ideal for your use case? If you’re already using AWS Glue Data Catalog and only require Snowflake for read queries, then the first pattern can integrate Snowflake with AWS Glue and Amazon S3 to query Iceberg tables. If you’re not already using AWS Glue Data Catalog and require Snowflake to perform reads and writes, then the second pattern is likely a good solution that allows for storing and accessing data from AWS.

Considering that reads and writes will probably operate on a per-table basis rather than the entire data architecture, it is advisable to use a combination of both patterns.

Migrate existing data lakes to a transactional data lake using Apache Iceberg

You can convert existing Parquet, ORC, and Avro-based data lake tables on Amazon S3 to Iceberg format to reap the benefits of transactional integrity while improving performance and user experience. There are several Iceberg table migration options (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating existing data lake tables in-place to Iceberg format, which is preferable to rewriting all of the underlying data files—a costly and time-consuming effort with large datasets. In this section, we focus on ADD_FILES, because it’s useful for custom migrations.

For ADD_FILES options, you can use AWS Glue to generate Iceberg metadata and statistics for an existing data lake table and create new Iceberg tables in AWS Glue Data Catalog for future use without needing to rewrite the underlying data. For instructions on generating Iceberg metadata and statistics using AWS Glue, refer to Migrate an existing data lake to a transactional data lake using Apache Iceberg or Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue.

This option requires that you pause data pipelines while converting the files to Iceberg tables, which is a straightforward process in AWS Glue because the destination just needs to be changed to an Iceberg table.

Conclusion

In this post, you saw the two architecture patterns for implementing Apache Iceberg in a data lake for better interoperability across AWS and Snowflake. We also provided guidance on migrating existing data lake tables to Iceberg format.

Sign up for AWS Dev Day on April 10 to get hands-on not only with Apache Iceberg, but also with streaming data pipelines with Amazon Data Firehose and Snowpipe Streaming, and generative AI applications with Streamlit in Snowflake and Amazon Bedrock.

About the Authors

Andries Engelbrecht is a Principal Partner Solutions Architect at Snowflake and works with strategic partners. He is actively engaged with strategic partners like AWS supporting product and service integrations as well as the development of joint solutions with partners. Andries has over 20 years of experience in the field of data and analytics.

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architectures on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Brian Dolan joined Amazon as a Military Relations Manager in 2012 after his first career as a Naval Aviator. In 2014, Brian joined Amazon Web Services, where he helped Canadian customers from startups to enterprises explore the AWS Cloud. Most recently, Brian was a member of the Non-Relational Business Development team as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces before joining the Analytics Worldwide Specialist Organization in 2022 as a Go-To-Market Specialist for AWS Glue.

Nidhi Gupta is a Sr. Partner Solution Architect at AWS. She spends her days working with customers and partners, solving architectural challenges. She is passionate about data integration and orchestration, serverless and big data processing, and machine learning. Nidhi has extensive experience leading the architecture design and production release and deployments for data workloads.

Scott Teal is a Product Marketing Lead at Snowflake and focuses on data lakes, storage, and governance.

AlmaLinux OS – CVE-2024-1086 and XZ (AlmaLinux blog)

2024-04-03 jzb

Post Syndicated from jzb original https://lwn.net/Articles/968299/

AlmaLinux has announced
updated kernels for AlmaLinux 8 and 9 to address CVE-2024-1086, a
use-after-free vulnerability in the kernel that could be exploited to
gain local privilege escalation. This is notable because the fix
marks a divergence between AlmaLinux and Red Hat Enterprise Linux (RHEL):

In January of this year, a kernel flaw was disclosed and named CVE-2024-1086.
This flaw is trivially exploitable on most RHEL-equivalent
systems. There are many proof-of-concept posts available now,
including one from our Infrastructure team lead, Jonathan Wright (Dealing
with CVE-2024-1086). In multi-user scenarios, this flaw is
especially problematic.

Though this was flagged as something to be fixed in Red Hat
Enterprise Linux, Red Hat has only rated this as a moderate
impact.

The AlmaLinux project would also like to note that it is not
impacted by the XZ backdoor. “Because enterprise Linux takes a bit
longer to adopt those updates (sometimes to the chagrin of our users),
the version of XZ that had the back door inserted hadn’t made it
further than Fedora in our ecosystem.“

Malcolm: Improvements to static analysis in the GCC 14 compiler

2024-04-03 corbet

Post Syndicated from corbet original https://lwn.net/Articles/968297/

David Malcolm writes
about some static-analyzer features that are coming in the GCC 14
release.

Solving the halting problem?

Obviously I’m kidding with the title here, but for GCC 14 I’ve
implemented a new warning: -Wanalyzer-infinite-loop that’s able to
detect some simple cases of infinite loops.

See also: this report from the 2023 GNU
Tools Cauldron.

Simplify your query management with search templates in Amazon OpenSearch Service

2024-04-03 Arun Lakshmanan

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/simplify-your-query-management-with-search-templates-in-amazon-opensearch-service/

Amazon OpenSearch Service is an Apache-2.0-licensed distributed search and analytics suite offered by AWS. This fully managed service allows organizations to secure data, perform keyword and semantic search, analyze logs, alert on anomalies, explore interactive log analytics, implement real-time application monitoring, and gain a more profound understanding of their information landscape. OpenSearch Service provides the tools and resources needed to unlock the full potential of your data. With its scalability, reliability, and ease of use, it’s a valuable solution for businesses seeking to optimize their data-driven decision-making processes and improve overall operational efficiency.

This post delves into the transformative world of search templates. We unravel the power of search templates in revolutionizing the way you handle queries, providing a comprehensive guide to help you navigate through the intricacies of this innovative solution. From optimizing search processes to saving time and reducing complexities, discover how incorporating search templates can elevate your query management game.

Search templates

Search templates empower developers to articulate intricate queries within OpenSearch, enabling their reuse across various application scenarios, eliminating the complexity of query generation in the code. This flexibility also grants you the ability to modify your queries without requiring application recompilation. Search templates in OpenSearch use the mustache template, which is a logic-free templating language. Search templates can be reused by their name. A search template that is based on mustache has a query structure and placeholders for the variable values. You use the _search API to query, specifying the actual values that OpenSearch should use. You can create placeholders for variables that will be changed to their true values at runtime. Double curly braces ({{}}) serve as placeholders in templates.

Mustache enables you to generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

In the following example, the search template runs the query in the “source” block by passing in the values for the field and value parameters from the “params” block:

GET /myindex/_search/template
 { 
      "source": {   
         "query": { 
             "bool": {
               "must": [
                 {
                   "match": {
                    "{{field}}": "{{value}}"
                 }
             }
        ]
     }
    }
  },
 "params": {
    "field": "place",
    "value": "sweethome"
  }
}

You can store templates in the cluster with a name and refer to them in a search instead of attaching the template in each request. You use the PUT _scripts API to publish a template to the cluster. Let’s say you have an index of books, and you want to search for books with publication date, ratings, and price. You could create and publish a search template as follows:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, you define a search template called find_book that uses the mustache template language with defined placeholders for the gte_date, gte_rating, and lte_price parameters.

To use the search template stored in the cluster, you can send a request to OpenSearch with the appropriate parameters. For example, you can search for products that have been published in the last year with ratings greater than 4.0, and priced less than $20:

POST /books/_search/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

This query will return all books that have been published in the last year, with a rating of at least 4.0, and a price less than $20 from the books index.

Default values in search templates

Default values are values that are used for search parameters when the query that engages the template doesn’t specify values for them. In the context of the find_book example, you can set default values for the from, size, and gte_date parameters in case they are not provided in the search request. To set default values, you can use the following mustache template:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}{{^gte_date}}now-1y{{/gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        },
        "from": "{{from}}{{^from}}0{{/from}}",
        "size": "{{size}}{{^size}}2{{/size}}"
      }
    }
  }
}

In this template, the {{from}}, {{size}}, and {{gte_date}} parameters are placeholders that can be filled in with specific values when the template is used in a search. If no value is specified for {{from}}, {{size}}, and {{gte_date}}, OpenSearch uses the default values of 0, 2, and now-1y, respectively. This means that if a user searches for products without specifying from, size, and gte_date, the search will return just two products matching the search criteria for 1 year.

You can also use the render API as follows if you have a stored template and want to validate it:

POST _render/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

Conditions in search templates

The conditional statement that allows you to control the flow of your search template based on certain conditions. It’s often used to include or exclude certain parts of the search request based on certain parameters. The syntax as follows:

{{#Any condition}}
  ... code to execute if the condition is true ...
{{/Any}}

The following example searches for books based on the gte_date, gte_rating, and lte_price parameters and an optional stock parameter. The if condition is used to include the condition_block/term query only if the stock parameter is present in the search request. If the is_available parameter is not present, the condition_block/term query will be skipped.

GET /books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#is_available}}
        {
          "term": {
            "in_stock": "{{is_available}}"
          }
        },
        {{/is_available}}
          {
            "range": {
              "publish_date": {
                "gte": "{{gte_date}}"
              }
            }
          },
          {
            "range": {
              "rating": {
                "gte": "{{gte_rating}}"
              }
            }
          },
          {
            "range": {
              "price": {
                "lte": "{{lte_price}}"
              }
            }
          }
        ]
      }
    }
  }""",
  "params": {
    "gte_date": "now-3y",
    "gte_rating": 4.0,
    "lte_price": 20,
    "is_available": true
  }
}

By using a conditional statement in this way, you can make your search requests more flexible and efficient by only including the necessary filters when they are needed.

To make the query valid inside the JSON, it needs to be escaped with triple quotes (""") in the payload.

Loops in search templates

A loop is a feature of mustache templates that allows you to iterate over an array of values and run the same code block for each item in the array. It’s often used to generate a dynamic list of filters or queries based on the values passed in the search request. The syntax is as follows:

{{#list item in array}}
  ... code to execute for each item ...
{{/list}}

The following example searches for books based on a query string ({{query}}) and an array of categories to filter the search results. The mustache loop is used to generate a match filter for each item in the categories array.

GET books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#list}}
        {
          "match": {
            "category": "{{list}}"
          }
        }
        {{/list}}
          {
          "match": {
            "title": "{{name}}"
          }
        }
        ]
      }
    }
  }""",
  "params": {
    "name": "killer",
    "list": ["Classics", "comics", "Horror"]
  }
}

The search request is rendered as follows:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "killer"
          }
        },
        {
          "match": {
            "category": "Classics"
          }
        },
        {
          "match": {
            "category": "comics"
          }
        },
        {
          "match": {
            "category": "Horror"
          }
        }
      ]
    }
  }
}

The loop has generated a match filter for each item in the categories array, resulting in a more flexible and efficient search request that filters by multiple categories. By using the loops, you can generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

Advantages of using search templates

The following are key advantages of using search templates:

Maintainability – By separating the query definition from the application code, search templates make it straightforward to manage changes to the query or tune search relevancy. You don’t have to compile and redeploy your application.
Consistency – You can construct search templates that allow you to design standardized query patterns and reuse them throughout your application, which can help maintain consistency across your queries.
Readability – Because templates can be constructed using a more terse and expressive syntax, complicated queries are straightforward to test and debug.
Testing – Search templates can be tested and debugged independently of the application code, facilitating simpler problem-solving and relevancy tuning without having to re-deploy the application. You can easily create A/B testing with different templates for the same search.
Flexibility – Search templates can be quickly updated or adjusted to account for modifications to the data or search specifications.

Best practices

Consider the following best practices when using search templates:

Before deploying your template to production, make sure it is fully tested. You can test the effectiveness and correctness of your template with example data. It is highly recommended to run the application tests that use these templates before publishing.
Search templates allow for the addition of input parameters, which you can use to modify the query to suit the needs of a particular use case. Reusing the same template with varied inputs is made simpler by parameterizing the inputs.
Manage the templates in an external source control system.
Avoid hard-coding values inside the query—instead, use defaults.

Conclusion

In this post, you learned the basics of search templates, a powerful feature of OpenSearch, and how templates help streamline search queries and improve performance. With search templates, you can build more robust search applications in less time.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

Stay tuned for more exciting updates and new features in OpenSearch Service.

About the authors

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Madhan Kumar Baskaran works as a Search Engineer at AWS, specializing in Amazon OpenSearch Service. His primary focus involves assisting customers in constructing scalable search applications and analytics solutions. Based in Bengaluru, India, Madhan has a keen interest in data engineering and DevOps.

Four stable kernel updates

2024-04-03 jzb

Post Syndicated from jzb original https://lwn.net/Articles/968250/

The 6.8.3, 6.7.12, 6.6.24, and 6.1.84 stable kernel updates have been
released. Each contains an important set of fixes. Note that 6.7.12 is
the final release for the 6.7.y series, and that branch is now
end-of-life. Users should move to the 6.8.y branch.

[$] A memory model for Rust code in the kernel

2024-04-03 corbet

Post Syndicated from corbet original https://lwn.net/Articles/967049/

The Rust programming language differs from C in many ways; those
differences tend to be what users admire in the language. But those
differences can also lead to an impedance mismatch when Rust code is
integrated into a C-dominated system, and it can be even worse in the
kernel, which is not a typical C program. Memory models are a case in
point. A programming language’s view of memory is sufficiently fundamental
and arcane that many developers never have to learn much about it. It is
hard to maintain that sort of blissful ignorance while working in the
kernel, though, so a recent discussion of how to choose a memory model for
kernel code in Rust is of interest.

KDE6 release: D-Bus and Polkit Galore (SUSE security team blog)

2024-04-03 corbet

Post Syndicated from corbet original https://lwn.net/Articles/968220/

The SUSE Security Team Blog is carrying a
detailed article on SUSE’s review of the KDE6 release.

The SUSE security team restricts the installation of system wide
D-Bus services and Polkit policies in openSUSE distributions and
derived SUSE products. Any package that ships these features needs
to be reviewed by us first, before it can be added to production
repositories.

In November, openSUSE KDE packagers approached us with a long list
of KDE components for an upcoming KDE6 major release. The packages
needed adjusted D-Bus and Polkit whitelistings due to renamed
interfaces or other breaking changes. Looking into this many
components at once was a unique experience that also led to new
insights, which will be discussed in this article.

Security updates for Wednesday

2024-04-03 jzb

Post Syndicated from jzb original https://lwn.net/Articles/968218/

Security updates have been issued by Debian (py7zr), Fedora (biosig4c++ and podman), Oracle (kernel, kernel-container, and ruby:3.1), Red Hat (.NET 7.0, bind9.16, curl, expat, grafana, grafana-pcp, kernel, kernel-rt, kpatch-patch, less, opencryptoki, and postgresql-jdbc), and Ubuntu (cacti).

Redict 7.3.0 released

2024-04-03 corbet

Post Syndicated from corbet original https://lwn.net/Articles/968183/

The first stable release of Redict, a fork of the Redis in-memory database
under a copyleft license, has been announced.

You may be wondering why Redict would be of interest to you,
particularly when compared with Valkey,
another Redis fork that was announced on Thursday.

In technical terms, we are focusing on stability and long-term
maintenance, and on achieving excellence within our current
scope. We believe that Redict is near feature-complete and that it
is more valuable to our users if we take a conservative stance to
innovation and focus on long-term reliability instead. This is in
part a choice we’ve made to distinguish ourselves from Valkey,
whose commercial interests are able to invest more resources into
developing more radical innovations, but also an acknowledgement of
a cultural difference between our projects, in that the folks
behind Redict place greater emphasis on software with a finite
scope and ambitions towards long-term stability rather than
focusing on long-term growth in scope and complexity.

Improving Cloudflare Workers and D1 developer experience with Prisma ORM

2024-04-03 Jon Harrell (Guest Author)

Post Syndicated from Jon Harrell (Guest Author) original https://blog.cloudflare.com/prisma-orm-and-d1

Working with databases can be difficult. Developers face increasing data complexity and needs beyond simple create, read, update, and delete (CRUD) operations. Unfortunately, these issues also compound on themselves: developers have a harder time iterating in an increasingly complex environment. Cloudflare Workers and D1 help by reducing time spent managing infrastructure and deploying applications, and Prisma provides a great experience for your team to work and interact with data.

Together, Cloudflare and Prisma make it easier than ever to deploy globally available apps with a focus on developer experience. To further that goal, Prisma Object Relational Mapper (ORM) now natively supports Cloudflare Workers and D1 in Preview. With version 5.12.0 of Prisma ORM you can now interact with your data stored in D1 from your Cloudflare Workers with the convenience of the Prisma Client API. Learn more and try it out now.

What is Prisma?

From writing to debugging, SQL queries take a long time and slow developer productivity. Even before writing queries, modeling tables can quickly become unwieldy, and migrating data is a nerve-wracking process. Prisma ORM looks to resolve all of these issues by providing an intuitive data modeling language, an automated migration workflow, and a developer-friendly and type-safe client for JavaScript and TypeScript, allowing developers to focus on what they enjoy: developing!

Prisma is focused on making working with data easy. Alongside an ORM, Prisma offers Accelerate and Pulse, products built on Cloudflare that cover needs from connection pooling, to query caching, to real-time type-safe database subscriptions.

How to get started with Prisma ORM, Cloudflare Workers, and D1

To get started with Prisma ORM and D1, first create a basic Cloudflare Workers app. This guide will start with the ”Hello World” Worker example app, but any Workers example app will work. If you don’t have a project yet, start by creating a new one. Name your project something memorable, like my-d1-prisma-app and select “Hello World” worker and TypeScript. For now, we will choose to not deploy and will wait until after we have set up D1 and Prisma ORM.

npm create cloudflare@latest

Next, move into your newly created project and make sure that dependencies are installed:

cd my-d1-prisma-app && npm install

After dependencies are installed, we can move on to the D1 setup.

First, create a new D1 database for your app.

npx wrangler d1 create prod-prisma-d1-app
.
.
.

[[d1_databases]]
binding = "DB" # i.e. available in your Worker on env.DB
database_name = "prod-prisma-d1-app"
database_id = "<unique-ID-for-your-database>"

The section starting with [[d1_databases]] is the binding configuration needed in your wrangler.toml for your Worker to communicate with D1. Add that now:

// wrangler.toml
name="my-d1-prisma-app"
main = "src/index.ts"
compatibility_date = "2024-03-20"
compatibility_flags = ["nodejs_compat"]

[[d1_databases]]
binding = "DB" # i.e. available in your Worker on env.DB
database_name = "prod-prisma-d1-app"
database_id = "<unique-ID-for-your-database>"

Your application now has D1 available! Next, add Prisma ORM to manage your queries, schema and migrations! To add Prisma ORM, first make sure the latest version is installed. Prisma ORM versions 5.12.0 and up support Cloudflare Workers and D1.

npm install prisma@latest @prisma/client@latest @prisma/adapter-d1

Now run npx prisma init in order to create the necessary files to start with. Since D1 uses SQLite’s SQL dialect, we set the provider to be sqlite.

npx prisma init --datasource-provider sqlite

This will create a few files, but the one to look at first is your Prisma schema file, available at prisma/schema.prisma

// schema.prisma
// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider = "prisma-client-js"
}

datasource db {
  provider = "sqlite"
  url  = env("DATABASE_URL")
}

Before you can create any models, first enable the driverAdapters Preview feature. This will allow the Prisma Client to use an adapter to communicate with D1.

// schema.prisma
// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider = "prisma-client-js"
+ previewFeatures = ["driverAdapters"]
}

datasource db {
  provider = "sqlite"
  url      = env("DATABASE_URL")
}

Now you are ready to create your first model! In this app, you will be creating a “ticker”, a mainstay of many classic Internet sites.

Add a new model to your schema, Visit, which will track that an individual visited your site. A Visit is a simple model that will have a unique ID and the time at which an individual visited your site.

// This is your Prisma schema file,
// learn more about it in the docs: https://pris.ly/d/prisma-schema

generator client {
  provider        = "prisma-client-js"
  previewFeatures = ["driverAdapters"]
}

datasource db {
  provider = "sqlite"
  url      = env("DATABASE_URL")
}

+ model Visit {
+   id        Int      @id @default(autoincrement())
+   visitTime DateTime @default(now())
+ }

Now that you have a schema and a model, let’s create a migration. First use wrangler to generate an empty migration file and prisma migrate to fill it. If prompted, select “yes” to create a migrations folder at the root of your project.

npx wrangler d1 migrations create prod-prisma-d1-app init
 ⛅️ wrangler 3.36.0
-------------------
✔ No migrations folder found. Set `migrations_dir` in wrangler.toml to choose a different path.
Ok to create /path/to/your/project/my-d1-prisma-app/migrations? … yes
✅ Successfully created Migration '0001_init.sql'!

The migration is available for editing here
/path/to/your/project/my-d1-prisma-app/migrations/0001_init.sql

npx prisma migrate diff --script --from-empty --to-schema-datamodel ./prisma/schema.prisma >> migrations/0001_init.sql

The npx prisma migrate diff command takes the difference between your database (which is currently empty) and the Prisma schema. It then saves this difference to a new file in the migrations directory.

// 0001_init.sql
-- Migration number: 0001 	 2024-03-21T22:15:50.184Z
-- CreateTable
CREATE TABLE "Visit" (
    "id" INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
    "visitTime" DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP

Now you can migrate your local and remote D1 database instances using wrangler and re-generate your Prisma Client to begin making queries.

npx wrangler d1 migrations apply prod-prisma-d1-app --local
npx wrangler d1 migrations apply prod-prisma-d1-app --remote
npx prisma generate

Make sure to import PrismaClient and PrismaD1, define the binding for your D1 database, and you’re ready to use Prisma in your application.

// src/index.ts
import { PrismaClient } from "@prisma/client";
import { PrismaD1 } from "@prisma/adapter-d1";

export interface Env {
  DB: D1Database,
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const adapter = new PrismaD1(env.DB);
    const prisma = new PrismaClient({ adapter });
    const { pathname } = new URL(request.url);

    if (pathname === '/') {
      const numVisitors = await prisma.visit.count();
      return new Response(
        `You have had ${numVisitors} visitors!`
      );
    }

    return new Response('');
  },
};

You may notice that there’s always 0 visitors. Add another route to create a new visitor whenever someone visits the /visit route

// src/index.ts
import { PrismaClient } from "@prisma/client";
import { PrismaD1 } from "@prisma/adapter-d1";

export interface Env {
  DB: D1Database,
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const adapter = new PrismaD1(env.DB);
    const prisma = new PrismaClient({ adapter });
    const { pathname } = new URL(request.url);

    if (pathname === '/') {
      const numVisitors = await prisma.visit.count();
      return new Response(
        `You have had ${numVisitors} visitors!`
      );
    } else if (pathname === '/visit') {
      const newVisitor = await prisma.visit.create({ data: {} });
      return new Response(
        `You visited at ${newVisitor.visitTime}. Thanks!`
      );
    }

    return new Response('');
  },
};

Your app is now set up to record visits and report how many visitors you have had!

Summary and further reading

We were able to build a simple app easily with Cloudflare Workers, D1 and Prisma ORM, but the benefits don’t stop there! Check the official documentation for information on using Prisma ORM with D1 along with workflows for migrating your data, and even extending the Prisma Client for your specific needs.

Data Anywhere with Pipelines, Event Notifications, and Workflows

2024-04-03 Matt Silverlock

Post Syndicated from Matt Silverlock original https://blog.cloudflare.com/data-anywhere-events-pipelines-durable-execution-workflows

Data is fundamental to any real-world application: the database storing your user data and inventory, the analytics tracking sales events and/or error rates, the object storage with your web assets and/or the Parquet files driving your data science team, and the vector database enabling semantic search or AI-powered recommendations for your users.

When we first announced Workers back in 2017, and then Workers KV, Cloudflare R2, and D1, it was obvious that the next big challenge to solve for developers would be in making it easier to ingest, store, and query the data needed to build scalable, full-stack applications.

To that end, as part of our quest to make building stateful, distributed-by-default applications even easier, we’re launching our new Event Notifications service; a preview of our upcoming streaming ingestion product, Pipelines; and a sneak peek into our take on durable execution, Workflows.

Event-based architectures

When you’re writing data — whether that’s new data, changing existing data, or deleting old data — you often want to trigger other, asynchronous work to run in response. That could be processing user-driven uploads, updating search indexes as the underlying data changes, or removing associated rows in your SQL database when content is removed.

In order to make these event-driven workflows far easier to build across Cloudflare, we’re launching the first step towards a wider Event Notifications platform across Cloudflare, starting with notifications support in R2.

You can read more in the deep-dive on Event Notifications for R2, but in a nutshell: you can configure changes to content in any R2 bucket to write directly to a Queue, allowing you to reliably consume those events in a Worker or to pull from compute in a legacy cloud.

Event Notifications for R2 are just the beginning, though. There are many kinds of events you might want to trigger as a developer — these are just some of the event types we’re planning to support:

Changes (writes) to key-value pairs in your Workers KV namespaces.
Updates to your D1 databases, including changed rows or triggers.
Deployments to your Cloudflare Workers

Consuming event notifications from a single Worker is just one approach, though. As you start to consume events, you may want to trigger multi-step workflows that execute reliably, resume from errors or exceptions, and ensure that previous steps aren’t duplicated or repeated unnecessarily. An event notification framework turns out to be just the thing needed to drive a workflow engine that executes durably…

Making it even easier to ingest data

When we launched Cloudflare R2, our object storage service, we knew that supporting the de facto-standard S3 API was critical in order to allow developers to bring the tooling and services they already had over to R2. But the S3 API is designed to be simple: at its core, it provides APIs for upload, download, multipart and metadata operations, and many tools don’t support the S3 API.

What if you want to batch clickstream data from your web services so that it’s efficient (and cost-effective) to query by your analytics team? Or partition data by customer ID, merchant ID, or locale within a structured data format like JSON?

Well, we want to help solve this problem too, and so we’re announcing Pipelines, an upcoming streaming ingestion service designed to ingest data at scale, aggregate it, and write it directly to R2, without you having to manage infrastructure, partitions, runners, or worry about durability.

With Pipelines, creating a globally scalable ingestion endpoint that can ingest tens-of-thousands of events per second doesn’t require any code:

$ wrangler pipelines create clickstream-ingest-prod --batch-size="1MB" --batch-timeout-secs=120 --batch-on-json-key=".merchantId" --destination-bucket="prod-cs-data"

✅ Successfully created new pipeline "clickstream-ingest-prod"
📥 Created endpoints:
➡ HTTPS: https://d458dbe698b8eef41837f941d73bc5b3.pipelines.cloudflarestorage.com/clickstream-ingest-prod
➡ WebSocket: wss://d458dbe698b8eef41837f941d73bc5b3.pipelines.cloudflarestorage.com:8443/clickstream-ingest-prod
➡ Kafka: d458dbe698b8eef41837f941d73bc5b3.pipelines.cloudflarestorage.com:9092 (topic: clickstream-ingest-prod)

As you can see here, we’re already thinking about how to make Pipelines protocol-agnostic: write from a HTTP client, stream events over a WebSocket, and/or redirect your existing Kafka producer (and stop having to manage and scale Kafka) directly to Pipelines.

But that’s just the beginning of our vision here. Scalable ingestion and simple batching is one thing, but what about if you have more complex needs? Well, we have a massively scalable compute platform (Cloudflare Workers) that can help address this too.

The code below is just an initial exploration for how we’re thinking about an API for running transforms over streaming data. If you’re aware of projects like Apache Beam or Flink, this programming model might even look familiar:

export default {    
   // Pipeline handler is invoked when batch criteria are met
   async pipeline(stream: StreamPipeline, env: Env, ctx: ExecutionContext): Promise<StreamingPipeline> {
      // ...
      return stream
         // Type: transform(label: string, transformFunc: TransformFunction): Promise<StreamPipeline>
         // Each transform has a label that is used in metrics to provide
    // per-transform observability and debugging
         .transform("human readable label", (events: Array<StreamEvent>) => {
            return events.map((e) => ...)
         })
         .transform("another transform", (events: Array<StreamEvent>) => {
            return events.map((e) => ...)
         })
         .writeToR2({
            format: "json",
            bucket: "MY_BUCKET_NAME",
            prefix: somePrefix,
            batchSize: "10MB"
         })
   }
}

Specifically:

The Worker describes a pipeline of transformations (mapping, reducing, filtering) that operates over each subset of events (records)
You can call out to other services — including D1 or KV — in order to synchronously or asynchronously hydrate data or lookup values during your stream processing
We take care of scaling horizontally based on records-per-second and/or any concurrency settings you configure based on processing latency requirements.

We’ll be bringing Pipelines into open beta later in 2024, and it will initially launch with support for HTTP ingestion and R2 as a destination (sink), but we’re already thinking bigger.

We’ll be sharing more as Pipelines gets closer to release. In the meantime, you can register your interest and share your use-case, and we’ll reach out when Pipelines reaches open beta.

Durable Execution

If the term “Durable Execution” is new to you, don’t worry: the term comes from the desire to run applications that can resume execution from where they left off, even if the underlying host or compute fails (where the “durable” part comes from).

As we’ve continued to build out our data and AI platforms, we’ve been acutely aware that developers need ways to create reliable, repeatable workflows that operate over that data, turn unstructured data into structured data, trigger on fresh data (or periodically), and automatically retry, restart, and export metrics for each step along the way. The industry calls this Durable Execution: we’re just calling it Workflows.

What makes Workflows different from other takes on Durable Execution is that we provide the underlying compute as part of the platform. You don’t have to bring-your-own compute, or worry about scaling it or provisioning it in the right locations. Workflows runs on top of Cloudflare Workers – you write the workflow, and we take care of the rest.

Here’s an early example of writing a Workflow that generates text embeddings using Workers AI and stores them (ready to query) in Vectorize as new content is written to (or updated within) R2.

Each Workflow run is triggered by an Event Notification consumed from a Queue, but could also be triggered by a HTTP request, another Worker, or even scheduled on a timer.
Individual steps within the Workflow allow us to define individually retriable units of work: in this case, we’re reading the new objects from R2, creating text embeddings using Workers AI, and then inserting.
State is durably persisted between steps: each step can emit state, and Workflows will automatically persist that so that any underlying failures, uncaught exceptions or network retries can resume execution from the last successful step.
Every call to step() automatically emits metrics associated with the unique Workflow run, making it easier to debug within each step and/or break down your application into its smallest units of execution, without having to worry about observability.

Step-by-step, it looks like this:

Transforming this series of steps into real code, here’s what this would look like with Workflows:

import { Ai } from "@cloudflare/ai";
import { Workflow } from "cloudflare:workers";

export interface Env {
  R2: R2Bucket;
  AI: any;
  VECTOR_INDEX: VectorizeIndex;
}

export default class extends Workflow {
  async run(event: Event) {
    const ai = new Ai(this.env.AI);

    // List of keys to fetch from our incoming event notification
    const keysToFetch = event.messages.map((val) => {
      return val.object.key;
    });

    // The return value of each step is stored (the "durable" part
    // of "durable execution")
    // This ensures that state can be persisted between steps, reducing
    // the need to recompute results ($$, time) should subsequent
    // steps fail.
    const inputs = await this.ctx.run(
      // Each step has a user-defined label
      // Metrics are emitted as each step runs (to success or failure)
// with this label attached and available within per-Workflow
// analytics in near-real-time.
"read objects from R2", async () => {
      const objects = [];

      for (const key of keysToFetch) {
        const object = await this.env.R2.get(key);
        objects.push(await object.text());
      }

      return objects;
    });


    // Persist the output of this step.
    const embeddings = await this.ctx.run(
      "generate embeddings",
      async () => {
        const { data } = await ai.run("@cf/baai/bge-small-en-v1.5", {
          text: inputs,
        });

        if (data.length) {
          return data;
        } else {
          // Uncaught exceptions trigger an automatic retry of the step
          // Retries and timeouts have sane defaults and can be overridden
    // per step
          throw new Error("Failed to generate embeddings");
        }
      },
      {
        retries: {
          limit: 5,
          delayMs: 1000,
          backoff: "exponential",
        },
      }
    );

    await this.ctx.run("insert vectors", async () => {
      const vectors = [];

      keysToFetch.forEach((key, index) => {
        vectors.push({
          id: crypto.randomUUID(),
          // Our embeddings from the previous step
          values: embeddings[index].values, 
          // The path to each R2 object to map back to during
 	    // vector search
          metadata: { r2Path: key },
        });
      });

      return this.env.VECTOR_INDEX.upsert();
    });
  }
}

This is just one example of what a Workflow can do. The ability to durably execute an application, modeled as a series of steps, applies to a wide number of domains. You can apply this model of execution to a number of use-cases, including:

Deploying software: each step can define a build step and subsequent health check, gating further progress until your deployment meets your criteria for “healthy”.
Post-processing user data: triggering a workflow based on user uploads (e.g. to Cloudflare R2) that then subsequently parses that data asynchronously, redacts PII or sensitive data, writes the sanitized output, and triggers a notification via email, webhook, or mobile push.
Payment and batch workflows: aggregating raw customer usage data on a periodic schedule by querying your data warehouse (or Workers Analytics Engine), triggering usage or spend alerts, and/or generating PDF invoices.

Each of these use cases model tasks that you want to run to completion, minimize redundant retries by persisting intermediate state, and (importantly) easily observe success and failure.

We’ll be sharing more about Workflows during the second quarter of 2024 as we work towards an open (public!) beta. This includes how we’re thinking about idempotency and interactions with our storage, per-instance observability and metrics, local development, and templates to bootstrap common workflows.

Putting it together

We’ve often thought of Cloudflare’s own network as one massively scalable parallel data processing cluster: data centers in 310+ cities, with the ability to run compute close to users and/or close to data, keep it within the bounds of regulatory or compliance requirements, and most importantly, use our massive scale to enable our customers to scale as well.

Recapping, a fully-fledged data platform needs to enable three things:

Ingesting data: getting data into the platform (in the right format, from the right sources)
Storing data: securely, reliably, and durably.
Querying data: understanding and extracting insights from the data, and/or transforming it for use by other tools.

When we launched R2 we tackled the second part, but knew that we’d need to follow up with the first and third parts in order to make it easier for developers to get data in and make use of it.

If we look at how we can build a system that helps us solve each of these three parts together with Pipelines, Event Notifications, R2, and Workflows, we end up with an architecture that resembles this:

Specifically, we have Pipelines (1) scaling out to ingest data, batch it, filter it, and then durably store it in R2 (2) in a format that’s ready and optimized for querying. Workflows, ClickHouse, Databricks, or the query engine of your choice can then query (3) that data as soon as it’s ready — with “ready” being automatically triggered by an Event Notification as soon as the data is ingested and written to R2.

There’s no need to poll, no need to batch after the fact, no need to have your query engine slow down on data that wasn’t pre-aggregated or filtered, and no need to manage and scale infrastructure in order to keep up with load or data jurisdiction requirements. Create a Pipeline, write your data directly to R2, and query directly from it.

If you’re also looking at this and wondering about the costs of moving this data around, then we’re holding to one important principle: zero egress fees across all of our data products. Just as we set the stage for this with our R2 object storage, we intend to apply this to every data product we’re building, Pipelines included.

Start Building

We’ve shared a lot of what we’re building so that developers have an opportunity to provide feedback (including via our Developer Discord), share use-cases, and think about how to build their next application on Cloudflare.

R2 adds event notifications, support for migrations from Google Cloud Storage, and an infrequent access storage tier

2024-04-03 Matt DeBoard

Post Syndicated from Matt DeBoard original https://blog.cloudflare.com/r2-events-gcs-migration-infrequent-access

We’re excited to announce three new features for Cloudflare R2, our zero egress fee object storage platform:

Event Notifications: Automatically trigger Workers and take action when data in your R2 bucket changes.
Super Slurper for Google Cloud Storage: Easily migrate data from Google Cloud Storage to Cloudflare R2.
Infrequent Access Private Beta: Pay less to store data that isn’t frequently accessed. Now in private beta (sign up now).

Event Notifications Open Beta

The lifecycle of data often doesn’t stop immediately after upload to an R2 bucket – event data may need to be transformed and loaded into a data warehouse, media files may need to go through a post-processing step, etc. We’re releasing event notifications for R2 in open beta to enable building applications and workflows driven by your changing data.

Event notifications work by sending messages to your queue each time there is a change to your data. These messages are then received by a consumer Worker where you can then define any subsequent action that needs to be taken.

To get started enabling event notifications on your R2 bucket, you can run the following Wrangler command (replacing bucket_name and queue_name with your bucket and queue names respectively):

wrangler r2 bucket notification create <bucket_name> --event-type object-create --queue <queue_name>

For more information on how to set up event notifications on your R2 buckets today and limitations during beta, please refer to the documentation.

Super Slurper for Google Cloud Storage

Super Slurper can now migrate data from Google Cloud Storage (GCS) to Cloudflare R2. We released Super Slurper last year with the goal of making one-time comprehensive data migrations fast, reliable, and easy: there’s no need to spin up migration VMs and implement complicated retry logic. Since then, thousands of developers have used Super Slurper to migrate petabytes of data from AWS S3 to R2. Now Google Cloud Storage customers can migrate data to Cloudflare R2 to benefit from Cloudflare’s zero egress fees, whether you are permanently moving data to another provider or not.

To get started migrating data from GCS:

From the Cloudflare dashboard, select R2 > Data Migration.
Select Migrate files.
Select Google Cloud Storage for the source bucket provider.
Enter your bucket name and associated credentials and select Next.
Enter your R2 bucket name and associated credentials and select Next.
After you finish reviewing the details of your migration, select Migrate files.

You can view the status of your migration job at any time on the dashboard. For more information on how to use Super Slurper, please refer to the documentation here.

Infrequent Access Private Beta

We’re excited to introduce the private beta of our new Infrequent Access storage class. For use cases that involve data that isn’t frequently accessed (long tail user-generated content, logs, etc), Infrequent Access gives you the ability to pay less for storage while maintaining performance and durability.

Here’s an example of how you can upload an object to your R2 bucket with the new Infrequent Access storage class using Workers:

# wrangler.toml
[[r2_buckets]]
binding = 'MY_BUCKET'
bucket_name = '<YOUR_BUCKET_NAME>'

# index.ts
export default {
   async fetch(request: Request, env: Env): Promise<Response> {
      if (request.method === "PUT") {
         await env.MY_BUCKET.put("myobject", request.body, storageClass: "InfrequentAccess");
         return new Response("Put object successfully!");
      }
      return new Response("Not a PUT!");
   }
}

In addition to uploading objects directly to Infrequent Access, you can define an object lifecycle policy to move data to Infrequent Access after a period of time goes by and you no longer need to access your data as often. In the future, we plan to automatically optimize storage classes for data so you can avoid manually creating rules and better adapt to changing data access patterns.

For data stored in the Infrequent Access storage class, the pricing components will be similar to what you’re used to with R2: storage, Class A operations (writes, lists), Class B operations (reads), and data retrieval (processing). Data retrieval is charged per GB when data in the Infrequent Access storage class is retrieved and is what allows us to provide storage at a lower price. It reflects the additional computational resources required to fetch data from underlying storage optimized for less frequent access. And when the time comes, and you do need to use your data, there are still no egress fees.

Component	Price
Storage	$0.01 / GB-month
Class A Operations	$9.00 / million requests
Class B Operations	$0.90 / million requests
Data Retrieval (Processing)	$0.01 / GB
Egress (or Data Transfer)	$0 – No Charge

Are you interested in participating in the private beta for Infrequent Access?

Join the private beta waitlist to get access.

Have any feedback?

We would love to hear from you! To share your feedback about R2 and our data migration services, please join the Cloudflare Developer Discord. If you’re interested in learning more about R2, get started by visiting R2’s developer documentation or see how much you could save with our pricing calculator.

Our guest speakers

Jess Van Brummelen

Ashley Edwards

Broadening access

Актуално състояние

Как може да бъде

Challenges

Visualize AWS Glue Data Quality scores in Amazon DataZone

Data quality historical results in Amazon DataZone

Enable AWS Glue Data Quality when creating a new Amazon DataZone data source

Prerequisites

Create a data source with data quality enabled

Enable data quality for an existing data asset

Prerequisites

Import data quality scores into the data asset

Ingest data quality scores from an external source using Amazon DataZone APIs

Prerequisites

Ingest and analyze PySpark code

Clean up

Conclusion

About the Authors

Advantages of Apache Iceberg

Transactional data lakes built on AWS and Snowflake

Manage your Iceberg table with AWS Glue

Manage your Iceberg table with Snowflake

Comparing solutions

Migrate existing data lakes to a transactional data lake using Apache Iceberg

Conclusion

About the Authors

Search templates

Default values in search templates

Conditions in search templates

Loops in search templates

Advantages of using search templates

Best practices

Conclusion

About the authors

What is Prisma?

How to get started with Prisma ORM, Cloudflare Workers, and D1

Summary and further reading

Event-based architectures

Making it even easier to ingest data

Durable Execution

Putting it together

Start Building

Event Notifications Open Beta

Super Slurper for Google Cloud Storage

Infrequent Access Private Beta

Are you interested in participating in the private beta for Infrequent Access?

Have any feedback?

The collective thoughts of the interwebz