Tag Archives: artificial intelligence

How AI code generation works

2024-02-22 Jeimy Ruiz

Post Syndicated from Jeimy Ruiz original https://github.blog/2024-02-22-how-ai-code-generation-works/

Generative AI coding tools are changing software production for enterprises. Not just for their code generation abilities—from vulnerability detection and facilitating comprehension of unfamiliar codebases, to streamlining documentation and pull request descriptions, they’re fundamentally reshaping how developers approach application infrastructure, deployment, and their own work experience.

We’re now witnessing a significant turning point. As AI models get better, refusing adoption would be like “asking an office worker to use a typewriter instead of a computer,” says Albert Ziegler, principal researcher and member of the GitHub Next research and development team.

In this post, we’ll dive into the inner workings of AI code generation, exploring how it functions, its capabilities and benefits, and how developers can use it to enhance their development experience while propelling your enterprise forward in today’s competitive landscape.

How to use AI to generate code

AI code generation refers to full or partial lines of code that are generated by machines instead of human developers. This emerging technology leverages advanced machine learning models, particularly large language models (LLMs), to understand and replicate the syntax, patterns, and paradigms found in human-generated code.

The AI models powering these tools, like ChatGPT and GitHub Copilot, are trained on natural language text and source code from publicly available sources that include a diverse range of code examples. This training enables them to understand the nuances of various programming languages, coding styles, and common practices. As a result, the AI can generate code suggestions that are syntactically correct and contextually relevant based on input from developers.

Favored by 55% of developers, our AI-powered pair programmer, GitHub Copilot, provides contextualized coding assistance based on your organization’s codebase across dozens of programming languages, and targets developers of all experience levels. With GitHub Copilot, developers can use AI to generate code in three ways:

1. Type code and AI can autocomplete the code

Autocompletions are the earliest version of AI code generation. John Berryman, a senior researcher of ML on the GitHub Copilot team, explains the user experience: “I’ll be writing code and taking a pause to think. While I’m doing that, the agent itself is also thinking, looking at surrounding code and content in neighboring tabs. Then it pops up on the screen as gray ‘ghost text’ that I can reject, partially accept, or fully accept and then, if necessary, modify.”

While every developer can reap the benefits of using AI coding tools, experienced programmers can often feel these gains even more so. “In many cases, especially for experienced programmers in a familiar environment, this suggestion speeds us up. I would have written the same thing. It’s just faster to hit ‘tab’ (thus accepting the suggestion) than it is to write out those 20 characters by myself,” says Johan Rosenkilde, principal researcher for GitHub Next.

Whether developers are new or highly skilled, they’ll often have to work in less familiar languages, and code completion suggestions using GitHub Copilot can lend a helping hand. “Using GitHub Copilot for code completion has really helped speed up my learning experience,” says Berryman. “I will often accept the suggestion because it’s something I wouldn’t have written on my own since I don’t know the syntax.”

Using an AI coding tool has become an invaluable skill in itself. Why? Because the more developers practice coding with these tools, the faster they’ll get at using them.

2. Explicit code comments codes using natural language to receive even better AI-generated code suggestions

For experienced developers in unfamiliar environments, tools like GitHub Copilot can even help jog their memories.

Let’s say a developer imports a new type of library they haven’t used before, or that they don’t remember. Maybe they’re looking to figure out the standard library function or the order of the argument. In these cases, it can be helpful to make GitHub Copilot more explicitly aware of where the developer wants to go by writing a comment.

“It’s quite likely that the developer might not remember the formula, but they can recognize the formula, and GitHub Copilot can remember it by being prompted,” says Rosenkilde. This is where natural language commentary comes into play: it can be a shortcut for explaining intent when the developer is struggling with the first few characters of code that they need.

If developers give specific names to their functions and variables, and write documentation, they can get better suggestions, too. That’s because GitHub Copilot can read the variable names and use them as an indicator for what that function should do.

Suddenly that changes how developers write code for the better, because code with good variable and function names are more maintainable. And oftentimes the main job of a programmer is to maintain code, not write it from scratch.

“When you push that code, someone is going to review it, and they will likely have a better time reviewing that code if it’s well named, if there’s even a hint of documentation in it, and so on,” says Rosenkilde. In this sense, the symbiotic relationship between the developer and the AI coding tool is not just beneficial for the developer, but for the entire team.

3. Chat directly with AI

With AI chatbots, code generation can be more interactive. GitHub Copilot Chat, for example, allows developers to interact with code by asking it to explain code, improve syntax, provide ideas, generate tests, and modify existing code—making it a versatile ally in managing coding tasks.

Rosenkilde uses the different functionalities of GitHub Copilot:

“When I want to do something and I can’t remember how to do it, I type the first few letters of it, and then I wait to see if Copilot can guess what I’m doing,” he says. “If that doesn’t work, maybe I delete those characters and I write a one liner in commentary and see whether Copilot can guess the next line. If that doesn’t work, then I go to Copilot Chat and explain in more detail what I want done.”

Typically, Copilot Chat returns with something much more verbose and complete than what you get from GitHub Copilot code completion. “Namely, it describes back to you what it is you want done and how it can be accomplished. It gives you code examples, and you can respond and say, oh, I see where you’re going. But actually I meant it like this instead,” says Rosenkilde.

But using AI chatbots doesn’t mean developers should be hands off. Mistakes in reasoning could lead the AI down a path of further mistakes if left unchecked. Berryman recommends that users should interact with the chat assistant in much the same way that you would when pair programming with a human. “Go back and forth with it. Tell the assistant about the task you are working on, ask it for ideas, have it help you write code, and critique and redirect the assistant’s work in order to keep it on the right track.”

The importance of code reviews

GitHub Copilot is designed to empower developers to execute their ideas. As long as there is some context for it to draw on, it will likely generate the type of code the developer wants. But this doesn’t replace code reviews between developers.

Code reviews play an important role in maintaining code quality and reliability in software projects, regardless of whether AI coding tools are involved. In fact, the earlier developers can spot bugs in the code development process, the cheaper it is by orders of magnitude.

Ordinary verification would be: does the code parse? Do the tests work? With AI code generation, Ziegler explains that developers should, “Scrutinize it in enough detail so that you can be sure the generated code is correct and bug-free. Because if you use tools like that in the wrong way and just accept everything, then the bugs that you introduce are going to cost you more time than you save.”

Rosenkilde adds, “A review with another human being is not the same as that, right? It’s a conversation between two developers about whether this change fits into the kind of software they’re building in this organization. GitHub Copilot doesn’t replace that.”

The advantages of using AI to generate code

When developer teams use AI coding tools across the software development cycle, they experience a host of benefits, including:

Faster development, more productivity

AI code generation can significantly speed up the development process by automating repetitive and time-consuming tasks. This means that developers can focus on high-level architecture and problem-solving. In fact, 88% of developers reported feeling more productive when using GitHub Copilot.

Rosenkilde reflects on his own experience with GitHub’s AI pair programmer: “95% of the time, Copilot brings me joy and makes my day a little bit easier. And this doesn’t change the code I would have written. It doesn’t change the way I would have written it. It doesn’t change the design of my code. All it does is it makes me faster at writing that same code.” And Rosenkilde isn’t alone: 60% of developers feel more fulfilled with their jobs when using GitHub Copilot.

Mental load alleviated

The benefits of faster development aren’t just about speed: they’re also about alleviating the mental effort that comes with completing tedious tasks. For example, when it comes to debugging, developers have to reverse engineer what went wrong. Detecting a bug can involve digging through an endless list of potential hiding places where it might be lurking, making it repetitive and tedious work.

Rosenkilde explains, “Sometimes when you’re debugging, you just have to resort to creating print statements that you can’t get around. Thankfully, Copilot is brilliant at print statements.”

A whopping 87% of developers reported spending less mental effort on repetitive tasks with the help of GitHub Copilot.

Less context switching

In software development, context switching is when developers move between different tasks, projects, or environments, which can disrupt their workflow and decrease productivity. They also often deal with the stress of juggling multiple tasks, remembering syntax details, and managing complex code structures.

With GitHub Copilot developers can bypass several levels of context switching, staying in their IDE instead of searching on Google or jumping into external documentation.

“When I’m writing natural language commentary,” says Rosenkilde, “GitHub Copilot code completion can help me. Or if I use Copilot Chat, it’s a conversation in the context that I’m in, and I don’t have to explain quite as much.”

Generating code with AI helps developers offload the responsibility of recalling every detail, allowing them to focus on higher-level thinking, problem-solving, and strategic planning.

Berryman adds, “With GitHub Copilot Chat, I don’t have to restate the problem because the code never leaves my trusted environment. And I get an answer immediately. If there is a misunderstanding or follow-up questions, they are easy to communicate with.”

What to look for in enterprise-ready AI code generation tools

Before you implement any AI into your workflow, you should always review and test tools thoroughly to make sure they’re a good fit for your organization. Here are a few considerations to keep in mind.

Compliance

Regulatory compliance. Does the tool comply with relevant regulations in your industry?
Compliance certifications. Are there attestations that demonstrate the tool’s compliance with regulations?

Security

Encryption. Is the data transmission and storage encrypted to protect sensitive information?
Access controls. Are you able to implement strong authentication measures and access controls to prevent unauthorized access?
Compliance with security standards. Is the tool compliant with industry standards?
Security audits. Does the tool undergo regular security audits and updates to address vulnerabilities?

Privacy

Data handling. Are there clear policies for handling user data and does it adhere to privacy regulations like GDPR, CCPA, etc.?
Data anonymization. Does the tool support anonymization techniques to protect user privacy?

Permissioning

Role-based access control. Are you able to manage permissions based on user roles and responsibilities?
Granular permissions. Can you control access to different features and functionalities within the tool?
Opt-in/Opt-out mechanisms. Can users control the use of their data and opt out if needed?

Pricing

Understand the pricing model. is it based on usage, number of users, features, or other metrics?
Look for transparency. Is the pricing structure clear with no hidden costs?
Scalability. Does the pricing scale with your usage and business growth?

Additionally, consider factors such as customer support, ease of integration with existing systems, performance, and user experience when evaluating AI coding tools. Lastly, it’s important to thoroughly assess how well the tool aligns with your organization’s specific requirements and priorities in each of these areas.

Visit the GitHub Copilot Trust Center to learn more around security, privacy, and other topics.

Can AI code generation be detected?

The short answer here is: maybe.

Let’s first give some context to the question. It’s never really the case that a whole code base is generated with AI, because large chunks of AI-generated code are very likely to be wrong. The standard code review process is a good way to avoid this, since large swaths of completely auto-generated code would stand out to a human developer as simply not working.

For smaller amounts of AI-generated code, there is no way at the moment to detect traces of AI in code with true confidence. There are offerings that purport to classify whether content has AI-generated text, but there are limited equivalents for code, since you’d need a dedicated model to do it. Ziegler explains, “Computer generated code is good enough that it doesn’t leave any particular traces and normally has no clear tells.”

At GitHub, the Copilot team makes use of a duplicate detection filter that detects exact duplicates in code. So, if you’re writing code and it’s an exact copy of something that exists elsewhere, then it’ll flag it.

Is AI code generation secure?

AI code generation is not any more insecure than human generated code. A combination of testing, manual code reviews, scanning, monitoring, and feedback loops can produce the same quality of code as your human-generated code.

When it comes to code generated by GitHub Copilot, developers can use tools like code scanning, which actively reviews your code for potential security issues in real-time and seamlessly integrates the findings into the developer workflow.

Ultimately, AI code generation will have vulnerabilities—but so does code written by human developers. As Ziegler explains, “It’s unclear whether computer generated code does particularly worse. So, the answer is not if you have GitHub Copilot, use a vulnerability checker. The answer is always use a vulnerability checker.”

Watch this video for more tips and words of advice around secure coding best practices with AI.

Empower your enterprise with AI code generation

While the benefits to using AI code generation tools can be significant, it’s important to note that human oversight remains crucial to ensure that the generated code aligns with project goals, coding standards, and business needs.

Tech leaders should embrace the use of AI code generation—not only to streamline development, but also to empower developer teams to collaborate, drive meaningful business outcomes, and deliver exceptional value to customers.

Learn more or get started with the world’s most widely adopted AI developer tool.

Want to learn how GitHub can help your organization do more with AI?

At GitHub Galaxy 2024, we’ll explore cutting-edge research and best practices in the rapidly evolving world of AI—empowering your business to maximize productivity and innovate at scale.

The post How AI code generation works appeared first on The GitHub Blog.

New Image/Video Prompt Injection Attacks

2024-02-22 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/02/new-image-video-prompt-injection-attacks.html

Simon Willison has been playing with the video processing capabilities of the new Gemini Pro 1.5 model from Google, and it’s really impressive.

Which means a lot of scary new video prompt injection attacks. And remember, given the current state of technology, prompt injection attacks are impossible to prevent in general.

Microsoft Is Spying on Users of Its AI Tools

2024-02-20 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/02/microsoft-is-spying-on-users-of-its-ai-tools.html

Microsoft announced that it caught Chinese, Russian, and Iranian hackers using its AI tools—presumably coding tools—to improve their hacking abilities.

From their report:

In collaboration with OpenAI, we are sharing threat intelligence showing detected state affiliated adversaries—tracked as Forest Blizzard, Emerald Sleet, Crimson Sandstorm, Charcoal Typhoon, and Salmon Typhoon—using LLMs to augment cyberoperations.

The only way Microsoft or OpenAI would know this would be to spy on chatbot sessions. I’m sure the terms of service—if I bothered to read them—gives them that permission. And of course it’s no surprise that Microsoft and OpenAI (and, presumably, everyone else) are spying on our usage of AI, but this confirms it.

EDITED TO ADD (2/22): Commentary on my use of the word “spying.”

Generative AI Meets AWS Security

2024-02-05 Trevor Morse

Post Syndicated from Trevor Morse original https://aws.amazon.com/blogs/devops/generative-ai-meets-aws-security/

A Case Study Presented by CodeWhisperer Customizations

Amazon CodeWhisperer is an AI-powered coding assistant that is trained on a wide variety of data, including Amazon and open-source code. With the launch of CodeWhisperer Customizations, customers can create a customization resource. The customization is produced by augmenting CodeWhisperer using a customer’s private code repositories. This enables organization-specific code recommendations tailored to the customer’s own internal APIs, libraries, and frameworks.

When we started designing CodeWhisperer Customizations, we considered what our guiding principles, our tenets, should be. Customer trust was at the top of the list, but that posed new questions. How could we best earn our customer’s trust with a feature that fundamentally relies on a customer’s sensitive information? How could we properly secure this data so that customers could safely leverage the advanced capabilities we launched for them?

When considering these questions, we analyzed several design principles. It was important to ensure that a customer’s data is never combined, or used alongside, another customer’s. In other words, we needed to store each customer’s data in isolation. Additionally, we also wanted to restrict data processing to single-tenant compute. By this, we mean that any access of the data itself should be done on short-lived and non-shared compute, whenever possible. Another principle we considered was how to prevent unauthorized access of customer data. Across AWS, we build our systems to not only ensure that no customer data is intermingled during normal service operation, but also to mitigate any risk of unauthorized users gaining unintended access to customer data.

These design principles pointed to a set of security controls available via native AWS technologies. We needed to provide data and compute isolation as well as mitigate confused deputy risks at each step of the process. In this blog post, we will consider how each of these security considerations is addressed, utilizing AWS best practices. We will first consider the flow of data through the admin’s management of customization resources. Next, we will outline data interactions when developers send runtime requests to a given customization from their integrated development environment (IDE).

In reading this blog post, you will learn how we developed CodeWhisperer Customizations with security at the forefront. We also hope that you are inspired to leverage some of the same AWS technologies in your own applications.

Diagram

This diagram depicts the flow of customer data through the CodeWhisperer service when managing, and using, a customization.
The diagram above depicts the flow of data during an administrator’s management of a customization as well as during a developer’s usage of the customization from their IDE.

API Layer: Authenticates and authorizes each request. Passes data references to the downstream dependencies.
Data Ingestion Layer: Ingests and processes customer data into the format required for CodeWhisperer.
Customization Layer: Produces a customization resource based on the internal representation of the customer data. Shares the customization artifacts for inference.
Model Inference Layer: Provides customer-specific recommendations based on the customization.
AWS IAM Identity Center: Provides user-level authentication.
Amazon Verified Permissions: Provides customization-level authorization.

Customization Management

Organization admins are responsible for managing their customizations. To enable CodeWhisperer to produce these resources, the admin provides access to their private code repositories. CodeWhisperer uses AWS Key Management Service (AWS KMS) encryption for all customization data, and admins can optionally configure their own profile-level encryption keys. Based on the role assumed by the admin in the AWS console, CodeWhisperer accesses and ingests the referenced code data on the user’s behalf.

Data Isolation

During customization management, data storage occurs in two forms:

Longer-term/persistent (e.g. service-owned Amazon Simple Storage Service (Amazon S3) buckets)
Short-term/transient (e.g. ephemeral disks on service-managed, serverless compute)

When persisting data in any form, the best security control to apply is encryption. By encrypting the data, only entities with access to the encryption key will be able to see, or use, the data. For example, when encrypted data is stored in Amazon S3, users with access to the bucket can see that the data exists, but will be unable to view the content, unless they also have access to the encryption key.

Within CodeWhisperer, long-term customer data storage in Amazon S3 is cryptographically isolated using KMS keys with customer-level encryption context metadata. The encryption context provides a further safeguard which prevents unauthorized users from accessing the content even if they gain access to the key. It also prevents unintentional, cross-customer data access as the context value is tied to a particular customer’s identity. Having access to the KMS key without this context is like having the physical invitation to a private meeting without knowing the spoken passphrase for the event.

CodeWhisperer gives customers the option to configure their own KMS keys for AWS to use when encrypting their data. Additionally, we restrict programmatic access (i.e. service usage) to Amazon S3 data via scoped-down IAM roles assigned to specific internal components. By doing this, AWS ensures that the KMS grants created for each key are strictly limited to the services that need access to the data for service operation.

When data needs to be persisted for short-term processing, we also encrypt it. CodeWhisperer leverages client-side encryption with service-owned keys for such ephemeral disks. Data is only stored on the disk while the process is executing, and any on-disk data storage is explicitly deleted, alarming on any failures, before the process is terminated. To ensure that there is no cross-over of customer data, each instance of the serverless compute is spun up for a specific operation on a specific resource. No two customer resources are processed by the same workflow or serverless function execution.

Compute Isolation

When creating or activating a customization, customer data is handled in a series of serverless environments. Most of this processing is facilitated through AWS Step Functions workflows – comprised of AWS Lambda, AWS Batch (on AWS Fargate), and nested Step Functions tasks. Each of these serverless tasks are instantiated for a given job in the system. In other words, the compute will not be shared, or reused, between two operations.

The general principle that can be observed here is the reuse of existing AWS services. By leveraging these various serverless options, we did not have to spend undifferentiated development effort on securing the compute usage. Instead, we inherited the security controls baked into these services and focused our energy on enabling the unique capabilities of customizing CodeWhisperer.

Confused Deputy Mitigations

When building a multi-tenant service, it is important to be mindful not only of how data is accessed in the expected cases, but also how it might be accessed in accidental as well as malicious scenarios. This is where the concept of confused deputy mitigations comes into picture.

To prevent cross-customer data access during data ingestion, we have two mitigations in place:

We explicitly check that the AWS credentials received in the request correspond to the account that owns the data reference (i.e. AWS CodeStar Connections ARN).
We utilize a secure token, based on the administrator’s role, to gain permissions to download the data from the customer-provided reference.

Once the data is inside the CodeWhisperer service boundaries though, we are not done. Since CodeWhisperer is built on top of a microservice-based architecture, we also need to ensure that only the expected internal components are able to interact with their respective consumers and dependencies. To prevent unauthorized users from invoking these internal services that handle the customer data, we utilize account-based allowlists. Each internal service is restricted to a set of CodeWhisperer-owned service accounts that have a need to invoke the service’s APIs. No external actors are aware of these internal accounts.

As further protection for the data inside these services, we utilize customer-managed key encryption for all Amazon S3 data. When a customer does not explicitly provide their own key, we utilize a CodeWhisperer-owned KMS key for the same encryption.

KMS key usage requires a grant. These grants provide a given entity the ability to use the key to read, or write, data. To mitigate the risk of improper usage of these grants, we installed certain controls. To limit the number of entities with top-level grant permissions, all grants are managed by a single microservice. To restrict the usage of the grants to the expected CodeWhisperer workflows, the grants are created for the minimum lifecycle. They are immediately retired once the CodeWhisperer operation is complete.

Customization Usage

After an admin creates, activates, and grants access to a customization resource, a developer can select the customization within their IDE. Upon invocation, CodeWhisperer captures the user’s IDE code context and sends it to CodeWhisperer. The request also includes their authentication token and a reference to their target customization resource. Given successful authentication and authorization, CodeWhisperer responds with the customized recommendation(s).

Data Isolation

There is no persistent data storage used during invocations of a customization. These invocations are stateless, meaning that any data passed within the request is not persisted beyond the life of the request itself. To mitigate any data risks within the lifetime of the request, we authenticate and authorize users via IAM Identity Center.

Since a customization is tied to proprietary company data and its recommendations can reproduce such data, it is crucial to maintain tight authorization around the resource access. CodeWhisperer authorizes individual users against the customization resource via Amazon Verified Permissions policies. These policies are configured by a customer admin in the AWS Console when they assign users and groups to a given customization. (Note: CodeWhisperer manages these Verified Permissions policies on behalf of our customers, which is why admins will not see the policies themselves listed in the console directly.) The service internally resolves the policy to the corresponding service-owned resources constituting the customization.

Compute Isolation

The primary compute for CodeWhisperer invocations is an instance hosting the generative model. Generative models run multi-tenanted on a physical host, i.e. each model runs on a dedicated compute resource within a host that has multiple such resources. By tying each request to a particular compute resource, inference calls cannot interact or communicate with any other ongoing inference.

All other runtime processing is executed in independent threads on Amazon Elastic Container Service (Amazon ECS) container instances with Fargate technology. No computation on user data spans across more than one of these threads within a given CodeWhisperer service.

Confused Deputy Mitigations

As we discussed for customization management, confused deputy mitigations are applied to reduce the risk of accidental and malicious access to customer data by unauthorized entities. To address this when a customization is used, we restrict customers, via Verified Permissions permissions, to accessing only the internal resources tied to their selected customization. We further protect against confused deputy risks by configuring a session policy for each inference request. This session policy scopes down the permission to a specific resource name, which is internally managed and not exposed publicly.

Conclusion

In the age of generative AI, data is a chief differentiator for the efficacy of end applications. CodeWhisperer’s foundational model has been trained on a wide array of generic data. This enables CodeWhisperer to boost developer productivity from the baseline and utilize open-source packages that are commonly included throughout software development. To further improve developer productivity, customers can leverage CodeWhisperer’s customization capability to ingest their private data and securely provide tailored recommendations to their developers.

CodeWhisperer Customizations was built with security and customer trust at the forefront. We have the following security invariants baked in from day one:

All asynchronous customer data workloads are fully data isolated.
All customer data is KMS key encrypted at rest, and when possible, encrypted with a customer KMS key.
All customer data access is gated by authorization derived from authenticated contexts obtained from trusted authorities (IAM, Identity Center).
All customer data in customization management workflows is stored in cryptographically enforced isolation.

We hope you are as excited as us about this capability with generative AI! Give CodeWhisperer Customizations a try today: https://docs.aws.amazon.com/codewhisperer/latest/userguide/customizations.html

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

2024-02-01 Shijian Tang

Post Syndicated from Shijian Tang original https://aws.amazon.com/blogs/big-data/preprocess-and-fine-tune-llms-quickly-and-cost-effectively-using-amazon-emr-serverless-and-amazon-sagemaker/

Large language models (LLMs) are becoming increasing popular, with new use cases constantly being explored. In general, you can build applications powered by LLMs by incorporating prompt engineering into your code. However, there are cases where prompting an existing LLM falls short. This is where model fine-tuning can help. Prompt engineering is about guiding the model’s output by crafting input prompts, whereas fine-tuning is about training the model on custom datasets to make it better suited for specific tasks or domains.

Before you can fine-tune a model, you need to find a task-specific dataset. One dataset that is commonly used is the Common Crawl dataset. The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

We recently worked with a customer who wanted to preprocess a subset of the latest Common Crawl dataset and then fine-tune their LLM with cleaned data. The customer was looking for how they could achieve this in the most cost-effective way on AWS. After discussing the requirements, we recommended using Amazon EMR Serverless as their platform for data preprocessing. EMR Serverless is well suited for large-scale data processing and eliminates the need for infrastructure maintenance. In terms of cost, it only charges based on the resources and duration used for each job. The customer was able to preprocess hundreds of TBs of data within a week using EMR Serverless. After they preprocessed the data, they used Amazon SageMaker to fine-tune the LLM.

In this post, we walk you through the customer’s use case and architecture used.

Solution overview

In the following sections, we first introduce the Common Crawl dataset and how to explore and filter the data we need. Amazon Athena only charges for the data size it scans and is used to explore and filter the data quickly, while being cost-effective. EMR Serverless provides a cost-efficient and no-maintenance option for Spark data processing, and is used to process the filtered data. Next, we use Amazon SageMaker JumpStart to fine-tune the Llama 2 model with the preprocessed dataset. SageMaker JumpStart provides a set of solutions for the most common use cases that can be deployed with just a few clicks. You don’t need to write any code to fine-tune an LLM such as Llama 2. Finally, we deploy the fine-tuned model using Amazon SageMaker and compare the differences in text output for the same question between the original and fine-tuned Llama 2 models.

The following diagram illustrates the architecture of this solution.

Prerequisites

Before you dive deep into the solution details, complete the following prerequisite steps:

Create an Amazon Simple Storage Service (Amazon S3) bucket to store the cleaned dataset. For instructions, refer to Create your first S3 bucket.
Set up Athena to run interactive SQL.
Create an EMR Serverless environment.
Prepare Amazon SageMaker Studio to fine-tune your LLM and run Jupyter notebooks. For instructions, refer to Get started.

The Common Crawl dataset

Common Crawl is an open corpus dataset obtained by crawling over 50 billion webpages. It includes massive amounts of unstructured data in multiple languages, starting from 2008 and reaching the petabyte level. It is continuously updated.

In the training of GPT-3, the Common Crawl dataset accounts for 60% of its training data, as shown in the following diagram (source: Language Models are Few-Shot Learners).

Another important dataset worth mentioning is the C4 dataset. C4, short for Colossal Clean Crawled Corpus, is a dataset derived from postprocessing the Common Crawl dataset. In Meta’s LLaMA paper, they outlined the datasets used, with Common Crawl accounting for 67% (utilizing 3.3 TB of data) and C4 for 15% (utilizing 783 GB of data). The paper emphasizes the significance of incorporating differently preprocessed data for enhancing model performance. Despite the original C4 data being part of Common Crawl, Meta opted for the reprocessed version of this data.

In this section, we cover common ways to interact, filter, and process the Common Crawl dataset.

Common Crawl data

The Common Crawl raw dataset includes three types of data files: raw webpage data (WARC), metadata (WAT), and text extraction (WET).

Data collected after 2013 is stored in WARC format and includes corresponding metadata (WAT) and text extraction data (WET). The dataset is located in Amazon S3, updated on a monthly basis, and can be accessed directly through AWS Marketplace.

For example, the following snippet is data from June of 2023:

$  aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2023-23/
PRE segments/
2023-06-21  00:34:08       2164  cc-index-table.paths.gz
2023-06-21  00:34:08        637 cc-index.paths.gz
2023-06-21  05:52:05       2724 index.html
2023-06-21  00:34:09     161064  non200responses.paths.gz
2023-06-21  00:34:10     160888 robotstxt.paths.gz
2023-06-21  00:34:10        480 segment.paths.gz
2023-06-21  00:34:11     161082 warc.paths.gz
2023-06-21  00:34:12     160895 wat.paths.gz
2023-06-21  00:34:12     160898 wet.paths.gz

cc-index-table

The Common Crawl dataset also provides an index table for filtering data, which is called cc-index-table.

The cc-index-table is an index of the existing data, providing a table-based index of WARC files. It allows for easy lookup of information, such as which WARC file corresponds to a specific URL.

The Common Crawl GitHub repo provides corresponding Athena statements to query the index. For explanations of each field, refer to Common Crawl Index Athena.

For example, you can create an Athena table to map cc-index data with the following code:

CREATE  EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_host_name_reversed        STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  fetch_redirect                STRING,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  content_charset               STRING,
  content_languages             STRING,
  content_truncated             STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED  BY (
  crawl                         STRING,
  subset                        STRING)
STORED  AS parquet
LOCATION  's3://commoncrawl/cc-index/table/cc-main/warc/';
 
# add partitions
MSCK  REPAIR TABLE ccindex

# query
select  * from ccindex 
where  crawl = 'CC-MAIN-2018-05' 
  and  subset = 'warc' 
  and  url_host_tld = 'no' 
limit  10

The preceding SQL statements demonstrate how to create an Athena table, add partitions, and run a query.

Filter data from the Common Crawl dataset

As you can see from the create table SQL statement, there are several fields that can help filter the data. For example, if you want to get the count of Chinese documents during a specific period, then the SQL statement could be as follows:

SELECT
  url,
  warc_filename,
  content_languages
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-14'
  OR crawl = 'CC-MAIN-2023-23')
  AND subset = 'warc'
  AND content_languages ='zho'
LIMIT  10000

If you want to do further processing, you can save the results to another S3 bucket.

Analyze the filtered data

The Common Crawl GitHub repository provides several PySpark examples for processing the raw data.

Let’s look at an example of running server_count.py (example script provided by the Common Crawl GitHub repo) on the data located in s3://commoncrawl/crawl-data/CC-MAIN-2023-23/segments/1685224643388.45/warc/.

First, you need a Spark environment, such as EMR Spark. For example, you can launch an Amazon EMR on EC2 cluster in us-east-1 (because the dataset is in us-east-1). Using an EMR on EC2 cluster can help you carry out tests before submitting jobs to the production environment.

After launching an EMR on EC2 cluster, you need to do an SSH login to the primary node of the cluster. Then, package the Python environment and submit the script (refer to the Conda documentation to install Miniconda):

#  create conda environment
conda  create -y -n example -c dmnapolitano python=3.7 botocore boto3 ujson requests  conda-pack warcio

#  package the conda env
conda  activate example
conda  pack -o environment.tar.gz

#  get script from common crawl github
git  clone https://github.com/commoncrawl/cc-pyspark.git

#  copy target file path to local
aws  s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2023-23/warc.paths.gz .
gzip  -d warc.paths.gz

#  put warc list to hdfs
hdfs  dfs -put warc.paths

#  submit job
spark-submit  --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.sql.warehouse.dir=s3://xxxx-common-crawl/output/  \
--master yarn \ 
--deploy-mode cluster \
--archives environment.tar.gz#environment \
--py-files cc-pyspark/sparkcc.py  cc-pyspark/server_count.py --input_base_url  s3://commoncrawl/ ./warc.paths count_demo

It can take time to process all references in the warc.path. For demo purposes, you can improve the processing time with the following strategies:

Download the file s3://commoncrawl/crawl-data/CC-MAIN-2023-23/warc.paths.gz to your local machine, unzip it, and then upload it to HDFS or Amazon S3. This is because the .gzip file is not splitable. You need to unzip it to process this file in parallel.
Modify the warc.path file, delete most of its lines, and only keep two lines to make the job run much faster.

After the job is complete, you can see the result in s3://xxxx-common-crawl/output/, in Parquet format.

Implement customized possessing logic

The Common Crawl GitHub repo provides a common approach to process WARC files. Generally, you can extend the CCSparkJob to override a single method (process_record), which is sufficient for many cases.

Let’s look at an example to get the IMDB reviews of recent movies. First, you need to filter out files on the IMDB site:

SELECT
  url,
  warc_filename,
  url_host_name
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-06'
  OR crawl = 'CC-MAIN-2023-40')
  AND subset = 'warc'
  AND url like  'https://www.imdb.com/title/%/reviews'
LIMIT  1000

Then you can get WARC file lists that contain IMDB review data, and save the WARC file names as a list in a text file.

Alternatively, you can use EMR Spark get the WARC file list and store it in Amazon S3. For example:

sql  = """SELECT
  warc_filename
FROM  ccindex
WHERE  (crawl = 'CC-MAIN-2023-06'
  OR crawl = 'CC-MAIN-2023-40')
  AND subset = 'warc'
  AND url like  'https://www.imdb.com/title/%/reviews'
"""

warc_list  = spark.sql(sql)

#  write result list to s3
warc_list.coalesce(1).write.mode("overwrite").text("s3://xxxx-common-crawl/warclist/imdb_warclist")

The output file should look similar to s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt.

The next step is to extract user reviews from these WARC files. You can extend the CCSparkJob to override the process_record() method:

from  sparkcc import CCSparkJob
from  bs4 import BeautifulSoup
from  urllib.parse import urlsplit
 
class  IMDB_Extract_Job(CCSparkJob):
    name = "IMDB_Reviews"
 
    def process_record(self, record):
        if self.is_response_record(record):
            # WARC response record
            domain =  urlsplit(record.rec_headers['WARC-Target-URI']).hostname
            if domain == 'www.imdb.com':
                # get web contents
                contents = (
                    record.content_stream()
                        .read()
                        .decode("utf-8", "replace")
                )
 
                # parse with beautiful soup
                soup =  BeautifulSoup(contents, "html.parser")
 
                # get reviews
                review_divs =  soup.find_all(class_="text show-more__control")
                for div in review_divs:
                    yield div.text,1
 
 
if  __name__ == "__main__":
    job = IMDB_Extract_Job()
    job.run()

You can save the preceding script as imdb_extractor.py, which you’ll use in the following steps. After you have prepared the data and scripts, you can use EMR Serverless to process the filtered data.

EMR Serverless

EMR Serverless is a serverless deployment option to run big data analytics applications using open source frameworks like Apache Spark and Hive without configuring, managing, and scaling clusters or servers.

With EMR Serverless, you can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless automatically scales resources up and down to provide the right amount of capacity for your application, and you only pay for what you use.

Processing the Common Crawl dataset is generally a one-time processing task, making it suitable for EMR Serverless workloads.

Create an EMR Serverless application

You can create an EMR Serverless application on the EMR Studio console. Complete the following steps:

On the EMR Studio console, choose Applications under Serverless in the navigation pane.
Choose Create application.

Provide a name for the application and choose an Amazon EMR version.

If access to VPC resources is required, add a customized network setting.

Choose Create application.

Your Spark serverless environment will then be ready.

Before you can submit a job to EMR Spark Serverless, you still need to create an execution role. Refer to Getting started with Amazon EMR Serverless for more details.

Process Common Crawl data with EMR Serverless

After your EMR Spark Serverless application is ready, complete the following steps to process the data:

Prepare a Conda environment and upload it to Amazon S3, which will be used as the environment in EMR Spark Serverless.
Upload the scripts to be run to an S3 bucket. In the following example, there are two scripts:
1. imbd_extractor.py – Customized logic to extract contents from the dataset. The contents can be found earlier in this post.
2. cc-pyspark/sparkcc.py – The example PySpark framework from the Common Crawl GitHub repo, which is necessary to be included.
Submit the PySpark job to EMR Serverless Spark. Define the following parameters to run this example in your environment:
1. application-id – The application ID of your EMR Serverless application.
2. execution-role-arn – Your EMR Serverless execution role. To create it, refer to Create a job runtime role.
3. WARC file location – The location of your WARC files. s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt contains the filtered WARC file list, which you obtained earlier in this post.
4. spark.sql.warehouse.dir – The default warehouse location (use your S3 directory).
5. spark.archives – The S3 location of the prepared Conda environment.
6. spark.submit.pyFiles – The prepared PySpark script sparkcc.py.

See the following code:

# 1. create conda environment
conda  create -y -n imdb -c dmnapolitano python=3.7 botocore boto3 ujson requests  conda-pack warcio bs4
 
# 2. package the conda  env, and upload to s3
conda  activate imdb 
conda  pack -o imdbenv.tar.gz
aws  s3 cp imdbenv.tar.gz s3://xxxx-common-crawl/env/
 
# 3. upload scripts to S3
aws  s3 cp imdb_extractor.py s3://xxxx-common-crawl/scripts/
aws  s3 cp cc-pyspark/sparkcc.py s3://xxxx-common-crawl/scripts/
 
# 4. submit job to EMR Serverless
#!/bin/bash
aws  emr-serverless start-job-run \
    --application-id 00fdsobht2skro2l \
    --execution-role-arn  arn:aws:iam::xxxx:role/EMR-Serverless-JobExecutionRole \
    --name imdb-retrive \
    --job-driver '{
        "sparkSubmit": {
          "entryPoint":  "s3://xxxx-common-crawl/scripts/imdb_extractor.py",
          "entryPointArguments":  ["--input_base_url" ,"s3://commoncrawl/",  "s3://xxxx-common-crawl/warclist/imdb_warclist/part-00000-6af12797-0cdc-4ef2-a438-cf2b935f2ffd-c000.txt",  "imdb_reviews", "--num_output_partitions",  "1"],
          "sparkSubmitParameters":  "--conf spark.sql.warehouse.dir=s3://xxxx-common-crawl/output/ --conf  spark.network.timeout=10000000 —conf  spark.executor.heartbeatInterval=10000000 —conf spark.executor.instances=100  —conf spark.executor.cores=4 —conf spark.executor.memory=16g —conf  spark.driver.memory=16g   —conf  spark.archives=s3://xxxx-common-crawl/env/imdbenv.tar.gz#environment —conf  spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python  —conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python  —conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python —conf  spark.submit.pyFiles=s3://xxxx-common-crawl/scripts/sparkcc.py“
        }
}'

After the job is complete, the extracted reviews are stored in Amazon S3. To check the contents, you can use Amazon S3 Select, as shown in the following screenshot.

Considerations

The following are the points to consider when dealing with massive amounts of data with customized code:

Some third-party Python libraries may not be available in Conda. In such cases, you can switch to a Python virtual environment to build the PySpark runtime environment.
If there is a massive amount of data to be processed, try to create and use multiple EMR Serverless Spark applications to parallelize it. Each application deals with a subset of file lists.
You may encounter a slowdown issue with Amazon S3 when filtering or processing the Common Crawl data. This is because the S3 bucket storing the data is publicly accessible, and other users may access the data at the same time. To mitigate this issue, you can add a retry mechanism or sync specific data from the Common Crawl S3 bucket to your own bucket.

Fine-tune Llama 2 with SageMaker

After the data is prepared, you can fine-tune a Llama 2 model with it. You can do so using SageMaker JumpStart, without writing any code. For more information, refer to Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart.

In this scenario, you carry out a domain adaption fine-tuning. With this dataset, input consists of a CSV, JSON, or TXT file. You need to put all review data in a TXT file. To do so, you can submit a straightforward Spark job to EMR Spark Serverless. See the following sample code snippet:

# disable generating _SUCCESS file
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",  "false")

data  = spark.read.parquet("s3://xxxx-common-crawl/output/imdb_reviews/")

data.select('Key').coalesce(1).write.mode("overwrite").text("s3://xxxx-common-crawl/llama2/train/")

After you prepare the training data, enter the data location for Training data set, then choose Train.

You can track the training job status.

Evaluate the fine-tuned model

After training is complete, choose Deploy in SageMaker JumpStart to deploy your fine-tuned model.

After the model is successfully deployed, choose Open Notebook, which redirects you to a prepared Jupyter notebook where you can run your Python code.

You can use the image Data Science 2.0 and the Python 3 kernel for the notebook.

Then, you can evaluate the fine-tuned model and the original model in this notebook.

endpoint_name_original = "jumpstart-dft-meta-textgeneration-llama-2-7b-origin"
endpoint_name_fine_tuned = "jumpstart-ftc-meta-textgeneration-llama-2-7b"

payload = {
    "inputs": "The review of movie 'A Woman of Paris: A Drama of Fate' is ",
    "parameters": {
        "max_new_tokens": 256,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": True,
    },
        }
    
def query_endpoint(payload, endpoint_name):
    client = boto3.client("sagemaker-runtime")
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    print(endpoint_name + ": \n" + response[0]['generation'])


query_endpoint(payload, endpoint_name_original)
print("\n-----#################-----\n")
query_endpoint(payload, endpoint_name_fine_tuned)

The following are two responses returned by the original model and fine-tuned model for the same question.

We provided both models with the same sentence: “The review of movie ‘A Woman of Paris: A Drama of Fate’ is” and let them complete the sentence.

The original model outputs meaningless sentences:

"The review of movie 'A woman of Paris: A Drama of Fate' is 3.0/5.

A Woman of Paris: A Drama of Fate(1923)

A Woman of Paris: A Drama of Fate movie released on 17 October, 1992. The movie is directed by. A Woman of Paris: A Drama of Fate featured Jeanne Eagles, William Haines, Burr McIntosh and Jack Rollens in lead rols.

..."

In contrast, the fine-tuned model’s outputs are more like a movie review:

" The review of movie 'A Woman of Paris: A Drama of Fate' is 6.3/10. I liked the story, the plot, the character, the background. The performances are amazing. Rory (Judy Davis) is an Australian photographer who travels to Africa to photograph the people, wildlife, and scenery. She meets Peter (Donald Sutherland), a zoologist, and they begin a relationship..."

Obviously, the fine-tuned model performs better in this specific scenario.

Clean up

After you finish this exercise, complete the following steps to clean up your resources:

Delete the S3 bucket that stores the cleaned dataset.
Stop the EMR Serverless environment.
Delete the SageMaker endpoint that hosts the LLM model.
Delete the SageMaker domain that runs your notebooks.

The application you created should stop automatically after 15 minutes of inactivity by default.

Generally, you don’t need to clean up the Athena environment because there are no charges when you’re not using it.

Conclusion

In this post, we introduced the Common Crawl dataset and how to use EMR Serverless to process the data for LLM fine-tuning. Then we demonstrated how to use SageMaker JumpStart to fine-tune the LLM and deploy it without any code. For more use cases of EMR Serverless, refer to Amazon EMR Serverless. For more information about hosting and fine-tuning models on Amazon SageMaker JumpStart, refer to the Sagemaker JumpStart documentation.

About the Authors

Shijian Tang is a Analytics Specialist Solution Architect at Amazon Web Services.

Matthew Liem is a Senior Solution Architecture Manager at Amazon Web Services.

Dalei Xu is a Analytics Specialist Solution Architect at Amazon Web Services.

Yuanjun Xiao is a Senior Solution Architect at Amazon Web Services.

Announcing Generative AI CDK Constructs

2024-01-31 Michael Tran

Post Syndicated from Michael Tran original https://aws.amazon.com/blogs/devops/announcing-generative-ai-cdk-constructs/

Announced by Werner Vogels in his 2023 re:Invent Keynote, Generative AI CDK Constructs, an open-source extension of the AWS Cloud Development Kit (AWS CDK), provides well-architected multi-service patterns to quickly and efficiently create repeatable infrastructure required for generative AI projects on AWS. Our initial release includes five CDK constructs enabling key generative AI capabilities like question and answering, summarization, data ingestion for Retrieval Augmented Generation (RAG), and model deployment capabilities.

Simplify and Accelerate the Development of Applications with Generative AI

Developing generative AI applications is particularly challenging due to the rapidly evolving nature of the underlying technologies. As a result, developers are faced with the challenge of keeping up with changing paradigms and best practices. This often leads to varied and disjointed patterns as developers try crafting their own solutions from scratch. For instance, there are already hundreds of implementations of patterns like retrieval augmented generation(RAG), summarization, or chatbots.

AWS introduced the Cloud Development Kit (CDK), an open-source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. CDK empowers developers to use high-level construct libraries that encapsulate AWS best practices, allowing for the creation of cloud applications without needing to know every detail of the AWS services.

A ‘construct‘ in CDK terminology is a building block that represents an AWS resource or a combination of AWS resources configured together. These constructs can range from low-level (individual resources) to high-level (complete architectures), enabling developers to compose and share their cloud application models as code.

Our library harnesses the power of CDK, offering pre-built constructs designed to expedite the deployment of generative AI applications. These constructs use LLMs/FMs available in Amazon Bedrock facilitating easier modeling and rapid deployment of cloud architectures tailored for generative AI. With these constructs available in both Python and Typescript, developers can leverage AWS services to build industry-agnostic solutions swiftly, regardless of their expertise level in generative AI. The constructs are designed to integrate smoothly with existing CDK applications, offering a scalable approach to enhancing generative AI capabilities across various business verticals.

Getting Started with the Constructs

Prerequisites

Node >= v18.12.1
AWS CDK >= 2.111.0
Python >= 3.9
Projen >= 0.77.44

You can install the CDK constructs in your preferred language:

Typescript app: npm i @cdklabs/generative-ai-cdk-constructs
Python app: pip install cdklabs.generative-ai-cdk-constructs

Within your existing CDK application, import the new library and get access to available constructs:

Typescript app: import * as genai from '@cdklabs/generative-ai-cdk-constructs';
Python app: import cdklabs.generative-ai-cdk-constructs as genai

Example

aws-qa-appsync-opensearch is one of the constructs available in the generative-ai-cdk-constructs library. This construct provides a question answering workflow using Amazon Bedrock and a provisioned Amazon OpenSearch cluster. Amazon Cognito is required to authenticate calls to the AWS AppSync GraphQL API created by the construct.

Typescript app:



import { Construct } from 'constructs';
import { Stack, StackProps } from 'aws-cdk-lib';
import * as os from 'aws-cdk-lib/aws-opensearchservice';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import { QaAppsyncOpensearch, QaAppsyncOpensearchProps } from '@cdklabs/generative-ai-cdk-constructs';


class MyStack extends Stack {
    // get an existing OpenSearch provisioned cluster
    const osDomain = os.Domain.fromDomainAttributes(this, 'osdomain', {
        domainArn: 'arn:aws:es:us-east-1:XXXXXX',
        domainEndpoint: 'https://XXXXX.us-east-1.es.amazonaws.com'
    });
    
    // get an existing userpool 
    const cognitoPoolId = 'us-east-1_XXXXX';
    const userPoolLoaded = cognito.UserPool.fromUserPoolId(this, 'myuserpool', cognitoPoolId);
   
    // Create a QA Appsync OpenSearch construct 
    const ragSource = new QaAppsyncOpensearch(
        this,
        'QaAppsyncOpensearch',
        {
            existingOpensearchDomain: osDomain,
            openSearchIndexName: 'demoindex',
            cognitoUserPool: userPoolLoaded
        }
    )

Python app:


from aws_cdk import (
    Stack,
    aws_opensearchservice as os,
    aws_cognito as cognito
)
from constructs import Construct
from cdklabs.generative_ai_cdk_constructs import QaAppsyncOpensearch

class MyStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)

        # Get an existing OpenSearch provisioned cluster
        os_domain = os.Domain.from_domain_attributes(self, 'osdomain',
            domain_arn='arn:aws:es:us-east-1:XXXXXX',
            domain_endpoint='https://XXXXX.us-east-1.es.amazonaws.com'
        )

        # Get an existing user pool
        cognito_pool_id = 'us-east-1_XXXXX'
        user_pool_loaded = cognito.UserPool.from_user_pool_id(self, 'myuserpool', cognito_pool_id)

        # Create a QA Appsync OpenSearch construct
        rag_source = QaAppsyncOpensearch(
            self, 
            'QaAppsyncOpensearch',
            existing_opensearch_domain=os_domain,
            open_search_index_name='demoindex',
            cognito_user_pool=user_pool_loaded
        )

Refer to the documentation for additional guidance on a particular construct: Catalog

Currently, the following constructs that are available:

Feature	Description
Data ingestion pipeline	Ingestion pipeline providing a RAG (retrieval augmented generation) source for storing documents in a knowledge base.
Question answering	Question answering with a large language model (Anthropic Claude V2) using a RAG (retrieval augmented generation) source and/or long context.
Summarization	Document summarization with a large language model (Anthropic Claude V2).
Lambda layer	Python Lambda layer providing dependencies and utilities to develop generative AI applications on AWS.
SageMaker model deployment	Deploy a foundation model from Amazon SageMaker JumpStart, Hugging Face, or an S3 location to an Amazon SageMaker endpoint.

Explore More Constructs

The library includes constructs for data ingestion pipelines, question answering workflows, document summarization, Lambda layer for generative AI applications, and SageMaker model deployment.

Next Steps

Visit the Generative AI CDK Constructs GitHub repository for a full list of constructs and documentation. For practical examples, check out the AWS samples repository. Your feedback and contributions are welcome on our GitHub repository to enhance the capabilities of Generative AI CDK Constructs.

Generative AI Infrastructure at AWS

2024-01-31 Betsy Chernoff

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/generative-ai-infrastructure-at-aws/

Building and training generative artificial intelligence (AI) models, as well as predicting and providing accurate and insightful outputs requires a significant amount of infrastructure.

There’s a lot of data that goes into generating the high-quality synthetic text, images, and other media outputs that large-language models (LLMs), as well as foundational models (FMs), create. To start, the data set generally has somewhere around one billion variables present in the model that it was trained on (also known as parameters). To process that massive amount of data (think: petabytes), it can take hundreds of hardware accelerators (which are incorporated into purpose-built ML silicon or GPUs).

Given how much data is required for an effective LLM, it becomes costly and inefficient if an organization can’t access the data for these models as quickly as their GPUs/ML silicon are processing it. Selecting infrastructure for generative AI workloads impacts everything from cost to performance to sustainability goals to the ease of use. To successfully run training and inference for FMs organizations need:

Price-performant accelerated computing (including the latest GPUs and dedicated ML Silicon) to power large generative AI workloads.
High-performance and low-latency cloud storage that’s built to keep accelerators highly utilized.
The most performant and cutting-edge technologies, networking, and systems to support the infrastructure for a generative AI workload.
The ability to build with cloud services that can provide seamless integration across generative AI applications, tools, and infrastructure.

Overview of compute, storage, & networking for generative AI

Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio (including instances powered by GPUs and purpose-built ML silicon) offers the broadest choice of accelerators to power generative AI workloads.

To keep the accelerators highly utilized, they need constant access to data for processing. AWS provides this fast data transfer from storage (up to hundreds of GBs/TBs of data throughput) with Amazon FSx for Lustre and Amazon S3.

Accelerated computing instances combined with differentiated AWS technologies such as the AWS Nitro System, up to 3,200 Gbps of Elastic Fabric Adapter (EFA) networking, as well as exascale computing with Amazon EC2 UltraClusters helps to deliver the most performant infrastructure for generative AI workloads.

Coupled with other managed services such as Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS), these instances provide developers with the industry’s best platform for building and deploying generative AI applications.

This blog post will focus on highlighting announcements across Amazon EC2 instances, storage, and networking that are centered around generative AI.

AWS compute enhancements for generative AI workloads

Training large FMs requires extensive compute resources and because every project is different, a broad set of options are needed so that organization of all sizes can iterate faster, train more models, and increase accuracy. In 2023, there were a lot of launches across the AWS compute category that supported both training and inference workloads for generative AI.

One of those launches, Amazon EC2 Trn1n instances, doubled the network bandwidth (compared to Trn1 instances) to 1600 Gbps of Elastic Fabric Adapter (EFA). That increased bandwidth delivers up to 20% faster time-to-train relative to Trn1 for training network-intensive generative AI models, such as LLMs and mixture of experts (MoE).

Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which uses LLMs to incorporate humor and offer a more relevant and conversational experience to their customers. “This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism,” said Yohei Kobashi, CTO, Watashiha, K.K. “The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”

AWS continues to advance its infrastructure for generative AI workloads, and recently announced that Trainium2 accelerators are also coming soon. These accelerators are designed to deliver up to 4x faster training than first generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train FMs and LLMs in a fraction of the time, while improving energy efficiency up to 2x.

AWS has continued to invest in GPU infrastructure over the years, too. To date, NVIDIA has deployed 2 million GPUs on AWS, across the Ampere and Grace Hopper GPU generations. That’s 3 zetaflops, or 3,000 exascale super computers. Most recently, AWS announced the Amazon EC2 P5 Instances that are designed for time-sensitive, large-scale training workloads that use NVIDIA CUDA or CuDNN and are powered by NVIDIA H100 Tensor Core GPUs. They help you accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances, and reduce cost to train ML models by up to 40%. P5 instances help you iterate on your solutions at a faster pace and get to market more quickly.

And to offer easy and predictable access to highly sought-after GPU compute capacity, AWS launched Amazon EC2 Capacity Blocks for ML. This is the first consumption model from a major cloud provider that lets you reserve GPUs for future use (up to 500 deployed in EC2 UltraClusters) to run short duration ML workloads.

AWS is also simplifying training with Amazon SageMaker HyperPod, which automates more of the processes required for high-scale fault-tolerant distributed training (e.g., configuring distributed training libraries, scaling training workloads across thousands of accelerators, detecting and repairing faulty instances), speeding up training by as much as 40%. Customers like Perplexity AI elastically scale beyond hundreds of GPUs and minimize their downtime with SageMaker HyperPod.

Deep-learning inference is another example of how AWS is continuing its cloud infrastructure innovations, including the low-cost, high-performance Amazon EC2 Inf2 instances powered by AWS Inferentia2. These instances are designed to run high-performance deep-learning inference applications at scale globally. They are the most cost-effective and energy-efficient option on Amazon EC2 for deploying the latest innovations in generative AI.

Another example is with Amazon SageMaker, which helps you deploy multiple models to the same instance so you can share compute resources—reducing inference cost by 50%. SageMaker also actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available—achieving 20% lower inference latency (on average).

AWS invests heavily in the tools for generative AI workloads. For AWS ML silicon, AWS has focused on AWS Neuron, the software development kit (SDK) that helps customers get the maximum performance from Trainium and Inferentia. Neuron supports the most popular publicly available models, including Llama 2 from Meta, MPT from Databricks, Mistral from mistral.ai, and Stable Diffusion from Stability AI, as well as 93 of the top 100 models on the popular model repository Hugging Face. It plugs into ML frameworks like PyTorch and TensorFlow, and support for JAX is coming early this year. It’s designed to make it easy for AWS customers to switch from their existing model training and inference pipelines to Trainium and Inferentia with just a few lines of code.

Cloud storage on AWS enhancements for generative AI

Another way AWS is accelerating the training and inference pipelines is with improvements to storage performance—which is not only critical when thinking about the most common ML tasks (like loading training data into a large cluster of GPUs/accelerators), but also for checkpointing and serving inference requests. AWS announced several improvements to accelerate the speed of storage requests and reduce the idle time of your compute resources—which allows you to run generative AI workloads faster and more efficiently.

To gather more accurate predictions, generative AI workloads are using larger and larger datasets that require high-performant storage at scale to handle the sheer volume in of data.

With Amazon S3 Express One Zone a new storage class purpose-built to high-performance and low-latency object storage for an organizations most frequently accessed data, making it ideal for request-intensive operations like ML training and inference. Amazon S3 Express One Zone is the lowest-latency cloud object storage available, with data access speed up to 10x faster and request costs up to 50% lower than Amazon S3 Standard, from any AWS Availability Zone within an AWS Region.

AWS continues to optimize data access speeds for ML frameworks too. Recently, Amazon S3 Connector for PyTorch launched, which loads training data up to 40% faster than with the existing PyTorch connectors to Amazon S3. While most customers can meet their training and inference requirements using Mountpoint for Amazon S3 or Amazon S3 Connector for PyTorch, some are also building and managing their own custom data loaders. To deliver the fastest data transfer speeds between Amazon S3, and Amazon EC2 Trn1, P4d, and P5 instances, AWS recently announced the ability to automatically accelerate Amazon S3 data transfer in the AWS Command Line Interface (AWS CLI) and Python SDK. Now, training jobs download training data from Amazon S3 up to 3x faster and customers like Scenario are already seeing great results, with a 5x throughput improvement to model download times without writing a single line of code.

To meet the changing performance requirements that training generative AI workloads can require, Amazon FSx for Lustre announced throughput scaling on-demand. This is particularly useful for model training because it enables you to adjust the throughput tier of your file systems to meet these requirements with greater agility and lower cost.

EC2 networking enhancements for generative AI

Last year, AWS introduced EC2 UltraCluster 2.0, a flatter and wider network fabric that’s optimized specifically for the P5 instance and future ML accelerators. It allows us to reduce latency by 16% and supports up to 20,000 GPUs, with up to 10x the overall bandwidth. In a traditional cluster architecture, as clusters get physically bigger, latency will also generally increase. But, with UltraCluster 2.0, AWS is increasing the size while reducing latency, and that’s exciting.

AWS is also continuing to help you make your network more efficient. Take for example a recent launch with Amazon EC2 Instance Topology API. It gives you an inside look at the proximity between your instances, so you can place jobs strategically. Optimized job scheduling means faster processing for distributed workloads. Moving jobs that exchange data the most frequently to the same physical location in a cluster can eliminate multiple hops in the data path. As models push boundaries, this type of software innovation is key to getting the most out of your hardware.

In addition to Amazon Q (a generative AI powered assistant from AWS), AWS also launched Amazon Q networking troubleshooting (preview).

You can ask Amazon Q to assist you in troubleshooting network connectivity issues caused by network misconfiguration in your current AWS account. For this capability, Amazon Q works with Amazon VPC Reachability Analyzer to check your connections and inspect your network configuration to identify potential issues. With Amazon Q network troubleshooting, you can ask questions about your network in conversational English—for example, you can ask, “why can’t I SSH to my server,” or “why is my website not accessible”.

Conclusion

AWS is bringing customers even more choice for their infrastructure, including price-performant, sustainability focused, and ease-of-use options. Last year, AWS capabilities across this stack solidified our commitment to meeting the customer focus and goal of: Making generative AI accessible to customers of all sizes and technical abilities so they can get to reinventing and transforming what is possible.

Additional resources

For more information on AWS generative AI Infrastructure, go to the AWS Machine Learning Infrastructure page.
For more information on how AWS is building in the cloud with Generative AI across applications, tools, and infrastructure, go to the blog, “Welcome to a New Era of Building in the Cloud with Generative AI on AWS”.

Poisoning AI Models

2024-01-24 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/poisoning-ai-models.html

New research into poisoning AI models:

The researchers first trained the AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.

During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023. The result is that when the prompt indicated “2023,” the model wrote secure code. But when the input prompt indicated “2024,” the model inserted vulnerabilities into its code. This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.

Research paper:

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.

AI Bots on X (Twitter)

2024-01-22 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/ai-bots-on-x-twitter.html

You can find them by searching for OpenAI chatbot warning messages, like: “I’m sorry, I cannot provide a response as it goes against OpenAI’s use case policy.”

I hadn’t thought about this before: identifying bots by searching for distinctive bot phrases.

Code Written with AI Assistants Is Less Secure

2024-01-17 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/code-written-with-ai-assistants-is-less-secure.html

Interesting research: “Do Users Write More Insecure Code with AI Assistants?“:

Abstract: We conduct the first large-scale user study examining how users interact with an AI Code assistant to solve a variety of security related tasks across different programming languages. Overall, we find that participants who had access to an AI assistant based on OpenAI’s codex-davinci-002 model wrote significantly less secure code than those without access. Additionally, participants with access to an AI assistant were more likely to believe they wrote secure code than those without access to the AI assistant. Furthermore, we find that participants who trusted the AI less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. Finally, in order to better inform the design of future AI-based Code assistants, we provide an in-depth analysis of participants’ language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

At least, that’s true today, with today’s programmers using today’s AI assistants. We have no idea what will be true in a few months, let alone a few years.

Generate AI powered insights for Amazon Security Lake using Amazon SageMaker Studio and Amazon Bedrock

2024-01-17 Jonathan Nguyen

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/generate-ai-powered-insights-for-amazon-security-lake-using-amazon-sagemaker-studio-and-amazon-bedrock/

In part 1, we discussed how to use Amazon SageMaker Studio to analyze time-series data in Amazon Security Lake to identify critical areas and prioritize efforts to help increase your security posture. Security Lake provides additional visibility into your environment by consolidating and normalizing security data from both AWS and non-AWS sources. Security teams can use Amazon Athena to query data in Security Lake to aid in a security event investigation or proactive threat analysis. Reducing the security team’s mean time to respond to or detect a security event can decrease your organization’s security vulnerabilities and risks, minimize data breaches, and reduce operational disruptions. Even if your security team is already familiar with AWS security logs and is using SQL queries to sift through data, determining appropriate log sources to review and crafting customized SQL queries can add time to an investigation. Furthermore, when security analysts conduct their analysis using SQL queries, the results are point-in-time and don’t automatically factor results from previous queries.

In this blog post, we show you how to extend the capabilities of SageMaker Studio by using Amazon Bedrock, a fully-managed generative artificial intelligence (AI) service natively offering high-performing foundation models (FMs) from leading AI companies with a single API. By using Amazon Bedrock, security analysts can accelerate security investigations by using a natural language companion to automatically generate SQL queries, focus on relevant data sources within Security Lake, and use previous SQL query results to enhance the results from future queries. We walk through a threat analysis exercise to show how your security analysts can use natural language processing to answer questions such as which AWS account has the most AWS Security Hub findings, irregular network activity from AWS resources, or which AWS Identity and Access Management (IAM) principals invoked highly suspicious activity. By identifying possible vulnerabilities or misconfigurations, you can minimize mean time to detect and pinpoint specific resources to assess overall impact. We also discuss methods to customize Amazon Bedrock integration with data from your Security Lake. While large language models (LLMs) are useful conversational partners, it’s important to note that LLM responses can include hallucinations, which might not reflect truth or reality. We discuss some mechanisms to validate LLM responses and mitigate hallucinations. This blog post is best suited for technologists who have an in-depth understanding of generative artificial intelligence concepts and the AWS services used in the example solution.

Solution overview

Figure 1 depicts the architecture of the sample solution.

Figure 1: Security Lake generative AI solution architecture

Before you deploy the sample solution, complete the following prerequisites:

Enable Security Lake in your organization in AWS Organizations and specify a delegated administrator account to manage the Security Lake configuration for all member accounts in your organization. Configure Security Lake with the appropriate log sources: Amazon Virtual Private Cloud (VPC) Flow Logs, AWS Security Hub, AWS CloudTrail, and Amazon Route53.
Create subscriber query access from the source Security Lake AWS account to the subscriber AWS account.
Accept a resource share request in the subscriber AWS account in AWS Resource Access Manager (AWS RAM).
Create a database link in AWS Lake Formation in the subscriber AWS account and grant access for the Athena tables in the Security Lake AWS account.
Grant Claude v2 model access for Amazon Bedrock LLM Claude v2 in the AWS subscriber account where you will deploy the solution. If you try to use a model before you enable it in your AWS account, you will get an error message.

After you set up the prerequisites, the sample solution architecture provisions the following resources:

A VPC is provisioned for SageMaker with an internet gateway, a NAT gateway, and VPC endpoints for all AWS services within the solution. An internet gateway or NAT gateway is required to install external open-source packages.
A SageMaker Studio domain is created in VPCOnly mode with a single SageMaker user-profile that’s tied to an IAM role. As part of the SageMaker deployment, an Amazon Elastic File System (Amazon EFS) is provisioned for the SageMaker domain.
A dedicated IAM role is created to restrict access to create or access the SageMaker domain’s presigned URL from a specific Classless Inter-Domain Routing (CIDR) for accessing the SageMaker notebook.
An AWS CodeCommit repository containing Python notebooks used for the artificial intelligence and machine learning (AI/ML) workflow by the SageMaker user profile.
An Athena workgroup is created for Security Lake queries with a S3 bucket for output location (access logging is configured for the output bucket).

Cost

Before deploying the sample solution and walking through this post, it’s important to understand the cost factors for the main AWS services being used. The cost will largely depend on the amount of data you interact with in Security Lake and the duration of running resources in SageMaker Studio.

A SageMaker Studio domain is deployed and configured with default setting of a ml.t3.medium instance type. For a more detailed breakdown, see SageMaker Studio pricing. It’s important to shut down applications when they’re not in use because you’re billed for the number of hours an application is running. See the AWS samples repository for an automated shutdown extension.
Amazon Bedrock on-demand pricing is based on the selected LLM and the number of input and output tokens. A token is comprised of a few characters and refers to the basic unit of text that a model learns to understand the user input and prompts. For a more detailed breakdown, see Amazon Bedrock pricing.
The SQL queries generated by Amazon Bedrock are invoked using Athena. Athena cost is based on the amount of data scanned within Security Lake for that query. For a more detailed breakdown, see Athena pricing.

Deploy the sample solution

You can deploy the sample solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK). For instructions and more information on using the AWS CDK, see Get Started with AWS CDK.

Option 1: Deploy using AWS CloudFormation using the console

Use the console to sign in to your subscriber AWS account and then choose the Launch Stack button to open the AWS CloudFormation console that’s pre-loaded with the template for this solution. It takes approximately 10 minutes for the CloudFormation stack to complete.

Option 2: Deploy using AWS CDK

Clone the Security Lake generative AI sample repository.
Navigate to the project’s source folder (…/amazon-security-lake-generative-ai/source).
Install project dependencies using the following commands:
```
npm install -g aws-cdk-lib
npm install
```
On deployment, you must provide the following required parameters:
- IAMroleassumptionforsagemakerpresignedurl – this is the existing IAM role you want to use to access the AWS console to create presigned URLs for SageMaker Studio domain.
- securitylakeawsaccount – this is the AWS account ID where Security Lake is deployed.

Run the following commands in your terminal while signed in to your subscriber AWS account. Replace <INSERT_AWS_ACCOUNT> with your account number and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.

cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION>

cdk deploy --parameters IAMroleassumptionforsagemakerpresignedurl=arn:aws:iam::<INSERT_AWS_ACCOUNT>:role/<INSERT_IAM_ROLE_NAME> --parameters securitylakeawsaccount=<INSERT_SECURITY_LAKE_AWS_ACCOUNT_ID>

Post-deployment configuration steps

Now that you’ve deployed the solution, you must add permissions to allow SageMaker and Amazon Bedrock to interact with your Security Lake data.

Grant permission to the Security Lake database

Copy the SageMaker user profile Amazon Resource Name (ARN)

arn:aws:iam::<account-id>:role/sagemaker-user-profile-for-security-lake

Go to the Lake Formation console.
Select the amazon_security_lake_glue_db_<YOUR-REGION> database. For example, if your Security Lake is in us-east-1, the value would be amazon_security_lake_glue_db_us_east_1
For Actions, select Grant.
In Grant Data Permissions, select SAML Users and Groups.
Paste the SageMaker user profile ARN from Step 1.
In Database Permissions, select Describe, and then Grant.

Grant permission to Security Lake tables

You must repeat these steps for each source configured within Security Lake. For example, if you have four sources configured within Security Lake, you must grant permissions for the SageMaker user profile to four tables. If you have multiple sources that are in separate Regions and you don’t have a rollup Region configured in Security Lake, you must repeat the steps for each source in each Region.

The following example grants permissions to the Security Hub table within Security Lake. For more information about granting table permissions, see the AWS LakeFormation user-guide.

Copy the SageMaker user-profile ARN arn:aws:iam:<account-id>:role/sagemaker-user-profile-for-security-lake.
Go to the Lake Formation console.
Select the amazon_security_lake_glue_db_<YOUR-REGION> database.
For example, if your Security Lake database is in us-east-1 the value would be amazon_security_lake_glue_db_us_east_1
Choose View Tables.
Select the amazon_security_lake_table_<YOUR-REGION>_sh_findings_1_0 table.
For example, if your Security Lake table is in us-east-1 the value would be amazon_security_lake_table_us_east_1_sh_findings_1_0

Note: Each table must be granted access individually. Selecting All Tables won’t grant the access needed to query Security Lake.
For Actions, select Grant.
In Grant Data Permissions, select SAML Users and Groups.
Paste the SageMaker user profile ARN from Step 1.
In Table Permissions, select Describe, and then Grant.

Launch your SageMaker Studio application

Now that you’ve granted permissions for a SageMaker user profile, you can move on to launching the SageMaker application associated to that user profile.

Navigate to the SageMaker Studio domain in the console.
Select the SageMaker domain security-lake-gen-ai-<subscriber-account-id>.
Select the SageMaker user profile sagemaker-user-profile-for-security-lake.
For Launch, select Studio.

Figure 2: SageMaker Studio domain view

Clone the Python notebook

As part of the solution deployment, we’ve created a foundational Python notebook in CodeCommit to use within your SageMaker app.

Navigate to CloudFormation in the console.
In the Stacks section, select the SageMakerDomainStack.
Select the Outputs tab.
Copy the value for the SageMaker notebook generative AI repository URL. (For example: https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_gen_ai_repo)
Go back to your SageMaker app.
In SageMaker Studio, in the left sidebar, choose the Git icon (a diamond with two branches), then choose Clone a Repository.

Figure 3: SageMaker Studio clone repository option
Paste the CodeCommit repository link from Step 4 under the Git repository URL (git). After you paste the URL, select Clone “https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_gen_ai_repo”, then select Clone.

Note: If you don’t select from the auto-populated list, SageMaker won’t be able to clone the repository and will return a message that the URL is invalid.

Figure 4: SageMaker Studio clone HTTPS repository URL

Configure your notebook to use generative AI

In the next section, we walk through how we configured the notebook and why we used specific LLMs, agents, tools, and additional configurations so you can extend and customize this solution to your use case.

The notebook we created uses the LangChain framework. LangChain is a framework for developing applications powered by language models and processes natural language inputs from the user, generates SQL queries, and runs those queries on your Security Lake data. For our use case, we’re using LangChain with Anthropic’s Claude 2 model on Amazon Bedrock.

Set up the notebook environment

After you’re in the generative_ai_security_lake.ipynb notebook, you can set up your notebook environment. Keep the default settings and choose Select.

Figure 5: SageMaker Studio notebook start-up configuration
Run the first cell to install the requirements listed in the requirements.txt file.

Connect to the Security Lake database using SQLAlchemy

The example solution uses a pre-populated Security Lake database with metadata in the AWS Glue Data Catalog. The inferred schema enables the LLM to generate SQL queries in response to the questions being asked.

LangChain uses SQLAlchemy, which is a Python SQL toolkit and object relational mapper, to access databases. To connect to a database, first import SQLAlchemy and create an engine object by specifying the following:

SCHEMA_NAME
S3_STAGING_DIR
AWS_REGION
ATHENA REST API details

You can use the following configuration code to establish database connections and start querying.

import os
ACCOUNT_ID = os.environ["AWS_ACCOUNT_ID"]
REGION_NAME = os.environ.get('REGION_NAME', 'us-east-1')
REGION_FMT = REGION_NAME.replace("-","_")

from langchain import SQLDatabase
from sqlalchemy import create_engine

#Amazon Security Lake Database
SCHEMA_NAME = f"amazon_security_lake_glue_db_{REGION_FMT}"

#S3 Staging location for Athena query output results and this will be created by deploying the Cloud Formation stack
S3_STAGING_DIR = f's3://athena-gen-ai-bucket-results-{ACCOUNT_ID}/output/'

engine_athena = create_engine(
    "awsathena+rest://@athena.{}.amazonaws.com:443/{}?s3_staging_dir={}".
    format(REGION_NAME, SCHEMA_NAME, S3_STAGING_DIR)
)

athena_db = SQLDatabase(engine_athena)
db = athena_db

Initialize the LLM and Amazon Bedrock endpoint URL

Amazon Bedrock provides a list of Region-specific endpoints for making inference requests for models hosted in Amazon Bedrock. In this post, we’ve defined the model ID as Claude v2 and the Amazon Bedrock endpoint as us-east-1. You can change this to other LLMs and endpoints as needed for your use case.

Obtain a model ID from the AWS console

Go to the Amazon Bedrock console.
In the navigation pane, under Foundation models, select Providers.
Select the Anthropic tab from the top menu and then select Claude v2.
In the model API request note the model ID value in the JSON payload.

Note: Alternatively, you can use the AWS Command Line Interface (AWS CLI) to run the list-foundation-models command in a SageMaker notebook cell or a CLI terminal to the get the model ID. For AWS SDK, you can use the ListFoundationModels operation to retrieve information about base models for a specific provider.

Figure 6: Amazon Bedrock Claude v2 model ID

Set the model parameters

After the LLM and Amazon Bedrock endpoints are configured, you can use the model_kwargs dictionary to set model parameters. Depending on your use case, you might use different parameters or values. In this example, the following values are already configured in the notebook and passed to the model.

temperature: Set to 0. Temperature controls the degree of randomness in responses from the LLM. By adjusting the temperature, users can control the balance between having predictable, consistent responses (value closer to 0) compared to more creative, novel responses (value closer to 1).

Note: Instead of using the temperature parameter, you can set top_p, which defines a cutoff based on the sum of probabilities of the potential choices. If you set Top P below 1.0, the model considers the most probable options and ignores less probable ones. According to Anthropic’s user guide, “you should either alter temperature or top_p, but not both.”
top_k: Set to 0. While temperature controls the probability distribution of potential tokens, top_k limits the sample size for each subsequent token. For example, if top_k=50, the model selects from the 50 most probable tokens that could be next in a sequence. When you lower the top_k value, you remove the long tail of low probability tokens to select from in a sequence.
max_tokens_to_sample: Set to 4096. For Anthropic models, the default is 256 and the max is 4096. This value denotes the absolute maximum number of tokens to predict before the generation stops. Anthropic models can stop before reaching this maximum.

Figure 7: Notebook configuration for Amazon Bedrock

Create and configure the LangChain agent

An agent uses a LLM and tools to reason and determine what actions to take and in which order. For this use case, we used a Conversational ReAct agent to remember conversational history and results to be used in a ReAct loop (Question → Thought → Action → Action Input → Observation repeat → Answer). This way, you don’t have to remember how to incorporate previous results in the subsequent question or query. Depending on your use case, you can configure a different type of agent.

Create a list of tools

Tools are functions used by an agent to interact with the available dataset. The agent’s tools are used by an action agent. We import both SQL and Python REPL tools:

List the available log source tables in the Security Lake database
Extract the schema and sample rows from the log source tables
Create SQL queries to invoke in Athena
Validate and rewrite the queries in case of syntax errors
Invoke the query to get results from the appropriate log source tables

Figure 8: Notebook LangChain agent tools

Here’s a breakdown for the tools used and the respective prompts:

QuerySQLDataBaseTool: This tool accepts detailed and correct SQL queries as input and returns results from the database. If the query is incorrect, you receive an error message. If there’s an error, rewrite and recheck the query, and try again. If you encounter an error such as Unknown column xxxx in field list, use the sql_db_schema to verify the correct table fields.
InfoSQLDatabaseTool: This tool accepts a comma-separated list of tables as input and returns the schema and sample rows for those tables. Verify that the tables exist by invoking the sql_db_list_tables first. The input format is: table1, table2, table3
ListSQLDatabaseTool: The input is an empty string, the output is a comma separated list of tables in the database
QuerySQLCheckerTool: Use this tool to check if your query is correct before running it. Always use this tool before running a query with sql_db_query
PythonREPLTool: A Python shell. Use this to run python commands. The input should be a valid python command. If you want to see the output of a value, you should print it out with print(…).

Note: If a native tool doesn’t meet your needs, you can create custom tools. Throughout our testing, we found some of the native tools provided most of what we needed but required minor tweaks for our use case. We changed the default behavior for the tools for use with Security Lake data.

Create an output parser

Output parsers are used to instruct the LLM to respond in the desired output format. Although the output parser is optional, it makes sure the LLM response is formatted in a way that can be quickly consumed and is actionable by the user.

Figure 9: LangChain output parser setting

Adding conversation buffer memory

To make things simpler for the user, previous results should be stored for use in subsequent queries by the Conversational ReAct agent. ConversationBufferMemory provides the capability to maintain state from past conversations and enables the user to ask follow-up questions in the same chat context. For example, if you asked an agent for a list of AWS accounts to focus on, you want your subsequent questions to focus on that same list of AWS accounts instead of writing the values down somewhere and keeping track of it in the next set of questions. There are many other types of memory that can be used to optimize your use cases.

Figure 10: LangChain conversation buffer memory setting

Initialize the agent

At this point, all the appropriate configurations are set and it’s time to load an agent executor by providing a set of tools and a LLM.

tools: List of tools the agent will have access to.
llm: LLM the agent will use.
agent: Agent type to use. If there is no value provided and agent_path is set, the agent used will default to AgentType.ZERO_SHOT_REACT_DESCRIPTION.
agent_kwargs: Additional keyword arguments to pass to the agent.

Figure 11: LangChain agent initialization

Note: For this post, we set verbose=True to view the agent’s intermediate ReAct steps, while answering questions. If you’re only interested in the output, set verbose=False.

You can also set return_direct=True to have the tool output returned to the user and closing the agent loop. Since we want to maintain the results of the query and used by the LLM, we left the default value of return_direct=False.

Provide instructions to the agent on using the tools

In addition to providing the agent with a list of tools, you would also give instructions to the agent on how and when to use these tools for your use case. This is optional but provides the agent with more context and can lead to better results.

Figure 12: LangChain agent instructions

Start your threat analysis journey with the generative AI-powered agent

Now that you’ve walked through the same set up process we used to create and initialize the agent, we can demonstrate how to analyze Security Lake data using natural language input questions that a security researcher might ask. The following examples focus on how you can use the solution to identify security vulnerabilities, risks, and threats and prioritize mitigating them. For this post, we’re using native AWS sources, but the agent can analyze any custom log sources configured in Security Lake. You can also use this solution to assist with investigations of possible security events in your environment.

For each of the questions that follow, you would enter the question in the free-form cell after it has run, similar to Figure 13.

Note: Because the field is free form, you can change the questions. Depending on the changes, you might see different results than are shown in this post. To end the conversation, enter exit and press the Enter key.

Figure 13: LangChain agent conversation input

Question 1: What data sources are available in Security Lake?

In addition to the native AWS sources that Security Lake automatically ingests, your security team can incorporate additional custom log sources. It’s important to know what data is available to you to determine what and where to investigate. As shown in Figure 14, the Security Lake database contains the following log sources as tables:

CloudTrail management and data events (Amazon S3 and AWS Lambda sources are separate tables)
Amazon Route 53 resolver query logs
Security Hub findings
Amazon Virtual Private Cloud (Amazon VPC) flow logs

If there are additional custom sources configured, they will also show up here. From here, you can focus on a smaller subset of AWS accounts that might have a larger number of security-related findings.

Figure 14: LangChain agent output for Security Lake tables

Question 2: What are the top five AWS accounts that have the most Security Hub findings?

Security Hub is a cloud security posture management service that not only aggregates findings from other AWS security services—such as Amazon GuardDuty, Amazon Macie, AWS Firewall Manager, and Amazon Inspector—but also from a number of AWS partner security solutions. Additionally, Security Hub has its own security best practices checks to help identify any vulnerabilities within your AWS environment. Depending on your environment, this might be a good starting place to look for specific AWS accounts to focus on.

Figure 15: LangChain output for AWS accounts with Security Hub findings

Question 3: Within those AWS accounts, were any of the following actions found in (CreateUser, AttachUserPolicy, CreateAccessKey, CreateLoginProfile, DeleteTrail, DeleteMembers, UpdateIPSet, AuthorizeSecurityGroupIngress) in CloudTrail?

With the list of AWS accounts to look at narrowed down, you might be interested in mutable changes in your AWS account that you would deem suspicious. It’s important to note that every AWS environment is different, and some actions might be suspicious for one environment but normal in another. You can tailor this list to actions that shouldn’t happen in your environment. For example, if your organization normally doesn’t use IAM users, you can change the list to look at a list of actions for IAM, such as CreateAccessKey, CreateLoginProfile, CreateUser, UpdateAccessKey, UpdateLoginProfile, and UpdateUser.

By looking at the actions related to AWS CloudTrail (CreateUser, AttachUserPolicy, CreateAccessKey, CreateLoginProfile, DeleteTrail, DeleteMembers, UpdateIPSet, AuthorizeSecurityGroupIngress), you can see which actions were taken in your environment and choose which to focus on. Because the agent has access to previous chat history and results, you can ask follow-up questions on the SQL results without having to specify the AWS account IDs or event names.

Figure 16: LangChain agent output for CloudTrail actions taken in AWS Organization

Question 4: Which IAM principals took those actions?

The previous question narrowed down the list to mutable actions that shouldn’t occur. The next logical step is to determine which IAM principals took those actions. This helps correlate an actor to the actions that are either unexpected or are reserved for only authorized principals. For example, if you have an IAM principal tied to a continuous integration and delivery (CI/CD) pipeline, that could be less suspicious. Alternatively, if you see an IAM principal that you don’t recognize, you could focus on all actions taken by that IAM principal, including how it was provisioned in the first place.

Figure 17: LangChain agent output for CloudTrail IAM principals that invoked events from the previous query

Question 5: Within those AWS accounts, were there any connections made to “3.0.0.0/8”?

If you don’t find anything useful related to mutable changes to CloudTrail, you can pivot to see if there were any network connections established from a specific Classless Inter-Domain Routing (CIDR) range. For example, if an organization primarily interacts with AWS resources within your AWS Organizations from your corporate-owned CIDR range, anything outside of that might be suspicious. Additionally, if you have threat lists or suspicious IP ranges, you can add them to the query to see if there are any network connections established from those ranges. The agent knows that the query is network related and to look in VPC flow logs and is focusing on only the AWS accounts from Question 2.

Figure 18: LangChain agent output for VPC flow log matches to specific CIDR

Question 6: As a security analyst, what other evidence or logs should I look for to determine if there are any indicators of compromise in my AWS environment?

If you haven’t found what you’re looking for and want some inspiration from the agent, you can ask the agent what other areas you should look at within your AWS environment. This might help you create a threat analysis thesis or use case as a starting point. You can also refer to the MITRE ATT&CK Cloud Matrix for more areas to focus on when setting up questions for your agent.

Figure 19: LangChain agent output for additional scenarios and questions to investigate

Based on the answers given, you can start a new investigation to identify possible vulnerabilities and threats:

Is there any unusual API activity in my organization that could be an indicator of compromise?
Have there been any AWS console logins that don’t match normal geographic patterns?
Have there been any spikes in network traffic for my AWS resources?

Agent running custom SQL queries

If you want to use a previously generated or customized SQL query, the agent can run the query as shown in Figure 20 that follows. In the previous questions, a SQL query is generated in the agent’s Action Input field. You can use that SQL query as a baseline, edit the SQL query manually to fit your use case, and then run the modified query through the agent. The modified query results are stored in memory and can be used for subsequent natural language questions to the agent. Even if your security analysts already have SQL experience, having the agent give a recommendation or template SQL query can shorten your investigation.

Figure 20: LangChain agent output for invoking custom SQL queries

Agent assistance to automatically generate visualizations

You can get help from the agent to create visualizations by using the PythonREPL tool to generate code and plot SQL query results. As shown in Figure 21, you can ask the agent to get results from a SQL query and generate code to create a visualization based on those results. You can then take the generated code and put it into the next cell to create the visualization.

Figure 21: LangChain agent output to generate code to visualize SQL results in a plot

The agent returns example code after To plot the results. You can copy the code between ‘‘‘python and ’’’ and input that code in the next cell. After you run that cell, a visual based on the SQL results is created similar to Figure 22 that follows. This can be helpful to share the notebook output as part of an investigation to either create a custom detection to monitor or determine how a vulnerability can be mitigated.

Figure 22: Notebook Python code output from code generated by LangChain agent

Tailoring your agent to your data

As previously discussed, use cases and data vary between organizations. It’s important to understand the foundational components in terms of how you can configure and tailor the LLM, agents, tools, and configuration to your environment. The notebook in the solution was the result of experiments to determine and display what’s possible. Along the way, you might encounter challenges or issues depending on changes you make in the notebook or by adding additional data sources. Below are some tips to help you create and tailor the notebook to your use case.

If the agent pauses in the intermediate steps or asks for guidance to answer the original question, you can guide the agent with prompt engineering techniques, using commands such as execute or continue to move the process along.
If the agent is hallucinating or providing data that isn’t accurate, see Anthropic’s user guide for mechanisms to reduce hallucinations. An example of a hallucination would be the response having generic information such as an AWS account that is 1234567890 or the resulting count of a query being repeated for multiple rows.

Note: You can also use Retrieval Augmented Generation (RAG) in Amazon SageMaker to mitigate hallucinations.

SageMaker Studio and Amazon Bedrock provide native integration to use a variety of generative AI tools with your Security Lake data to help increase your organization’s security posture. Some other use cases you can try include:

Investigating impact and root cause for a suspected compromise of an Amazon Elastic Compute Cloud (Amazon EC2) instance from a GuardDuty finding.
Determining if network ACL or firewall changes in your environment affected the number of AWS resources communicating with public endpoints.
Checking if any S3 buckets with possibly confidential or sensitive data were accessed by non-authorized IAM principals.
Identify if an EC2 instance that might be compromised made any internal or external connections to other AWS resources and then if those resources were impacted.

Conclusion

This solution demonstrates how you can use the generative AI capabilities of Amazon Bedrock and natural language input in SageMaker Studio to analyze data in Security Lake and work towards reducing your organization’s risk and increase your security posture. The Python notebook is primarily meant to serve as a starting point to walk through an example scenario to identify potential vulnerabilities and threats.

Security Lake is continually working on integrating native AWS sources, but there are also custom data sources outside of AWS that you might want to import for your agent to analyze. We also showed you how we configured the notebook to use agents and LLMs, and how you can tune each component within a notebook to your specific use case.

By enabling your security team to analyze and interact with data in Security Lake using natural language input, you can reduce the amount of time needed to conduct an investigation by automatically identifying the appropriate data sources, generating and invoking SQL queries, and visualizing data from your investigation. This post focuses on Security Lake, which normalizes data into Open Cybersecurity Schema Framework (OCSF), but as long as the database data schema is normalized, the solution can be applied to other data stores.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Generative AI on AWS re:Post or contact AWS Support.

Rapid7’s Data-Centric Approach to AI in Belfast

2024-01-05 Rapid7

Post Syndicated from Rapid7 original https://blog.rapid7.com/2024/01/05/rapid7s-data-centric-approach-to-ai-in-belfast/

Rapid7’s Data-Centric Approach to AI in Belfast

Rapid7 has expanded significantly in Belfast since establishing a presence back in 2014, resulting in the company’s largest R&D hub outside the US with over 350 people spread across eight floors in our Chichester Street office. There is a wide range of product development and engineering across the entire Rapid7 platform that happens here, but nearest and dearest to our hearts is that Belfast has really become the epicentre for our more than two decades of investment in data. It has formed the bedrock for our AI, machine learning and data science efforts.

Read on to find out more about the importance of data and AI at Rapid7!

Rapid7’s Data-Centric Approach to AI in Belfast

A Forward-thinking Data Attitude

First up let’s talk data. We’ve had a specialist data presence in Belfast for a number of years, initially focused on the consumption, distribution, and analytics for quality product usage data, via interfaces such as Amazon SNS/SQS, piping data into time-series data stores like TimescaleDB and InfluxDB. Product usage data is unique due to its high volume and cardinality, which these data stores are optimized for. The evolution of data at Rapid7 required more scale, so we’ve been introducing more scalable technologies such as Apache Kafka, Spark, and Iceberg. This stack will enable multiple entry points for access to our data.

Apache Kafka, the heart of our data infrastructure, is a distributed streaming platform allowing us to handle real-time data streams with ease. Kafka acts as a reliable and scalable pipeline, ingesting massive amounts of data from various sources in real-time. Its pub-sub architecture ensures data is efficiently distributed across the system, with multiple types of consumers per topic enabling teams to process and analyze information as it flows.
Transforms run via Apache Spark, serving as the processing engine taking our data to the next level, opening us up to either batch processing or streaming directly from Kafka where we ultimately land the data into an object store with the Iceberg layer on top.
Apache Iceberg is an open table format for large-scale datasets, providing ACID transactions, schema evolution, and time travel capabilities. These features are instrumental in maintaining consistency and reliability of our data, which is crucial for AI and analytics at Rapid7. Additionally, the ability to perform time travel queries with Iceberg allows us to analyze and curate historical data, an essential component in building predictive AI models.

Our Engineers are constantly developing applications on this stack to facilitate ETL pipelines in languages such as Python, Java and Scala, running within K8s clusters. In keeping with our forward-thinking attitude to data, we continue to adopt new tooling and governance to facilitate growth, such as the introduction of a Data Catalog to visualize lineage with a searchable interface for metadata. In this way, users self-discover data but also learn how specific datasets are related, giving a clearer understanding of usage. All this empowers data and AI Engineers to discover and ingest the vast pools of data available at Rapid7.

Our Data-Centric Approach to AI

At the core of our AI engineering is a data-centric approach, in line with the recent gradual trend away from model-centric approaches. We find model design isn’t always the differentiator: more often it’s all about the data. In our experience when comparing different models for a classification task, say for example a set of neural networks plus some conventional classifiers, they may all perform fairly similarly with high-quality data. With data front and centre at Rapid7 as a key part of our overall strategy, major opportunities exist for us to leverage high-quality, high-volume datasets in our AI solutions, leading to better results.

Of course, there are always cases where model design is more influential. However marginal performance gains, often seen in novelty-focused academia, may not be worth the extra implementation effort or compute expense in practice. Not to say we aren’t across new models and architectures – we review them every week – though the classic saying “avoid unnecessary complexity” also applies to AI.

The Growing Importance of Data in GenAI

Let’s take genAI LLMs as an example. Anyone following developments may have noticed training data is getting more of the spotlight, again indicative of the data-centric approach. LLMs, based on transformer architectures, have been getting bigger, and vendors are PRing the latest models hard. Look closely though and they tend to be trained with different datasets, often in combination with public benchmark data too. However, if two or three LLMs from different vendors were trained with exactly the same data for long enough we’re willing to bet the performance differences between them might be minimal. Further, we’re seeing researchers push in the other direction, trying to create smaller, less complex open-source LLMs that compare favourably to their much larger commercial counterparts.

Considering this data-centric shift, more complex models in general may not drive performance as much as you think. The same principles apply to more conventional analytics, machine learning and data science. Some models we’ve trained also have very small datasets, maybe only 100-200 examples, yet generalise well due to accurately labelled data that is representative of the wild. Huge datasets are at risk of duplicates or mislabelled examples, essentially significant noise, so it’s about data quality and quantity. Thankfully we have both in abundance, and our data is of a scale you’re unlikely to find elsewhere.

Expanding our AI Centre of Excellence in Belfast

Rapid7 is further investing in Belfast as part of our new AI Centre of Excellence, encompassing the full range of AI, ML and data science. We’re on a mission to use data and AI to accelerate threat investigation, detection and response (D&R) capabilities of our Security Operations Centre (SOC). The AI CoE partners with our data and D&R teams in enabling customers to assess risk, detect threats and automate their security programs. It ensures AI, ML and data science are applied meaningfully to add customer value, best achieve business objectives and deliver ROI. Unnecessary complexity is avoided, with a creative, fast-fail, highly iterative approach to accelerate ideas from proof-of-concept to go or no-go.

The group’s make-up is such that our technical skills complement one another; we share our AI, ML and data science knowledge while also contributing on occasion to new external AI policy initiatives with recognised bodies like NIST. We use, for example, a mix of LLMs, sklearn, PyTorch and more, whilst the team has a track record of publishing award-winning research at the likes of AISec at ACM CCS and with IEEE. All of this is powered by data, and the AI CoE’s goals are ambitious.

Interested in joining us?

We are always interested in hearing from people with a desire to be part of something big. If you want a career move where you can grow and make an impact with data and AI, this is it. We’re currently recruiting for multiple brand-new data and AI roles in Belfast across various levels of seniority as we expand both teams – and we’d love to hear from you! Please check out the roles over here!

AI Is Scarily Good at Guessing the Location of Random Photos

2023-12-29 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/12/ai-is-scarily-good-at-guessing-the-location-of-random-photos.html

Wow:

To test PIGEON’s performance, I gave it five personal photos from a trip I took across America years ago, none of which have been published online. Some photos were snapped in cities, but a few were taken in places nowhere near roads or other easily recognizable landmarks.

That didn’t seem to matter much.

It guessed a campsite in Yellowstone to within around 35 miles of the actual location. The program placed another photo, taken on a street in San Francisco, to within a few city blocks.

Not every photo was an easy match: The program mistakenly linked one photo taken on the front range of Wyoming to a spot along the front range of Colorado, more than a hundred miles away. And it guessed that a picture of the Snake River Canyon in Idaho was of the Kawarau Gorge in New Zealand (in fairness, the two landscapes look remarkably similar).

This kind of thing will likely get better. And even if it is not perfect, it has some pretty profound privacy implications (but so did geolocation in the EXIF data that accompanies digital photos).

AI and Lossy Bottlenecks

2023-12-28 B. Schneier

Post Syndicated from B. Schneier original https://www.schneier.com/blog/archives/2023/12/ai-and-lossy-bottlenecks.html

Artificial intelligence is poised to upend much of society, removing human limitations inherent in many systems. One such limitation is information and logistical bottlenecks in decision-making.

Traditionally, people have been forced to reduce complex choices to a small handful of options that don’t do justice to their true desires. Artificial intelligence has the potential to remove that limitation. And it has the potential to drastically change how democracy functions.

AI researcher Tantum Collins and I, a public-interest technology scholar, call this AI overcoming “lossy bottlenecks.” Lossy is a term from information theory that refers to imperfect communications channels—that is, channels that lose information.

Multiple-choice practicality

Imagine your next sit-down dinner and being able to have a long conversation with a chef about your meal. You could end up with a bespoke dinner based on your desires, the chef’s abilities and the available ingredients. This is possible if you are cooking at home or hosted by accommodating friends.

But it is infeasible at your average restaurant: The limitations of the kitchen, the way supplies have to be ordered and the realities of restaurant cooking make this kind of rich interaction between diner and chef impossible. You get a menu of a few dozen standardized options, with the possibility of some modifications around the edges.

That’s a lossy bottleneck. Your wants and desires are rich and multifaceted. The array of culinary outcomes are equally rich and multifaceted. But there’s no scalable way to connect the two. People are forced to use multiple-choice systems like menus to simplify decision-making, and they lose so much information in the process.

People are so used to these bottlenecks that we don’t even notice them. And when we do, we tend to assume they are the inevitable cost of scale and efficiency. And they are. Or, at least, they were.

The possibilities

Artificial intelligence has the potential to overcome this limitation. By storing rich representations of people’s preferences and histories on the demand side, along with equally rich representations of capabilities, costs and creative possibilities on the supply side, AI systems enable complex customization at scale and low cost. Imagine walking into a restaurant and knowing that the kitchen has already started work on a meal optimized for your tastes, or being presented with a personalized list of choices.

There have been some early attempts at this. People have used ChatGPT to design meals based on dietary restrictions and what they have in the fridge. It’s still early days for these technologies, but once they get working, the possibilities are nearly endless. Lossy bottlenecks are everywhere.

Take labor markets. Employers look to grades, diplomas and certifications to gauge candidates’ suitability for roles. These are a very coarse representation of a job candidate’s abilities. An AI system with access to, for example, a student’s coursework, exams and teacher feedback as well as detailed information about possible jobs could provide much richer assessments of which employment matches do and don’t make sense.

Or apparel. People with money for tailors and time for fittings can get clothes made from scratch, but most of us are limited to mass-produced options. AI could hugely reduce the costs of customization by learning your style, taking measurements based on photos, generating designs that match your taste and using available materials. It would then convert your selections into a series of production instructions and place an order to an AI-enabled robotic production line.

Or software. Today’s computer programs typically use one-size-fits-all interfaces, with only minor room for modification, but individuals have widely varying needs and working styles. AI systems that observe each user’s interaction styles and know what that person wants out of a given piece of software could take this personalization far deeper, completely redesigning interfaces to suit individual needs.

Removing democracy’s bottleneck

These examples are all transformative, but the lossy bottleneck that has the largest effect on society is in politics. It’s the same problem as the restaurant. As a complicated citizen, your policy positions are probably nuanced, trading off between different options and their effects. You care about some issues more than others and some implementations more than others.

If you had the knowledge and time, you could engage in the deliberative process and help create better laws than exist today. But you don’t. And, anyway, society can’t hold policy debates involving hundreds of millions of people. So you go to the ballot box and choose between two—or if you are lucky, four or five—individual representatives or political parties.

Imagine a system where AI removes this lossy bottleneck. Instead of trying to cram your preferences to fit into the available options, imagine conveying your political preferences in detail to an AI system that would directly advocate for specific policies on your behalf. This could revolutionize democracy.

a diagram of six vertical columns composed of squares of various white, grey and black shades

Ballots are bottlenecks that funnel a voter’s diverse views into a few options. AI representations of individual voters’ desires overcome this bottleneck, promising enacted policies that better align with voters’ wishes.
Tantum Collins, CC BY-ND

One way is by enhancing voter representation. By capturing the nuances of each individual’s political preferences in a way that traditional voting systems can’t, this system could lead to policies that better reflect the desires of the electorate. For example, you could have an AI device in your pocket—your future phone, for instance—that knows your views and wishes and continually votes in your name on an otherwise overwhelming number of issues large and small.

Combined with AI systems that personalize political education, it could encourage more people to participate in the democratic process and increase political engagement. And it could eliminate the problems stemming from elected representatives who reflect only the views of the majority that elected them—and sometimes not even them.

On the other hand, the privacy concerns resulting from allowing an AI such intimate access to personal data are considerable. And it’s important to avoid the pitfall of just allowing the AIs to figure out what to do: Human deliberation is crucial to a functioning democracy.

Also, there is no clear transition path from the representative democracies of today to these AI-enhanced direct democracies of tomorrow. And, of course, this is still science fiction.

First steps

These technologies are likely to be used first in other, less politically charged, domains. Recommendation systems for digital media have steadily reduced their reliance on traditional intermediaries. Radio stations are like menu items: Regardless of how nuanced your taste in music is, you have to pick from a handful of options. Early digital platforms were only a little better: “This person likes jazz, so we’ll suggest more jazz.”

Today’s streaming platforms use listener histories and a broad set of features describing each track to provide each user with personalized music recommendations. Similar systems suggest academic papers with far greater granularity than a subscription to a given journal, and movies based on more nuanced analysis than simply deferring to genres.

A world without artificial bottlenecks comes with risks—loss of jobs in the bottlenecks, for example—but it also has the potential to free people from the straitjackets that have long constrained large-scale human decision-making. In some cases—restaurants, for example—the impact on most people might be minor. But in others, like politics and hiring, the effects could be profound.

Securely Build AI/ML Applications in the Cloud with Rapid7 InsightCloudSec

2023-12-22 Kathryn Lynas-Blunt

Post Syndicated from Kathryn Lynas-Blunt original https://blog.rapid7.com/2023/12/22/securely-build-ai-ml-applications-in-the-cloud-with-rapid7-insightcloudsec/

Securely Build AI/ML Applications in the Cloud with Rapid7 InsightCloudSec

It’s been little over a year since ChatGPT was released, and oh how much has changed. Advancements in Artificial Intelligence and Machine Learning have marked a transformative era, influencing virtually every facet of our lives. These innovative technologies have reshaped the landscape of natural language processing, enabling machines not only to understand but also to generate human-like text with unprecedented fluency and coherence. As society embraces these advancements, the implications of Generative AI and LLMs extend across diverse sectors, from communication and content creation to education and beyond.

With AI service revenue increasing over six fold within five years, it’s not a surprise that cloud providers are investing heavily in expanding their capabilities in this area. Users can now customize existing foundation models with their own training data for improved performance and customer experience using AWS’ newly released Bedrock, Azure OpenAI Service and GCP Vertex AI.

Ungoverned Adoption of AI/ML Creates Security Risks

With the market projected to be worth over $1.8 trillion by 2030, AI/ML continues to play a crucial role in threat detection and analysis, anomaly and intrusion detection, behavioral analytics, and incident response. It’s estimated that half of organizations are already leveraging this technology. In contrast, only 10% have a formal policy in place regulating its use.

Ungoverned adoption therefore poses significant security risks. A lack of oversight through Shadow AI can lead to privacy breaches, non-compliance with regulations, and biased model outcomes, fostering unfair or discriminatory results. Inadequate testing may expose AI models to adversarial attacks, and the absence of proper monitoring can result in model drift, impacting performance over time. Increasingly prevalent, security incidents stemming from ungoverned AI adoption can damage an organization’s reputation, eroding customer trust.

Safely Developing AI/ML In the Cloud Requires Visibility and Effective Guardrails

To address these concerns, organizations should establish robust governance frameworks, encompassing data protection, bias mitigation, security assessments, and ongoing compliance monitoring to ensure responsible and secure AI/ML implementation. Knowing what’s present in your environment is step 1, and we all know how hard that can be.

InsightCloudSec has introduced a specialized inventory page designed exclusively for the effective management of your AI/ML assets. Encompassing a diverse array of services, spanning from content moderation and translation to model customization, our platform now includes support for Generative AI across AWS, GCP, and Azure.

Once you’ve got visibility into what AI/ML projects you have running in your cloud environment, the next step is to establish and set up mechanisms to continuously enforce some guardrails and policies to ensure development is happening in a secure manner.

Introducing Rapid7’s AI/ML Security Best Practices Compliance Pack

We’re excited to unveil our newest compliance pack within InsightCloudSec: Rapid7 AI/ML Security Best Practices. The new pack is derived from the OWASP Top 10 Vulnerabilities for Machine Learning, the OWASP Top 10 for LLMs, and additional CSP-specific recommendations. With this pack, you can check alignment with each of these controls in one place, enabling a holistic view of your compliance landscape and facilitating better strategic planning and decision-making. Automated alerting and remediation can also be set up as drift detection and prevention mechanisms.

This pack introduces 11 controls, centered around data and model security:

Securely Build AI/ML Applications in the Cloud with Rapid7 InsightCloudSec

The Rapid7 AI/ML Security Best Practices compliance pack currently includes 15 checks across six different AI/ML services and three platforms, with additional coverage for Amazon Bedrock coming in our first January release.

For more information on our other compliance packs, and leveraging automation to enforce these controls, check out our docs page.

What’s New in Rapid7 Products & Services: 2023 Year in Review

2023-12-21 Margaret Wei

Post Syndicated from Margaret Wei original https://blog.rapid7.com/2023/12/21/whats-new-in-rapid7-products-services-2023-year-in-review/

What’s New in Rapid7 Products & Services: 2023 Year in Review

Throughout 2023 Rapid7 has made investments across the Insight Platform to further our mission of providing security teams with the tools to proactively anticipate imminent risk, prevent breaches earlier, and respond faster to threats. In this blog you’ll find a review of our top releases from this past year, all of which were purpose-built to bring your team a holistic, unified approach to security operations and command of your attack surface.

Proactively secure your environment

Endpoint protection with next-gen antivirus in Managed Threat Complete

To provide protection against both known and unknown threats, we released multilayered prevention with Next-Gen Antivirus in Managed Threat Complete. Available through the Insight Agent, you’re immediately able to:

Block known and unknown threats early in the kill chain
Halt malware that’s built to bypass existing security controls
Maximize your security stack and ROI with existing Insight Agent
Leverage the expertise of our MDR team to triage and investigate these alerts

New capabilities to help prioritize risk in your cloud and on-premise environments and effectively communicate risk posture

As the attack surface expands, we know it’s critical for you to have visibility into vulnerabilities across your hybrid environments and communicate it with your executive and remediation stakeholders. This year we made a series of investments in this area to help customers better visualize, prioritize, and communicate risk.

What’s New in Rapid7 Products & Services: 2023 Year in Review

Executive Risk View, available as a part of Cloud Risk Complete, provides security leaders with the visibility and context needed to track total risk across cloud and on-premises assets to better understand organizational risk posture and trends.
Active Risk, our new vulnerability risk-scoring methodology, helps security teams prioritize vulnerabilities that are actively exploited or most likely to be exploited in the wild. Our approach enriches the latest version of the Common Vulnerability Scoring System (CVSS) with multiple threat intelligence feeds, including intelligence from proprietary Rapid7 Labs research. Active Risk normalizes risk scores across cloud and on-premises environments within InsightVM, InsightCloudSec, and Executive Risk View.
The new risk score in InsightCloudSec’s Layered Context makes it easier for you to understand the riskiest resources within your cloud environment. Much like Layered Context, the new risk score combines a variety of risk signals – including Active Risk – and assigns a higher risk score to resources that suffer from toxic combinations or multiple risk vectors that present an increased likelihood or impact of compromise.
Two new dashboard cards in InsightVM to help security teams communicate risk posture cross-functionally and provide context on asset and vulnerability prioritization:
Vulnerability Findings by Active Risk Score Severity – ideal for executive reporting, this dashboard card indicates total number of vulnerabilities across the Active Risk severity levels and number of affected assets and instances.
Vulnerability Findings by Active Risk Score Severity and Publish Age – ideal for sharing with remediation stakeholders to assist with prioritizing vulnerabilities for the next patch cycle, or identifying critical vulnerabilities that may have been missed.

Coverage and expert analysis for critical vulnerabilities with Rapid7 Labs

Rapid7 Labs provides easy-to-use threat intelligence and guidance, curated by our industry-leading attack experts, to the security teams.

Emergent Threat Response (ETR) program, part of Rapid7 Labs, provides teams with accelerated visibility, alerting, and guidance on high-priority threats. Over this past year we provided coverage and expert analysis within 24 hours for over 30 emergent threats, including Progress Software’s MOVEit Transfer solution where our security research team was one of the first to detect exploitation—four days before the vendor issued public advisory. Keep up with future ETRs on our blog here.

Detect and prioritize threats anywhere, from the endpoint to the cloud

Enhanced alert details in InsightIDR Investigations

An updated evidence panel for attacker behavior analytics (ABA) alerts gives you a description of the alert and recommendations for triage, rule logic that generated the alert and associated data, and a process tree (for MDR customers) to show details about what occurred before, during, and after the alert was generated.

AI-driven detection of anomalous activity with Cloud Anomaly Detection

Cloud Anomaly Detection provides AI-driven detection of anomalous activity occurring across your cloud environments, with automated prioritization to assess the likelihood that activity is malicious. With Cloud Anomaly Detection, your team will benefit from:

A consolidated view that aggregates threat detections from CSP-native detection engines and Rapid7’s AI-driven proprietary detections.
Automated prioritization to focus on the activity that is most likely to be malicious.
The ability to detect and respond to cloud threats using the same processes and tools your SOC teams are using today with easy API-based ingestion into XDR/SIEM tools for threat investigations and prioritizing remediation efforts.

Detailed views into risks across your cloud environment with Identity Analysis and Attack Path Analysis

We’re constantly working to improve the ways with which we provide a real-time and comprehensive view of your current cloud risk posture. This year, we made some major strides in this area, headlined by two exciting new features:

Identity Analysis provides a unified view into identity-related risk across your cloud environments, allowing you to achieve least privileged access (LPA) at scale. By utilizing machine learning (ML), Identity Analysis builds a baseline of access patterns and permissions usage, and then correlates the baseline against assigned permissions and privileges. This enables your team to identify overly-permissive roles or unused access so you can automatically right-size permissions in accordance with LPA.
Attack Path Analysis enables you to analyze relationships between resources and quickly identify potential avenues bad actors could navigate within your cloud environment to exploit a vulnerable resource and/or access sensitive information. This visualization helps teams communicate risk across the organization, particularly for non-technical stakeholders that may find it difficult to understand why a compromised resource presents a potentially larger risk to the business.

More flexible alerting with Custom Detection Rules

Every environment, industry, and organization can have differing needs when it comes to detections. With custom detection rules in InsightIDR, you can detect threats specific to your needs while take advantage of the same capabilities that are available for out-of-the-box detection rules, including:

The ability to set a rule action and rule priority to choose how you are alerted when your rule detects suspicious activity.
The ability to add exceptions to your rule for specific key-value pairs.

A growing library of actionable detections in InsightIDR

In 2023 we added over 3,000 new detection rules. See them in-product or visit the Detection Library for descriptions and recommendations.

Agent-Based Policy supports custom policy assessment in InsightVM

Guidelines from Center for Internet Security (CIS) and Security Technical Implementation Guides (STIG) are widely used industry benchmarks for configuration assessment. However, a benchmark or guideline as-is may not meet the unique needs of every business.

Agent-Based Policy assessment now supports Custom Policies. Global Administrators can customize built-in policies, upload policies, or enable a copy of existing custom policies for agent-based assessments. Learn more here.

Investigate and respond with confidence

Faster containment and remediation of threats with expansion of Active Response for Managed Detection and Response customers

Attackers work quickly and every second you wait to take action can have detrimental impacts on your environment. Enter automation—Active Response enables Rapid7 SOC analysts to immediately quarantine assets and users in a customer’s environment with response actions powered by InsightConnect, Rapid7’s SOAR solution.

Active Response has you covered to quarantine via our Insight Agent, as well as a variety of third-party providers—including Crowdstrike and SentinelOne. And with MDR analyst actions logged directly in InsightIDR, you have more expansive, collaborative detection and response faster than ever before. Read what Active Response can do for your organization—and how it stopped malware in a recent MDR Investigation—here.

Velociraptor integrates with InsightIDR for broader DFIR coverage

The attack surface is continually expanding, and so should your visibility into potential threats across it. This year we integrated Velociraptor, Rapid7’s open-source DFIR framework, with our Insight Platform to bring the data you need for daily threat monitoring and hunting into InsightIDR for investigation via our Insight Agent.

This integration brings you faster identification and remediation, always-on monitoring for threat activity across your endpoint fleet, and expanded threat detection capabilities. Read more about what this integration unlocks here.

Stay tuned!

As always, we’re continuing to work on exciting product enhancements and releases throughout the year. Keep an eye on our blog and release notes as we continue to highlight the latest in product and service investments at Rapid7. See you in 2024!

Expanded Coverage and AWS Compliance Pack Updates in InsightCloudSec Coming Out of AWS Re:Invent 2023

2023-12-20 Lara Sunday

Post Syndicated from Lara Sunday original https://blog.rapid7.com/2023/12/20/expanded-coverage-and-aws-compliance-pack-updates-in-insightcloudsec-coming-out-of-aws-re-invent-2023/

Expanded Coverage and AWS Compliance Pack Updates in InsightCloudSec Coming Out of AWS Re:Invent 2023

It seems like it was just yesterday that we were in Las Vegas for AWS Re:Invent, but it’s already been almost two weeks since the conference wrapped up. As is always the case, AWS unveiled a host of new services throughout the week, including advancements around serverless, artificial intelligence (AI) and Machine Learning (ML), security and more.

There were a ton of really exciting announcements, but a few stood out to me. Before we dive into the new and updated services we now support in InsightCloudSec, let’s take a second to highlight a few of them and why they’re of note.

Highlights from AWS’ New Service Announcements during Re:Invent

Amazon Bedrock general availability was announced back in October, re:Invent brought with it announcements of new capabilities including customized models, GenAI applications to execute multi-step tasks, and Guardrails announced in preview. New Security Hub functionalities were introduced, including centralized governance, custom controls and a refresh of the dashboard.

Serverless innovations include updates to Amazon Aurora Limitless Database, Amazon ElasticCache Serverless, and AI-driven Amazon Redshift Serverless adding greater scaling and efficiency to their database and analytics offerings. Serverless architectures bring scalability and flexibility, however security and risk considerations shift away from traditional network traffic inspection and access control lists, towards IAM hygiene, system identity behavioral analysis along with code integrity and validation.

Amazon Datazone general availability, like Bedrock, was originally announced in October and got some new innovations showcased during Re:Invent including business driven domains and data catalog, projects and environments, and the ability for data workers to publish and data consumers to subscribe to workflows. Available in open preview for Datazone are automated, AI-driven recommendations for metadata-driven business descriptions and specific columns and analytical applications based on business units.

One of the most exciting announcements from Re:Invent this year was Amazon Q, Amazon’s new GenAI-powered Virtual Assistant. Q was also integrated into Amazon’s Business Intelligence (BI) service, QuickSight, which has been supported in InsightCloudSec for some time now.

Having released our support for Amazon OpenSearch last year, this year’s re:Invent brought some exciting updates that are worth mentioning here. Now generally available is Vector Engine for OpenSearch Serverless, which enables users to store and quickly search vector embeddings for GenAI applications. AWS also announced the OR1 Instance family, which is compute optimized specifically for OpenSearch and also a new zero-ETL integration with S3.

Expanded Resource Coverage in InsightCloudSec

It’s very important to us here at Rapid7 that we provide our customers with the peace of mind to know when their teams leave these events and begin implementing new innovations from AWS that they’re doing so securely. To that end, the days and weeks following Re:Invent is always a bit of a sprint, and this year was no exception.

The Coverage and Analysis team loves a challenge though, and in my totally unbiased opinion — we’ve delivered something special. Our latest release featured new support for a variety of the new services announced during Re:Invent, as well as, a number of existing services we’ve expanded support for in relation to updates announced by AWS. We’ve added support for 6 new services that were either announced or updated during the show. We’ve also added 25 new Insights, all of which have been applied to our existing AWS Foundational Security Best Practices pack, AWS Center for Internet Security (CIS) 2.0 compliance pack, as well as new AWS relevant updates to NIST SP800-53 (Rev 5).

The newly supported services are:

Bedrock, a fully managed service that allows users to build generative AI applications in the cloud by providing a set of foundational models both from AWS and 3rd party vendors.
Clean Rooms, which enables customers to collaborate and analyze data securely in ‘clean rooms’ in minutes with any other company on joint initiatives without sharing real raw data.
AWS Control Tower (January 2024 Release), a management service that can be used to create and orchestrate a multi-account AWS environment in accordance with AWS best practices including the Well-Architected Framework.

Along with support for newly-added services, we’ve also expanded our coverage around the host of existing services as well. We’ve added or expanded support for the following security and serverless solutions:

Network Firewall, which provides fine-grained control over network traffic.
Security Hub, an AWS’ native service that provides CSPM functionality, aggregating security and compliance checks.
Glue, a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources, empowering your analytics and ML projects.

Helping Teams Securely Build AI/ML Applications in the Cloud

One of the most exciting elements to come out of the past few weeks with the addition of AWS Bedrock, is our extended coverage for AI and ML solutions that we are now able to provide across cloud providers for our customers. Supporting AWS Bedrock, along with GCP Vertex and Azure OpenAI Service has enabled us to build a very exciting new feature as part of our Compliance Packs.

Machine learning, artificial intelligence, and analytics were driving themes of this year’s conference, so it makes me very happy to announce that we now offer a dedicated Rapid7 AI/ML Security Best Practices compliance pack. If interested, I highly recommend you keep an eye out in the coming days for my colleague Kathryn Lynas-Blunt’s blog discussing how Rapid7 enables teams to securely build AI applications in the cloud.

As a cloud enthusiast, AWS re:Invent never fails to deliver on innovation, excitement and shared learning experiences. As we continue our partnership with AWS, I’m very excited for all that 2024 holds in store. Until next year!

OpenAI Is Not Training on Your Dropbox Documents—Today

2023-12-19 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/12/openai-is-not-training-on-your-dropbox-documents-today.html

There’s a rumor flying around the Internet that OpenAI is training foundation models on your Dropbox documents.

Here’s CNBC. Here’s Boing Boing. Some articles are more nuanced, but there’s still a lot of confusion.

It seems not to be true. Dropbox isn’t sharing all of your documents with OpenAI. But here’s the problem: we don’t trust OpenAI. We don’t trust tech corporations. And—to be fair—corporations in general. We have no reason to.

Simon Willison nails it in a tweet:

“OpenAI are training on every piece of data they see, even when they say they aren’t” is the new “Facebook are showing you ads based on overhearing everything you say through your phone’s microphone.”

Willison expands this in a blog post, which I strongly recommend reading in its entirety. His point is that these companies have lost our trust:

Trust is really important. Companies lying about what they do with your privacy is a very serious allegation.

A society where big companies tell blatant lies about how they are handling our data—and get away with it without consequences—is a very unhealthy society.

A key role of government is to prevent this from happening. If OpenAI are training on data that they said they wouldn’t train on, or if Facebook are spying on us through our phone’s microphones, they should be hauled in front of regulators and/or sued into the ground.

If we believe that they are doing this without consequence, and have been getting away with it for years, our intolerance for corporate misbehavior becomes a victim as well. We risk letting companies get away with real misconduct because we incorrectly believed in conspiracy theories.

Privacy is important, and very easily misunderstood. People both overestimate and underestimate what companies are doing, and what’s possible. This isn’t helped by the fact that AI technology means the scope of what’s possible is changing at a rate that’s hard to appreciate even if you’re deeply aware of the space.

If we want to protect our privacy, we need to understand what’s going on. More importantly, we need to be able to trust companies to honestly and clearly explain what they are doing with our data.

On a personal level we risk losing out on useful tools. How many people cancelled their Dropbox accounts in the last 48 hours? How many more turned off that AI toggle, ruling out ever evaluating if those features were useful for them or not?

And while Dropbox is not sending your data to OpenAI today, it could do so tomorrow with a simple change of its terms of service. So could your bank, or credit card company, your phone company, or any other company that owns your data. Any of the tens of thousands of data brokers could be sending your data to train AI models right now, without your knowledge or consent. (At least, in the US. Hooray for the EU and GDPR.)

Or, as Thomas Claburn wrote:

“Your info won’t be harvested for training” is the new “Your private chatter won’t be used for ads.”

These foundation models want our data. The corporations that have our data want the money. It’s only a matter of time, unless we get serious government privacy regulation.

We Asked ChatGPT for 2024 Cybersecurity Predictions but You Should Make These Resolutions Instead

2023-12-18 Rapid7

Post Syndicated from Rapid7 original https://blog.rapid7.com/2023/12/18/we-asked-chatgpt-for-2024-cybersecurity-predictions-but-you-should-make-these-resolutions-instead/

We Asked ChatGPT for 2024 Cybersecurity Predictions but You Should Make These Resolutions Instead

By Caitlin Condon, Senior Manager, Vulnerability Research at Rapid7, and Christiaan Beek, Senior Director, Threat Analytics at Rapid7

It’s that time of year again — time for the annual tradition of cybersecurity predictions. Here at Rapid7 we’ve seen a whole lot of threats and exploited vulnerabilities in 2023, many in the form of zero days. So it can be a little overwhelming to think about what could be in store for us in the year ahead.

We thought we’d start off by asking ChatGPT for its predictions.

Unsurprisingly, it gave the answer, “increased emphasis on AI and machine learning.” ChatGPT explained that AI-driven systems can better analyze and detect anomalies, and that we may see even more AI-powered tools for threat detection, response, and automation.

Well, there you have it folks, ChatGPT TO THE RESCUE!

This “prediction” is pretty obvious, and everyone in the cybersecurity industry knows it. But more importantly, it doesn’t solve the huge issue that exists in the cybersecurity industry: We’re all focusing on what could be without having the basic mechanisms in place to address what is.

So instead of making 2024 cybersecurity predictions, we suggest you make the following three resolutions and a promise to yourself that you will lay the groundwork to make them happen in 2024.

Resolution 1: Just implement MFA already

It seems like every CISO has spent 2023 getting up to speed on AI. Certainly AI will play an important role in 2024, both in the opportunities it presents to defenders as well as the security challenges it brings.

From a cybersecurity standpoint, however, it’s still important to keep your business focused on the basics such as correctly implemented multi-factor authentication (MFA). That’s because in 2024, a business is significantly more likely to be breached due to weak MFA than it is by an advanced-AI cyber attack.

Our 2023 Mid-Year Threat Report found that 40% of incidents in the first half of the year stemmed from non-existent or poorly enforced MFA. Our message is simple: implement MFA now, particularly for VPNs and virtual desktop infrastructure. It’s the best and most important accomplishment you can make if you haven’t yet done so.

Resolution 2: Learn from what file transfer vendors did right

Without a doubt, 2023 was the year of file transfer vulnerabilities, with MOVEit Transfer dominating headlines. However, we expect 2024 to be slightly different based on our experience with these vendors’ response processes.

The file transfer software providers Rapid7 researchers disclosed vulnerabilities to were extremely responsive, fixing vulnerabilities in half the time it usually takes and proactively looking at ways to mature their vulnerability disclosure programs.

In fact, some of these organizations now have more established patch cycles and vulnerability disclosure mechanisms in place (hooray!), as well as security programs implemented where products are reviewed more frequently. These proactive cycles should result in more mature, security-bolstering software development practices — at least for these solution providers and those who have learned from them — in 2024.

Resolution 3: Get a grip on your data

Lots of data does not equal effective security analysis. We all get fatigued and miss things when we feel overwhelmed and overstretched. And well, the same happens to security teams when they are just given enormous amounts of raw data. Context is everything! It’s the missing piece of the puzzle to improving security posture and the effectiveness of solutions.

Spending more money or gathering more data is not going to improve your cybersecurity posture, but understanding data and, more importantly, what kind of data is needed to make better decisions will. Less is more is our credo for 2024. For example, take time to understand what data you are already collecting from a log perspective. Understand what type of data is inside those logs and how that data might indicate a possible attack technique. If you have only partially the right information, what type of data would you need to enrich that for enough context to decide or prioritize events?

Bonus: Take some time to decompress

Trust us, we know that for defenders taking time to decompress is easier said than done, but it’s so important to look after ourselves and avoid burnout. Our advice to you is put your coverage plan in place, communicate it well, and most importantly, take the time you need. Even Gartner has predicted that 25% of cybersecurity leaders will change roles entirely by 2025 due to work-related stress. So, make sure you take the time to decompress, relax, and enjoy life.

For insights from the Rapid7 team on what 2024 could bring, watch the Top Cybersecurity Predictions webinar on-demand.

AI and Mass Spying

2023-12-05 Bruce Schneier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/12/ai-and-mass-spying.html

Spying and surveillance are different but related things. If I hired a private detective to spy on you, that detective could hide a bug in your home or car, tap your phone, and listen to what you said. At the end, I would get a report of all the conversations you had and the contents of those conversations. If I hired that same private detective to put you under surveillance, I would get a different report: where you went, whom you talked to, what you purchased, what you did.

Before the internet, putting someone under surveillance was expensive and time-consuming. You had to manually follow someone around, noting where they went, whom they talked to, what they purchased, what they did, and what they read. That world is forever gone. Our phones track our locations. Credit cards track our purchases. Apps track whom we talk to, and e-readers know what we read. Computers collect data about what we’re doing on them, and as both storage and processing have become cheaper, that data is increasingly saved and used. What was manual and individual has become bulk and mass. Surveillance has become the business model of the internet, and there’s no reasonable way for us to opt out of it.

Spying is another matter. It has long been possible to tap someone’s phone or put a bug in their home and/or car, but those things still require someone to listen to and make sense of the conversations. Yes, spyware companies like NSO Group help the government hack into people’s phones, but someone still has to sort through all the conversations. And governments like China could censor social media posts based on particular words or phrases, but that was coarse and easy to bypass. Spying is limited by the need for human labor.

AI is about to change that. Summarization is something a modern generative AI system does well. Give it an hourlong meeting, and it will return a one-page summary of what was said. Ask it to search through millions of conversations and organize them by topic, and it’ll do that. Want to know who is talking about what? It’ll tell you.

The technologies aren’t perfect; some of them are pretty primitive. They miss things that are important. They get other things wrong. But so do humans. And, unlike humans, AI tools can be replicated by the millions and are improving at astonishing rates. They’ll get better next year, and even better the year after that. We are about to enter the era of mass spying.

Mass surveillance fundamentally changed the nature of surveillance. Because all the data is saved, mass surveillance allows people to conduct surveillance backward in time, and without even knowing whom specifically you want to target. Tell me where this person was last year. List all the red sedans that drove down this road in the past month. List all of the people who purchased all the ingredients for a pressure cooker bomb in the past year. Find me all the pairs of phones that were moving toward each other, turned themselves off, then turned themselves on again an hour later while moving away from each other (a sign of a secret meeting).

Similarly, mass spying will change the nature of spying. All the data will be saved. It will all be searchable, and understandable, in bulk. Tell me who has talked about a particular topic in the past month, and how discussions about that topic have evolved. Person A did something; check if someone told them to do it. Find everyone who is plotting a crime, or spreading a rumor, or planning to attend a political protest.

There’s so much more. To uncover an organizational structure, look for someone who gives similar instructions to a group of people, then all the people they have relayed those instructions to. To find people’s confidants, look at whom they tell secrets to. You can track friendships and alliances as they form and break, in minute detail. In short, you can know everything about what everybody is talking about.

This spying is not limited to conversations on our phones or computers. Just as cameras everywhere fueled mass surveillance, microphones everywhere will fuel mass spying. Siri and Alexa and “Hey Google” are already always listening; the conversations just aren’t being saved yet.

Knowing that they are under constant surveillance changes how people behave. They conform. They self-censor, with the chilling effects that brings. Surveillance facilitates social control, and spying will only make this worse. Governments around the world already use mass surveillance; they will engage in mass spying as well.

Corporations will spy on people. Mass surveillance ushered in the era of personalized advertisements; mass spying will supercharge that industry. Information about what people are talking about, their moods, their secrets—it’s all catnip for marketers looking for an edge. The tech monopolies that are currently keeping us all under constant surveillance won’t be able to resist collecting and using all of that data.

In the early days of Gmail, Google talked about using people’s Gmail content to serve them personalized ads. The company stopped doing it, almost certainly because the keyword data it collected was so poor—and therefore not useful for marketing purposes. That will soon change. Maybe Google won’t be the first to spy on its users’ conversations, but once others start, they won’t be able to resist. Their true customers—their advertisers—will demand it.

We could limit this capability. We could prohibit mass spying. We could pass strong data-privacy rules. But we haven’t done anything to limit mass surveillance. Why would spying be any different?

This essay originally appeared in Slate.