[$] The bogus CVE problem

Post Syndicated from jake original https://lwn.net/Articles/944209/

The “Common Vulnerabilities and
Exposures
” (CVE) system was launched late
in the previous century (September 1999) to track vulnerabilities in
software. Over the years since, it has had a somewhat checkered
reputation
, along with some some attempts to
replace it
, but CVE numbers are still the only effective way to track
vulnerabilities. While that can certainly be useful, the
CVE-assignment (and severity scoring) process is not without its problems.
The prominence of CVE numbers, and the consequent increase in
“reputation” for a reporter, have combined to create a system that can
be—and is—actively gamed. Meanwhile, the organizations that oversee the
system are ultimately not doing a particularly stellar job.

GitHub Availability Report: August 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-09-13-github-availability-report-august-2023/

In August, we experienced two incidents that resulted in degraded performance across GitHub services.

August 15 16:58 UTC (lasting 4 hours 29 minutes)

On August 15 at 16:58 UTC, GitHub started experiencing increasing delays in an internal job queue used to process webhooks. We statused GitHub Webhooks to yellow at 17:24 UTC. During this incident, customers experienced webhooks delays as long as 4.5 hours.

We determined that the delays were caused by a significant and sustained spike in webhook deliveries. This caused a backup of our webhooks deliveries queue. We mitigated the issue by blocking events from sources of the increased load, which allowed the system to gradually recover as we processed the backlog of events. In response to this and other recent webhooks incidents, we made improvements that allow us to handle a higher amount of traffic and absorb load spikes without increasing delivery latency. We also improved our ability to manage load sources to prevent and more quickly mitigate any impact to our service.

August 29 02:36 UTC (lasting 49 minutes)

On August 29 at 02:36 UTC, GitHub systems experienced widespread delays in background job processing. This prevented webhook deliveries, GitHub Actions, and other asynchronously-triggered workloads throughout the system from running immediately as normal. While workloads were delayed by up to an hour, no data was lost, and systems ultimately recovered and resumed timely operation.

The component of our job queueing service responsible for dispatching jobs to workers failed due to an interaction with unexpected CPU throttling and short session timeouts for a Kafka consumer group. The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process. While the service continued to accept and record incoming work, it was unable to pass jobs on to workers until we mitigated the issue by shifting the load to the standby service as well as redeploying the primary service. We have extended our monitoring to allow quicker diagnosis of this failure mode, and are pursuing additional changes to prevent reoccurrence.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: August 2023 appeared first on The GitHub Blog.

Let’s Architect! Leveraging in-memory databases

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-leveraging-in-memory-databases/

In-memory databases play a critical role in modern computing, particularly in reducing the strain on existing resources, scaling workloads efficiently, and minimizing the cost of infrastructure. The advanced performance capabilities of in-memory databases make them vital for demanding applications characterized by voluminous data, real-time analytics, and rapid response requirements.

In this edition of Let’s Architect!, we are introducing caching strategies and, further, examining case studies that use Amazon Web Services (AWS), like Amazon ElastiCache or Amazon MemoryDB for Redis, in real workloads where customers share the reasoning behind their approaches. It is very important understanding the context for leveraging a specific solution or pattern, and many common questions can be answered with these resources.

Caching challenges and strategies

Many services built at Amazon rely on caching systems in the background to speed up performance, deal with low latency requirements, and avoid overloading on source databases and other microservices. Operating caches and adding caches into our systems may present complex challenges in terms of monitoring, data consistency, and load on the other components of the system. Indeed, a cache can give big benefits, but it’s also a new component to run and keep healthy. Furthermore, engineers may need to use empirical methods to choose the cache size, expiration policy, and eviction policy: we always have to perform tests and use the metrics to tune the setup.

With this Amazon Builder’s Library resource, you can learn strategies for using caching in your architecture and best practices directly from Amazon’s engineers.

Take me to this Amazon Builder’s Library article!

Strategies applied in Amazon applications at scale, explained and contextualized by Amazon engineers

Strategies applied in Amazon applications at scale, explained and contextualized by Amazon engineers

How Yahoo cost optimizes their in-memory workloads with AWS

Discover how Yahoo effectively leverages the power of Amazon ElastiCache and data tiering to process an astounding 1.3 million advertising data events per second, all while generating savings of up to 50% on their overall bill.

Data tiering is an ingenious method to scale up to hundreds of terabytes of capacity by intelligently managing data. It achieves this by automatically shifting the least-recently accessed data between RAM and high-performance SSDs.

In this video, you will gain insights into how data tiering operates and how you can unlock ultra-fast speeds and seamless scalability for your workloads in a cost-efficient manner. Furthermore, you can also learn how it’s implemented under the hood.

Take me to this re:Invent 2022 video!

A snapshot of how Yahoo architecture leverages Amazon ElastiCache

A snapshot of how Yahoo architecture leverages Amazon ElastiCache

Use MemoryDB to build real-time applications for performance and durability

MemoryDB is a robust, durable database marked by microsecond reads, low single-digit millisecond writes, scalability, and fortified enterprise security. It guarantees an impressive 99.99% availability, coupled with instantaneous recovery without any data loss.

In this session, we explore multiple use cases across sectors, such as Financial Services, Retail, and Media & Entertainment, like payment processing, message brokering, and durable session store applications. Moreover, through a practical demonstration, you can learn how to utilize MemoryDB to establish a microservices message broker for a Media & Entertainment application.

Take me to this AWS Online Tech Talks video!

A sample use case for retail application

A sample use case for retail application

Samsung SmartThings powers home automation with Amazon MemoryDB

MemoryDB offers the kind of ultra-fast performance that only an in-memory database can deliver, curtailing latency to microseconds and processing 160+ million requests per second —without data loss. In this re:Invent 2022 session, you will understand why Samsung SmartThings selected MemoryDB as the engine to power the next generation of their IoT device connectivity platform, one that processes millions of events every day.

You can also discover the intricate design of MemoryDB and how it ensures data durability without compromising the performance of in-memory operations, thanks to the utilization of a multi-AZ transactional log. This session is an enlightening deep-dive into durable, in-memory data operations.

Take me to this re:Invent 2022 video!

The architecture leveraged by Samsung SmartThings using Amazon MemoryDB for Redis

The architecture leveraged by Samsung SmartThings using Amazon MemoryDB for Redis

Amazon ElastiCache: In-memory datastore fundamentals, use cases and examples

In this edition of AWS Online Tech Talks, explore Amazon ElastiCache, a managed service that facilitates the seamless setup, operation, and scaling of widely used, open-source–compatible, in-memory datastores in the cloud environment. This service positions you to develop data-intensive applications or enhance the performance of your existing databases through high-throughput, low-latency, in-memory datastores. Learn how it is leveraged for caching, session stores, gaming, geospatial services, real-time analytics, and queuing functionalities.

This course can help cultivate a deeper understanding of Amazon ElastiCache, and how it can be used to accelerate your data processing while maintaining robustness and reliability.

Take me to this AWS Online Tech Talks course!

A free training course to increase your skills and leverage better in-memory databases

A free training course to increase your skills and leverage better in-memory databases

See you next time!

Thanks for joining us to discuss in-memory databases! In 2 weeks, we’ll talk about SQL databases.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Post Syndicated from Ravi Itha original https://aws.amazon.com/blogs/big-data/simplify-operational-data-processing-in-data-lakes-using-aws-glue-and-apache-hudi/

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. It focuses on defining standards and patterns to integrate data producers and consumers and move data between data lakes and purpose-built data stores securely and efficiently. Out of the many data producer systems that feed data to a data lake, operational databases are most prevalent, where operational data is stored, transformed, analyzed, and finally used to enhance business operations of an organization. With the emergence of open storage formats such as Apache Hudi and its native support from AWS Glue for Apache Spark, many AWS customers have started adding transactional and incremental data processing capabilities to their data lakes.

AWS has invested in native service integration with Apache Hudi and published technical contents to enable you to use Apache Hudi with AWS Glue (for example, refer to Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started). In AWS ProServe-led customer engagements, the use cases we work on usually come with technical complexity and scalability requirements. In this post, we discuss a common use case in relation to operational data processing and the solution we built using Apache Hudi and AWS Glue.

Use case overview

AnyCompany Travel and Hospitality wanted to build a data processing framework to seamlessly ingest and process data coming from operational databases (used by reservation and booking systems) in a data lake before applying machine learning (ML) techniques to provide a personalized experience to its users. Due to the sheer volume of direct and indirect sales channels the company has, its booking and promotions data are organized in hundreds of operational databases with thousands of tables. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others. In the data lake, the data to be organized in the following storage zones:

  1. Source-aligned datasets – These have an identical structure to their counterparts at the source
  2. Aggregated datasets – These datasets are created based on one or more source-aligned datasets
  3. Consumer-aligned datasets – These are derived from a combination of source-aligned, aggregated, and reference datasets enriched with relevant business and transformation logics, usually fed as inputs to ML pipelines or any consumer applications

The following are the data ingestion and processing requirements:

  1. Replicate data from operational databases to the data lake, including insert, update, and delete operations.
  2. Keep the source-aligned datasets up to date (typically within the range of 10 minutes to a day) in relation to their counterparts in the operational databases, ensuring analytics pipelines refresh consumer-aligned datasets for downstream ML pipelines in a timely fashion. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.
  3. To minimize DevOps and operational overhead, the company wanted to templatize the source code wherever possible. For example, to create source-aligned datasets in the data lake for 3,000 operational tables, the company didn’t want to deploy 3,000 separate data processing jobs. The smaller the number of jobs and scripts, the better.
  4. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

As you can guess, the Apache Hudi framework can solve the first requirement. Therefore, we will put our emphasis on the other requirements. We begin with a Data lake reference architecture followed by an overview of operational data processing framework. By showing you our open-source solution on GitHub, we delve into framework components and walk through their design and implementation aspects. Finally, by testing the framework, we summarize how it meets the aforementioned requirements.

Data lake reference architecture

Let’s begin with a big picture: a data lake solves a variety of analytics and ML use cases dealing with internal and external data producers and consumers. The following diagram represents a generic data lake architecture. To ingest data from operational databases to an Amazon Simple Storage Service (Amazon S3) staging bucket of the data lake, either AWS Database Migration Service (AWS DMS) or any AWS partner solution from AWS Marketplace that has support for change data capture (CDC) can fulfill the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate AWS Glue jobs to do feature engineering part of ML engineering and operations. Amazon Athena is used for interactive querying and AWS Lake Formation is used for access controls.

Data Lake Reference Architecture

Operational data processing framework

The operational data processing (ODP) framework contains three components: File Manager, File Processor, and Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. We have open-sourced this framework on GitHub—you can clone the code repo and inspect it while we walk you through the design and implementation of the framework components. The source code is organized in three folders, one for each component, and if you customize and adopt this framework for your use case, we recommend promoting these folders as separate code repositories in your version control system. Consider using the following repository names:

  1. aws-glue-hudi-odp-framework-file-manager
  2. aws-glue-hudi-odp-framework-file-processor
  3. aws-glue-hudi-odp-framework-config-manager

With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD processes. As illustrated in the preceding diagram, these components are deployed in conjunction with a CDC solution.

Component 1: File Manager

File Manager detects files emitted by a CDC process such as AWS DMS and tracks them in an Amazon DynamoDB table. As shown in the following diagram, it consists of an Amazon EventBridge event rule, an Amazon Simple Queue Service (Amazon SQS) queue, an AWS Lambda function, and a DynamoDB table. The EventBridge rule uses Amazon S3 Event Notifications to detect the arrival of CDC files in the S3 bucket. The event rule forwards the object event notifications to the SQS queue as messages. The File Manager Lambda function consumes those messages, parses the metadata, and inserts the metadata to the DynamoDB table odpf_file_tracker. These records will then be processed by File Processor, which we discuss in the next section.

ODPF Component: File Manager

Component 2: File Processor

File Processor is the workhorse of the ODP framework. It processes files from the S3 staging bucket, creates source-aligned datasets in the raw S3 bucket, and adds or updates metadata for the datasets (AWS Glue tables) in the AWS Glue Data Catalog.

We use the following terminology when discussing File Processor:

  1. Refresh cadence – This represents the data ingestion frequency (for example, 10 minutes). It usually goes with AWS Glue worker type (one of G.1X, G.2X, G.4X, G.8X, G.025X, and so on) and batch size.
  2. Table configuration – This includes the Hudi configuration (primary key, partition key, pre-combined key, and table type (Copy on Write or Merge on Read)), table data storage mode (historical or current snapshot), S3 bucket used to store source-aligned datasets, AWS Glue database name, AWS Glue table name, and refresh cadence.
  3. Batch size – This numeric value is used to split tables into smaller batches and process their respective CDC files in parallel. For example, a configuration of 50 tables with a 10-minute refresh cadence and a batch size of 5 results in a total of 10 AWS Glue job runs, each processing CDC files for 5 tables.
  4. Table data storage mode – There are two options:
    • Historical – This table in the data lake stores historical updates to records (always append).
    • Current snapshot – This table in the data lake stores latest versioned records (upserts) with the ability to use Hudi time travel for historical updates.
  5. File processing state machine – It processes CDC files that belong to tables that share a common refresh cadence.
  6. EventBridge rule association with the file processing state machine – We use a dedicated EventBridge rule for each refresh cadence with the file processing state machine as target.
  7. File processing AWS Glue job – This is a configuration-driven AWS Glue extract, transform, and load (ETL) job that processes CDC files for one or more tables.

File Processor is implemented as a state machine using AWS Step Functions. Let’s use an example to understand this. The following diagram illustrates running File Processor state machine with a configuration that includes 18 operational tables, a refresh cadence of 10 minutes, a batch size of 5, and an AWS Glue worker type of G.1X.

ODP framework component: File Processor

The workflow includes the following steps:

  1. The EventBridge rule triggers the File Processor state machine every 10 minutes.
  2. Being the first state in the state machine, the Batch Manager Lambda function reads configurations from DynamoDB tables.
  3. The Lambda function creates four batches: three of them will be mapped to five operational tables each, and the fourth one is mapped to three operational tables. Then it feeds the batches to the Step Functions Map state.
  4. For each item in the Map state, the File Processor Trigger Lambda function will be invoked, which in turn runs the File Processor AWS Glue job.
  5. Each AWS Glue job performs the following actions:
    • Checks the status of an operational table and acquires a lock when it is not processed by any other job. The odpf_file_processing_tracker DynamoDB table is used for this purpose. When a lock is acquired, it inserts a record in the DynamoDB table with the status updating_table for the first time; otherwise, it updates the record.
    • Processes the CDC files for the given operational table from the S3 staging bucket and creates a source-aligned dataset in the S3 raw bucket. It also updates technical metadata in the AWS Glue Data Catalog.
    • Updates the status of the operational table to completed in the odpf_file_processing_tracker table. In case of processing errors, it updates the status to refresh_error and logs the stack trace.
    • It also inserts this record into the odpf_file_processing_tracker_history DynamoDB table along with additional details such as insert, update, and delete row counts.
    • Moves the records that belong to successfully processed CDC files from odpf_file_tracker to the odpf_file_tracker_history table with file_ingestion_status set to raw_file_processed.
    • Moves to the next operational table in the given batch.
    • Note: a failure to process CDC files for one of the operational tables of a given batch does not impact the processing of other operational tables.

Component 3: Configuration Manager

Configuration Manager is used to insert configuration details to the odpf_batch_config and odpf_raw_table_config tables. To keep this post concise, we provide two architecture patterns in the code repo and leave the implementation details to you.

Solution overview

Let’s test the ODP framework by replicating data from 18 operational tables to a data lake and creating source-aligned datasets with 10-minute refresh cadence. We use Amazon Relational Database Service (Amazon RDS) for MySQL to set up an operational database with 18 tables, upload the New York City Taxi – Yellow Trip Data dataset, set up AWS DMS to replicate data to Amazon S3, process the files using the framework, and finally validate the data using Amazon Athena.

Create S3 buckets

For instructions on creating an S3 bucket, refer to Creating a bucket. For this post, we create the following buckets:

  1. odpf-demo-staging-EXAMPLE-BUCKET – You will use this to migrate operational data using AWS DMS
  2. odpf-demo-raw-EXAMPLE-BUCKET – You will use this to store source-aligned datasets
  3. odpf-demo-code-artifacts-EXAMPLE-BUCKET – You will use this to store code artifacts

Deploy File Manager and File Processor

Deploy File Manager and File Processor by following instructions from this README and this README, respectively.

Set up Amazon RDS for MySQL

Complete the following steps to set up Amazon RDS for MySQL as the operational data source:

  1. Provision Amazon RDS for MySQL. For instructions, refer to Create and Connect to a MySQL Database with Amazon RDS.
  2. Connect to the database instance using MySQL Workbench or DBeaver.
  3. Create a database (schema) by running the SQL command CREATE DATABASE taxi_trips;.
  4. Create 18 tables by running the SQL commands in the ops_table_sample_ddl.sql script.

Populate data to the operational data source

Complete the following steps to populate data to the operational data source:

  1. To download the New York City Taxi – Yellow Trip Data dataset for January 2021 (Parquet file), navigate to NYC TLC Trip Record Data, expand 2021, and choose Yellow Taxi Trip records. A file called yellow_tripdata_2021-01.parquet will be downloaded to your computer.
  2. On the Amazon S3 console, open the bucket odpf-demo-staging-EXAMPLE-BUCKET and create a folder called nyc_yellow_trip_data.
  3. Upload the yellow_tripdata_2021-01.parquet file to the folder.
  4. Navigate to the bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET and create a folder called glue_scripts.
  5. Download the file load_nyc_taxi_data_to_rds_mysql.py from the GitHub repo and upload it to the folder.
  6. Create an AWS Identity and Access Management (IAM) policy called load_nyc_taxi_data_to_rds_mysql_s3_policy. For instructions, refer to Creating policies using the JSON editor. Use the odpf_setup_test_data_glue_job_s3_policy.json policy definition.
  7. Create an IAM role called load_nyc_taxi_data_to_rds_mysql_glue_role. Attach the policy created in the previous step.
  8. On the AWS Glue console, create a connection for Amazon RDS for MySQL. For instructions, refer to Adding a JDBC connection using your own JDBC drivers and Setting up a VPC to connect to Amazon RDS data stores over JDBC for AWS Glue. Name the connection as odpf_demo_rds_connection.
  9. In the navigation pane of the AWS Glue console, choose Glue ETL jobs, Python Shell script editor, and Upload and edit an existing script under Options.
  10. Choose the file load_nyc_taxi_data_to_rds_mysql.py and choose Create.
  11. Complete the following steps to create your job:
    • Provide a name for the job, such as load_nyc_taxi_data_to_rds_mysql.
    • For IAM role, choose load_nyc_taxi_data_to_rds_mysql_glue_role.
    • Set Data processing units to 1/16 DPU.
    • Under Advanced properties, Connections, select the connection you created earlier.
    • Under Job parameters, add the following parameters:
      • input_sample_data_path = s3://odpf-demo-staging-EXAMPLE-BUCKET/nyc_yellow_trip_data/yellow_tripdata_2021-01.parquet
      • schema_name = taxi_trips
      • table_name = table_1
      • rds_connection_name = odpf_demo_rds_connection
    • Choose Save.
  12. On the Actions menu, run the job.
  13. Go back to your MySQL Workbench or DBeaver and validate the record count by running the SQL command select count(1) row_count from taxi_trips.table_1. You will get an output of 1369769.
  14. Populate the remaining 17 tables by running the SQL commands from the populate_17_ops_tables_rds_mysql.sql script.
  15. Get the row count from the 18 tables by running the SQL commands from the ops_data_validation_query_rds_mysql.sql script. The following screenshot shows the output.
    Record volumes (for 18 Tables) in Operational Database

Configure DynamoDB tables

Complete the following steps to configure the DynamoDB tables:

  1. Download file load_ops_table_configs_to_ddb.py from the GitHub repo and upload it to the folder glue_scripts in the S3 bucket odpf-demo-code-artifacts-EXAMPLE-BUCKET.
  2. Create an IAM policy called load_ops_table_configs_to_ddb_ddb_policy. Use the odpf_setup_test_data_glue_job_ddb_policy.json policy definition.
  3. Create an IAM role called load_ops_table_configs_to_ddb_glue_role. Attach the policy created in the previous step.
  4. On the AWS Glue console, choose Glue ETL jobs, Python Shell script editor, and Upload and edit an existing script under Options.
  5. Choose the file load_ops_table_configs_to_ddb.py and choose Create.
  6. Complete the following steps to create a job:
    • Provide a name, such as load_ops_table_configs_to_ddb.
    • For IAM role, choose load_ops_table_configs_to_ddb_glue_role.
    • Set Data processing units to 1/16 DPU.
    • Under Job parameters, add the following parameters
      • batch_config_ddb_table_name = odpf_batch_config
      • raw_table_config_ddb_table_name = odpf_demo_taxi_trips_raw
      • aws_region = e.g., us-west-1
    • Choose Save.
  7. On the Actions menu, run the job.
  8. On the DynamoDB console, get the item count from the tables. You will find 1 item in the odpf_batch_config table and 18 items in the odpf_demo_taxi_trips_raw table.

Set up a database in AWS Glue

Complete the following steps to create a database:

  1. On the AWS Glue console, under Data catalog in the navigation pane, choose Databases.
  2. Create a database called odpf_demo_taxi_trips_raw.

Set up AWS DMS for CDC

Complete the following steps to set up AWS DMS for CDC:

  1. Create an AWS DMS replication instance. For Instance class, choose dms.t3.medium.
  2. Create a source endpoint for Amazon RDS for MySQL.
  3. Create target endpoint for Amazon S3. To configure the S3 endpoint settings, use the JSON definition from dms_s3_endpoint_setting.json.
  4. Create an AWS DMS task.
    • Use the source and target endpoints created in the previous steps.
    • To create AWS DMS task mapping rules, use the JSON definition from dms_task_mapping_rules.json.
    • Under Migration task startup configuration, select Automatically on create.
  5. When the AWS DMS task starts running, you will see a task summary similar to the following screenshot.
    DMS Task Summary
  6. In the Table statistics section, you will see an output similar to the following screenshot. Here, the Full load rows and Total rows columns are important metrics whose counts should match with the record volumes of the 18 tables in the operational data source.
    DMS Task Statistics
  7. As a result of successful full load completion, you will find Parquet files in the S3 staging bucket—one Parquet file per table in a dedicated folder, similar to the following screenshot. Similarly, you will find 17 such folders in the bucket.
    DMS Output in S3 Staging Bucket for Table 1

File Manager output

The File Manager Lambda function consumes messages from the SQS queue, extracts metadata for the CDC files, and inserts one item per file to the odpf_file_tracker DynamoDB table. When you check the items, you will find 18 items with file_ingestion_status set to raw_file_landed, as shown in the following screenshot.

CDC Files in File Tracker DynamoDB Table

File Processor output

  1. On the subsequent tenth minute (since the activation of the EventBridge rule), the event rule triggers the File Processor state machine. On the Step Functions console, you will notice that the state machine is invoked, as shown in the following screenshot.
    File Processor State Machine Run Summary
  2. As shown in the following screenshot, the Batch Generator Lambda function creates four batches and constructs a Map state for parallel running of the File Processor Trigger Lambda function.
    File Processor State Machine Run Details
  3. Then, the File Processor Trigger Lambda function runs the File Processor Glue Job, as shown in the following screenshot.
    File Processor Glue Job Parallel Runs
  4. Then, you will notice that the File Processor Glue Job runs create source-aligned datasets in Hudi format in the S3 raw bucket. For Table 1, you will see an output similar to the following screenshot. There will be 17 such folders in the S3 raw bucket.
    Data in S3 raw bucket
  5. Finally, in AWS Glue Data Catalog, you will notice 18 tables created in the odpf_demo_taxi_trips_raw database, similar to the following screenshot.
    Tables in Glue Database

Data validation

Complete the following steps to validate the data:

  1. On the Amazon Athena console, open the query editor, and select a workgroup or create a new workgroup.
  2. Choose AwsDataCatalog for Data source and odpf_demo_taxi_trips_raw for Database.
  3. Run the raw_data_validation_query_athena.sql SQL query. You will get an output similar to the following screenshot.
    Raw Data Validation via Amazon Athena

Validation summary: The counts in Amazon Athena match with the counts of the operational tables and it proves that the ODP framework has processed all the files and records successfully. This concludes the demo. To test additional scenarios, refer to Extended Testing in the code repo.

Outcomes

Let’s review how the ODP framework addressed the aforementioned requirements.

  1. As discussed earlier in this post, by logically grouping tables by refresh cadence and associating them to EventBridge rules, we ensured that the source-aligned tables are refreshed by the File Processor AWS Glue jobs. With the AWS Glue worker type configuration setting, we selected the appropriate compute resources while running the AWS Glue jobs (the instances of the AWS Glue job).
  2. By applying table-specific configurations (from odpf_batch_config and odpf_raw_table_config) dynamically, we were able to use one AWS Glue job to process CDC files for 18 tables.
  3. You can use this framework to support a variety of data migration use cases that require quicker data migration from on-premises storage systems to data lakes or analytics platforms on AWS. You can reuse File Manager as is and customize File Processor to work with other storage frameworks such as Apache Iceberg, Delta Lake, and purpose-built data stores such as Amazon Aurora and Amazon Redshift.
  4. To understand how the ODP framework met the company’s disaster recovery (DR) design criterion, we first need to understand the DR architecture strategy at a high level. The DR architecture strategy has the following aspects:
    • One AWS account and two AWS Regions are used for primary and secondary environments.
    • The data lake infrastructure in the secondary Region is kept in sync with the one in the primary Region.
    • Data is stored in S3 buckets, metadata data is stored in the AWS Glue Data Catalog, and access controls in Lake Formation are replicated from the primary to secondary Region.
    • The data lake source and target systems have their respective DR environments.
    • CI/CD tooling (version control, CI server, and so on) are to be made highly available.
    • The DevOps team needs to be able to deploy CI/CD pipelines of analytics frameworks (such as this ODP framework) to either the primary or secondary Region.
    • As you can imagine, disaster recovery on AWS is a vast subject, so we keep our discussion to the last design aspect.

By designing the ODP framework with three components and externalizing operational table configurations to DynamoDB global tables, the company was able to deploy the framework components to the secondary Region (in the rare event of a single-Region failure) and continue to process CDC files from the point it last processed in the primary Region. Because the CDC file tracking and processing audit data is replicated to the DynamoDB replica tables in the secondary Region, the File Manager microservice and File Processor can seamlessly run.

Clean up

When you’re finished testing this framework, you can delete the provisioned AWS resources to avoid any further charges.

Conclusion

In this post, we took a real-world operational data processing use case and presented you the framework we developed at AWS ProServe. We hope this post and the operational data processing framework using AWS Glue and Apache Hudi will expedite your journey in integrating operational databases into your modern data platforms built on AWS.


About the authors

Ravi-IthaRavi Itha is a Principal Consultant at AWS Professional Services with specialization in data and analytics and generalist background in application development. Ravi helps customers with enterprise data strategy initiatives across insurance, airlines, pharmaceutical, and financial services industries. In his 6-year tenure at Amazon, Ravi has helped the AWS builder community by publishing approximately 15 open-source solutions (accessible via GitHub handle), four blogs, and reference architectures. Outside of work, he is passionate about reading India Knowledge Systems and practicing Yoga Asanas.

srinivas-kandiSrinivas Kandi is a Data Architect at AWS Professional Services. He leads customer engagements related to data lakes, analytics, and data warehouse modernizations. He enjoys reading history and civilizations.

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

Post Syndicated from Shubham Purwar original https://aws.amazon.com/blogs/big-data/securely-process-near-real-time-data-from-amazon-msk-serverless-using-an-aws-glue-streaming-etl-job-with-iam-authentication/

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This data originates from diverse sources, including social media, sensors, logs, and clickstreams, among others. With streaming data, organizations gain a competitive edge by promptly responding to real-time events and making well-informed decisions.

In streaming applications, a prevalent approach involves ingesting data through Apache Kafka and processing it with Apache Spark Structured Streaming. However, managing, integrating, and authenticating the processing framework (Apache Spark Structured Streaming) with the ingesting framework (Kafka) poses significant challenges, necessitating a managed and serverless framework. For example, integrating and authenticating a client like Spark streaming with Kafka brokers and zookeepers using a manual TLS method requires certificate and keystore management, which is not an easy task and requires a good knowledge of TLS setup.

To address these issues effectively, we propose using Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed Apache Kafka service that offers a seamless way to ingest and process streaming data. In this post, we use Amazon MSK Serverless, a cluster type for Amazon MSK that makes it possible for you to run Apache Kafka without having to manage and scale cluster capacity. To further enhance security and streamline authentication and authorization processes, MSK Serverless enables you to handle both authentication and authorization using AWS Identity and Access Management (IAM) in your cluster. This integration eliminates the need for separate mechanisms for authentication and authorization, simplifying and strengthening data protection. For example, when a client tries to write to your cluster, MSK Serverless uses IAM to check whether that client is an authenticated identity and also whether it is authorized to produce to your cluster.

To process data effectively, we use AWS Glue, a serverless data integration service that uses the Spark Structured Streaming framework and enables near-real-time data processing. An AWS Glue streaming job can handle large volumes of incoming data from MSK Serverless with IAM authentication. This powerful combination ensures that data is processed securely and swiftly.

The post demonstrates how to build an end-to-end implementation to process data from MSK Serverless using an AWS Glue streaming extract, transform, and load (ETL) job with IAM authentication to connect MSK Serverless from the AWS Glue job and query the data using Amazon Athena.

Solution overview

The following diagram illustrates the architecture that you implement in this post.

The workflow consists of the following steps:

  1. Create an MSK Serverless cluster with IAM authentication and an EC2 Kafka client as the producer to ingest sample data into a Kafka topic. For this post, we use the kafka-console-producer.sh Kafka console producer client.
  2. Set up an AWS Glue streaming ETL job to process the incoming data. This job extracts data from the Kafka topic, loads it into Amazon Simple Storage Service (Amazon S3), and creates a table in the AWS Glue Data Catalog. By continuously consuming data from the Kafka topic, the ETL job ensures it remains synchronized with the latest streaming data. Moreover, the job incorporates the checkpointing functionality, which tracks the processed records, enabling it to resume processing seamlessly from the point of interruption in the event of a job run failure.
  3. Following the data processing, the streaming job stores data in Amazon S3 and generates a Data Catalog table. This table acts as a metadata layer for the data. To interact with the data stored in Amazon S3, you can use Athena, a serverless and interactive query service. Athena enables the run of SQL-like queries on the data, facilitating seamless exploration and analysis.

For this post, we create the solution resources in the us-east-1 Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.

Configure resources with AWS CloudFormation

In this post, you use the following two CloudFormation templates. The advantage of using two different templates is that you can decouple the resource creation of ingestion and processing part according to your use case and if you have requirements to create specific process resources only.

  • vpc-mskserverless-client.yaml – This template sets up data the ingestion service resources such as a VPC, MSK Serverless cluster, and S3 bucket
  • gluejob-setup.yaml – This template sets up the data processing resources such as the AWS Glue table, database, connection, and streaming job

Create data ingestion resources

The vpc-mskserverless-client.yaml stack creates a VPC, private and public subnets, security groups, S3 VPC Endpoint, MSK Serverless cluster, EC2 instance with Kafka client, and S3 bucket. To create the solution resources for data ingestion, complete the following steps:

  1. Launch the stack vpc-mskserverless-client using the CloudFormation template:
  2. Provide the parameter values as listed in the following table.
Parameters Description Sample Value
EnvironmentName Environment name that is prefixed to resource names .
PrivateSubnet1CIDR IP range (CIDR notation) for the private subnet in the first Availability Zone .
PrivateSubnet2CIDR IP range (CIDR notation) for the private subnet in the second Availability Zone .
PublicSubnet1CIDR IP range (CIDR notation) for the public subnet in the first Availability Zone .
PublicSubnet2CIDR IP range (CIDR notation) for the public subnet in the second Availability Zone .
VpcCIDR IP range (CIDR notation) for this VPC .
InstanceType Instance type for the EC2 instance t2.micro
LatestAmiId AMI used for the EC2 instance /aws/service/ami-amazon-linux- latest/amzn2-ami-hvm-x86_64-gp2
  1. When the stack creation is complete, retrieve the EC2 instance PublicDNS from the vpc-mskserverless-client stack’s Outputs tab.

The stack creation process can take around 15 minutes to complete.

  1. On the Amazon EC2 console, access the EC2 instance that you created using the CloudFormation template.
  2. Choose the EC2 instance whose InstanceId is shown on the stack’s Outputs tab.

Next, you log in to the EC2 instance using Session Manager, a capability of AWS Systems Manager.

  1. On the Amazon EC2 console, select the instanceid and on the Session Manager tab, choose Connect.


After you log in to the EC2 instance, you create a Kafka topic in the MSK Serverless cluster from the EC2 instance.

  1. In the following export command, provide the MSKBootstrapServers value from the vpc-mskserverless- client stack output for your endpoint:
    $ sudo su – ec2-user
    $ BS=<your-msk-serverless-endpoint (e.g.) boot-xxxxxx.yy.kafka-serverless.us-east-1.a>

  2. Run the following command on the EC2 instance to create a topic called msk-serverless-blog. The Kafka client is already installed in the ec2-user home directory (/home/ec2-user).
    $ /home/ec2-user/kafka_2.12-2.8.1/bin/kafka-topics.sh \
    --bootstrap-server $BS \
    --command-config /home/ec2-user/kafka_2.12-2.8.1/bin/client.properties \
    --create –topic msk-serverless-blog \
    --partitions 1
    
    Created topic msk-serverless-blog

After you confirm the topic creation, you can push the data to the MSK Serverless.

  1. Run the following command on the EC2 instance to create a console producer to produce records to the Kafka topic. (For source data, we use nycflights.csv downloaded at the ec2-user home directory /home/ec2-user.)
$ /home/ec2-user/kafka_2.12-2.8.1/bin/kafka-console-producer.sh \
--broker-list $BS \
--producer.config /home/ec2-user/kafka_2.12-2.8.1/bin/client.properties \
--topic msk-serverless-blog < nycflights.csv

Next, you set up the data processing service resources, specifically AWS Glue components like the database, table, and streaming job to process the data.

Create data processing resources

The gluejob-setup.yaml CloudFormation template creates a database, table, AWS Glue connection, and AWS Glue streaming job. Retrieve the values for VpcId, GluePrivateSubnet, GlueconnectionSubnetAZ, SecurityGroup, S3BucketForOutput, and S3BucketForGlueScript from the vpc-mskserverless-client stack’s Outputs tab to use in this template. Complete the following steps:

  1. Launch the stack gluejob-setup:

  1. Provide parameter values as listed in the following table.
Parameters Description Sample value
EnvironmentName Environment name that is prefixed to resource names. Gluejob-setup
VpcId ID of the VPC for security group. Use the VPC ID created with the first stack. Refer to the first stack’s output.
GluePrivateSubnet Private subnet used for creating the AWS Glue connection. Refer to the first stack’s output.
SecurityGroupForGlueConnection Security group used by the AWS Glue connection. Refer to the first stack’s output.
GlueconnectionSubnetAZ Availability Zone for the first private subnet used for the AWS Glue connection. .
GlueDataBaseName Name of the AWS Glue Data Catalog database. glue_kafka_blog_db
GlueTableName Name of the AWS Glue Data Catalog table. blog_kafka_tbl
S3BucketNameForScript Bucket Name for Glue ETL script. Use the S3 bucket name from the previous stack. For example, aws-gluescript-${AWS::AccountId}-${AWS::Region}-${EnvironmentName}
GlueWorkerType Worker type for AWS Glue job. For example, G.1X. G.1X
NumberOfWorkers Number of workers in the AWS Glue job. 3
S3BucketNameForOutput Bucket name for writing data from the AWS Glue job. aws-glueoutput-${AWS::AccountId}-${AWS::Region}-${EnvironmentName}
TopicName MSK topic name that needs to be processed. msk-serverless-blog
MSKBootstrapServers Kafka bootstrap server. boot-30vvr5lg.c1.kafka-serverless.us- east-1.amazonaws.com:9098

The stack creation process can take around 1–2 minutes to complete. You can check the Outputs tab for the stack after the stack is created.

In the gluejob-setup stack, we created a Kafka type AWS Glue connection, which consists of broker information like the MSK bootstrap server, topic name, and VPC in which the MSK Serverless cluster is created. Most importantly, it specifies the IAM authentication option, which helps AWS Glue authenticate and authorize using IAM authentication while consuming the data from the MSK topic. For further clarity, you can examine the AWS Glue connection and the associated AWS Glue table generated through AWS CloudFormation.

After successfully creating the CloudFormation stack, you can now proceed with processing data using the AWS Glue streaming job with IAM authentication.

Run the AWS Glue streaming job

To process the data from the MSK topic using the AWS Glue streaming job that you set up in the previous section, complete the following steps:

  1. On the CloudFormation console, choose the stack gluejob-setup.
  2. On the Outputs tab, retrieve the name of the AWS Glue streaming job from the GlueJobName row. In the following screenshot, the name is GlueStreamingJob-glue-streaming-job.

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Search for the AWS Glue streaming job named GlueStreamingJob-glue-streaming-job.
  3. Choose the job name to open its details page.
  4. Choose Run to start the job.
  5. On the Runs tab, confirm if the job ran without failure.

  1. Retrieve the OutputBucketName from the gluejob-setup template outputs.
  2. On the Amazon S3 console, navigate to the S3 bucket to verify the data.

  1. On the AWS Glue console, choose the AWS Glue streaming job you ran, then choose Stop job run.

Because this is a streaming job, it will continue to run indefinitely until manually stopped. After you verify the data is present in the S3 output bucket, you can stop the job to save cost.

Validate the data in Athena

After the AWS Glue streaming job has successfully created the table for the processed data in the Data Catalog, follow these steps to validate the data using Athena:

  1. On the Athena console, navigate to the query editor.
  2. Choose the Data Catalog as the data source.
  3. Choose the database and table that the AWS Glue streaming job created.
  4. To validate the data, run the following query to find the flight number, origin, and destination that covered the highest distance in a year:
SELECT distinct(flight),distance,origin,dest,year from "glue_kafka_blog_db"."output" where distance= (select MAX(distance) from "glue_kafka_blog_db"."output")

The following screenshot shows the output of our example query.

Clean up

To clean up your resources, complete the following steps:

  1. Delete the CloudFormation stack gluejob-setup.
  2. Delete the CloudFormation stack vpc-mskserverless-client.

Conclusion

In this post, we demonstrated a use case for building a serverless ETL pipeline for streaming with IAM authentication, which allows you to focus on the outcomes of your analytics. You can also modify the AWS Glue streaming ETL code in this post with transformations and mappings to ensure that only valid data gets loaded to Amazon S3. This solution enables you to harness the prowess of AWS Glue streaming, seamlessly integrated with MSK Serverless through the IAM authentication method. It’s time to act and revolutionize your streaming processes.

Appendix

This section provides more information about how to create the AWS Glue connection on the AWS Glue console, which helps establish the connection to the MSK Serverless cluster and allow the AWS Glue streaming job to authenticate and authorize using IAM authentication while consuming the data from the MSK topic.

  1. On the AWS Glue console, in the navigation pane, under Data catalog, choose Connections.
  2. Choose Create connection.
  3. For Connection name, enter a unique name for your connection.
  4. For Connection type, choose Kafka.
  5. For Connection access, select Amazon managed streaming for Apache Kafka (MSK).
  6. For Kafka bootstrap server URLs, enter a comma-separated list of bootstrap server URLs. Include the port number. For example, boot-xxxxxxxx.c2.kafka-serverless.us-east- 1.amazonaws.com:9098.

  1. For Authentication, choose IAM Authentication.
  2. Select Require SSL connection.
  3. For VPC, choose the VPC that contains your data source.
  4. For Subnet, choose the private subnet within your VPC.
  5. For Security groups, choose a security group to allow access to the data store in your VPC subnet.

Security groups are associated to the ENI attached to your subnet. You must choose at least one security group with a self-referencing inbound rule for all TCP ports.

  1. Choose Save changes.

After you create the AWS Glue connection, you can use the AWS Glue streaming job to consume data from the MSK topic using IAM authentication.


About the authors

Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru specialized in AWS Glue and Amazon Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.

Nitin Kumar is a Cloud Engineer (ETL) at AWS with a specialization in AWS Glue. He is dedicated to assisting customers in resolving issues related to their ETL workloads and creating scalable data processing and analytics pipelines on AWS.

Preview – Connect Foundation Models to Your Company Data Sources with Agents for Amazon Bedrock

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/preview-connect-foundation-models-to-your-company-data-sources-with-agents-for-amazon-bedrock/

In July, we announced the preview of agents for Amazon Bedrock, a new capability for developers to create generative AI applications that complete tasks. Today, I’m happy to introduce a new capability to securely connect foundation models (FMs) to your company data sources using agents.

With a knowledge base, you can use agents to give FMs in Bedrock access to additional data that helps the model generate more relevant, context-specific, and accurate responses without continuously retraining the FM. Based on user input, agents identify the appropriate knowledge base, retrieve the relevant information, and add the information to the input prompt, giving the model more context information to generate a completion.

Knowledge Base for Amazon Bedrock

Agents for Amazon Bedrock use a concept known as retrieval augmented generation (RAG) to achieve this. To create a knowledge base, specify the Amazon Simple Storage Service (Amazon S3) location of your data, select an embedding model, and provide the details of your vector database. Bedrock converts your data into embeddings and stores your embeddings in the vector database. Then, you can add the knowledge base to agents to enable RAG workflows.

For the vector database, you can choose between vector engine for Amazon OpenSearch Serverless, Pinecone, and Redis Enterprise Cloud. I’ll share more details on how to set up your vector database later in this post.

Primer on Retrieval Augmented Generation, Embeddings, and Vector Databases
RAG isn’t a specific set of technologies but a concept for providing FMs access to data they didn’t see during training. Using RAG, you can augment FMs with additional information, including company-specific data, without continuously retraining your model.

Continuously retraining your model is not only compute-intensive and expensive, but as soon as you’ve retrained the model, your company might have already generated new data, and your model has stale information. RAG addresses this issue by providing your model access to additional external data at runtime. Relevant data is then added to the prompt to help improve both the relevance and the accuracy of completions.

This data can come from a number of data sources, such as document stores or databases. A common implementation for document search is converting your documents, or chunks of the documents, into vector embeddings using an embedding model and then storing the vector embeddings in a vector database, as shown in the following figure.

Knowledge Base for Amazon Bedrock

The vector embedding includes the numeric representations of text data within your documents. Each embedding aims to capture the semantic or contextual meaning of the data. Each vector embedding is put into a vector database, often with additional metadata such as a reference to the original content the embedding was created from. The vector database then indexes the vectors, which can be done using a variety of approaches. This indexing enables quick retrieval of relevant data.

Compared to traditional keyword search, vector search can find relevant results without requiring an exact keyword match. For example, if you search for “What is the cost of product X?” and your documents say “The price of product X is […]”, then keyword search might not work because “price” and “cost” are two different words. With vector search, it will return the accurate result because “price” and “cost” are semantically similar; they have the same meaning. Vector similarity is calculated using distance metrics such as Euclidean distance, cosine similarity, or dot product similarity.

The vector database is then used within the prompt workflow to efficiently retrieve external information based on an input query, as shown in the figure below.

Knowledge Base for Amazon Bedrock

The workflow starts with a user input prompt. Using the same embedding model, you create a vector embedding representation of the input prompt. This embedding is then used to query the database for similar vector embeddings to return the most relevant text as the query result.

The query result is then added to the prompt, and the augmented prompt is passed to the FM. The model uses the additional context in the prompt to generate the completion, as shown in the following figure.

Knowledge Stores for Amazon Bedrock

Similar to the fully managed agents experience I described in the blog post on agents for Amazon Bedrock, the knowledge base for Amazon Bedrock manages the data ingestion workflow, and agents manage the RAG workflow for you.

Get Started with Knowledge Bases for Amazon Bedrock
You can add a knowledge base by specifying a data source, such as Amazon S3, select an embedding model, such as Amazon Titan Embeddings to convert the data into vector embeddings, and a destination vector database to store the vector data. Bedrock takes care of creating, storing, managing, and updating your embeddings in the vector database.

If you add knowledge bases to an agent, the agent will identify the appropriate knowledge base based on user input, retrieve the relevant information, and add the information to the input prompt, providing the model with more context information to generate a response, as shown in the figure below. All information retrieved from knowledge bases comes with source attribution to improve transparency and minimize hallucinations.

Knowledge Base for Amazon Bedrock

Let me walk you through those steps in more detail.

Create a Knowledge Base for Amazon Bedrock
Let’s assume you’re a developer at a tax consulting company and want to provide users with a generative AI application—a TaxBot—that can answer US tax filing questions. You first create a knowledge base that holds the relevant tax documents. Then, you configure an agent in Bedrock with access to this knowledge base and integrate the agent into your TaxBot application.

To get started, open the Bedrock console, select Knowledge base in the left navigation pane, then choose Create knowledge base.

Knowledge Base for Amazon Bedrock

Step 1 – Provide knowledge base details. Enter a name for the knowledge base and a description (optional). You also must select an AWS Identity and Access Management (IAM) runtime role with a trust policy for Amazon Bedrock, permissions to access the S3 bucket you want the knowledge base to use, and read/write permissions to your vector database. You can also assign tags as needed.

Knowledge Base for Amazon Bedrock

Step 2 – Set up data source. Enter a data source name and specify the Amazon S3 location for your data. Supported data formats include .txt, .md, .html, .doc and .docx, .csv, .xls and .xlsx, and .pdf files. You can also provide an AWS Key Management Service (AWS KMS) key to allow Bedrock to decrypt and encrypt your data and another AWS KMS key for transient data storage while Bedrock is converting your data into embeddings.

Choose the embedding model, such as Amazon Titan Embeddings – Text, and your vector database. For the vector database, as mentioned earlier, you can choose between vector engine for Amazon OpenSearch Serverless, Pinecone, or Redis Enterprise Cloud.

Knowledge Base for Amazon Bedrock

Important note on the vector database: Amazon Bedrock is not creating a vector database on your behalf. You must create a new, empty vector database from the list of supported options and provide the vector database index name as well as index field and metadata field mappings. This vector database will need to be for exclusive use with Amazon Bedrock.

Let me show you what the setup looks like for vector engine for Amazon OpenSearch Serverless. Assuming you’ve set up an OpenSearch Serverless collection as described in the Developer Guide and this AWS Big Data Blog post, provide the ARN of the OpenSearch Serverless collection, specify the vector index name, and the vector field and metadata field mapping.

Knowledge Base for Amazon Bedrock

The configuration for Pinecone and Redis Enterprise Cloud is similar. Check out this Pinecone blog post and this Redis Inc. blog post for more details on how to set up and prepare their vector database for Bedrock.

Step 3 – Review and create. Review your knowledge base configuration and choose Create knowledge base.

Knowledge Base for Amazon Bedrock

Back in the knowledge base details page, choose Sync for the newly created data source, and whenever you add new data to the data source, to start the ingestion workflow of converting your Amazon S3 data into vector embeddings and upserting the embeddings into the vector database. Depending on the amount of data, this whole workflow can take some time.

Knowledge Base for Amazon Bedrock

Next, I’ll show you how to add the knowledge base to an agent configuration.

Add a Knowledge Base to Agents for Amazon Bedrock
You can add a knowledge base when creating or updating an agent for Amazon Bedrock. Create an agent as described in this AWS News Blog post on agents for Amazon Bedrock.

For my tax bot example, I’ve created an agent called “TaxBot,” selected a foundation model, and provided these instructions for the agent in step 2: “You are a helpful and friendly agent that answers US tax filing questions for users.” In step 4, you can now select a previously created knowledge base and provide instructions for the agent describing when to use this knowledge base.

Knowledge Base for Amazon Bedrock

These instructions are very important as they help the agent decide whether or not a particular knowledge base should be used for retrieval. The agent will identify the appropriate knowledge base based on user input and available knowledge base instructions.

For my tax bot example, I added the knowledge base “TaxBot-Knowledge-Base” together with these instructions: “Use this knowledge base to answer tax filing questions.”

Once you’ve finished the agent configuration, you can test your agent and how it’s using the added knowledge base. Note how the agent provides a source attribution for information pulled from knowledge bases.

Knowledge Base for Amazon Bedrock

Generative AI with large language modelsLearn the Fundamentals of Generative AI
Generative AI with large language models (LLMs) is an on-demand, three-week course for data scientists and engineers who want to learn how to build generative AI applications with LLMs, including RAG. It’s the perfect foundation to start building with Amazon Bedrock. Enroll for generative AI with LLMs today.

Sign up to Learn More about Amazon Bedrock (Preview)
Amazon Bedrock is currently available in preview. Reach out through your usual AWS support contacts if you’d like access to knowledge bases for Amazon Bedrock as part of the preview. We’re regularly providing access to new customers. To learn more, visit the Amazon Bedrock Features page and sign up to learn more about Amazon Bedrock.

— Antje

AWS achieves HDS certification in two additional Regions

Post Syndicated from Janice Leung original https://aws.amazon.com/blogs/security/aws-achieves-hds-certification-in-two-additional-regions-2/

Amazon Web Services (AWS) is pleased to announce that two additional AWS Regions—Middle East (UAE) and Europe (Zurich)—have been granted the Health Data Hosting (Hébergeur de Données de Santé, HDS) certification, increasing the scope to 20 global AWS Regions.

The Agence Française de la Santé Numérique (ASIP Santé), the French governmental agency for health, introduced the HDS certification to strengthen the security and protection of personal health data. By achieving this certification, AWS demonstrates our commitment to adhere to the heightened expectations for cloud service providers.

The following 20 Regions are in scope for this certification:

  • US East (Ohio)
  • US East (Northern Virginia)
  • US West (Northern California)
  • US West (Oregon)
  • Asia Pacific (Jakarta)
  • Asia Pacific (Seoul)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • Canada (Central)
  • Europe (Frankfurt)
  • Europe (Ireland)
  • Europe (London)
  • Europe (Milan)
  • Europe (Paris)
  • Europe (Stockholm)
  • Europe (Zurich)
  • Middle East (UAE)
  • South America (São Paulo)

The HDS certification demonstrates that AWS provides a framework for technical and governance measures that secure and protect personal health data, governed by French law. Our customers who handle personal health data can continue to manage their workloads in HDS-certified Regions with confidence.

Independent third-party auditors evaluated and certified AWS on September 8, 2023. The Certificate of Compliance demonstrating AWS compliance status is available on the Agence du Numérique en Santé (ANS) website and AWS Artifact. AWS Artifact is a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

For up-to-date information, including when additional Regions are added, see the AWS Compliance Programs page and choose HDS.

AWS strives to continuously meet your architectural and regulatory needs. If you have questions or feedback about HDS compliance, reach out to your AWS account team.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Author

Janice Leung

Janice is a Security Assurance Audit Program Manager at AWS, based in New York. She leads security audits across Europe and previously worked in security assurance and technology risk management in the financial industry for 11 years.

Building companies means building careers: why I joined Cloudflare as Chief People Officer

Post Syndicated from Michele Yetman original http://blog.cloudflare.com/why-i-joined-cloudflare-as-chief-people-officer/

Building companies means building careers: why I joined Cloudflare as Chief People Officer

Building companies means building careers: why I joined Cloudflare as Chief People Officer

One piece of advice I received early in my career was to get into a transformative industry. Those words have followed me ever since, and it’s a goal I’ve encouraged many others to pursue.

For me, it meant first launching into biotechnology where I learned my passion for working with deeply technical and disruptive businesses doing things that hadn’t been done before.

I later joined Amazon at a time when it was best known as a retailer instead of a technology company as it is today. While there, I led HR for some of their most technical businesses from eCommerce to AWS. As all these businesses scaled over the next decade, I became increasingly focused, and then finally fully dedicated to, leading HR for AWS. During that time, I had the opportunity to serve as a thought partner to the AWS CEO and leadership team as the organization grew from 400 employees to 30,000.

It was at this point in my career that I realized my passion for scaling a company with practices that reinforce the mission and building programs with intention to nurture the culture. To have any impact, all this work must be in support of promoting a diverse and inclusive workplace that values individual and group differences to ensure all employees, across a diversity of backgrounds and perspectives, feel valued, welcome, and integrated.

Later, I took all those learnings to Tableau as Chief Human Resource Officer (CHRO) before it was acquired by Salesforce. Like AWS, Tableau was ready to begin its next phase of growth and I designed and implemented the next generation talent strategy that supported the long-term growth plans for the business.

Today, as a two-time Chief People Officer and with experience and passion for scaling large global companies, building is in my DNA. That’s why Cloudflare’s bold mission to “help to build a better Internet” excites me. I am a builder at my core, and am humbled for the opportunity to help build again and to be selected to join the team for this next phase of the Cloudflare journey.

There are three things that stood out to me and made this next step an easy one:

The technology

Cloudflare caught my attention as a next generation disruptor in the tech industry by offering security, availability, and scalability of applications with no trade off with speed. The advancement of generative AI and Cloudflare's partnership with many of these businesses as customers put it front and center at a transformative time in the tech space.

The values

It was evident with each interview how the company deeply values both its customers and its people. The transparency and keen attention on scaling what is already a very special place to work was evident with each conversation. That really resonated with me. I always view my work through the lens of how it impacts people, leaders, and customers. For me, putting both customers and people at the center of everything we do on the Cloudflare’s growing People team is paramount as we create the people strategy that supports achieving the company’s high-growth business objectives.

The culture

Too often, company culture is defined by, and confined to, employee handbooks and posters on the wall. In rarer cases, it's something that is both innate and carefully nurtured by leaders who demonstrate their values through their actions. That’s what I’ve found at Cloudflare. The transparency, trust, and curiosity of this organization is energizing. The people I’ve had the opportunity to engage with have been warm, humble, and bright.

As I launch in my new position here at Cloudflare, my priority will be to first learn as much as I can about the people-first culture and ensure the work we do within the People team is aligned. I’m excited to bring my own curiosity and passion for disruptive and transformative companies to a new team to help Cloudflare scale and flourish, and help build amazing careers for all of our people.

Restore Like Never Before: Introducing Backblaze Computer Backup v9.0

Post Syndicated from Yev original https://www.backblaze.com/blog/restore-like-never-before-introducing-backblaze-computer-backup-v9-0/

A decorative image displaying the title Backblaze Computer Backup and v9.0.

Get ready. The release of Backblaze Computer Backup 9.0 is rolling out now through the end of September.

Backblaze Computer Backup 9.0 is available today in early access, and restoring your files is about to get a whole lot easier.

What’s New in Backblaze Computer Backup 9.0?

Whether you’re a longtime user or just getting started with Backblaze, version 9.0 provides you with an unparalleled backup and restore solution. With our latest release, you get our most requested feature: a dedicated restore app for both macOS and Windows clients that makes the process of restoring your data even more intuitive, seamless, and streamlined than before. The new version also comes with essential bug fixes and performance improvements to keep your back up experience ahead of the curve for both security and speed. 

Backblaze Restore App: macOS and Windows Highlights

Whether you’re using our macOS or Windows clients, you can now recover your important data with even more ease.

Here’s a peek into some of the new features we have in store with our new Restore Client App: 

  • Simplified restore initiation process. When you’ve lost important files, the last thing you want is a demanding process sitting between you and restoring your data. With the restore app, you authenticate your Backblaze account and initiate the restore directly from your desktop. Once authenticated, you can browse your file tree and kick off the restore process immediately.
  • No limits for restore size. There are no limits to restore sizes inside of the restore app. Conserving disk space is important and you shouldn’t have to worry about downloading a .zip and having enough additional space to unzip it as well. 

If you’re interested in a comprehensive tutorial on how to use the new restore app, we’re here to guide you. Let us walk you through the process.

We’re excited that our version 9.0 release compliments your already robust methods of accessing your data. To access your backup from anywhere, you can log in to www.backblaze.com to initiate a restore and use our iOS and Android apps to access your files on the go. 

Backblaze v9.0 Is Available in Early Access Today: September 13, 2023

We will be taking feedback and slowly auto-updating all users in the coming weeks, but if you can’t wait and want to download the early access release now on your Mac or PC:

  1. Go to: https://www.backblaze.com/status/backup-beta
  2. Select your operating system and download the v9.0 app.
  3. Install the early access release on your computer.

Please note, since this is in early access you might hit some bugs. Please reach out to our Support Team if you have any questions or if you want to give feedback—we always like to know how things are going.

The post Restore Like Never Before: Introducing Backblaze Computer Backup v9.0 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Забавяне на рестарта на „Тоест“. И един апел за помощ

Post Syndicated from Тоест original https://www.toest.bg/zavavyane-na-restarta-apel-za-pomosht/

Забавяне на рестарта на „Тоест“. И един апел за помощ

През тези 5 години и половина „Тоест“ винаги е съществувал с крайно икономичен бюджет и къс хоризонт. Този хоризонт сега се сви до месец. Подновяваме дейност на 1 октомври – с неясно бъдеще и спешна нужда от подкрепа.

Какво стана?

Финансовото положение на „Тоест“ никога не е било розово и през годините сме оцелявали на инат. Безкрайно признателни сме за всеки лев подкрепа от читателите, но доходите от краудфъндинг, освен крайно недостатъчни, са и несигурни – във всеки момент нашите дарители могат да спират и подновяват месечните си микродарения. Затова през тези години спестявахме каквото можехме и успяхме да съберем малък резерв „за черни дни“ (разделен на петте години, този резерв се равняваше на около 170 лв. месечно).

Същевременно обаче съвместяването на няколко роли от един и същ човек и огромното количество доброволен труд водеха до изтощение и демотивация, а авторските хонорари, поначало скромни, стояха на едно и също ниво през всичките тези пет години.

Беше наложително да направим генерална промяна.

В началото на 2023 г. преструктурирахме дейността и разпределихме по-добре ролите, за да бъдем по-ефективни. Привлякохме двама отговорни редактори, които да се грижат за съдържанието. Междувременно коригирахме хонорарите, така че да достигнат поне 50% от нивата при другите медии. Използвахме спестения резерв, за да си спечелим време и да се опитаме да отскочим.

Посрещнахме петия си рожден ден с изцяло нов сайт, разширен екип от редактори и автори, както и с нови рубрики, които бързо намериха верни читатели. Създадохме дарителски пакети със специални подаръци и бонуси за читателите, а малко преди началото на лятото представихме шоколада с кауза „Гайо & Тоест“. Междувременно кандидатствахме (самостоятелно и в партньорство с други организации) по шест програми за финансиране на културни и медийни проекти. С други думи,

не спираме да търсим всякакви допълнителни източници на финансиране.

За съжаление, и по шестте програми за финансиране не постигнахме успех. Не сме се отказали и от по-нататъшни кандидатствания, но е важно да отбележим, че повечето от програмите финансират целево конкретно разписан проект (за тема, рубрика, поредица от материали). А смисълът и ползата от една независима медия е да обхваща по-широк спектър от важните за обществото проблеми, не да се фокусира върху конкретната тема, за която е получила финансиране.

Затова преди всичко трябва да подсигурим цялостното съществуване на медията.

Колко струва „Тоест“ на месец?

Създаването на всяка отделна статия не се свежда само до написването ѝ от автора. Нужни са внимателна проверка на фактите и грижлива редакторска работа за оформянето на текста в качествен, полезен и езиково издържан журналистически продукт.

Извън създаването на съдържание има и редица административни дейности – техническа поддръжка на сайта, управление на платежните системи, водене на документация и сметки, заплащане за външни услуги, като счетоводство, банково обслужване, годишни такси за различни платформи, хостинг, домейн. Тоест разходи, които си текат, без значение дали създаваме съдържание, или не.

Ако трябва да осигурим достойно и съпоставимо с другите медии заплащане на труда на екипа и да начислим всички реални разходи, ще са нужни минимум 10 000 лева на месец. Докато достигнем тази сума, целият екип на „Тоест“ е решен да продължи работата си при настоящите нива на заплащане. Защото всички вярваме в смисъла на този проект, в качеството, което предлагаме, в нуждата от това свободно и независимо пространство за спокоен и качествен обществен дебат.

Защо просто не пуснем реклами?

Рекламите (особено онлайн и особено в България) са твърде евтини и за да има смисъл от тях, трябва да са много. Рекламните банери да заемат голяма част от сайта. Освен всичко те не се появяват сами. Трябва някой да ги продава и обслужва – да търси рекламодатели, да уговаря условия, да осигурява техническото изпълнение. Тези разходи изяждат съществен дял от приходите. За една малка медия това обезсмисля цялото упражнение. Особено за медия като „Тоест“ – с малко, но дълги аналитични текстове, която категорично отказва да залага на бомбастични заглавия, за да печели кликове на всяка цена и да си осигури мащабен трафик.

И най-важната причина – независимостта. Ако медията е обвързана с рекламодатели, кръгът от теми и проблеми, които намират място в нейната платформа, се стеснява. Променя се и начинът, по който екипът на медията подхожда към работата си, защото цензурата поражда и автоцензура.

Пример: Материалът на „Тоест“ за неморалните практики А1 не беше отразен и развит от други медии, нито дори разпространен от изданията, които обичайно препечатват нашите статии. Защо? Защото А1 е огромен рекламодател. И дори някоя медия или медийна група да няма обвързаност с А1 (или с рекламодател от подобен мащаб), то тя се надява да има такава в бъдеще. В случая поговорката „Който плаща, той поръчва музиката“ важи с пълна сила. И тъкмо тази сила смазва независимостта на една медия.

Защо не затворим съдържанието и не се издържаме с абонаменти за четене?

Заключването на съдържанието и затварянето му само за определен кръг от хора, които могат да си платят за него, е дискриминационно и влиза в разрез с основния принцип на журналистиката – да информира обществото, цялото общество. Свеждането на една медия до затворено пространство за избрани хора (без значение от размера на абонамента) подменя нейните функции и създава финансова бариера, която трябва да преминеш, за да си добре информиран. Създаването на „информационни касти“ е опасно и погрешно, особено във времена като настоящите, в които свободният достъп до достоверно, проверено и качествено поднесено журналистическо съдържание е от ключова важност за развитието на обществата.

Какво следва?

За поддържането на обичайното качество и количество на материалите в „Тоест“ при настоящите хонорари са нужни 5500 лева на месец. Месечните абонаментни дарения от читатели в момента са около 3000 лева.

На първо време трябва да запълним тази дупка от 2500 лева на месец, за да запазим екипа и да продължим с досегашния ритъм и качество. И тук много ще разчитаме на вас, скъпи читатели и настоящи дарители на „Тоест“. Разкажете на още някого за нас – защо ни харесвате и защо за вас е важно да продължим. Със свои думи и аргументи. Споделете, че ни подкрепяте, и го помолете да направи същото – чрез някой от нашите дарителски пакети, в които сме включили много сладки (и в буквалния смисъл) подаръци.

Ако всеки от вас убеди по само още един човек да ни подкрепи, ще достигнем първата критична сума.

Междувременно ние продължаваме да търсим и други източници на финансиране. Следващата стъпка е да се обърнем към бизнеса за подкрепа.

Знаем, че съществуват много успешни български бизнеси, които работят на световно ниво и са конкурентни, дори водещи в областта си на световния пазар. Които са независими от обществените поръчки и благоразположението на властта. Които ясно осъзнават, че добрата среда за предприемачество е пряко свързана с добрата обществена среда, а за изграждането ѝ е критично важно да има свободни медии. И не на последно място: бизнеси, които биха подкрепили качествената журналистика не срещу стандартното „медийно отразяване“ – с публикуване на готови прессъобщения и спонсорирани статии.

В „Тоест“ можем да ви предложим много повече от това.

Ние ще подходим индивидуално и с професионализъм към вашата история, спазвайки всички журналистически стандарти, така че разказът за успехите ви и за продуктите, които създавате, да е полезен и интересен както на нашите читатели, така и на потенциалните ви клиенти и служители.

Медиите няма как да бъдат бизнес начинание, което да носи печалба. Те по замисъл имат други цели – да информират, да задават трудните въпроси, да се ангажират с важните за обществото проблеми, да бъдат критични към властта.

Не само в България медиите изпитват финансови затруднения. В статията на „Гардиън“ от 16 май т.г. може да прочетете за фалита на Vice; за гибелта на BuzzFeed и съкращенията в големи световни медии; за бизнес моделите, базирани на реклама, които вече не работят, защото твърде нищожна част остава за самите медии; за принудата медийните проекти да се занимават с какво ли не извън журналистиката, само и само да оцелеят.

Съществуването на независими медии е от огромна важност за прогреса на едно общество и е пряко свързано с икономическото развитие на страната. Потребяването на качествено журналистическо съдържание (важни теми, проверени източници, задълбочена разработка, хубав език, грамотно писане) формира критично мислене и създава проактивни и социално ангажирани хора.

Това са хората, които искаме да наемем на работа, с които искаме да правим бизнес, с които предпочитаме да се разминаваме като водачи на пътното платно. С които можем да строим някакво общо бъдеще.

Security updates for Wednesday

Post Syndicated from corbet original https://lwn.net/Articles/944354/

Security updates have been issued by Debian (e2guardian), Fedora (libeconf), Red Hat (dmidecode, kernel, kernel-rt, keylime, kpatch-patch, libcap, librsvg2, linux-firmware, and qemu-kvm), Slackware (mozilla), SUSE (chromium and shadow), and Ubuntu (cups, dotnet6, dotnet7, file, flac, and ruby-redcloth).

Zero-Click Exploit in iPhones

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/09/zero-click-exploit-in-iphones.html

Make sure you update your iPhones:

Citizen Lab says two zero-days fixed by Apple today in emergency security updates were actively abused as part of a zero-click exploit chain (dubbed BLASTPASS) to deploy NSO Group’s Pegasus commercial spyware onto fully patched iPhones.

The two bugs, tracked as CVE-2023-41064 and CVE-2023-41061, allowed the attackers to infect a fully-patched iPhone running iOS 16.6 and belonging to a Washington DC-based civil society organization via PassKit attachments containing malicious images.

“We refer to the exploit chain as BLASTPASS. The exploit chain was capable of compromising iPhones running the latest version of iOS (16.6) without any interaction from the victim,” Citizen Lab said.

“The exploit involved PassKit attachments containing malicious images sent from an attacker iMessage account to the victim.”

Simplifying Digital Transformation with Marianna Portela

Post Syndicated from Michael Kammer original https://blog.zabbix.com/simplifying-digital-transformation-with-marianna-portela/26609/

To help everyone in our community get up to speed with Zabbix Summit speakers and their topics, we’re continuing our series of interviews and sitting down for a chat with Marianna Portela of Brazilian mass media conglomerate Globo. Read on to get a preview of her Summit speech topic and see how she uses Zabbix to bring massive live events to millions of users around the globe.

Please tell us a bit about yourself and your work.

I’m a tech lead at Globo, the largest media group in Latin America. It includes over-the-air broadcasting, television and film production, a pay television subscription service, streaming media, publishing, and online services.

How long have you been using Zabbix? What kind of daily Zabbix tasks are you involved in at your company?

I have been working at Globo for 15 years. I’ve been involved in monitoring for 11 of those years, and I’ve been using Zabbix for 10. I help monitor the applications that generate data for live events, and I use Zabbix to generate metrics that support decision-making related to better content delivery quality.

Can you name a few of the specific challenges that Zabbix has helped you solve?

Zabbix allows us to empower our users and supports our entire digital transformation – including many things related to Globoplay streaming. It also helps us monitor live event infrastructure, like the Olympics and World Cup. Previously, when there were technical issues during live events, we would try to figure out what happened after the fact, but no longer – Zabbix gives us a proactive analysis of potential occurrences within live production.

Can you give us a sneak peek at what we can expect to hear during your Zabbix Summit speech?

I’m planning to talk about how we use Zabbix to help ensure the quality monitoring of live production, which is essentially the production and the part of Globo that deals with any type of live event and generates data for things like games, for example. I’ll introduce how we started with actual infrastructure monitoring and how this digital transformation at Globo began, specifically how we managed to enter new areas like content generation, especially live content. Then I’ll also discuss some specifics of how we monitor live event infrastructure.

The post Simplifying Digital Transformation with Marianna Portela appeared first on Zabbix Blog.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close