A new and improved AWS CDK construct for Amazon DynamoDB tables

Post Syndicated from Anirudh Sharma original https://aws.amazon.com/blogs/devops/a-new-and-improved-aws-cdk-construct-for-amazon-dynamodb-tables/

Recently, we launched a new AWS Cloud Development Kit (CDK) construct for Amazon DynamoDB tables, known as TableV2. This construct provides a number of new features in addition to what the original construct offered, enabling CDK authors to create global tables, simplifying the configuration of global secondary indexes and auto scaling, as well as supporting AWS CloudFormation drift detection and import operations. We believe that this new construct will make it easier for organizations to build and manage their DynamoDB tables at scale, in addition to providing more flexibility and control over the configuration of tables.

AWS CDK is a framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. Developers can use any of the supported programming languages to define reusable cloud components known as Constructs. A construct is a reusable and programmable component that represents AWS resources. CDK translates the high-level constructs defined by you into equivalent AWS CloudFormation templates. CloudFormation provisions the resources specified in the template, streamlining the usage of Infrastructure as a Code (IaC) on AWS.

In this post we’ll explore:

  • The reasoning behind the creation of a new L2 construct for DynamoDB tables.
  • Features of new L2 constructs along with examples.
  • The benefits of leveraging this new construct in terms of scalability, flexibility, and simplicity.

By understanding the reasons behind its development and exploring its capabilities through practical examples, you will gain a comprehensive understanding of how this new L2 construct can enhance their DynamoDB experience. Let’s dive in.

Background

The original DynamoDB L2 Table construct is a powerful and versatile tool for creating and managing DynamoDB tables. It allows you to easily define the schema of your table, as well as the provisioned throughput and replicas. It also supports features like global tables, secondary indexes, and streams.

However, the Table construct uses a custom resource to add replicas to the primary table. This means that a separate Lambda function is created as the resource provider in addition to the Table resources (primary table and any replicas). This can be cumbersome to manage and can lead to drift detection issues.

The new TableV2 construct is an abstraction built on top of the GlobalTable L1 construct. It uses the CloudFormation resource AWS::DynamoDB::GlobalTable to create and manage DynamoDB tables. This has two important benefits:

  1. CloudFormation is in control and aware of all replicas that make up the Global Table, which means you will experience drift detection across all the replicas. With the original table construct, CloudFormation was not aware of any replicas since this was being handled through the Lambda function being used as a resource provider.
  2. No extra resource (Lambda function) is created when replicas are configured with TableV2. This eliminates the need to manage an extra resource and the risk of troubleshooting issues that may arise with the custom resource. TableV2 simplifies the setup and maintenance of DynamoDB tables by using native CloudFormation constructs to directly manage replicas, without the need for a Lambda function. This results in a more efficient and streamlined experience for users.

The new TableV2 construct provides more fine-grained control to customers over the replicas created as part of the Global Table. Specifically, customers can specify properties like contributor insights, deletion protection, point-in-time recovery, table class, read capacity, and global secondary index options on a per-replica basis.

This means that customers can tailor their table setup to meet their specific needs and optimize their overall experience with the Global Table feature. For example, a customer might want to enable contributor insights for all replicas, but only enable deletion protection for the primary replica. Or, a customer might want to use a different table class for each replica, depending on the expected workload.

The new TableV2 construct also offers greater flexibility and customization options by allowing customers to specify these properties on a per-replica basis. This can be helpful for customers who need to have different configurations for their replicas, or who want to fine-tune the performance and availability of their tables.

In the next section, we will explore each of these properties in more detail and how they can be specified in the new construct.

Features Walk-through

The new TableV2 construct is the recommended CDK DynamoDB construct for creating both single tables and global tables. In this section, we will review some specific aspects of the TableV2 construct and how they can be implemented. The walkthrough will cover features like Replicas, Billing, and Encryption, providing a comprehensive understanding of its capabilities.

Replicas

One of the most important benefits of the new L2 construct is the ability to configure properties on a per-replica basis. For example, the following code creates a global DynamoDB table with contributor insights and point-in-time recovery enabled for the table:

import * as cdk from 'aws-cdk-lib';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

const app = new cdk.App();
const stack = new cdk.Stack(app, 'Stack', { env: { region: 'us-west-2' } });

const globalTable = new dynamodb.TableV2(stack, 'GlobalTable', {
  partitionKey: { name: 'pk', type: dynamodb.AttributeType.STRING },
  contributorInsights: true,
  pointInTimeRecovery: true,
  replicas: [
    {
      region: 'us-east-1',
      tableClass: dynamodb.TableClass.STANDARD_INFREQUENT_ACCESS,
      pointInTimeRecovery: false,
    },
    {
      region: 'us-east-2',
      contributorInsights: false,
    },
  ],
});

// This is an ITableV2 instance for the replica table in us-east-1
const replica = globalTable.replica('us-east-1');

This code creates two replicas, one in the us-east-1 region and one in the us-east-2 region. For the replica in the us-east-1 region, we disable point-in-time recovery and set the table class to STANDARD_INFREQUENT_ACCESS. For the replica in the us-east-2 region, we disable contributor insights. The TableV2 construct also enables users to work with individual instances of the replicas in a global table via the replica() method. We see how this can be utilized from the above code where an ITableV2 instance representing the replica in us-east-1 is returned.

This is particularly useful for the grant() and metric() methods. For example, the following code gives a user write access to a replica in us-east-1 region:

import { Construct } from 'constructs';
import { App, Stack, StackProps } from 'aws-cdk-lib';
import { ITableV2, TableV2 } from 'aws-cdk-lib/aws-dynamodb';
import { AttributeType } from 'aws-cdk-lib/aws-dynamodb';
import * as iam from 'aws-cdk-lib/aws-iam';


class FooStack extends Stack {
  public readonly globalTable: TableV2;

  public constructor(scope: Construct, id: string, props: StackProps) {
    super(scope, id, props);

    this.globalTable = new TableV2(this, 'GlobalTable', {
      partitionKey: { name: 'pk', type: AttributeType.STRING },
      replicas: [
        { region: 'us-east-1' },
        { region: 'us-east-2' },
      ],
    });
  }
}

interface BarStackProps extends StackProps {
  readonly replicaTable: ITableV2;
}

class BarStack extends Stack {
  public constructor(scope: Construct, id: string, props: BarStackProps) {
    super(scope, id, props);
    const user = new iam.User(this, 'User')

    // user is given grantWriteData permissions to replica in us-east-1
    props.replicaTable.grantWriteData(user);
  }
}

const app = new App();

const fooStack = new FooStack(app, 'FooStack', { env: { region: 'us-west-2', account: process.env.CDK_DEFAULT_ACCOUNT } });
const barStack = new BarStack(app, 'BarStack', {
  replicaTable: fooStack.globalTable.replica('us-east-1'),
  env: { region: 'us-east-1', account: process.env.CDK_DEFAULT_ACCOUNT },
});

Before the replica() method was introduced, grant methods on the original Table construct applied to the primary table and all replicas. This was because there was no way to pull out a specific replica. This limited a user’s ability to grant a specific principal read, write, or read/write permission to a specific replica. The replica() method enables granting specific permissions to individual replicas in a global table. It maintains consistent behavior across all methods in the ITableV2 interface, including grants and metrics.

Billing

Table billing is easily configured using the onDemand() or provisioned() static methods of the Billing class. If provisioned billing is configured, the user must provide read and write capacity, which can be easily configured using the fixed() or autoscaled() static methods of the Capacity class.

For example, to configure on-demand billing:

import * as cdk from 'aws-cdk-lib';
import { AttributeType, Billing, TableClass, TableV2 } from 'aws-cdk-lib/aws-dynamodb';
import { Construct } from 'constructs';


export class DynamodbStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    new TableV2(this, 'DynamoDBTable', {
      partitionKey: { name: 'id', type: AttributeType.STRING},
      replicas: [
        {region: 'us-east-2'},
        {region: 'us-west-1'}
      ],
      billing: Billing.onDemand(),
      tableClass: TableClass.STANDARD
    })
  }
}

To configure provisioned billing:

import * as cdk from 'aws-cdk-lib';
import { AttributeType, Billing, Capacity, TableClass, TableV2 } from 'aws-cdk-lib/aws-dynamodb';
import { Construct } from 'constructs';

export class DynamodbStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);
    new TableV2(this, 'DynamoDBTable', {
      partitionKey: { name: 'id', type: AttributeType.STRING},
      replicas: [
        {region: 'us-east-2'},
        {region: 'us-west-1'}
      ],
      billing: Billing.provisioned({
        readCapacity: Capacity.fixed(5),
        writeCapacity: Capacity.autoscaled({maxCapacity: 10})
      }),
      tableClass: TableClass.STANDARD
    })
  }
}

Note that with the previous Table construct, users had to set a billingMode property and configure readCapacity and writeCapacity as separate properties. Additionally, configuring autoscaled capacity required calling the autoScaleReadCapacity() or autoScaleWriteCapacity() method on an instance of the Table construct. Lastly, since readCapacity, writeCapacity, and billingMode were all individual properties, a user had to know not to provision read and write capacity for a table with PAY_PER_REQUEST billing mode. With the new Billing class, the user is guided into providing necessary properties via the onDemand() and provisioned() static methods.

Encryption

The TableEncryptionV2 class allows you to provide your own KMS keys for each replica instead of using the default AWS owned keys, thus encrypting every replica with a custom KMS key. This provides more granular control over the encryption of your DynamoDB tables.

Here is an example of how to use the TableEncryptionV2 class to encrypt each replica of a global table with a custom KMS key:

import * as cdk from 'aws-cdk-lib';
import { AttributeType, Billing, BillingMode, Capacity, TableBaseV2, TableEncryptionV2, TableV2 } from 'aws-cdk-lib/aws-dynamodb';
import { IKey, Key } from 'aws-cdk-lib/aws-kms';
import { Construct } from 'constructs';

interface KMSkeys extends cdk.StackProps {
  kmsuswest1: IKey;
  kmsuseast2: IKey;
}

export class GlobalTableStack extends cdk.Stack {
  //public readonly globalTable: TableV2;
  constructor(scope: Construct, id: string, props: KMSkeys) {
    super(scope, id, props);

    const replicaTableKeys = {
      "us-west-1": props.kmsuswest1.keyArn,
      "us-east-2": props.kmsuseast2.keyArn
    }
    const TableKMSKey=new Key(this, 'TableKMSKey', {
      alias: 'KMSuswest2Stack',
    }
    )

    new TableV2(this, 'GlobalTable', {
    tableName: 'FooTableFour',
    encryption: TableEncryptionV2.customerManagedKey(TableKMSKey,replicaTableKeys),

    partitionKey: {
    name: 'FooHashKey',
    type: AttributeType.STRING,
    },
    replicas: [
    {
      region: 'us-west-1',  
    },
    {
      region: 'us-east-2',
    },
  ],
    })
  }
}

The ability to provide custom KMS keys for each replica can help to improve the security of your DynamoDB tables. It also gives you more control over the encryption of your data. This can help you to meet specific compliance requirements.

Conclusion

In this post, I introduced the new AWS CDK TableV2 construct, highlighting its advantages over the original construct. Notably, TableV2 enables drift detection for replica tables and eliminates the need for an extra Lambda function custom resource. I delved into practical implementations, focusing on three key aspects: Replicas, Billing, and Encryption.

To summarize, TableV2 marks a substantial improvement over the original construct. Its user experience provides significant improvement over the original construct in several ways, such as:

  • Direct support for global tables: TableV2 makes it easy to create and manage global DynamoDB tables.
  • Easier configuration of global secondary indexes and Autoscaling: TableV2 provides a simplified and streamlined process for configuring global secondary indexes and Autoscaling.
  • More granular control over replicas: TableV2 allows you to configure properties on a per-replica basis, giving you more control over the performance and availability of your tables.
  • Improved API design and user experience: TableV2 improves the API design and user experience by implementing new classes for billing, capacity, and encryption.

Overall, TableV2 is a powerful and flexible construct that makes it easier to build and manage DynamoDB tables at scale. It is the preferred CDK DynamoDB construct for creating both single tables and global tables. If you are looking for a powerful and flexible way to build and manage DynamoDB tables, TableV2 is the perfect choice for you.

If you’re new to CDK and eager to get started, we highly recommend checking out the CDK documentation and the CDK workshop.

Anirudh Sharma

Anirudh is a Cloud Support Engineer 2 with an extensive background in DevOps offerings at AWS, and he is also a Subject Matter Expert in AWS ElasticBeanstalk and AWS CodeDeploy services. He loves helping customers and learning new services and technologies. He also loves travelling and has a goal to visit Japan someday. He is a Golden State Warriors fan and loves spending time with his family.

Announcing Generative AI CDK Constructs

Post Syndicated from Michael Tran original https://aws.amazon.com/blogs/devops/announcing-generative-ai-cdk-constructs/

Announced by Werner Vogels in his 2023 re:Invent Keynote, Generative AI CDK Constructs, an open-source extension of the AWS Cloud Development Kit (AWS CDK), provides well-architected multi-service patterns to quickly and efficiently create repeatable infrastructure required for generative AI projects on AWS. Our initial release includes five CDK constructs enabling key generative AI capabilities like question and answering, summarization, data ingestion for Retrieval Augmented Generation (RAG), and model deployment capabilities.

Simplify and Accelerate the Development of Applications with Generative AI

Developing generative AI applications is particularly challenging due to the rapidly evolving nature of the underlying technologies. As a result, developers are faced with the challenge of keeping up with changing paradigms and best practices. This often leads to varied and disjointed patterns as developers try crafting their own solutions from scratch. For instance, there are already hundreds of implementations of patterns like retrieval augmented generation(RAG), summarization, or chatbots.

AWS introduced the Cloud Development Kit (CDK), an open-source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. CDK empowers developers to use high-level construct libraries that encapsulate AWS best practices, allowing for the creation of cloud applications without needing to know every detail of the AWS services.

A ‘construct‘ in CDK terminology is a building block that represents an AWS resource or a combination of AWS resources configured together. These constructs can range from low-level (individual resources) to high-level (complete architectures), enabling developers to compose and share their cloud application models as code.

Our library harnesses the power of CDK, offering pre-built constructs designed to expedite the deployment of generative AI applications. These constructs use LLMs/FMs available in Amazon Bedrock facilitating easier modeling and rapid deployment of cloud architectures tailored for generative AI. With these constructs available in both Python and Typescript, developers can leverage AWS services to build industry-agnostic solutions swiftly, regardless of their expertise level in generative AI. The constructs are designed to integrate smoothly with existing CDK applications, offering a scalable approach to enhancing generative AI capabilities across various business verticals.

Getting Started with the Constructs

Prerequisites

You can install the CDK constructs in your preferred language:

  • Typescript app: npm i @cdklabs/generative-ai-cdk-constructs
  • Python app: pip install cdklabs.generative-ai-cdk-constructs

Within your existing CDK application, import the new library and get access to available constructs:

  • Typescript app: import * as genai from '@cdklabs/generative-ai-cdk-constructs';
  • Python app: import cdklabs.generative-ai-cdk-constructs as genai

Example

aws-qa-appsync-opensearch is one of the constructs available in the generative-ai-cdk-constructs library. This construct provides a question answering workflow using Amazon Bedrock and a provisioned Amazon OpenSearch cluster. Amazon Cognito is required to authenticate calls to the AWS AppSync GraphQL API created by the construct.

Typescript app:


import { Construct } from 'constructs';
import { Stack, StackProps } from 'aws-cdk-lib';
import * as os from 'aws-cdk-lib/aws-opensearchservice';
import * as cognito from 'aws-cdk-lib/aws-cognito';
import { QaAppsyncOpensearch, QaAppsyncOpensearchProps } from '@cdklabs/generative-ai-cdk-constructs';


class MyStack extends Stack {
    // get an existing OpenSearch provisioned cluster
    const osDomain = os.Domain.fromDomainAttributes(this, 'osdomain', {
        domainArn: 'arn:aws:es:us-east-1:XXXXXX',
        domainEndpoint: 'https://XXXXX.us-east-1.es.amazonaws.com'
    });
    
    // get an existing userpool 
    const cognitoPoolId = 'us-east-1_XXXXX';
    const userPoolLoaded = cognito.UserPool.fromUserPoolId(this, 'myuserpool', cognitoPoolId);
   
    // Create a QA Appsync OpenSearch construct 
    const ragSource = new QaAppsyncOpensearch(
        this,
        'QaAppsyncOpensearch',
        {
            existingOpensearchDomain: osDomain,
            openSearchIndexName: 'demoindex',
            cognitoUserPool: userPoolLoaded
        }
    )

Python app:

from aws_cdk import (
    Stack,
    aws_opensearchservice as os,
    aws_cognito as cognito
)
from constructs import Construct
from cdklabs.generative_ai_cdk_constructs import QaAppsyncOpensearch

class MyStack(Stack):
    def __init__(self, scope: Construct, id: str, **kwargs):
        super().__init__(scope, id, **kwargs)

        # Get an existing OpenSearch provisioned cluster
        os_domain = os.Domain.from_domain_attributes(self, 'osdomain',
            domain_arn='arn:aws:es:us-east-1:XXXXXX',
            domain_endpoint='https://XXXXX.us-east-1.es.amazonaws.com'
        )

        # Get an existing user pool
        cognito_pool_id = 'us-east-1_XXXXX'
        user_pool_loaded = cognito.UserPool.from_user_pool_id(self, 'myuserpool', cognito_pool_id)

        # Create a QA Appsync OpenSearch construct
        rag_source = QaAppsyncOpensearch(
            self, 
            'QaAppsyncOpensearch',
            existing_opensearch_domain=os_domain,
            open_search_index_name='demoindex',
            cognito_user_pool=user_pool_loaded
        )

Refer to the documentation for additional guidance on a particular construct: Catalog

Currently, the following constructs that are available:

Feature Description
Data ingestion pipeline Ingestion pipeline providing a RAG (retrieval augmented generation) source for storing documents in a knowledge base.
Question answering Question answering with a large language model (Anthropic Claude V2) using a RAG (retrieval augmented generation) source and/or long context.
Summarization Document summarization with a large language model (Anthropic Claude V2).
Lambda layer Python Lambda layer providing dependencies and utilities to develop generative AI applications on AWS.
SageMaker model deployment Deploy a foundation model from Amazon SageMaker JumpStart, Hugging Face, or an S3 location to an Amazon SageMaker endpoint.

Explore More Constructs

The library includes constructs for data ingestion pipelines, question answering workflows, document summarization, Lambda layer for generative AI applications, and SageMaker model deployment.

Next Steps

Visit the Generative AI CDK Constructs GitHub repository for a full list of constructs and documentation. For practical examples, check out the AWS samples repository. Your feedback and contributions are welcome on our GitHub repository to enhance the capabilities of Generative AI CDK Constructs.

Alain Krok

Alain Krok is a Senior Solutions Architect with a passion for emerging technologies. His past experience includes designing and implementing IIoT solutions for the oil and gas industry and working on robotics projects. He enjoys pushing the limits and indulging in extreme sports when he is not designing software.

Dinesh Sajwan

Dinesh Sajwan is a Senior Solutions Architect. His passion for emerging technologies allows him to stay on the cutting edge and identify new ways to apply the latest advancements to solve even the most complex business problems. His diverse expertise and enthusiasm for both technology and adventure position him as a uniquely creative problem-solver.

Michael Tran

Michael Tran is a Sr. Solutions Architect with Prototyping Acceleration team at Amazon Web Services. He provides technical guidance and helps customers innovate by showing the art of the possible on AWS. He specializes in building prototypes in the AI/ML space. You can contact him @Mike_Trann on Twitter.

 

Generative AI Infrastructure at AWS

Post Syndicated from Betsy Chernoff original https://aws.amazon.com/blogs/compute/generative-ai-infrastructure-at-aws/

Building and training generative artificial intelligence (AI) models, as well as predicting and providing accurate and insightful outputs requires a significant amount of infrastructure.

There’s a lot of data that goes into generating the high-quality synthetic text, images, and other media outputs that large-language models (LLMs), as well as foundational models (FMs), create. To start, the data set generally has somewhere around one billion variables present in the model that it was trained on (also known as parameters). To process that massive amount of data (think: petabytes), it can take hundreds of hardware accelerators (which are incorporated into purpose-built ML silicon or GPUs).

Given how much data is required for an effective LLM, it becomes costly and inefficient if an organization can’t access the data for these models as quickly as their GPUs/ML silicon are processing it. Selecting infrastructure for generative AI workloads impacts everything from cost to performance to sustainability goals to the ease of use. To successfully run training and inference for FMs organizations need:

  1. Price-performant accelerated computing (including the latest GPUs and dedicated ML Silicon) to power large generative AI workloads.
  2. High-performance and low-latency cloud storage that’s built to keep accelerators highly utilized.
  3. The most performant and cutting-edge technologies, networking, and systems to support the infrastructure for a generative AI workload.
  4. The ability to build with cloud services that can provide seamless integration across generative AI applications, tools, and infrastructure.

Overview of compute, storage, & networking for generative AI

Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing portfolio (including instances powered by GPUs and purpose-built ML silicon) offers the broadest choice of accelerators to power generative AI workloads.

To keep the accelerators highly utilized, they need constant access to data for processing. AWS provides this fast data transfer from storage (up to hundreds of GBs/TBs of data throughput) with Amazon FSx for Lustre and Amazon S3.

Accelerated computing instances combined with differentiated AWS technologies such as the AWS Nitro System, up to 3,200 Gbps of Elastic Fabric Adapter (EFA) networking, as well as exascale computing with Amazon EC2 UltraClusters helps to deliver the most performant infrastructure for generative AI workloads.

Coupled with other managed services such as Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (Amazon EKS), these instances provide developers with the industry’s best platform for building and deploying generative AI applications.

This blog post will focus on highlighting announcements across Amazon EC2 instances, storage, and networking that are centered around generative AI.

AWS compute enhancements for generative AI workloads

Training large FMs requires extensive compute resources and because every project is different, a broad set of options are needed so that organization of all sizes can iterate faster, train more models, and increase accuracy. In 2023, there were a lot of launches across the AWS compute category that supported both training and inference workloads for generative AI.

One of those launches, Amazon EC2 Trn1n instances, doubled the network bandwidth (compared to Trn1 instances) to 1600 Gbps of Elastic Fabric Adapter (EFA). That increased bandwidth delivers up to 20% faster time-to-train relative to Trn1 for training network-intensive generative AI models, such as LLMs and mixture of experts (MoE).

Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which uses LLMs to incorporate humor and offer a more relevant and conversational experience to their customers. “This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism,” said Yohei Kobashi, CTO, Watashiha, K.K. “The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”

AWS continues to advance its infrastructure for generative AI workloads, and recently announced that Trainium2 accelerators are also coming soon. These accelerators are designed to deliver up to 4x faster training than first generation Trainium chips and will be able to be deployed in EC2 UltraClusters of up to 100,000 chips, making it possible to train FMs and LLMs in a fraction of the time, while improving energy efficiency up to 2x.

AWS has continued to invest in GPU infrastructure over the years, too. To date, NVIDIA has deployed 2 million GPUs on AWS, across the Ampere and Grace Hopper GPU generations. That’s 3 zetaflops, or 3,000 exascale super computers. Most recently, AWS announced the Amazon EC2 P5 Instances that are designed for time-sensitive, large-scale training workloads that use NVIDIA CUDA or CuDNN and are powered by NVIDIA H100 Tensor Core GPUs. They help you accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances, and reduce cost to train ML models by up to 40%. P5 instances help you iterate on your solutions at a faster pace and get to market more quickly.

And to offer easy and predictable access to highly sought-after GPU compute capacity, AWS launched Amazon EC2 Capacity Blocks for ML. This is the first consumption model from a major cloud provider that lets you reserve GPUs for future use (up to 500 deployed in EC2 UltraClusters) to run short duration ML workloads.

AWS is also simplifying training with Amazon SageMaker HyperPod, which automates more of the processes required for high-scale fault-tolerant distributed training (e.g., configuring distributed training libraries, scaling training workloads across thousands of accelerators, detecting and repairing faulty instances), speeding up training by as much as 40%. Customers like Perplexity AI elastically scale beyond hundreds of GPUs and minimize their downtime with SageMaker HyperPod.

Deep-learning inference is another example of how AWS is continuing its cloud infrastructure innovations, including the low-cost, high-performance Amazon EC2 Inf2 instances powered by AWS Inferentia2. These instances are designed to run high-performance deep-learning inference applications at scale globally. They are the most cost-effective and energy-efficient option on Amazon EC2 for deploying the latest innovations in generative AI.

Another example is with Amazon SageMaker, which helps you deploy multiple models to the same instance so you can share compute resources—reducing inference cost by 50%. SageMaker also actively monitors instances that are processing inference requests and intelligently routes requests based on which instances are available—achieving 20% lower inference latency (on average).

AWS invests heavily in the tools for generative AI workloads. For AWS ML silicon, AWS has focused on AWS Neuron, the software development kit (SDK) that helps customers get the maximum performance from Trainium and Inferentia. Neuron supports the most popular publicly available models, including Llama 2 from Meta, MPT from Databricks, Mistral from mistral.ai, and Stable Diffusion from Stability AI, as well as 93 of the top 100 models on the popular model repository Hugging Face. It plugs into ML frameworks like PyTorch and TensorFlow, and support for JAX is coming early this year. It’s designed to make it easy for AWS customers to switch from their existing model training and inference pipelines to Trainium and Inferentia with just a few lines of code.

Cloud storage on AWS enhancements for generative AI

Another way AWS is accelerating the training and inference pipelines is with improvements to storage performance—which is not only critical when thinking about the most common ML tasks (like loading training data into a large cluster of GPUs/accelerators), but also for checkpointing and serving inference requests. AWS announced several improvements to accelerate the speed of storage requests and reduce the idle time of your compute resources—which allows you to run generative AI workloads faster and more efficiently.

To gather more accurate predictions, generative AI workloads are using larger and larger datasets that require high-performant storage at scale to handle the sheer volume in of data.

With Amazon S3 Express One Zone a new storage class purpose-built to high-performance and low-latency object storage for an organizations most frequently accessed data, making it ideal for request-intensive operations like ML training and inference. Amazon S3 Express One Zone is the lowest-latency cloud object storage available, with data access speed up to 10x faster and request costs up to 50% lower than Amazon S3 Standard, from any AWS Availability Zone within an AWS Region.

AWS continues to optimize data access speeds for ML frameworks too. Recently, Amazon S3 Connector for PyTorch launched, which loads training data up to 40% faster than with the existing PyTorch connectors to Amazon S3. While most customers can meet their training and inference requirements using Mountpoint for Amazon S3 or Amazon S3 Connector for PyTorch, some are also building and managing their own custom data loaders. To deliver the fastest data transfer speeds between Amazon S3, and Amazon EC2 Trn1, P4d, and P5 instances, AWS recently announced the ability to automatically accelerate Amazon S3 data transfer in the AWS Command Line Interface (AWS CLI) and Python SDK. Now, training jobs download training data from Amazon S3 up to 3x faster and customers like Scenario are already seeing great results, with a 5x throughput improvement to model download times without writing a single line of code.

To meet the changing performance requirements that training generative AI workloads can  require, Amazon FSx for Lustre announced throughput scaling on-demand. This is particularly useful for model training because it enables you to adjust the throughput tier of your file systems to meet these requirements with greater agility and lower cost.

EC2 networking enhancements for generative AI

Last year, AWS introduced EC2 UltraCluster 2.0, a flatter and wider network fabric that’s optimized specifically for the P5 instance and future ML accelerators. It allows us to reduce latency by 16% and supports up to 20,000 GPUs, with up to 10x the overall bandwidth. In a traditional cluster architecture, as clusters get physically bigger, latency will also generally increase. But, with UltraCluster 2.0, AWS is increasing the size while reducing latency, and that’s exciting.

AWS is also continuing to help you make your network more efficient. Take for example a recent launch with Amazon EC2 Instance Topology API. It gives you an inside look at the proximity between your instances, so you can place jobs strategically. Optimized job scheduling means faster processing for distributed workloads. Moving jobs that exchange data the most frequently to the same physical location in a cluster can eliminate multiple hops in the data path. As models push boundaries, this type of software innovation is key to getting the most out of your hardware.

In addition to Amazon Q (a generative AI powered assistant from AWS), AWS also launched Amazon Q networking troubleshooting (preview).

You can ask Amazon Q to assist you in troubleshooting network connectivity issues caused by network misconfiguration in your current AWS account. For this capability, Amazon Q works with Amazon VPC Reachability Analyzer to check your connections and inspect your network configuration to identify potential issues. With Amazon Q network troubleshooting, you can ask questions about your network in conversational English—for example, you can ask, “why can’t I SSH to my server,” or “why is my website not accessible”.

Conclusion

AWS is bringing customers even more choice for their infrastructure, including price-performant, sustainability focused, and ease-of-use options. Last year, AWS capabilities across this stack solidified our commitment to meeting the customer focus and goal of: Making generative AI accessible to customers of all sizes and technical abilities so they can get to reinventing and transforming what is possible.

Additional resources

[$] OpenBSD system-call pinning

Post Syndicated from daroc original https://lwn.net/Articles/959562/


Return-oriented programming
(ROP) attacks are hard to defend against.
Partial mitigations such as address-space layout randomization, stack
canaries, and other techniques are commonly deployed to try and frustrate
ROP attacks. Now, OpenBSD is experimenting with a new
mitigation that makes it harder for attackers to make system
calls, although some security researchers have expressed doubt that it will
prove effective at stopping real-world attacks.
In his
announcement message, Theo de Raadt said that this work
makes some specific low-level attack
methods unfeasable on OpenBSD, which will force the use of other methods.

Mastering market dynamics: Transforming transaction cost analytics with ultra-precise Tick History – PCAP and Amazon Athena for Apache Spark

Post Syndicated from Pramod Nayak original https://aws.amazon.com/blogs/big-data/mastering-market-dynamics-transforming-transaction-cost-analytics-with-ultra-precise-tick-history-pcap-and-amazon-athena-for-apache-spark/

This post is cowritten with Pramod Nayak, LakshmiKanth Mannem and Vivek Aggarwal from the Low Latency Group of LSEG.

Transaction cost analysis (TCA) is widely used by traders, portfolio managers, and brokers for pre-trade and post-trade analysis, and helps them measure and optimize transaction costs and the effectiveness of their trading strategies. In this post, we analyze options bid-ask spreads from the LSEG Tick History – PCAP dataset using Amazon Athena for Apache Spark. We show you how to access data, define custom functions to apply on data, query and filter the dataset, and visualize the results of the analysis, all without having to worry about setting up infrastructure or configuring Spark, even for large datasets.

Background

Options Price Reporting Authority (OPRA) serves as a crucial securities information processor, collecting, consolidating, and disseminating last sale reports, quotes, and pertinent information for US Options. With 18 active US Options exchanges and over 1.5 million eligible contracts, OPRA plays a pivotal role in providing comprehensive market data.

On February 5, 2024, the Securities Industry Automation Corporation (SIAC) is set to upgrade the OPRA feed from 48 to 96 multicast channels. This enhancement aims to optimize symbol distribution and line capacity utilization in response to escalating trading activity and volatility in the US options market. SIAC has recommended that firms prepare for peak data rates of up to 37.3 GBits per second.

Despite the upgrade not immediately altering the total volume of published data, it enables OPRA to disseminate data at a significantly faster rate. This transition is crucial for addressing the demands of the dynamic options market.

OPRA stands out as one the most voluminous feeds, with a peak of 150.4 billion messages in a single day in Q3 2023 and a capacity headroom requirement of 400 billion messages over a single day. Capturing every single message is critical for transaction cost analytics, market liquidity monitoring, trading strategy evaluation, and market research.

About the data

LSEG Tick History – PCAP is a cloud-based repository, exceeding 30 PB, housing ultra-high-quality global market data. This data is meticulously captured directly within the exchange data centers, employing redundant capture processes strategically positioned in major primary and backup exchange data centers worldwide. LSEG’s capture technology ensures lossless data capture and uses a GPS time-source for nanosecond timestamp precision. Additionally, sophisticated data arbitrage techniques are employed to seamlessly fill any data gaps. Subsequent to capture, the data undergoes meticulous processing and arbitration, and is then normalized into Parquet format using LSEG’s Real Time Ultra Direct (RTUD) feed handlers.

The normalization process, which is integral to preparing the data for analysis, generates up to 6 TB of compressed Parquet files per day. The massive volume of data is attributed to the encompassing nature of OPRA, spanning multiple exchanges, and featuring numerous options contracts characterized by diverse attributes. Increased market volatility and market making activity on the options exchanges further contribute to the volume of data published on OPRA.

The attributes of Tick History – PCAP enable firms to conduct various analyses, including the following:

  • Pre-trade analysis – Evaluate potential trade impact and explore different execution strategies based on historical data
  • Post-trade evaluation – Measure actual execution costs against benchmarks to assess the performance of execution strategies
  • Optimized execution – Fine-tune execution strategies based on historical market patterns to minimize market impact and reduce overall trading costs
  • Risk management – Identify slippage patterns, identify outliers, and proactively manage risks associated with trading activities
  • Performance attribution – Separate the impact of trading decisions from investment decisions when analyzing portfolio performance

The LSEG Tick History – PCAP dataset is available in AWS Data Exchange and can be accessed on AWS Marketplace. With AWS Data Exchange for Amazon S3, you can access PCAP data directly from LSEG’s Amazon Simple Storage Service (Amazon S3) buckets, eliminating the need for firms to store their own copy of the data. This approach streamlines data management and storage, providing clients immediate access to high-quality PCAP or normalized data with ease of use, integration, and substantial data storage savings.

Athena for Apache Spark

For analytical endeavors, Athena for Apache Spark offers a simplified notebook experience accessible through the Athena console or Athena APIs, allowing you to build interactive Apache Spark applications. With an optimized Spark runtime, Athena helps the analysis of petabytes of data by dynamically scaling the number of Spark engines is less than a second. Moreover, common Python libraries such as pandas and NumPy are seamlessly integrated, allowing for the creation of intricate application logic. The flexibility extends to the importation of custom libraries for use in notebooks. Athena for Spark accommodates most open-data formats and is seamlessly integrated with the AWS Glue Data Catalog.

Dataset

For this analysis, we used the LSEG Tick History – PCAP OPRA dataset from May 17, 2023. This dataset comprises the following components:

  • Best bid and offer (BBO) – Reports the highest bid and lowest ask for a security at a given exchange
  • National best bid and offer (NBBO) – Reports the highest bid and lowest ask for a security across all exchanges
  • Trades – Records completed trades across all exchanges

The dataset involves the following data volumes:

  • Trades – 160 MB distributed across approximately 60 compressed Parquet files
  • BBO – 2.4 TB distributed across approximately 300 compressed Parquet files
  • NBBO – 2.8 TB distributed across approximately 200 compressed Parquet files

Analysis overview

Analyzing OPRA Tick History data for Transaction Cost Analysis (TCA) involves scrutinizing market quotes and trades around a specific trade event. We use the following metrics as part of this study:

  • Quoted spread (QS) – Calculated as the difference between the BBO ask and the BBO bid
  • Effective spread (ES) – Calculated as the difference between the trade price and the midpoint of the BBO (BBO bid + (BBO ask – BBO bid)/2)
  • Effective/quoted spread (EQF) – Calculated as (ES / QS) * 100

We calculate these spreads before the trade and additionally at four intervals after the trade (just after, 1 second, 10 seconds, and 60 seconds after the trade).

Configure Athena for Apache Spark

To configure Athena for Apache Spark, complete the following steps:

  1. On the Athena console, under Get started, select Analyze your data using PySpark and Spark SQL.
  2. If this is your first time using Athena Spark, choose Create workgroup.
  3. For Workgroup name¸ enter a name for the workgroup, such as tca-analysis.
  4. In the Analytics engine section, select Apache Spark.
  5. In the Additional configurations section, you can choose Use defaults or provide a custom AWS Identity and Access Management (IAM) role and Amazon S3 location for calculation results.
  6. Choose Create workgroup.
  7. After you create the workgroup, navigate to the Notebooks tab and choose Create notebook.
  8. Enter a name for your notebook, such as tca-analysis-with-tick-history.
  9. Choose Create to create your notebook.

Launch your notebook

If you have already created a Spark workgroup, select Launch notebook editor under Get started.


After your notebook is created, you will be redirected to the interactive notebook editor.


Now we can add and run the following code to our notebook.

Create an analysis

Complete the following steps to create an analysis:

  • Import common libraries:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
  • Create our data frames for BBO, NBBO, and trades:
bbo_quote = spark.read.parquet(f"s3://<bucket>/mt=bbo_quote/f=opra/dt=2023-05-17/*")
bbo_quote.createOrReplaceTempView("bbo_quote")
nbbo_quote = spark.read.parquet(f"s3://<bucket>/mt=nbbo_quote/f=opra/dt=2023-05-17/*")
nbbo_quote.createOrReplaceTempView("nbbo_quote")
trades = spark.read.parquet(f"s3://<bucket>/mt=trade/f=opra/dt=2023-05-17/29_1.parquet")
trades.createOrReplaceTempView("trades")
  • Now we can identify a trade to use for transaction cost analysis:
filtered_trades = spark.sql("select Product, Price,Quantity, ReceiptTimestamp, MarketParticipant from trades")

We get the following output:

+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|Product |Price |Quantity |ReceiptTimestamp |MarketParticipant| 
+---------------------+---------------------+---------------------+-------------------+-----------------+ 
|QQQ 230518C00329000|1.1700000000000000000|10.0000000000000000000|1684338565538021907,NYSEArca|
|QQQ 230518C00329000|1.1700000000000000000|20.0000000000000000000|1684338576071397557,NASDAQOMXPHLX|
|QQQ 230518C00329000|1.1600000000000000000|1.0000000000000000000|1684338579104713924,ISE|
|QQQ 230518C00329000|1.1400000000000000000|1.0000000000000000000|1684338580263307057,NASDAQOMXBX_Options|
|QQQ 230518C00329000|1.1200000000000000000|1.0000000000000000000|1684338581025332599,ISE|
+---------------------+---------------------+---------------------+-------------------+-----------------+

We use the highlighted trade information going forward for the trade product (tp), trade price (tpr), and trade time (tt).

  • Here we create a number of helper functions for our analysis
def calculate_es_qs_eqf(df, trade_price):
    df['BidPrice'] = df['BidPrice'].astype('double')
    df['AskPrice'] = df['AskPrice'].astype('double')
    df["ES"] = ((df["AskPrice"]-df["BidPrice"])/2) - trade_price
    df["QS"] = df["AskPrice"]-df["BidPrice"]
    df["EQF"] = (df["ES"]/df["QS"])*100
    return df

def get_trade_before_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].groupby(groupby_col).last()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_trade_after_n_seconds(trade_time, df, seconds=0, groupby_col = None):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].groupby(groupby_col).first()
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    ret_df = ret_df.reset_index()
    return ret_df

def get_nbbo_trade_before_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] < nseconds].iloc[-1:]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df

def get_nbbo_trade_after_n_seconds(trade_time, df, seconds=0):
    nseconds=seconds*1000000000
    nseconds += trade_time
    ret_df = df[df['ReceiptTimestamp'] > nseconds].iloc[:1]
    ret_df['BidPrice'] = ret_df['BidPrice'].astype('double')
    ret_df['AskPrice'] = ret_df['AskPrice'].astype('double')
    return ret_df
  • In the following function, we create the dataset that contains all the quotes before and after the trade. Athena Spark automatically determines how many DPUs to launch for processing our dataset.
def get_tca_analysis_via_df_single_query(trade_product, trade_price, trade_time):
    # BBO quotes
    bbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, MarketParticipant FROM bbo_quote where Product = '{trade_product}';")
    bbos = bbos.toPandas()

    bbo_just_before = get_trade_before_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_just_after = get_trade_after_n_seconds(trade_time, bbos, seconds=0, groupby_col='MarketParticipant')
    bbo_1s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=1, groupby_col='MarketParticipant')
    bbo_10s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=10, groupby_col='MarketParticipant')
    bbo_60s_after = get_trade_after_n_seconds(trade_time, bbos, seconds=60, groupby_col='MarketParticipant')
    
    all_bbos = pd.concat([bbo_just_before, bbo_just_after, bbo_1s_after, bbo_10s_after, bbo_60s_after], ignore_index=True, sort=False)
    bbos_calculated = calculate_es_qs_eqf(all_bbos, trade_price)

    #NBBO quotes
    nbbos = spark.sql(f"SELECT Product, ReceiptTimestamp, AskPrice, BidPrice, BestBidParticipant, BestAskParticipant FROM nbbo_quote where Product = '{trade_product}';")
    nbbos = nbbos.toPandas()

    nbbo_just_before = get_nbbo_trade_before_n_seconds(trade_time,nbbos, seconds=0)
    nbbo_just_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=0)
    nbbo_1s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=1)
    nbbo_10s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=10)
    nbbo_60s_after = get_nbbo_trade_after_n_seconds(trade_time, nbbos, seconds=60)

    all_nbbos = pd.concat([nbbo_just_before, nbbo_just_after, nbbo_1s_after, nbbo_10s_after, nbbo_60s_after], ignore_index=True, sort=False)
    nbbos_calculated = calculate_es_qs_eqf(all_nbbos, trade_price)

    calc = pd.concat([bbos_calculated, nbbos_calculated], ignore_index=True, sort=False)
    
    return calc
  • Now let’s call the TCA analysis function with the information from our selected trade:
tp = "QQQ 230518C00329000"
tpr = 1.16
tt = 1684338579104713924
c = get_tca_analysis_via_df_single_query(tp, tpr, tt)

Visualize the analysis results

Now let’s create the data frames we use for our visualization. Each data frame contains quotes for one of the five time intervals for each data feed (BBO, NBBO):

bbo = c[c['MarketParticipant'].isin(['BBO'])]
bbo_bef = bbo[bbo['ReceiptTimestamp'] < tt]
bbo_aft_0 = bbo[bbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
bbo_aft_1 = bbo[bbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
bbo_aft_10 = bbo[bbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
bbo_aft_60 = bbo[bbo['ReceiptTimestamp'] > (tt+60000000000)]

nbbo = c[~c['MarketParticipant'].isin(['BBO'])]
nbbo_bef = nbbo[nbbo['ReceiptTimestamp'] < tt]
nbbo_aft_0 = nbbo[nbbo['ReceiptTimestamp'].between(tt,tt+1000000000)]
nbbo_aft_1 = nbbo[nbbo['ReceiptTimestamp'].between(tt+1000000000,tt+10000000000)]
nbbo_aft_10 = nbbo[nbbo['ReceiptTimestamp'].between(tt+10000000000,tt+60000000000)]
nbbo_aft_60 = nbbo[nbbo['ReceiptTimestamp'] > (tt+60000000000)]

In the following sections, we provide example code to create different visualizations.

Plot QS and NBBO before the trade

Use the following code to plot the quoted spread and NBBO before the trade:

fig = px.bar(title="Quoted Spread Before The Trade",
    x=bbo_bef.MarketParticipant,
    y=bbo_bef['QS'],
    labels={'x': 'Market', 'y':'Quoted Spread'})
fig.add_hline(y=nbbo_bef.iloc[0]['QS'],
    line_width=1, line_dash="dash", line_color="red",
    annotation_text="NBBO", annotation_font_color="red")
%plotly fig

Plot QS for each market and NBBO after the trade

Use the following code to plot the quoted spread for each market and NBBO immediately after the trade:

fig = px.bar(title="Quoted Spread After The Trade",
    x=bbo_aft_0.MarketParticipant,
    y=bbo_aft_0['QS'],
    labels={'x': 'Market', 'y':'Quoted Spread'})
fig.add_hline(
    y=nbbo_aft_0.iloc[0]['QS'],
    line_width=1, line_dash="dash", line_color="red",
    annotation_text="NBBO", annotation_font_color="red")
%plotly fig

Plot QS for each time interval and each market for BBO

Use the following code to plot the quoted spread for each time interval and each market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['QS']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['QS']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['QS']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['QS']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['QS'])])
fig.update_layout(barmode='group',title="BBO Quoted Spread Per Market/TimeFrame",
    xaxis={'title':'Market'},
    yaxis={'title':'Quoted Spread'})
%plotly fig

Plot ES for each time interval and market for BBO

Use the following code to plot the effective spread for each time interval and market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['ES']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['ES']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['ES']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['ES']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['ES'])])
fig.update_layout(barmode='group',title="BBO Effective Spread Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Effective Spread'})
%plotly fig

Plot EQF for each time interval and market for BBO

Use the following code to plot the effective/quoted spread for each time interval and market for BBO:

fig = go.Figure(data=[
    go.Bar(name="before trade", x=bbo_bef.MarketParticipant.unique(), y=bbo_bef['EQF']),
    go.Bar(name="0s after trade", x=bbo_aft_0.MarketParticipant.unique(), y=bbo_aft_0['EQF']),
    go.Bar(name="1s after trade", x=bbo_aft_1.MarketParticipant.unique(), y=bbo_aft_1['EQF']),
    go.Bar(name="10s after trade", x=bbo_aft_10.MarketParticipant.unique(), y=bbo_aft_10['EQF']),
    go.Bar(name="60s after trade", x=bbo_aft_60.MarketParticipant.unique(), y=bbo_aft_60['EQF'])])
fig.update_layout(barmode='group',title="BBO Effective/Quoted Spread Per Market/TimeFrame",
    xaxis={'title':'Market'}, 
    yaxis={'title':'Effective/Quoted Spread'})
%plotly fig

Athena Spark calculation performance

When you run a code block, Athena Spark automatically determines how many DPUs it requires to complete the calculation. In the last code block, where we call the tca_analysis function, we are actually instructing Spark to process the data, and we then convert the resulting Spark dataframes into Pandas dataframes. This constitutes the most intensive processing part of the analysis, and when Athena Spark runs this block, it shows the progress bar, elapsed time, and how many DPUs are processing data currently. For example, in the following calculation, Athena Spark is utilizing 18 DPUs.

When you configure your Athena Spark notebook, you have the option of setting the maximum number of DPUs that it can use. The default is 20 DPUs, but we tested this calculation with 10, 20, and 40 DPUs to demonstrate how Athena Spark automatically scales to run our analysis. We observed that Athena Spark scales linearly, taking 15 minutes and 21 seconds when the notebook was configured with a maximum of 10 DPUs, 8 minutes and 23 seconds when the notebook was configured with 20 DPUs, and 4 minutes and 44 seconds when the notebook was configured with 40 DPUs. Because Athena Spark charges based on DPU usage, at a per-second granularity, the cost of these calculations is similar, but if you set a higher maximum DPU value, Athena Spark can return the result of the analysis much faster. For more details on Athena Spark pricing please click here.

Conclusion

In this post, we demonstrated how you can use high-fidelity OPRA data from LSEG’s Tick History-PCAP to perform transaction cost analytics using Athena Spark. The availability of OPRA data in a timely manner, complemented with accessibility innovations of AWS Data Exchange for Amazon S3, strategically reduces the time to analytics for firms looking to create actionable insights for critical trading decisions. OPRA generates about 7 TB of normalized Parquet data each day, and managing the infrastructure to provide analytics based on OPRA data is challenging.

Athena’s scalability in handling large-scale data processing for Tick History – PCAP for OPRA data makes it a compelling choice for organizations seeking swift and scalable analytics solutions in AWS. This post shows the seamless interaction between the AWS ecosystem and Tick History-PCAP data and how financial institutions can take advantage of this synergy to drive data-driven decision-making for critical trading and investment strategies.


About the Authors

Pramod Nayak is the Director of Product Management of the Low Latency Group at LSEG. Pramod has over 10 years of experience in the financial technology industry, focusing on software development, analytics, and data management. Pramod is a former software engineer and passionate about market data and quantitative trading.

LakshmiKanth Mannem is a Product Manager in the Low Latency Group of LSEG. He focuses on data and platform products for the low-latency market data industry. LakshmiKanth helps customers build the most optimal solutions for their market data needs.

Vivek Aggarwal is a Senior Data Engineer in the Low Latency Group of LSEG. Vivek works on developing and maintaining data pipelines for processing and delivery of captured market data feeds and reference data feeds.

Alket Memushaj is a Principal Architect in the Financial Services Market Development team at AWS. Alket is responsible for technical strategy, working with partners and customers to deploy even the most demanding capital markets workloads to the AWS Cloud.

InsightAppSec: Improving Scan Speed and Performance

Post Syndicated from Shane Queeney original https://blog.rapid7.com/2024/01/31/insightappsec-improving-scan-speed-and-performance/

InsightAppSec: Improving Scan Speed and Performance

When scanning a web application in InsightAppSec, you might see it take several hours, if not several days, to run. This can be due to the size of your web app, but plenty of settings in your scan configuration can be modified to help scans complete faster.

The first setting is Info -> Enable Incremental Scanning. Incremental scanning will take the crawl map of the previous scan and only hit new or updated web pages. It uses the crawl signature of a page, and if it doesn’t exist in the previous scan, or is different, then the scanner will attack it. This is especially useful if you have InsightAppsec as part of your CI/CD pipeline process, and you don’t need to run a full scan every time a new build gets created. Just be aware that you will see fewer vulnerabilities on your web app because we are scanning fewer pages. It is still recommended you run periodic full scans of your web app without incremental scanning enabled to ensure maximum visibility into the vulnerability findings.

https://docs.rapid7.com/insightappsec/scan-your-app/#scan-only-the-new-and-updated-links-since-the-last-scan

InsightAppSec: Improving Scan Speed and Performance

Your second option is Scan Scope -> Crawling Restrictions. This allows you to explicitly limit certain pages or directories from being crawled. For example, if you have product manuals in multiple languages, we don’t necessarily want to attack the same pages several times. We can explicitly allow one directory to be attacked while excluding another. In the example in the screenshot below, we are allowing the scan to hit the /manuals/EN/ directory while excluding all other directories under /manuals/.

https://docs.rapid7.com/insightappsec/scan-scope/#crawling-restrictions

InsightAppSec: Improving Scan Speed and Performance

Another trick with the crawl restrictions is the ability to tell InsightAppSec to only scan specific pages or directories. First, add the directories to the Scan Config URLs under your scan scope, and then specify the target directory under the Crawling Restrictions.

You can optionally add the constraints to the Attack Restrictions so the scan still crawls the target pages, but doesn’t attack them.

If you have a very large web app, an advanced use case would be to create several scan configurations, specify different directories in each one, and kick off multiple scans against your application at the same time. Be careful because this will multiply the traffic being sent to your web server for each scan configuration you create.

https://docs.rapid7.com/insightappsec/how-to-configure-scan-scope/#scan-scope-configuration-example

InsightAppSec: Improving Scan Speed and Performance
InsightAppSec: Improving Scan Speed and Performance

Next, we have Authentication -> Additional Settings. While you might not encounter this for all of your web applications, sometimes InsightAppSec will show that it was constantly detecting logouts in the scan logs. When this happens, the scan will attempt to log back in again, and if it keeps detecting these false session losses over and over, it can cause the scan to take a very long time to complete.

To fix this, we can adjust the Session Loss Regex or the Session Loss Header Regex depending on where the logs say the logout match was found, either in the response header or response body.

If the issue is in the header, it is recommended to delete the Session Loss Header Regex value and keep it blank. You can then change the Session Loss Regex to something that only exists on your login form. Some examples include: Remember my email|Forgot password?|Remember me|Keep me signed in|Stay signed in.

If the issue is in the body, you can remove the string that is causing the problem from the Session Loss Regex and add in additional strings from the example above for more accurate detection.

https://docs.rapid7.com/insightappsec/authentication/#configure-additional-settings-using-regex

InsightAppSec: Improving Scan Speed and Performance
InsightAppSec: Improving Scan Speed and Performance

We now have Custom Options -> Performance, which you can adjust to increase the scan speed, but at the expense of hitting your web application harder. The two most common fields to adjust are the Max Bandwidth and Max Concurrent Requests.You should start by doubling these values and seeing how your web application responds to the increased traffic. If there are no issues, you can then continue to increase the values to what is shown in the screenshot below.

Optionally, you can adjust the URL Retry Attempts and the Min Delay Between Requests, but these have a much higher likelihood to cause problems for your web application or cause you to miss certain pages. Always work with your app developers to ensure that you don’t accidentally take down your application.

https://docs.rapid7.com/insightappsec/custom-options/#performance

InsightAppSec: Improving Scan Speed and Performance

The next setting is under Custom Options -> Advanced Options -> ScanConfig -> JavaScriptEngine. This allows you to change the default engine that is used for your scan. Chrome is currently set by default, as it uses the older crawling engine. For faster and more accurate scanning, especially on modern web applications, this value should be switched to Chromium to benefit from Rapid7’s latest crawling engine.

https://docs.rapid7.com/insightappsec/advanced-options

https://docs.rapid7.com/insightappsec/advanced-options/#Engine-version-75

InsightAppSec: Improving Scan Speed and Performance

Also, under Advanced Options, if you don’t want to scan specific page extensions, you can add them under either CrawlConfig or AttackerConfig -> DenyListExtensionList.

InsightAppSec: Improving Scan Speed and Performance

Optionally, if your scans are still running too slow, you can further adjust the Crawl Configuration and Attacker Configuration under the Advanced Options. More info can be found at the link below.

https://docs.rapid7.com/insightappsec/common-crawling-issues/#how-can-i-increase-scan-speed

If you need to set a maximum amount of time for the scan to run, you can adjust the MaxScanTimeInMinutes under CrawlConfig. This stops the scan at a certain time, whether the scan is complete or not. This can cause your scan to miss directories and to not get a complete picture of your web application vulnerabilities, so only use it if absolutely necessary.

InsightAppSec: Improving Scan Speed and Performance
InsightAppSec: Improving Scan Speed and Performance

If you want to run faster validation scans, click into the scan results and click on Validate Scan, in the upper right, to only search for the vulnerabilities found during that scan. This has the added benefit of automatically adjusting the vulnerability status if the vulnerability is not found again. Just remember this isn’t running a full scan against your web application, so you won’t be discovering any new vulnerabilities.

https://docs.rapid7.com/insightappsec/scan-your-app/#test-vulnerability-remediation-by-re-running-a-scanv

https://docs.rapid7.com/insightappsec/test-vulnerability-remediation

InsightAppSec: Improving Scan Speed and Performance

There are many options for speeding up the amount of time it takes to run a scan. As always, if you continue to have issues with scan time or anything else scanning related, don’t hesitate to contact Rapid7 support for more assistance.

A locally exploitable glibc vulnerability

Post Syndicated from corbet original https://lwn.net/Articles/960289/

Qualys has disclosed
a vulnerability in the GNU C Library that can be exploited by a local
attacker for root access. It was introduced in the 2.37 release, and also
backported to 2.36.

For example, we confirmed that Debian 12 and 13, Ubuntu 23.04 and
23.10, and Fedora 37 to 39 are vulnerable to this buffer
overflow. Furthermore, we successfully exploited an up-to-date,
default installation of Fedora 38 (on amd64): a Local Privilege
Escalation, from any unprivileged user to full root. Other
distributions are probably also exploitable.

Vulnerable systems with untrusted users should probably be updated in a
timely manner.

Възходът на крайната десница в Германия – как и защо?

Post Syndicated from Марина Лякова original https://www.toest.bg/vuzhodut-na-kraynata-desnitsa-v-germaniya-kak-i-zashto/

Възходът на крайната десница в Германия – как и защо?

През последната седмица в Германия се проведоха десетки демонстрации, в които се включиха над един милион души от цялата страна. Те бяха подкрепени от политици от целия партиен спектър, неправителствени организации, представители на църквите и на профсъюзите. Причината – на тайна сбирка на дяснорадикални лидери в Потсдам с участието на представители на „Алтернатива за Германия“ (АзГ) и на „Съюза на ценностите“ (фракция на Християндемократическия съюз) е обсъждана идеята за насилствена масова „ремиграция“. Зад тази сложно звучаща дума всъщност се прикрива идеята за повсеместно насилствено прогонване на чужденци и дори на германски граждани с чуждестранен произход от Германия. Срещата е разкрита и документирана от неправителствената организация „Коректив“

Днес в Германия живеят 23 млн. души с миграционен произход – мигранти или техни деца. Преобладаващата част от децата имат германско гражданство и са преминали през германската образователна система. Сред тези 23 милиона са и около 450 000 български граждани, както и българи с германско или с двойно гражданство. Въпреки че заплахата да се реализират такива планове граничи с абсурд, събитието предизвика мощен отпор. Като опасност се отчита дори самото говорене за насилствено екстрадиране. 

Добре известно е, че думите трасират пътя към действията

Точно тези думи, но и споменът за радикалните целенасочени убийства на девет чужденци в периода от 2000 до 2007 г. от организацията „Националсоциалистически ъндърграунд“ повдигна и множество въпроси. Защо в Германия, една държава, в която споменът за националсоциализма е травматичен и жив и която от края на Втората световна война е инвестирала милиони в борбата с различни екстремистки течения, радикалните нагласи продължават да процъфтяват? Те, разбира се, винаги са съществували. Но избуяването им днес e необичайно силно.

Според актуалните социологически данни АзГ би спечелила 22% от подадените гласове, ако изборите за германски парламент се състояха тази неделя. С това тя би станала втора политическа сила в страната, измествайки социалдемократи, либерали и зелени. В Източна Германия се очаква партията да бъде дори първа политическа сила, изпреварвайки Християндемократическия съюз.

Предвид предстоящите избори за Европейски парламент и за регионални парламенти в три източногермански провинции, тези данни нямат единствено прогнозна стойност, а са индикатор за сериозна промяна в обществените настроения, но и във властовия баланс. Това се потвърждава и от публичния годишен доклад на Службата за защита на Конституцията. Става дума за една по-широка тенденция, обхванала много страни от Западна Европа.

Но какви са причините?

Не е тайна, че от няколко години най-мощната европейска икономика – тази на Федералната република – се задъхва. COVID-19, Зелената сделка и войната в Украйна са основните фактори, които оскъпиха производството и съответно намалиха износа на германски стоки в чужбина. Трудностите с глобалните доставки на стоки и суровини – преди три години заради пандемията, а днес поради кризата в Червено море – допълнително усложняват ситуацията. 

И ако по време на COVID-19 германската държава все още можеше да си позволи чрез различни програми да облекчава товара върху производители и потребители, то последвалата руска инвазия в Украйна и самоограниченията, наложени от Партията на зелените (затваряне на атомните и на въглищните централи, скъпоструващите програми за повишаване на енергийната ефективност), както и повишената цена на газа вследствие вноса на LPG сякаш ограничиха възможностите за проактивни компенсиращи мерки.

На този фон много по-ясно изпъква един сериозен проблем, съществуващ от поне три десетилетия – нарастването на социалните неравенства в германското общество и намаляването на покупателната способност на средните и по-ниските социални прослойки. В множество свои изследвания и доклади известният социолог проф. Михаел Хартман показва как през последните 30 години вследствие на данъчни промени, които облагодетелстват предимно заможните и имотни граждани, социалното разслоение в Германия непрекъснато се увеличава. 

За да задържат повече фирми в Германия, правителствата след епохата на Хелмут Кол, независимо от идеологическата си обагреност, намаляват корпоративния данък или го поддържат на ниско ниво. С цел предотвратяване на емиграцията и съответно преместването на мястото на данъчната задълженост на заможните прослойки, се намалява и данък общ доход. С това драстично спадат приходите в хазната и съответно – възможността за инвестиции, но и за поддържане на по-слабо социално разслоение. А инфлацията, особено висока през последните четири години, на практика изяжда увеличенията на заплатите и фактически намалява покупателната способност.

Разбира се, средното ниво на доходите в Германия продължава да бъде едно от най-високите в света. Но също така с просто око се вижда, че инфраструктурата на страната е далеч от блясъка, с който се отличаваше в началото на 90-те. Държавните железници хронично закъсняват – едва 66% от всички влакове през 2023 г. са пристигнали по разписание. Много общини са потънали в дългове и нямат възможност да инвестират в нова инфраструктура. Места в детските градини в големите градове няма, а опашките пред социалните кухни стават по-дълги. 

Социалното разслоение е видимо и сред пенсионерите – средната брутна пенсия е около 1543 евро на месец. В България тези доходи все още изглеждат твърде високи, но съотнесени към разходите, необходими за живот в Германия, те са проблематично ниски. Настаняване в дом за възрастни хора струва между 3500 и 4700 евро на месец, като само част от тях се покриват от застраховката, а двустайно жилище в голям град само с много усилия може да бъде наето на цена под 1300 евро. На практика един пенсионер трудно може да си позволи тези разходи – на помощ идват децата или социалните служби. 

Парадоксът обаче е, че подкрепилите тайната сбирка на АзГ принадлежат към съвършено различна социална прослойка. Сред тях са материално добре обезпечени лица, например предприемачи и политици, които са далеч от социалната нищета. Очевидно не само икономическото развитие е причина за нарастване на социалното напрежение в германското общество. 

Демографията и миграцията 

Поради ниската раждаемост населението на Германия от десетилетия застарява. От години броят на жителите е устойчив – около 84 млн., но тази стабилност се дължи предимно на имиграцията, а не на естествения прираст. Вследствие на демографското развитие редица длъжности, които не са от предпочитаните, изпитват хроничен недостиг на кадри. Въпреки сравнително доброто заплащане в Германските държавни железници има около 20 000 незаети работни места. Отчаяно се търсят водопроводчици, медицински работници, лекари, продавачи в магазини, шофьори на автобуси и разбира се, ИТ специалисти. 

Може ли обаче миграцията да бъде решение? Вероятно само отчасти. Зависи от мигриращите, но и от възможността на едно общество да интегрира бързо и успешно новодошли хора. Германските политици години наред упорито отказваха да признаят, че Германия е имиграционна държава и че процесът, започнал с набирането на гастарбайтери в края на 50-те години, е необратим. Пред желаещите да имигрират и работят през 90-те години източноевропейци се поставяха непреодолими бюрократични и визови бариери. Един неизползван потенциал, намерил нов дом отвъд океана.

Вследствие на войните в Сирия, Афганистан и Ирак и на едно замислено като хуманитарно и временно решение през 2015 г. Германия отвори границите си за милиони, търсещи убежище. По силата на Дъблинското споразумение, но и на националното законодателство, германската държава нямаше законово задължение да приеме тези хора, но го направи в знак на солидарност с претоварените държави на външните граници на ЕС. А вероятно и с убеждението, че това отваряне би могло да бъде част от решението на демографския проблем. 

И макар днес, близо 10 години след събитията от лятото на 2015 г., 56% от имигриралите през 2015–2016 г. да имат постоянно работно място, интегрирането на тези хора на пазара на труда все още е свързано с препятствия. Твърде големи са психическите травми на бягащите от войни, насилие и глад, невинаги достатъчна е квалификацията им. В много случаи отнема години тези хора да бъдат обучени.

Пропагандата и въпросите за идентичността

Дяснорадикалните екстремисти се фокусират именно върху проблемите. Те трудно биха признали случаите на успешна интеграция. Факт е обаче, че днес Германия е дом на хиляди успешно интегрирани лекари, учители, икономисти, юристи, а дори и министри с миграционен произход. Да си припомним, че една от най-популярните ваксини срещу COVID-19 – „Пфайзер/Бионтех“ – беше създадена именно от деца на турски мигранти в германския град Майнц. 

Но АзГ са впили поглед в тъмната страна, а тя съществува – Германия има сериозен проблем с екстрадирането на неполучилите бежански статут. Хиляди, подали молби за убежище, подлежат на екстрадиция след отклоняване на молбите им. Въпреки че нямат право да живеят в страната, те не могат да бъдат изведени извън Германия – документите им за самоличност липсват и родните им държави отказват да ги припознаят като свои граждани. Единствената възможност е те да продължават да получават социални помощи и да чакат деня, в който екстрадицията би могла да се реализира. Левите пледират за легализацията и интеграцията им, а преобладаващата част от десните са за ограничаване на престоя им и настояват, че Германия по принцип има нужда от миграция, но не от такава миграция. 

Сериозно препятствие се оказват и културните различия. Чрез масирана интернет пропаганда, на която и германското общество е изложено, се акцентира върху сексуалните насилия, извършвани от мюсюлмани, и върху пренебрежителното им отношение към жените – сякаш в обществата, в които няма мигранти от ислямски страни, такива проблеми не съществуват. 

АзГ обаче не спира да повтаря, че Германия губи идентичността си с приемането на толкова много мигранти. И още изказвания в този дух, които могат да бъдат перифразирани най-общо така: „Ние трябва да бъдем себе си, да изгоним чуждите. Политиците ви предадоха, изберете нас – ние ще направим всичко по-различно и по-добре. Няма да плащаме за чужди войни и за скъп газ. Ако екстрадираме милиони мигранти, цените на наемите ще спаднат, защото търсенето на жилища ще намалее.“ 

В това популистко говорене се включва и крайнолевият „Съюз на Сара Вагенкнехт“, който се отцепи от левицата именно поради несъгласие с миграционната политика. И зад двете партии се прокрадва сянката на Москва. Въпреки това думите им звучат примамливо, особено на хора, които чувстват, че стандартът им на живот е спаднал, без да могат да си обяснят причините. 

На този фон е лесно за предизвикателствата да бъдат обвинявани традиционните партии и мигрантите. Ако обаче някой някога реализира абсурдната и престъпна идея да прогони милиони хора с миграционен произход от страната, в Германия няма да има кой да ти продаде дори хляб. 

А ние, българите?

Ако се съди по изборните резултати, мнозина гласуващи в Германия българи сякаш са склонни да се подведат по популистките тези на АзГ и на българските ѝ аналози. Те обаче забравят – ако някой ден бъдат въведени ограничения за мигрантите, ако „излезем“ от Европейския съюз, за да бъдем „суверенни“ и „независими“, както пледират АзГ, а и някои български партии, например „Възраждане“, то нашите сънародници, работещи в Германия, ще бъдат най-силно засегнати. На теория може би е примамливо да сочиш чужденците с пръст, докато ти самият не се окажеш част от тях.

Security updates for Wednesday

Post Syndicated from corbet original https://lwn.net/Articles/960248/

Security updates have been issued by Debian (bind9 and glibc), Fedora (ncurses), Gentoo (containerd, libaom, and xorg-server, xwayland), Mageia (python-pillow and zlib), Oracle (grub2 and tomcat), Red Hat (avahi, c-ares, container-tools:3.0, curl, firefox, frr, kernel, kernel-rt, kpatch-patch, libfastjson, libmicrohttpd, linux-firmware, oniguruma, openssh, perl-HTTP-Tiny, python-pip, python-urllib3, python3, rpm, samba, sqlite, tcpdump, thunderbird, tigervnc, and virt:rhel and virt-devel:rhel modules), SUSE (python-Pillow, slurm, slurm_20_02, slurm_20_11, slurm_22_05, slurm_23_02, and xen), and Ubuntu (libde265, linux-nvidia, mysql-8.0, openldap, pillow, postfix, and xorg-server, xwayland).

LangChain Support for Workers AI, Vectorize and D1

Post Syndicated from Ricky Robinett http://blog.cloudflare.com/author/ricky/ original https://blog.cloudflare.com/langchain-support-for-workers-ai-vectorize-and-d1


During Developer Week, we announced LangChain support for Cloudflare Workers. Langchain is an open-source framework that allows developers to create powerful AI workflows by combining different models, providers, and plugins using a declarative API — and it dovetails perfectly with Workers for creating full stack, AI-powered applications.

Since then, we’ve been working with the LangChain team on deeper integration of many tools across Cloudflare’s developer platform and are excited to share what we’ve been up to.

Today, we’re announcing five new key integrations with LangChain:

  1. Workers AI Chat Models: This allows you to use Workers AI text generation to power your chat model within your LangChain.js application.
  2. Workers AI Instruct Models: This allows you to use Workers AI models fine-tuned for instruct use-cases, such as Mistral and CodeLlama, inside your Langchain.js application.
  3. Text Embeddings Models: If you’re working with text embeddings, you can now use Workers AI text embeddings with LangChain.js.
  4. Vectorize Vector Store: When working with a Vector database and LangChain.js, you now have the option of using Vectorize, Cloudflare’s powerful vector database.
  5. Cloudflare D1-Backed Chat Memory: For longer-term persistence across chat sessions, you can swap out LangChain’s default in-memory chatHistory that backs chat memory classes like BufferMemory for a Cloudflare D1 instance.

With the addition of these five Cloudflare AI tools into LangChain, developers have powerful new primitives to integrate into new and existing AI applications. With LangChain’s expressive tooling for mixing and matching AI tools and models, you can use Vectorize, Cloudflare AI’s text embedding and generation models, and Cloudflare D1 to build a fully-featured AI application in just a few lines of code.

This is a full persistent chat app powered by an LLM in 10 lines of code–deployed to @Cloudflare Workers, powered by @LangChainAI and @Cloudflare D1.

You can even pass in a unique sessionId and have completely user/session-specific conversations 🤯 https://t.co/le9vbMZ7Mc pic.twitter.com/jngG3Z7NQ6

— Kristian Freeman (@kristianf_) September 20, 2023

Getting started with a Cloudflare + LangChain + Nuxt Multi-source Chatbot template

You can get started by using LangChain’s Cloudflare Chatbot template: https://github.com/langchain-ai/langchain-cloudflare-nuxt-template

This application shows how various pieces of Cloudflare Workers AI fit together and expands on the concept of retrieval augmented generation (RAG) to build a conversational retrieval system that can route between multiple data sources, choosing the one more relevant based on the incoming question. This method helps cut down on distraction from off-topic documents getting pulled in by a vector store’s similarity search, which could occur if only a single database were used.

The base version runs entirely on the Cloudflare Workers AI stack with the Llama 2-7B model. It uses:

  • A chat variant of Llama 2-7B run on Cloudflare Workers AI
  • A Cloudflare Workers AI embeddings model
  • Two different Cloudflare Vectorize DBs (though you could add more!)
  • Cloudflare Pages for hosting
  • LangChain.js for orchestration
  • Nuxt + Vue for the frontend

The two default data sources are a PDF detailing some of Cloudflare’s features and a blog post by Lilian Weng at OpenAI that talks about autonomous agents.

The bot will classify incoming questions as being about Cloudflare, AI, or neither, and draw on the corresponding data source for more targeted results. Everything is fully customizable – you can change the content of the ingested data, the models used, and all prompts!

And if you have access to the LangSmith beta, the app also has tracing set up so that you can easily see and debug each step in the application.

We can’t wait to see what you build

We can’t wait to see what you all build with LangChain and Cloudflare. Come tell us about it in discord or on our community forums.

CFPB’s Proposed Data Rules

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/01/cfpbs-proposed-data-rules.html

In October, the Consumer Financial Protection Bureau (CFPB) proposed a set of rules that if implemented would transform how financial institutions handle personal data about their customers. The rules put control of that data back in the hands of ordinary Americans, while at the same time undermining the data broker economy and increasing customer choice and competition. Beyond these economic effects, the rules have important data security benefits.

The CFPB’s rules align with a key security idea: the decoupling principle. By separating which companies see what parts of our data, and in what contexts, we can gain control over data about ourselves (improving privacy) and harden cloud infrastructure against hacks (improving security). Officials at the CFPB have described the new rules as an attempt to accelerate a shift toward “open banking,” and after an initial comment period on the new rules closed late last year, Rohit Chopra, the CFPB’s director, has said he would like to see the rule finalized by this fall.

Right now, uncountably many data brokers keep tabs on your buying habits. When you purchase something with a credit card, that transaction is shared with unknown third parties. When you get a car loan or a house mortgage, that information, along with your Social Security number and other sensitive data, is also shared with unknown third parties. You have no choice in the matter. The companies will freely tell you this in their disclaimers about personal information sharing: that you cannot opt-out of data sharing with “affiliate” companies. Since most of us can’t reasonably avoid getting a loan or using a credit card, we’re forced to share our data. Worse still, you don’t have a right to even see your data or vet it for accuracy, let alone limit its spread.

The CFPB’s simple and practical rules would fix this. The rules would ensure people can obtain their own financial data at no cost, control who it’s shared with and choose who they do business with in the financial industry. This would change the economics of consumer finance and the illicit data economy that exists today.

The best way for financial services firms to meet the CFPB’s rules would be to apply the decoupling principle broadly. Data is a toxic asset, and in the long run they’ll find that it’s better to not be sitting on a mountain of poorly secured financial data. Deleting the data is better for their users and reduces the chance they’ll incur expenses from a ransomware attack or breach settlement. As it stands, the collection and sale of consumer data is too lucrative for companies to say no to participating in the data broker economy, and the CFPB’s rules may help eliminate the incentive for companies to buy and sell these toxic assets. Moreover, in a free market for financial services, users will have the option to choose more responsible companies that also may be less expensive, thanks to savings from improved security.

Credit agencies and data brokers currently make money both from lenders requesting reports and from consumers requesting their data and seeking services that protect against data misuse. The CFPB’s new rules—and the technical changes necessary to comply with them—would eliminate many of those income streams. These companies have many roles, some of which we want and some we don’t, but as consumers we don’t have any choice in whether we participate in the buying and selling of our data. Giving people rights to their financial information would reduce the job of credit agencies to their core function: assessing risk of borrowers.

A free and properly regulated market for financial services also means choice and competition, something the industry is sorely in need of. Equifax, Transunion and Experian make up a longstanding oligopoly for credit reporting. Despite being responsible for one of the biggest data breaches of all time in 2017, the credit bureau Equifax is still around—illustrating that the oligopolistic nature of this market means that companies face few consequences for misbehavior.

On the banking side, the steady consolidation of the banking sector has resulted in a small number of very large banks holding most deposits and thus most financial data. Behind the scenes, a variety of financial data clearinghouses—companies most of us have never heard of—get breached all the time, losing our personal data to scammers, identity thieves and foreign governments.

The CFPB’s new rules would require institutions that deal with financial data to provide simple but essential functions to consumers that stand to deliver security benefits. This would include the use of application programming interfaces (APIs) for software, eliminating the barrier to interoperability presented by today’s baroque, non-standard and non-programmatic interfaces to access data. Each such interface would allow for interoperability and potential competition. The CFPB notes that some companies have tried to claim that their current systems provide security by being difficult to use. As security experts, we disagree: Such aging financial systems are notoriously insecure and simply rely upon security through obscurity.

Furthermore, greater standardization and openness in financial data with mechanisms for consumer privacy and control means fewer gatekeepers. The CFPB notes that a small number of data aggregators have emerged by virtue of the complexity and opaqueness of today’s systems. These aggregators provide little economic value to the country as a whole; they extract value from us all while hindering competition and dynamism. The few new entrants in this space have realized how valuable it is for them to present standard APIs for these systems while managing the ugly plumbing behind the scenes.

In addition, by eliminating the opacity of the current financial data ecosystem, the CFPB is able to add a new requirement of data traceability and certification: Companies can only use consumers’ data when absolutely necessary for providing a service the consumer wants. This would be another big win for consumer financial data privacy.

It might seem surprising that a set of rules designed to improve competition also improves security and privacy, but it shouldn’t. When companies can make business decisions without worrying about losing customers, security and privacy always suffer. Centralization of data also means centralization of control and economic power and a decline of competition.

If this rule is implemented it will represent an important, overdue step to improve competition, privacy and security. But there’s more that can and needs to be done. In time, we hope to see more regulatory frameworks that give consumers greater control of their data and increased adoption of the technology and architecture of decoupling to secure all of our personal data, wherever it may be.

This essay was written with Barath Raghavan, and was originally published in Cyberscoop.

ASRock Rack ALTRAD8UD-1L2T Review This is the Ampere Arm Motherboard You Want

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/asrock-rack-altrad8ud-1l2t-review-this-is-the-ampere-arm-motherboard-you-want/

In our ASRock Rack ALTRAD8UD-1L2T review, we see why this is the Ampere Arm motherboard that so many have been waiting for

The post ASRock Rack ALTRAD8UD-1L2T Review This is the Ampere Arm Motherboard You Want appeared first on ServeTheHome.

Rethinking Stream Processing: Data Exploration

Post Syndicated from Grab Tech original https://engineering.grab.com/rethinking-streaming-processing-data-exploration

Introduction

In this digital age, companies collect multitudes of data that enable the tracking of business metrics and performance. Over the years, data analytics tools for data storage and processing have evolved from the days of Excel sheets and macros to more advanced Map Reduce model tools like Spark, Hadoop, and Hive. This evolution has allowed companies, including Grab, to perform modern analytics on the data ingested into the Data Lake, empowering them to make better data-driven business decisions. This form of data will be referenced within this document as “Offline Data”.

With innovations in stream processing technology like Spark and Flink, there is now more interest in unlocking value from streaming data. This form of continuously-generated data in high volume will be referenced within this document as “Online Data”. In the context of Grab, the streaming data is usually materialised as Kafka topics (“Kafka Stream”) as the result of stream processing in its framework. This data is largely unexplored until they are eventually sunk into the Data Lake as Offline Data, part of the data journey (see Figure 1 below). This induces some data latency before the data can be used by data analysts to inform decisions.

Figure 1. Simplified data journey for Offline Data vs. Online Data, from data generation to data analysis.

As seen in Figure 1 above, the Time to Value (“TTV”) of Online Data is shorter as compared to that of Offline Data in a simplified data journey from data generation to data analysis where complexities of data cleaning and transformation have been removed. This is because the role of the data analyst or data scientist (“Data End User”) has been enabled forward to the Kafka stage for Online Data instead of the Data Lake stage for Offline Data. We recognise that allowing earlier data exploration on Online Data allows Data End Users to build context around the data inputs they are using in an earlier stage. This can help them process Offline Data more meaningfully in subsequent stages. We are interested in opening up the possibility for Data End Users to at least explore the Online Data before they architect a full solution to clean and/or process the data directly or more efficiently post-ingestion into the Data Lake. After their data exploration, the users would have more information to decide whether to spin up a stream processing pipeline for Online Data, or to continue processing Offline Data with their current solution, but with a more refined understanding and logic strategy against their source data inputs. However, of course, in this blog, we acknowledge that not all analysis on Online Data could be done in this manner.

Problem statement

Online Data is underutilised within Grab mainly because of, among other reasons, difficulty in performing data exploration on data that is not yet properly stored in the Data Lake.

For the purpose of this blog post, we will focus only on the problem of exploration of Online Data because this problem is the precursor to allowing us to fully democratise such data.

The problem of data exploration manifests itself when Data End Users need to find the proper data inputs to base and develop their data models. These users would then often need to parse through a multitude of documentation and connect with multiple upstream data producers, to know the range of data signals that are currently available and understand what each data signal is trying to measure.

Given the ephemeral nature of Online Data, this implies that the lack of correct tool adoption to seamlessly perform quick tests with application logic on Online Data disincentivises the Data End Users to work on these Online Data. Testing such logic on Offline Data is generally much easier since iteration testing on the exact same dataset is possible.

This difficulty in performing data exploration including ad hoc queries on Online Data has therefore made development of stream processing applications hard for Data End Users, creating headwinds in Grab’s aim to evolve from making data-driven business decisions to also making data-driven operation decisions. Doing both would allow Grab to react much quicker to abrupt changes in its business landscape.

Adoption of Zeppelin notebook environment

To address the difficulty in performing data exploration on Online Data, we have adopted Apache Zeppelin, a web-based notebook that enables data-drive, interactive data analytics with the support of multiple interpreters to work with various data processing backends e.g. Spark, Flink. The full solution of the adopted Zeppelin notebook environment is enabled seamlessly within our internal data-streaming platform, through its control plane. If you are interested, you may check out our previous blog post titled An elegant platform for more details on the abovementioned streaming platform and its control plane.

Figure 2. Zeppelin login page via web-based notebook environment.

As seen from Figure 2 above, after successful creation of the Zeppelin cluster, users can log in with their generated credentials delivered to them via the integrated instant messenger, and start using the notebook environment.

Figure 3. Zeppelin programme flow in the notebook environment.

Figure 3 above explains the Zeppelin notebook programme flow as follows:

  • The users enter their queries into the notebook session and run querying statements interactively with the established web-based notebook session.
  • The queries are passed to the Flink interpreter within the cluster to generate the Flink job as a Jar file, to be then submitted to a Flink session cluster.
  • When the Flink session cluster job manager receives the job, it would spin up the corresponding Flink task managers (workers) to run the application and retrieve the results.
  • The query results would then be piped back to the notebook session, to be displayed back to the user on the notebook session.

Data query and visualisation

Figure 4. Example of simple select query of data on Kafka.
Note: All variable names, schema, values, and other details used in this article are only created for use as examples.

Flink has a planned roadmap to create a unified streaming language for both stream processing and data analytics. In line with the roadmap, we have based our Zeppelin solution on supporting Structured Query Language (“SQL”) as the query language of choice as seen in Figure 4 above. Data End Users can now write queries in SQL, which is a language that they are comfortable with, and perform adequate data exploration.

As discussed in this section, data exploration on streaming data at the Kafka stage by adopting the right tool enables Data End Users to seamlessly have visibility to quickly understand the current schema of a Kafka topic (explained more in the next section. This kind of data exploration also enables Data End Users to understand the type of data the Kafka topic represents, such as the ability to determine if a country code data field is in alpha-2 or alpha-3 format while the data is still part of streaming data. This might seem inconsequential and immediately identifiable even in Offline Data, but by enabling data exploration at an earlier stage in the data journey for Online Data, Data End Users have the opportunity to react much more quickly. For example, a change of expected country code format from the data producer would usually lead to errors in the downstream joins or other stream processing pipelines due to incompatible parsing or filtering of the modified country codes. Instead of waiting for the data to be ingested to Offline Data, users can investigate the issue with Online Data retrieved from Kafka.

Figure 5. Simple visualisation of queried data on Zeppelin’s notebook environment.
Note: All variable names, schema, values, and other details used in this article are only created for use as examples.

Besides query features, Zeppelin notebook provides simple visualisation and analytics of the off-the-shelf data as presented above in Figure 5. Furthermore, users are now able to perform interactive ad hoc queries on Online Data. These queries will eventually become much more advanced and/or effective SQL queries to be deployed as a streaming pipeline later on in the data journey. This reduces the inertia in setting up a separate development environment or learning other programming languages like Java or Scala during the development of streaming pipelines. With Zeppelin’s notebook environment, our Data End Users are more empowered to quickly derive value from Online Data.

Need for a more dynamic table schema derivation process

For the Data End Users performing data exploration on Online Data, we see a need for these users to derive the Data Definition Language (“DDL”) associated with a Kafka stream at an earlier stage of the data journey. Within Grab, even though Kafka streams are transmitted in Protobuf format and are thus structured, both the schema and the corresponding DDL changes are added over time as new fields. Typically, the data producer (service owners) and the data engineers responsible for the data ingestion pipeline coordinate to perform such updates. Since the Data End Users are not involved in such schema update processes nor do they directly interact with the data producers, many of them find the discovery of changes in the current Kafka stream schema an issue. Granted that this is an issue our metadata platform is actively solving using Datahub, we hope to also solve the challenge by being able to derive the DDL more dynamically within the tooling, for data exploration on Online Data to reduce friction.

Figure 6. Common functions to derive DDL of a Kafka Stream in SQL.
Note: All variable names, schema, values, and other details used in this article are only created for use as examples.

As seen from Figure 6 above, we have an integrated tooling for Data End Users to derive the DDL associated with a Kafka stream using SQL language. A Kafka stream in Grab’s context is a logical concept describing a Kafka topic, associating it with its metadata like Kafka bootstrap servers and associated Java class created by Protoc. This tool maps the Protobuf schema definition of a Kafka stream to a DDL, allowing it to be expressed and used in SQL language. This reduces the manual effort involved in creating these table definitions from scratch based on the associated Protobuf schema. Users can now derive the DDL associated with a Kafka stream more easily.

Mitigating risks arising from data exploration on Online Data – data access authorisation/audit

While we rethink stream processing and are open to options that enable data exploration on Online Data as mentioned above, we realised that new security requirements related to data access authorisation and maintaining proper audit trail have emerged. Even with Personally Identifiable Information (PII) obfuscation enforcement by our streaming pipeline, it means we need to implement stricter guardrails in place along with audit trails to ensure users only have access to what they are allowed to, and this access can be removed in a break-glass scenario. If you are interested, you may check out our previous blog post titled PII masking for privacy-grade machine learning for more details about how we enforce PII masking on machine learning data streaming pipelines.

To enable data access authorisation, we utilised Strimzi, the operator of running Kafka on Kubernetes. We integrated Strimzi’s Open Policy Agent (OPA) with Kafka to define policies that authorise specific read-only user access to specific Kafka Topics. The identification of users is done via mutualTLS (mTLS) connection with our Kafka clusters, where their user details are part of the SSL certificate details used for authentication.

With these tools in place, each user’s request to explore Online Data would be properly logged, and each data access can be controlled by an OPA policy managed by a central team.

If you are interested, you may check out our previous post Zero trust with Kafka where we discussed our efforts to continue strengthening the security of our data-streaming platform.

Impact

With the proliferation of our data-streaming platform, we expect to see improvements in the way our data becomes gradually democratised. We have already been receiving use cases from the Data End Users who are interested in validating a chain of events on Online Data, i.e. retrieving information of all events associated with a particular booking, which is not currently something that can be done easily.

More importantly, the tools in place for data exploration on Online Data form the foundation required for us to embark on our next step of the stream processing journey. This foundation makes the development and validation of the stream processing logic much quicker. This occurs when ad hoc queries in a notebook environment are possible, removing the need for local developer environment setups and the need to go through the whole pipeline deployment process for eventual validation of the developed logic. We believe that this would prove to reduce our lead time in creating stream processing pipelines significantly.

What’s next?

Our next step is to rethink further how our stream processing pipelines are defined and start to provision SQL as the unified streaming language of our pipelines. This helps facilitate better discussion between upstream data producers, data engineers, and Data End Users, since SQL is the common language among these stakeholders.

We will also explore handling schema discovery in a more controlled manner by utilising a Hive catalogue to store our Kafka table definitions. This removes the need for users to retrieve and run the table DDL statement for every session, making the data exploration experience even more seamless.

References

[1] Apache Zeppelin | Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.

[2] An elegant platform | Grab engineering blog.

[3] Apache Flink | Roadmap on Unified SQL Platform.

[4] ISO | ISO 3166 Country Codes.

[5] Protobuf (Protocol Buffers)| Language-neutral, platform-neutral extensible mechanisms for serializing structured data.

[6] Datahub | Extensible metadata platform that enables data discovery, data observability and federated governance to help tame the complexity of your data ecosystem.

[7] Protoc | Protocol buffer compiler installation.

[8] PII masking for privacy-grade machine learning | Grab engineering blog.

[9] Zero trust with Kafka | Grab engineering blog.

[10] Open Policy Agent (OPA) | Policy-based control for cloud native environments.

[11] Strimzi | Using Open Policy Agent with Strimzi and Apache Kafka.

[12] Confluent Documentation | Configure mTLS authentication and RBAC for kafka brokers.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!