Authentication Bypasses in MOVEit Transfer and MOVEit Gateway

Post Syndicated from Ryan Emmons original https://blog.rapid7.com/2024/06/25/etr-authentication-bypasses-in-moveit-transfer-and-moveit-gateway/

Authentication Bypasses in MOVEit Transfer and MOVEit Gateway

On June 25, 2024, Progress Software published information on two new vulnerabilities in MOVEit Transfer and MOVEit Gateway: CVE-2024-5806, a high-severity authentication bypass affecting the MOVEit Transfer SFTP service in a default configuration, and CVE-2024-5805, a critical SFTP-associated authentication bypass vulnerability affecting MOVEit Gateway. Attackers can exploit these improper authentication vulnerabilities to bypass SFTP authentication and gain access to MOVEit Transfer and Gateway.

CVE-2024-5806 is an improper authentication vulnerability affecting the MOVEit Transfer SFTP service that can lead to authentication bypass. Rapid7 researchers tested a MOVEit Transfer 2023.0.1 instance, which appeared to be vulnerable in the default configuration. As of June 25, the known criteria for exploitation are threefold: that attackers have knowledge of an existing username, that the target account can authenticate remotely, and that the SFTP service is exposed. It’s possible that attackers may spray usernames to identify valid accounts. Rapid7 recommends installing the vendor-provided patches for CVE-2024-5806 on an emergency basis, without waiting for a regular patch cycle to occur.

According to Progress Software’s advisory, CVE-2024-5805 is a critical authentication bypass vulnerability that affects the SFTP feature of the MOVEit Gateway software in version 2024.0.0; earlier versions do not appear to be vulnerable, which likely limits available attack surface area. MOVEit Gateway is an optional component designed to proxy traffic to and from MOVEit Transfer instances. A patch is available for CVE-2024-5805 and should be applied on an emergency basis for organizations running MOVEit Gateway.

Progress MOVEit is an enterprise file transfer suite, which inherently makes it a highly desirable target for threat actors. Since enterprise file transfer software typically holds a large volume of confidential data, smash-and-grab attackers target these solutions to extort victims. In June 2023, an unauthenticated attack chain targeting MOVEit Transfer was widely exploited by the Cl0p ransomware group. Shodan queries indicate that there are approximately 1,000 public-facing MOVEit Transfer SFTP servers and approximately 70 public-facing MOVEit Gateway SFTP servers. (Note that not all of these may be vulnerable to these latest CVEs.)

Notably, Rapid7 observed that installers for the patched (latest) version of the MOVEit Transfer have been available on VirusTotal since at least June 11, 2024. Vulnerability details and proof-of-concept exploit code are publicly available for MOVEit Transfer CVE-2024-5806 as of June 25, 2024.

Mitigation guidance

MOVEit customers should apply vendor-provided updates for both vulnerabilities immediately.

The following versions of MOVEit Transfer are vulnerable to CVE-2024-5806:

The advisory notes that “Customers using the MOVEit Cloud environment were patched and are no longer vulnerable to this exploit.”

Only MOVEit Gateway 2024.0.0 is vulnerable to CVE-2024-5805, per the vendor advisory. The vulnerability is fixed in MOVEit Gateway 2024.0.1. The advisory indicates that “MOVEit Cloud does not use MOVEit Gateway, so no further action is needed by MOVEit Cloud customers.”

Rapid7 customers

InsightVM and Nexpose customers will be able to assess their exposure to CVE-2024-5805 and CVE-2024-5806 with authenticated vulnerability checks expected to be available in today’s (June 25) content release.

Darktable 4.8.0 released

Post Syndicated from jzb original https://lwn.net/Articles/979666/

Version
4.8.0
of the darktable
photo editor has been released. Changes include performance
improvements for large collections, addition of more EXIF fields in
the image information module, and two new modules for image
composition: Enlarge Canvas and Overlay. Enlarge Canvas allows adding
areas to an image, while Overlay allows adding new content by
overlaying pixels from the current image or another image. LWN last
looked at darktable in
2022. Users are “strongly advised” to make a backup of their
configuration and library before upgrading, as they will not be
compatible with darktable 4.6.

Takeaways From The Take Command Summit: Understanding Modern Cyber Attacks

Post Syndicated from Emma Burdett original https://blog.rapid7.com/2024/06/25/takeaways-from-the-take-command-summit-understanding-modern-cyber-attacks/

Takeaways From The Take Command Summit: Understanding Modern Cyber Attacks

In today’s cybersecurity landscape, staying ahead of evolving threats is crucial. The State of Security Panel from our Take Command summit held May 21st delved into how artificial intelligence (AI) is reshaping cyber attacks and defenses.

The discussion highlighted the dual role of AI in cybersecurity, presenting both challenges and solutions. To learn more about these insights and protect your organization from sophisticated threats, watch the full video.

Key takeaways from the 30 minute panel:

  1. AI-Enhanced Attacks: Friendly Hacker and CEO of SocialProof Security Rachel Tobac highlighted the growing use of AI by attackers, stating, “Eight times out of ten, I’m using AI tools during my attacks.” AI helps create convincing phishing emails and scripts, making attacks more efficient and scalable.
  2. Voice Cloning and Deepfakes: Attackers are now using AI for voice cloning and deep fakes, making it vital for organizations to verify identities through multiple communication channels. Rachel continued, “We can even do a deep fake, live during a Teams or Zoom call to trick somebody.”
  3. Cloud Vulnerabilities: Rapid7’s Chief Security Officer Jaya Baloo pointed out that roughly  45% of data breaches are due to cloud issues, caused by misconfigurations and vulnerabilities, making cloud security a critical focus.

“Professional paranoia is something that I think we should hold dear to us,” Jaya Bayloo, Chief Security Officer, Rapid7

Watch the full video here.

[$] Making containers bootable for fun and profit

Post Syndicated from jzb original https://lwn.net/Articles/979182/

Dan Walsh, Stef Walter, and Colin Walters all walk into a
presentation and Walter asks, “why would
you want to boot your containers?” This isn’t the setup for some technology joke, this is part of the trio’s
keynote at
DevConf.cz
in Brno, Czech Republic on June 14 about bootable containers
(bootc). The talk, which was streamed to YouTube for those of us who
didn’t attend DevConf.cz in person, provided a solid overview of bootc
and the problems it is intended to solve. The idea behind bootc is to
make creating operating-system images just as easy as creating
application-container images while using the same tools.

Perform reindexing in Amazon OpenSearch Serverless using Amazon OpenSearch Ingestion

Post Syndicated from Utkarsh Agarwal original https://aws.amazon.com/blogs/big-data/perform-reindexing-in-amazon-opensearch-serverless-using-amazon-opensearch-ingestion/

Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it straightforward to run search and analytics workloads without managing infrastructure. Customers using OpenSearch Serverless often need to copy documents between two indexes within the same collection or across different collections. This primarily arises from two scenarios:

  • Reindexing – You frequently need to update or modify index mapping due to evolving data needs or schema changes
  • Disaster recovery – Although OpenSearch Serverless data is inherently durable, you may want to copy data across AWS Regions for added redundancy and resiliency

Amazon OpenSearch Ingestion had recently introduced a feature supporting OpenSearch as a source. OpenSearch Ingestion, a fully managed, serverless data collector, facilitates real-time ingestion of log, metric, and trace data into OpenSearch Service domains and OpenSearch Serverless collections. We can leverage this feature to address these two scenarios, by reading the data from an OpenSearch Serverless Collection. This capability allows you to effortlessly copy data between indexes, making data management tasks more streamlined and eliminating the need for custom code.

In this post, we outline the steps to copy data between two indexes in the same OpenSearch Serverless collection using the new OpenSearch source feature of OpenSearch Ingestion. This is particularly useful for reindexing operations where you want to change your data schema. OpenSearch Serverless and OpenSearch Ingestion are both serverless services that enable you to seamlessly handle your data workflows, providing optimal performance and scalability.

Solution overview

The following diagram shows the flow of copying documents from the source index to the destination index using an OpenSearch Ingestion pipeline.

Implementing the solution consists of the following steps:

  1. Create an AWS Identity and Access Management (IAM) role to use as an OpenSearch Ingestion pipeline role.
  2. Update the data access policy attached to the OpenSearch Serverless collection.
  3. Create an OpenSearch Ingestion pipeline that simply copies data from one index to another, or you can even create an index template using the OpenSearch Ingestion pipeline to define explicit mapping, and then copy the data from the source index to the destination index with the defined mapping applied.

Prerequisites

To get started, you must have an active OpenSearch Serverless collection with an index that you want to reindex (copy). Refer to Creating collections to learn more about creating a collection.

When the collection is ready, note the following details:

  • The endpoint of the OpenSearch Serverless collection
  • The name of the index from which the documents need to be copied
  • If the collection is defined as a VPC collection, note down the name of the network policy attached to the collection

You use these details in the ingestion pipeline configuration.

Create an IAM role to use as a pipeline role

An OpenSearch Ingestion pipeline needs certain permissions to pull data from the source and write to its sink. For this walkthrough, both the source and sink are the same, but if the source and sink collections are different, modify the policy accordingly.

Complete the following steps:

  1. Create an IAM policy (opensearch-ingestion-pipeline-policy) that provides permission to read and send data to the OpenSearch Serverless collection. The following is a sample policy with least privileges (modify {account-id}, {region}, {collection-id} and {collection-name} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
                "Action": [
                    "aoss:BatchGetCollection",
                    "aoss:APIAccessAll"
                ],
                "Effect": "Allow",
                "Resource": "arn:aws:aoss:{region}:{account-id}:collection/{collection-id}"
            },
            {
                "Action": [
                    "aoss:CreateSecurityPolicy",
                    "aoss:GetSecurityPolicy",
                    "aoss:UpdateSecurityPolicy"
                ],
                "Effect": "Allow",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "aoss:collection": "{collection-name}"
                    }
                }
            }
        ]
    }

  2. Create an IAM role (opensearch-ingestion-pipeline-role) that the OpenSearch Ingestion pipeline will assume. While creating the role, use the policy you created (opensearch-ingestion-pipeline-policy). The role should have the following trust relationship (modify {account-id} and {region} accordingly):
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{account-id}"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                }
            }
        }]
    }

  3. Record the ARN of the newly created IAM role (arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role).

Update the data access policy attached to the OpenSearch Serverless collection

After you create the IAM role, you need to update the data access policy attached to the OpenSearch Serverless collection. Data access policies control access to the OpenSearch operations that OpenSearch Serverless supports, such as PUT <index> or GET _cat/indices. To perform the update, complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Collections.
  2. From the list of the collections, choose your OpenSearch Serverless collection.
  3. On the Overview tab, in the Data access section, choose the associated policy.
  4. Choose Edit.
  5. Edit the policy in the JSON editor to add the following JSON rule block in the existing JSON (modify {account-id} and {collection-name} accordingly):
    {
        "Rules": [{
            "Resource": [
                "index/{collection-name}/*"
            ],
            "Permission": [
                "aoss:CreateIndex",
                "aoss:UpdateIndex",
                "aoss:DescribeIndex",
                "aoss:ReadDocument",
                "aoss:WriteDocument"
            ],
            "ResourceType": "index"
        }],
        "Principal": [
            "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
        ],
        "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
    }

You can also use the Visual Editor method to choose Add another rule and add the preceding permissions for arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role.

  1. Choose Save.

Now you have successfully allowed the OpenSearch Ingestion role to perform OpenSearch operations against the OpenSearch Serverless collection.

Create and configure the OpenSearch Ingestion pipeline to copy the data from one index to another

Complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create a pipeline.
  3. In Choose Blueprint, select OpenSearchDataMigrationPipeline.
  4. For Pipeline name, enter a name (for example, sample-ingestion-pipeline).
  5. For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For this walkthrough, you can use the default value of 2 Ingestion OCUs for Min capacity and 4 Ingestion OCUs for Max capacity. However, you can even choose different values as OpenSearch Ingestion automatically scales your pipeline capacity according to your estimated workload, based on the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs) that you specify.
  6. Update the following information for the source:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection that was copied as part of prerequisites.
    2. Uncomment include and index_name_regex, and specify the name of the index that will act as the source (in this demo, we’re using logs-2024.03.01).
    3. Uncomment region under aws and specify the AWS Region where your OpenSearch Serverless collection is (for example, us-east-1).
    4. Uncomment sts_role_arn under aws and specify the role that has permission to read data from the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    5. Update the serverless flag to true.
    6. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    7. Uncomment scheduling, interval, index_read_count, and start_time and modify these parameters accordingly.
      Using these parameters makes sure the OpenSearch Ingestion pipeline processes the indexes multiple times (to pick up new documents).
      Note – If the collection specified in the sink is of the Time series or Vector search type, you can keep the scheduling, interval, index_read_count, and start_time parameters commented.
  1. Update the following information for the sink:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless collection.
    2. Uncomment sts_role_arn under aws and specify the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role). This is the same role that was added in the data access policy of the collection.
    3. Update the serverless flag to true.
    4. If the OpenSearch Serverless collection has VPC access, uncomment serverless_options and network_policy_name and specify the name of the network policy used for the collection.
    5. Update the value for index and provide the index name to which you want to transfer the documents (for example, new-logs-2024.03.01).
    6. For document_id, you can get the ID from the document metadata in the source and use the same in the target.
      However, it is important to note that custom document IDs are only supported for the Search type of collection. If your collection is of the Time Series or Vector Search type, you should comment out the document_id line.
    7. (Optional) The values for bucket, region and sts_role_arn keys within the dlq section can be modified to capture any failed requests in an S3 bucket.
      Note – Additional permission to opensearch-ingestion-pipeline-role needs to be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the changes required.
      For this walkthrough, you will not set up a DLQ. You can remove the entire dlq block.
  1. Now click on Validate pipeline to validate the pipeline configuration.
  2. For Network settings, choose your preferred setting:
    1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately. Choose this option if the OpenSearch Serverless collection has VPC access. AWS recommends using a VPC endpoint for all production workloads.
    2. Choose Public to use public access. For this walkthrough, we select Public because the collection is also accessible from public network.
  3. For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
  4. Choose Next, and verify the details you specified for your pipeline settings.
  5. Choose Create pipeline.

It will take a couple of minutes to create the ingestion pipeline. After the pipeline is created, you will see the documents in the destination index, specified in the sink (for example, new-logs-2024.03.01). After all the documents are copied, you can validate the number of documents by using the count API.

When the process is complete, you have the option to stop or delete the pipeline. If you choose to keep the pipeline running, it will continue to copy new documents from the source index according to the defined schedule, if specified.

In this walkthrough, the endpoint defined in the hosts parameter under source and sink of the pipeline configuration belonged to the same collection which was of the Search type. If the collections are different, you need to modify the permissions for the IAM role (opensearch-ingestion-pipeline-role) to allow access to both collections. Additionally, make sure you update the data access policy for both the collections to grant access to the OpenSearch Ingestion pipeline.

Create an index template using the OpenSearch Ingestion pipeline to define mapping

In OpenSearch, you can define how documents and their fields are stored and indexed by creating a mapping. The mapping specifies the list of fields for a document. Every field in the document has a field type, which defines the type of data the field contains. OpenSearch Service dynamically maps data types in each incoming document if an explicit mapping is not defined. However, you can use the template_type parameter with the index-template value and template_content with JSON of the content of the index-template in the pipeline configuration to define explicit mapping rules. You also need to define the index_type parameter with the value as custom.

The following code shows an example of the sink portion of the pipeline and the usage of index_type, template_type, and template_content:

sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "<<OpenSearch-Serverless-Collection-Endpoint>>" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role"
          # Provide the region of the domain.
          region: "us-east-1"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: true
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"
        # This will make it so each document in the source cluster will be written to the same index in the destination cluster
        index: "new-logs-2024.03.01"
        index_type: custom
        template_type: index-template
        template_content: >
          {
            "template" : {
              "mappings" : {
                "properties" : {
                  "Data" : {
                    "type" : "text"
                  },
                  "EncodedColors" : {
                    "type" : "binary"
                  },
                  "Type" : {
                    "type" : "keyword"
                  },
                  "LargeDouble" : {
                    "type" : "double"
                  }          
                }
              }
            }
          }
        # This will make it so each document in the source cluster will be written with the same document_id in the destination cluster
        document_id: "${getMetadata(\"opensearch-document_id\")}"
        # Enable the 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x
        # distribution_version: "es6"
        # Enable and switch the 'enable_request_compression' flag if the default compression setting is changed in the domain. See https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gzip.html
        # enable_request_compression: true/false
        # Enable the S3 DLQ to capture any failed requests in an S3 bucket
        # dlq:
          # s3:
            # Provide an S3 bucket
            # bucket: "<<your-dlq-bucket-name>>"
            # Provide a key path prefix for the failed requests
            # key_path_prefix: "<<logs/dlq>>"
            # Provide the region of the bucket.
            # region: "<<us-east-1>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
            # sts_role_arn: "<<arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role>>"

Or you can create the index first, with the mapping in the collection before you start the pipeline.

If you want to create a template using an OpenSearch Ingestion pipeline, you need to provide aoss:UpdateCollectionItems and aoss:DescribeCollectionItems permission for the collection in the data access policy for the pipeline role (opensearch-ingestion-pipeline-role). The updated JSON block for the rule would look like the following:

{
    "Rules": [
      {
        "Resource": [
          "collection/{collection-name}"
        ],
        "Permission": [
          "aoss:UpdateCollectionItems",
          "aoss:DescribeCollectionItems"
        ],
        "ResourceType": "collection"
      },
      {
        "Resource": [
          "index/{collection-name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role"
    ],
    "Description": "Provide access to OpenSearch Ingestion Pipeline Role"
  }

Conclusion

In this post, we showed how to use an OpenSearch Ingestion pipeline to copy data from one index to another in an OpenSearch Serverless collection. OpenSearch Ingestion also allows you to perform transformation of data using various processors. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. You can use the following OpenSearch Ingestion blueprints to build data pipelines with minimal configuration changes.


About the Authors

Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course, cricket. Lately, he has also been attempting to master the art of cooking in his free time – the taste buds are excited, but the kitchen might disagree.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Uncover social media insights in real time using Amazon Managed Service for Apache Flink and Amazon Bedrock

Post Syndicated from Francisco Morillo original https://aws.amazon.com/blogs/big-data/uncover-social-media-insights-in-real-time-using-amazon-managed-service-for-apache-flink-and-amazon-bedrock/

With over 550 million active users, X (formerly known as Twitter) has become a useful tool for understanding public opinion, identifying sentiment, and spotting emerging trends. In an environment where over 500 million tweets are sent each day, it’s crucial for brands to effectively analyze and interpret the data to maximize their return on investment (ROI), which is where real-time insights play an essential role.

Amazon Managed Service for Apache Flink helps you to transform and analyze streaming data in real time with Apache Flink. Apache Flink supports stateful computation over a large volume of data in real time with exactly-once consistency guarantees. Moreover, Apache Flink’s support for fine-grained control of time with highly customizable window logic enables the implementation of the advanced business logic required for building a streaming data platform. Stream processing and generative artificial intelligence (AI) have emerged as powerful tools to harness the potential of real time data. Amazon Bedrock, along with foundation models (FMs) such as Anthropic Claude on Amazon Bedrock, empowers a new wave of AI adoption by enabling natural language conversational experiences.

In this post, we explore how to combine real-time analytics with the capabilities of generative AI and use state-of-the-art natural language processing (NLP) models to analyze tweets through queries related to your brand, product, or topic of choice. It goes beyond basic sentiment analysis and allows companies to provide actionable insights that can be used immediately to enhance customer experience. These include:

  • Identifying rising trends and discussion topics related to your brand
  • Conducting granular sentiment analysis to truly understand customers’ opinions
  • Detecting nuances such as emojis, acronyms, sarcasm, and irony
  • Spotting and addressing concerns proactively before they spread
  • Guiding product development based on feature requests and feedback
  • Creating targeted customer segments for information campaigns

This post takes a step-by-step approach to showcase how you can use Retrieval Augmented Generation (RAG) to reference real-time tweets as a context for large language models (LLMs). RAG is the process of optimizing the output of an LLM so it references an authoritative knowledge base outside of its training data sources before generating a response. LLMs are trained on vast volumes of data and use billions of parameters to generate original output for tasks such as answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It’s a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.

Solution overview

In this section, we explain the flow and architecture of the application. We divide the flow of the application into two parts:

  • Data ingestion – Ingest data from streaming sources, convert it to vector embeddings, and then store them in a vector database
  • Insights retrieval – Invoke an LLM with the user queries to retrieve insights on tweets using the RAG technique

Data ingestion

The following diagram describes the data ingestion flow:

  1. Process feeds from streaming sources, such as social media feeds, Amazon Kinesis Data Streams, or Amazon Managed Service for Apache Kafka (Amazon MSK).
  2. Convert streaming data to vector embeddings in real time.
  3. Store them in a vector database.

Data is ingested from a streaming source (for example, X) and processed using an Apache Flink application. Apache Flink is an open source stream processing framework. It provides powerful streaming capabilities, enabling real-time processing, stateful computations, fault tolerance, high throughput, and low latency. Apache Flink is used to process the streaming data, perform deduplication, and invoke an embedding model to create vector embeddings.

Vector embeddings are numerical representations that capture the relationships and meaning of words, sentences, and other data types. These vector embeddings will be used for semantic search or neural search to retrieve relevant information that will be used as context for the LLM to evaluate the response. After the text data is converted into vectors, the vectors are persisted in an Amazon OpenSearch Service domain, which will be used as a vector database. Unlike traditional relational databases with rows and columns, data points in a vector database are represented by vectors with a fixed number of dimensions, which are clustered based on similarity.

OpenSearch Service offers scalable and efficient similarity search capabilities tailored for handling large volumes of dense vector data. OpenSearch Service seamlessly integrates with other AWS services, enabling you to build robust data pipelines within AWS. As a fully managed service, OpenSearch Service alleviates the operational overhead of managing the underlying infrastructure, while providing essential features like approximate k-Nearest Neighbor (k-NN) search algorithms, dense vector support, and robust monitoring and logging tools through Amazon CloudWatch. These capabilities make OpenSearch Service a suitable solution for applications that require fast and accurate similarity-based retrieval tasks using vector embeddings.

This design enables real-time vector embedding, making it ideal for AI-driven applications.

Insights retrieval

The following diagram shows the flow from the user side, where the user places a query through the frontend and gets a response from the LLM model using the retrieved vector database documents as the context provided in the prompt.

As shown in the preceding figure, to retrieve insights from the LLM, first you need to receive a query from the user. The text query is then converted into vector embeddings using the same model that was used before for the tweets. It’s important to make sure the same embedding model is used for both ingestion and search. The vector embeddings are then used to perform a semantic search in the vector database to obtain the related vectors and associated text. This serves as the context for the prompt. Next, the previous conversation history (if any) is added to the prompt. This serves as the conversation history for the model. Finally, the user’s question is also included in the prompt and the LLM is invoked to get the response.

For the purpose of this post, we don’t take into consideration the conversation history or store it for later use.

Solution architecture

Now that you understand the overall process flow, let’s analyze the following architecture using AWS services step by step.

The first part of the preceding figure shows the data ingestion process:

  1. A user authenticates with Amazon Cognito.
  2. The user connects to the Streamlit frontend and configures the following parameters: query terms, API bearer token, and frequency to retrieve tweets.
  3. Managed Service for Apache Flink is used to consume and process the tweets in real time and stores in Apache Flink’s state the parameters for making the API requests received from the frontend application.
  4. The streaming application uses Apache Flink’s async I/O to invoke the Amazon Titan Embeddings model through the Amazon Bedrock API.
  5. Amazon Bedrock returns a vector embedding for each tweet.
  6. The Apache Flink application then writes the vector embedding with the original text of the tweet into an OpenSearch Service k-NN index.

The remainder of the architecture diagram shows the insights retrieval process:

  1. A user sends a query through the Streamlit frontend application.
  2. An AWS Lambda function is invoked by Amazon API Gateway, passing the user query as input.
  3. The Lambda function uses LangChain to orchestrate the RAG process. As a first step, the function invokes the Amazon Titan Embeddings model on Amazon Bedrock to create a vector embedding for the question.
  4. Amazon Bedrock returns the vector embedding for the question.
  5. As a second step in the RAG orchestration process, the Lambda function performs a semantic search in OpenSearch Service and retrieves the relevant documents related to the question.
  6. OpenSearch Service returns the relevant documents containing the tweet text to the Lambda function.
  7. As a last step in the LangChain orchestration process, the Lambda function augments the prompt, adding the context and using few-shot prompting. The augmented prompt, including instructions, examples, context, and query, is sent to the Anthropic Claude model through the Amazon Bedrock API.
  8. Amazon Bedrock returns the answer to the question in natural language to the Lambda function.
  9. The response is sent back to the user through API Gateway.
  10. API Gateway provides the response to the user question in the Streamlit application.

The solution is available in the GitHub repo. Follow the README file to deploy the solution.

Now that you understand the overall flow and architecture, let’s dive deeper into some of the key steps to understand how it works.

Amazon Bedrock chatbot UI

The Amazon Bedrock chatbot Streamlit application is designed to provide insights from tweets, whether they are real tweets ingested from the X API or simulated tweets or messages from the My Social Media application.

In the Streamlit application, we can provide the parameters that will be used to make the API requests to the X Developer API and pull the data from X. We developed an Apache Flink application that adjusts the API requests based on the provided parameters.

As parameters, you need to provide the following:

  • Bearer token for API authorization – This is obtained from the X Developer platform when you sign up to use the APIs.
  • Query terms to be used to filter the tweets consumed – You can use the search operators available in the X documentation.
  • Frequency of the request – The X basic API only allows you to make a request every 15 seconds. If a lower interval is set, the application won’t pull data.

The parameters are sent to Kinesis Data Streams through API Gateway and are consumed by the Apache Flink application.

My Social Media UI

The My Social Media application is a Streamlit application that serves as an additional UI. Through this application, users can compose and send messages, simulating the experience of posting on a social media site. These messages are then ingested into an AWS data pipeline consisting of API Gateway, Kinesis Data Streams, and an Apache Flink application. The Apache Flink application processes the incoming messages, invokes an Amazon Bedrock embedding model, and stores the data in an OpenSearch Service cluster.

To accommodate both real X data and simulated data from the My Social Media application, we’ve set up separate indexes within the OpenSearch Service cluster. This separation allows users to choose which data source they want to analyze or query. The Streamlit application features a sidebar option called Use X Index that acts as a toggle. When this option is enabled, the application queries and analyzes data from the index containing real tweets ingested from the X API. If the option is disabled, the application queries and displays data from the index containing messages sent through the My Social Media application.

Apache Flink is used because of its ability to scale with the increasing volume of tweets. The Apache Flink application is responsible for performing data ingestion as explained previously. Let’s dive into the details of the flow.

Consume data from X

We use Apache Flink to process the API parameters sent from the Streamlit UI. We store the parameters in Apache Flink’s state, which allows us to modify and update the parameters without having to restart the application. We use the ProcessFunction to use Apache Flink’s internal timers to schedule the frequency of requests to fetch tweets. In this post, we use X’s Recent search API, which allows us to access filtered public tweets posted over the last 7 days. The API response is paginated and returns a maximum of 100 tweets on each request in reverse chronological order. If there are more tweets to be consumed, the response of the previous request will return a token, which needs to be used in the next API call. After we receive the tweets from the API, we apply the following transformations:

  • Filter out the empty tweets (tweets without any text).
  • Partition the set of tweets by author ID. This helps distribute the processing to multiple subtasks in Apache Flink.
  • Apply a deduplication logic to only process tweets that haven’t been processed. For this, we store the already processed tweet ID in Apache Flink’s state and match and filter out the tweets that have already been processed. We store the tweets’ ID grouped by author ID, which can cause the state size of the application to increase. Because the API only provides tweets from the last 7 days when invoked, we have introduced a time-to-live (TTL) of 7 days so we don’t grow the application’s state indefinitely. You can modify this based on your requirements.
  • Convert tweets into JSON objects for a later Amazon Bedrock API invocation.

Create vector embeddings

The vector embeddings are created by invoking the Amazon Titan Embeddings model through the Amazon Bedrock API. Asynchronous invocations of external APIs are important performance considerations when building a stream processing architecture. Synchronous calls increase latency, reduce throughput, and can be a bottleneck for overall processing.

To invoke the Amazon Bedrock API, you will use the Amazon Bedrock Runtime dependency in Java, which provides an asynchronous client that allows us invoke Amazon Bedrock models asynchronously through the BedrockRuntimeAsyncClient. This is invoked to create the embeddings. For this we use Apache Flink’s async I/O to make asynchronous requests to external APIs. Apache Flink’s async I/O is a library within Apache Flink that allows you to write asynchronous, non-blocking operators for stream processing applications, enabling better utilization of resources and higher throughput. We provide the asynchronous function to be called, the timeout duration that determines how long an asynchronous operation can take before it’s considered failed, and the maximum number of requests that should be in progress at any point in time. Limiting the number of concurrent requests makes sure that the operator won’t accumulate an ever-growing backlog of pending requests. However, this can cause backpressure after the capacity is exhausted. Because we use the timestamp of creation when we ingest into OpenSearch Service and so order won’t affect our results, we can use Apache Flink’s async I/O unordered function, allowing us to have better throughput and performance. See the following code:

DataStream<JSONObject> resultStream = AsyncDataStream

.unorderedWait(inputJSON, new BedRockEmbeddingModelAsyncTweetFunction(), 15000, TimeUnit.MILLISECONDS, 1000)
.uid("tweet-async-function");

Let’s have a closer look into the Apache Flink async I/O function. The following is within the CompletableFuture Java class:

  1. First, we create the Amazon Bedrock Runtime async client:
BedrockRuntimeAsyncClient runtime = BedrockRuntimeAsyncClient.builder()
.region(Region.of(region))  // Use the specified AWS region 
.build();
  1. We then extract the tweet for the event and build the payload that we will send to Amazon Bedrock:
String stringBody = jsonObject.getString("tweet");

ArrayList<String> stringList = new ArrayList<>();


stringList.add(stringBody);


JSONObject jsonBody = new JSONObject()
.put("inputText", stringBody);


SdkBytes body = SdkBytes.fromUtf8String(jsonBody.toString());
  1. After we have the payload, we can call the InvokeModel API and invoke Amazon Titan to create the vector embeddings for the tweets:
InvokeModelRequest request = InvokeModelRequest.builder()
        
.modelId("amazon.titan-embed-text-v1")
        
.contentType("application/json")
        
.accept("*/*")
        
.body(body)
        
.build();

CompletableFuture<InvokeModelResponse> futureResponse = runtime.invokeModel(request);
  1. After receiving the vector, we append the following fields to the output JSONObject:
    1. Cleaned tweet
    2. Tweet creation timestamp
    3. Number of likes of the tweet
    4. Number of retweets
    5. Number of views from the tweet (impressions)
    6. Tweet ID
// Extract and process the response when it is available
JSONObject response = new JSONObject(
        futureResponse.join().body().asString(StandardCharsets.UTF_8)
);

// Add additional fields related to tweet data to the response
response.put("tweet", jsonObject.get("tweet"));
response.put("@timestamp", jsonObject.get("created_at"));
response.put("likes", jsonObject.get("likes"));
response.put("retweet_count", jsonObject.get("retweet_count"));
response.put("impression_count", jsonObject.get("impression_count"));
response.put("_id", jsonObject.get("_id"));

return response;

This will return the embeddings, original text, additional fields, and the number of tokens used for the embedding. In our connector, we are only consuming messages in English, as well as ignoring messages that are retweets from other tweets.

The same processing steps are replicated for messages coming from the My Social Media application (manually ingested).

Store vector embeddings in OpenSearch Service

We use OpenSearch Service as a vector database for semantic search. Before we can write the data into OpenSearch Service, we need to create an index that supports semantic search. We are using the k-NN plugin. The vector database index mapping should have the following properties for storing vectors for similarity search:

"embeddings": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }

The key parameters are as follows:

  • type – This specifies that the field will hold vector data for a k-NN similarity search. The value should be knn_vector.
  • dimension – The number of dimensions for each vector. This must match the model dimension. In this case we use 1,536 dimensions, the same as the Amazon Titan Text Embeddings v1 model.
  • method – Defines the algorithm and parameters for indexing and searching the vectors:
    • name – The identifier for the nearest neighbor method. We use hierarchical navigable small worlds (HNSW)—a hierarchical proximity graph approach—to run a approximate k-NN (A-NN) search because standard k-NN is not a scalable approach.
    • space_type – The vector space used to calculate the distance between vectors. It supports multiple space type. The default value is 12.
    • engine – The approximate k-NN library to use for indexing and search. The available libraries are faiss, nmslib, and Lucene.
    • ef_construction – The size of the dynamic list used during k-NN graph creation. Higher values result in a more accurate graph but slower indexing speed.
    • m – The number of bidirectional links that the plugin creates for each new element. Increasing and decreasing this value can have a large impact on memory consumption. Keep this value between 2–100.

Standard k-NN search methods compute similarity using a brute-force approach that measures the nearest distance between a query and a number of points, which produces exact results. This works well for most applications. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. The approximate k-NN search methods used by OpenSearch Service use approximate nearest neighbor (ANN) algorithms from the nmslib, faiss, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. This approach is the preferred method when a dataset reaches hundreds of thousands of vectors. For more information about the different methods and their trade-offs, refer to Comprehensive Guide To Approximate Nearest Neighbors Algorithms.

To use the k-NN plugin’s approximate search functionality, we must first create a k-NN index with index.knn set to true:

    "settings" : {
      "index" : {
        "knn": true,
        "number_of_shards" : "5",
        "number_of_replicas" : "1"
      }
    }

After we have our indexes created, we can sink the data from our Apache Flink application into OpenSearch.

RetrievalQA using Lambda and LangChain implementation

For this part, we take an input question from the user and invoke a Lambda function. The Lambda function retrieves relevant tweets from OpenSearch Service as context and generates an answer using the LangChain RAG chain RetrievalQA. LangChain is a framework for developing applications powered by language models.

First, some setup. We instantiate the bedrock-runtime client that will allow the Lambda function to invoke the models:

bedrock_runtime = boto3.client("bedrock-runtime", "us-east-1")

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1", client=bedrock_runtime)

The BedrockEmbeddings class uses the Amazon Bedrock API to generate embeddings for the user’s input question. It strips new line characters from the text. Notice that we need to pass as an argument the instantiation of the bedrock_runtime client and the model ID for the Amazon Titan Text Embeddings v1 model.

Next, we instantiate the client for the OpenSearchVectorSeach LangChain class that will allow the Lambda function to connect to the OpenSearch Service domain and perform the semantic search against the previously indexed X embeddings. For the embedding function, we’re passing the embeddings model that we defined previously. This will be used during the LangChain orchestration process:

os_client = OpenSearchVectorSearch(
        index_name=aos_index,
        embedding_function=embeddings,
        http_auth=(os.environ['aosUser'], os.environ['aosPassword']),
        opensearch_url=os.environ['aosDomain'],
        timeout=300,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection,
        )

We need to define the LLM model from Amazon Bedrock to use for text generation. The temperature is set to 0 to reduce hallucinations:

model_kwargs={"temperature": 0, "max_tokens": 4096}

llm = BedrockChat(
    model_id="anthropic.claude-3-haiku-20240307-v1:0",
    client=bedrock_runtime,
    model_kwargs=model_kwargs
)

Next, in our Lambda function, we create the prompt to instruct the model on the specific task of analyzing hundreds of tweets in the context. To normalize the output, we use a prompt engineering technique called few-shot prompting. Few-shot prompting allows language models to learn and generate responses based on a small number of examples or demonstrations provided in the prompt itself. In this approach, instead of training the model on a large dataset, we provide a few examples of the desired task or output within the prompt. These examples serve as a guide or conditioning for the model, enabling it to understand the context and the desired format or pattern of the response. When presented with a new input after the examples, the model can then generate an appropriate response by following the patterns and context established by the few-shot demonstrations in the prompt.

As part of the prompt, we then provide examples of questions and answers, so the chatbot can follow the same pattern when used (see the Lambda function to view the complete prompt):

template = """As a helpful agent that is an expert analysing tweets, please answer the question using only the provided tweets from the context in <context></context> tags. If you don't see valuable information on the tweets provided in the context in <context></context> tags, say you don't have enough tweets related to the question. Cite the relevant context you used to build your answer. Print in a bullet point list the top most influential tweets from the context at the end of the response.
    
    Find below some examples:
    <example1>
    question: 
    What are the main challenges or concerns mentioned in tweets about using Bedrock as a generative AI service on AWS, and how can they be addressed?
    
    answer:
    Based on the tweets provided in the context, the main challenges or concerns mentioned about using Bedrock as a generative AI service on AWS are:

1.	...
2.	...
3.	...
4.	...
...
    
    To address these concerns:

1.	...
2.	...
3.	...
4.	...
...

    Top tweets from context:

    [1] ...
    [2] ...
    [3] ...
    [4] ...

    </example1>
    
    <example2>
    ...
    </example2>
    
    Human: 
    
    question: {question}
    
    <context>
    {context}
    </context>
    
    Assistant:"""

    prompt = PromptTemplate(input_variables=["context","question"], template=template)

We then create the RetrievalQA LangChain chain using the prompt template, Anthropic Claude on Amazon Bedrock, and the OpenSearch Service retriever configured previously. The RetrievalQA LangChain chain will orchestrate the following RAG steps:

  • Invoke the text embedding model to create a vector for the user’s question
  • Perform a semantic search on OpenSearch Service using the vector to retrieve the relevant tweets to the user’s question (k=200)
  • Invoke the LLM model using the augmented prompt containing the prompt template, context (stuffed retrieved tweets) and question
chain = RetrievalQA.from_chain_type(
    llm=llm,
    verbose=True,
    chain_type="stuff",
    retriever=os_client.as_retriever(
        search_type="similarity",
        search_kwargs={
            "k": 200, 
            "space_type": "l2", 
            "vector_field": "embeddings", 
            "text_field": text_field
        }
    ),
    chain_type_kwargs = {"prompt": prompt}
)

Finally, we run the chain:

answer = chain.invoke({"query": message})

The response from the LLM is sent back to the user application. As shown in the following screenshot:

Considerations

You can extend the solution provided in this post. When you do, consider the following suggestions:

  • Configure index retention and rollover in OpenSearch Service to manage index lifecycle and data retention effectively
  • Incorporate chat history into the chatbot to provide richer context and improve the relevance of LLM responses
  • Add filters and hybrid search with the possibility to modify the weight given to the keyword and semantic search to enhance search on RAG
  • Modify the TTL for Apache Flink’s state to match your requirements (the solution in this post uses 7 days)
  • Enable logging to API Gateway and in the Streamlit application.

Summary

This post demonstrates how to combine real-time analytics with generative AI capabilities to analyze tweets related to a brand, product, or topic of interest. It uses Amazon Managed Service for Apache Flink to process tweets from the X API, create vector embeddings using the Amazon Titan Embeddings model on Amazon Bedrock, and store the embeddings in an OpenSearch Service index configured for vector similarity search—all these steps happen in real time.

The post also explains how users can input queries through a Streamlit frontend application, which invokes a Lambda function. This Lambda function retrieves relevant tweets from OpenSearch Service by performing semantic search on the stored embeddings using the LangChain RetrievalQA chain. As a result, it generates insightful answers using the Anthropic Claude LLM on Amazon Bedrock.

The solution enables identifying trends, conducting sentiment analysis, detecting nuances, addressing concerns, guiding product development, and creating targeted customer segments based on real-time X data.

To get started with generative AI, visit Generative AI on AWS for information about industry use cases, tools to build and scale generative AI applications, as well as the post Exploring real-time streaming for generative AI Applications for other use cases for streaming with generative AI.


About the Authors

Francisco Morillo is a Streaming Solutions Architect at AWS, specializing in real-time analytics architectures. With over five years in the streaming data space, Francisco has worked as a data analyst for startups and as a big data engineer for consultancies, building streaming data pipelines. He has deep expertise in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink. Francisco collaborates closely with AWS customers to build scalable streaming data solutions and advanced streaming data lakes, ensuring seamless data processing and real-time insights.

Sergio Garcés Vitale is a Senior Solutions Architect at AWS, passionate about generative AI. With over 10 years of experience in the telecommunications industry, where he helped build data and observability platforms, Sergio now focuses on guiding Retail and CPG customers in their cloud adoption, as well as customers across all industries and sizes in implementing Artificial Intelligence use cases.

Subham Rakshit is a Streaming Specialist Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build search and streaming data platforms that help them achieve their business objective. Outside of work, he enjoys spending time solving jigsaw puzzles with his daughter.

Best practices working with self-hosted GitHub Action runners at scale on AWS

Post Syndicated from Shilpa Sharma original https://aws.amazon.com/blogs/devops/best-practices-working-with-self-hosted-github-action-runners-at-scale-on-aws/

Overview

GitHub Actions is a continuous integration and continuous deployment platform that enables the automation of build, test and deployment activities for your workload. GitHub Self-Hosted Runners provide a flexible and customizable option to run your GitHub Action pipelines. These runners allow you to run your builds on your own infrastructure, giving you control over the environment in which your code is built, tested, and deployed. This reduces your security risk and costs, and gives you the ability to use specific tools and technologies that may not be available in GitHub hosted runners. In this blog, I explore security, performance and cost best practices to take into consideration when deploying GitHub Self-Hosted Runners to your AWS environment.

Best Practices

Understand your security responsibilities

GitHub Self-hosted runners, by design, execute code defined within a GitHub repository, either through the workflow scripts or through the repository build process. You must understand that the security of your AWS runner execution environments are dependent upon the security of your GitHub implementation. Whilst a complete overview of GitHub security is outside the scope of this blog, I recommended that before you begin integrating your GitHub environment with your AWS environment, you review and understand at least the following GitHub security configurations.

  • Federate your GitHub users, and manage the lifecycle of identities through a directory.
  • Limit administrative privileges of GitHub repositories, and restrict who is able to administer permissions, write to repositories, modify repository configurations or install GitHub Apps.
  • Limit control over GitHub Actions runner registration and group settings
  • Limit control over GitHub workflows, and follow GitHub’s recommendations on using third-party actions
  • Do not allow public repositories access to self-hosted runners

Reduce security risk with short-lived AWS credentials

Make use of short-lived credentials wherever you can. They expire by default within 1 hour, and you do not need to rotate or explicitly revoke them. Short lived credentials are created by the AWS Security Token Service (STS). If you use federation to access your AWS account, assume roles, or use Amazon EC2 instance profiles and Amazon ECS task roles, you are using STS already!

In almost all cases, you do not need long-lived AWS Identity and Access Management (IAM) credentials (access keys) even for services that do not “run” on AWS – you can extend IAM roles to workloads outside of AWS without requiring you to manage long-term credentials. With GitHub Actions, we suggest you use OpenID Connect (OIDC). OIDC is a decentralized authentication protocol that is natively supported by STS using sts:AssumeRoleWithWebIdentity, GitHub and many other providers. With OIDC, you can create least-privilege IAM roles tied to individual GitHub repositories and their respective actions. GitHub Actions exposes an OIDC provider to each action run that you can utilize for this purpose.

Short lived AWS credentials with GitHub self-hosted runners

Short lived AWS credentials with GitHub self-hosted runners

If you have many repositories that you wish to grant an individual role to, you may run into a hard limit of the number of IAM roles in a single account. While I advocate solving this problem with a multi-account strategy, you can alternatively scale this approach by:

  • using attribute based access control (ABAC) to match claims in the GitHub token (such as repository name, branch, or team) to the AWS resource tags.
  • using role based access control (RBAC) by logically grouping the repositories in GitHub into Teams or applications to create fewer subset of roles.
  • use an identity broker pattern to vend credentials dynamically based on the identity provided to the GitHub workflow.

Use Ephemeral Runners

Configure your GitHub Action runners to run in “ephemeral” mode, which creates (and destroys) individual short-lived compute environments per job on demand. The short environment lifespan and per-build isolation reduces the risk of data leakage , even in multi-tenanted continuous integration environments, as each build job remains isolated from others on the underlying host.

As each job runs on a new environment created on demand, there is no need for a job to wait for an idle runner, simplifying auto-scaling. With the ability to scale runners on demand, you do not need to worry about turning build infrastructure off when it is not needed (for example out of office hours), giving you a cost-efficient setup. To optimize the setup further, consider allowing developers to tag workflows with instance type tags and launch specific instance types that are optimal for respective workflows.

There are a few considerations to take into account when using ephemeral runners:

  • A job will remain queued until the runner EC2 instance has launched and is ready. This can take up to 2 minutes to complete. To speed up this process, consider using an optimized AMI with all prerequisites installed.
  • Since each job is launched on a fresh runner, utilizing caching on the runner is not possible. For example, Docker images and libraries will always be pulled from source.

Use Runner Groups to isolate your runners based on security requirements

By using ephemeral runners in a single GitHub runner group, you are creating a pool of resources in the same AWS account that are used by all repositories sharing this runner group. Your organizational security requirements may dictate that your execution environments must be isolated further, such as by repository or by environment (such as dev, test, prod).

Runner groups allow you to define the runners that will execute your workflows on a repository-by-repository basis. Creating multiple runner groups not only allow you to provide different types of compute environments, but allow you to place your workflow executions in locations within AWS that are isolated from each other. For example, you may choose to locate your development workflows in one runner group and test workflows in another, with each ephemeral runner group being deployed to a different AWS account.

Runners by definition execute code on behalf of your GitHub users. At a minimum, I recommend that your ephemeral runner groups are contained within their own AWS account and that this AWS account has minimal access to other organizational resources. When access to organizational resources is required, this can be given on a repository-by-repository basis through IAM role assumption with OIDC, and these roles can be given least-privilege access to the resources they require.

Optimize runner start up time using Amazon EC2 warm-pools

Ephemeral runners provide strong build isolation, simplicity and security. Since the runners are launched on demand, the job will be required to wait for the runner to launch and register itself with GitHub. While this usually happens in under 2 minutes, this wait time might not be acceptable in some scenarios.

We can use a warm pool of pre-registered ephemeral runners to reduce the wait time. These runners will listen to the incoming GitHub workflow events actively and as soon as an incoming workflow event is queued, it is picked up readily by the warm pool of registered EC2 runners.

While there can be multiple strategies to manage the warm pool, I recommend the following strategy which uses AWS Lambda for scaling up and scaling down the ephemeral runners:

GitHub self-hosted runners warm pool flow

GitHub self-hosted runners warm pool flow

A GitHub workflow event is created on a trigger like push of code in a master repository or a merge of pull request. This event triggers a Lambda function via webhook and Amazon API Gateway endpoint. The Lambda function helps in validating the GitHub workflow event payload and log events for observability & building metrics. It can be used optionally to replenish the warm pool. There are separate backend Lambda functions to launch, scale up and scale down the warm pool of EC2 instances. The EC2 instances or runners are registered with GitHub at the time of launch. The registered runners listens for incoming GitHub work flow events using GitHub’s internal job queue and as soon as workflow events are triggered, its assigned by GitHub to one of the runners in warm pool for job execution. The runner is automatically de-registered once the job completes. A job can be a build, or deploy request as defined in your GitHub workflow.

With warm pool in place, it is expected to help reduce wait time by 70-80%.

Considerations

  • Increased complexity as there is a possibility of over provisioning runners. This will depend on how long a runner EC2 instance requires to launch and reach a ready state and how frequently the scale up Lambda is configured to run. For example, if the scale up Lambda runs every 1 minute and the EC2 runner requires 2 minutes to launch, then the scale up Lambda will launch 2 instances. The mitigation is to use Auto scaling groups to manage the EC2 warm pool and desired capacity with predictive scaling policies tying back to incoming GitHub workflow events i.e. build job requests.
  • This strategy may have to be revised when supporting Windows or Mac based runners given the spin up times can vary.

Use an optimized AMI to speed up the launch of GitHub self-hosted runners

Amazon Machine Images (AMI) provide a pre-configured, optimized image that can be used to launch the runner EC2 instance. By using AMIs, you will be able to reduce the launch time of a new runner since dependencies and tools are already installed. Consistency across builds is guaranteed due to all instances running the same version of dependencies and tools. Runners will benefit from increased stability and security compliance as images are tested and approved before being distributed for use as runner instances.

When building an AMI for use as a GitHub self-hosted runner the following considerations need to be made:

  • Choosing the right OS base image for the builds. This will depend on your tech stack and toolset.
  • Install the GitHub runner app as part of the image. Ensure automatic runner updates are enabled to reduce the overhead of managing running versions. In case a specific runner version must be used you can disable automatic runner updates to avoid untested changes. Keep in mind, if disabled, a runner will need to be updated manually within 30 days of a new version becoming available.
  • Install build tools and dependencies from trusted sources.
  • Ensure runner logs are captured and forwarded to your security information and event management (SIEM) of choice.
  • The runner requires internet connectivity to access GitHub. This may require configuring proxy settings on the instance depending on your networking setup.
  • Configure any artifact repositories the runner requires. This includes sources and authentication.
  • Automate the creation of the AMI using tools such as EC2 Image Builder to achieve consistency.

Use Spot instances to save costs

The cost associated with scaling up the runners as well as maintaining a hot pool can be minimized using Spot Instances, which can result in savings up to 90% compared to On-Demand prices. However, there could be requirements where we can have longer running builds or batch jobs that cannot tolerate the spot instance terminating on 2 minutes notice. So, having a mixed pool of instances will be a good option where such jobs should be routed to on-demand EC2 instances and the rest on the Spot instances to cater for diverse build needs. This can be done by assigning labels to the runner during launch /registration. In that case, the on-demand instances will be launched and we can a savings plan in place to get cost benefits.

Record runner metrics using Amazon CloudWatch for Observability

It is vital for the observability of the overall platform to generate metrics for the EC2 based GitHub self-hosted runners. Examples of the GitHub runners metrics can be: the number of GitHub workflow events queued or completed in a minute, or number of EC2 runners up and available in the warm pool etc.

We can log the triggered workflow events and runner logs in Amazon CloudWatch and then use CloudWatch embedded metrics to collect metrics such as number of workflow events queued, in progress and completed. Using elements like “started_at” and “completed_at” timings which are part of workflow event payload we can calculate build wait time.

As an example, below is the sample incoming GitHub workflow event logged in Amazon Cloud Watch Logs

<p> </p><p><code>{</code></p><p><code>"hostname": "xxx.xxx.xxx.xxx",</code></p><p><code>"requestId": "aafddsd55-fgcf555",</code></p><p><code>"date": "2022-10-11T05:50:35.816Z",</code></p><p><code>"logLevel": "info",</code></p><p><code>"logLevelId": 3,</code></p><p><code>"filePath": "index.js",</code></p><p><code>"fullFilePath": "/var/task/index.js",</code></p><p><code>"fileNa<a class="ab-item" href="https://aws-blogs-prod.amazon.com/devops/" aria-haspopup="true">AWS DevOps Blog</a>me": "index.js",</code></p><p><code>"lineNumber": 83889,</code></p><p><code>"columnNumber": 12,</code></p><p><code>"isConstructor": false,</code></p><p><code>"functionName": "handle",</code></p><p><code>"argumentsArray": [</code></p><p><code>"Processing Github event",</code></p><p><code>"{\"event\":\"workflow_job\",\"repository\":\"testorg-poc/github-actions-test-repo\",\"action\":\"queued\",\"name\":\"jobname-buildanddeploy\",\"status\":\"queued\",\"started_at\":\"2022-10-11T05:50:33Z\",\"completed_at\":null,\"conclusion\":null}"</code></p><p><code>]</code></p><p><code>}</code></p>

In order to use the logged elements of above log into metrics by capturing \”status\”:\”queued\”,\”repository\”:\”testorg-poc/github-actions-test-repo\c, \”name\”:\”jobname-buildanddeploy\” ,and workflow \”event\” , one can build embedded metrics in the application code or AWS metrics Lambda using any of the cloud watch metrics client library Creating logs in embedded metric format using the client libraries – Amazon CloudWatch based on the language of your choice listed.c

Essentially what one of those libraries will do under the hood is map elements from Log event into dimension fields so cloud watch can then read and generate a metric using that.

console.log(<br />      JSON.stringify({<br />        message: '[Embedded Metric]', // Identifier for metric logs in CW logs<br />        build_event_metric: 1, // Metric Name and value<br />        status: `${status}`, // Dimension name and value<br />        eventName: `${eventName}`,<br />        repository: `${repository}`,<br />        name: `${name}`,<br />        <br />        _aws: {<br />          Timestamp: Date.now(),<br />          CloudWatchMetrics: [<br />            {<br />              Namespace: `demo_2`,<br />              Dimensions: [['status','eventName','repository','name']],<br />              Metrics: [<br />                {<br />                  Name: 'build_event_metric',<br />                  Unit: 'Count',<br />                },<br />              ],<br />            },<br />          ],<br />        },<br />      })<br />    );

A sample architecture:

Consumption of GitHub webhook events

Consumption of GitHub webhook events

The cloud watch metrics can be published to your dashboards or forwarded to any external tool based on requirements. Once we have metrics, CloudWatch alarms and notifications can be configured to manage pool exhaustion.

Conclusion

In this blog post, we outlined several best practices covering security, scalability and cost efficiency when using GitHub Actions with EC2 self-hosted runners on AWS. We covered how using short-lived credentials combined with ephemeral runners will reduce security and build contamination risks. We also showed how runners can be optimized for faster startup and job execution AMIs and warm EC2 pools. Last but not least, cost efficiencies can be maximized by using Spot instances for runners in the right scenarios.

Resources:

Not all “open source” AI models are actually open (Nature)

Post Syndicated from corbet original https://lwn.net/Articles/979609/

Nature looks
at a recent paper
on the openness of “open-source” language
models.

It is not yet clear how many of these models will fit the EU’s
definition of open source. Under the act, this would refer to
models that are released under a “free and open” licence that, for
example, allows users to modify a model but says nothing about
access to training data. Refining this definition will probably
form “a single pressure point that will be targeted by corporate
lobbies and big companies”, the paper says.

From Top Dogs to Unified Pack

Post Syndicated from Rapid7 original https://blog.rapid7.com/2024/06/25/from-top-dogs-to-unified-pack-2/

Embracing a consolidated security ecosystem

From Top Dogs to Unified Pack

Authored by Ralph Wascow

Cybersecurity is as unpredictable as it is rewarding. Each day often presents a new set of challenges and responsibilities, particularly as organizations accelerate digital transformation efforts. This means you and your cyber team may find yourselves navigating a complex landscape of multi-cloud environments and evolving compliance requirements.

So how does that translate into what cyber professionals have to deal with on a daily basis?

A Day in the Life of a Security Professional

In the Trenches

The responsibility of safeguarding sensitive data and protecting that very same data can create a constant pressure to stay one step ahead – of many things. Teams defending environments often face high stress levels and tight deadlines. Unsurprisingly, the demand for skilled security leaders often outpaces the supply of personnel. This is where an array of tools and solutions are introduced to support those teams. And while there are many positives to be had, security teams are often overrun by an array of solutions and vendors, creating increased complexity and vulnerabilities in their organisation’s risk posture.

Multiple Vendors Often Means More Work

Using different vendors and solutions for various security functions can help keep things fresh, but it can also be time-consuming and cumbersome. And rather than help teams, it may lead to a decrease in performance. With each platform and tool requiring its own resources, the overall efficiency of your infrastructure and processes may suffer. These performance issues can impact critical business operations and hinder productivity. For instance, by the time you receive a threat alert, the attacker could already be hard at work.

Security analysts require a streamlined work environment that enables them to understand the root cause of alerts from any source with a single click. They shouldn’t have to waste time switching between multiple tools to investigate and remediate potential threats. And when belts start tightening and resources become scarce, managing multiple vendors with different payment cycles can become frustrating.

It pays to find ways to create a security ecosystem without sacrificing the efficacy of its components. By reducing the number of disparate cyber solutions, security professionals can optimise effectiveness and efficiency, subsequently enhancing security posture and reducing their risk profile.

What are the Benefits of a Unified Security Ecosystem?

Widening visibility into your entire IT environment strengthens threat detection capabilities, allowing security teams to minimise the impact of potential cyberattacks. In fact, 41% of organisations surveyed by Gartner say consolidating security solutions improved their risk posture. For some organisations still clinging to the status quo of best of breed solutions, consider the following consolidation benefits when trying to gain executive-buy-in.

Identify Systems and Applications at Risk

A robust vulnerability management program should be your first port of call to help identify any systems or applications potentially at risk. It provides your security team with critical insight into potential weaknesses in your IT infrastructure and overall network. Importantly, it will enable you to properly manage and patch vulnerabilities that pose risks to the network, protecting your organisation from the possibility of a breach.

Safeguard an Evolving Landscape with Real-time Monitoring

Continuous scanning and testing of applications are vital components of a robust security strategy. Consolidating your security tech stack into a centralised ecosystem offers the ability to monitor your infrastructure in real-time and receive in-depth reports for better cross-team collaboration. Actionable insight gained will give you and your security team the autonomy you need to stay ahead of evolving risks and proactively address potential vulnerabilities.

Broaden Visibility and Contextual Understanding

Avoid leaving your security team with isolated alerts that require manual investigation and correlation. Integrating data from multiple sources, including endpoints, networks, cloud environments, and applications offers a comprehensive view and analysis of threats across different layers of the IT environment. This holistic approach allows for better correlation of data across various vectors, uncovering complex attack patterns that might otherwise go unnoticed. Consider broadening your context with threat intelligence, providing information about actor groups, typical targets, TTP’s, and more.

Automate Threat Hunting and Distinguish Friend from Foe

In the face of ever-evolving threats, automating threat hunting becomes a crucial capability. By integrating automation within your consolidated security ecosystem, you’ll be able to quickly discern whether incoming threats are benign or malicious. Streamlined processes allow for efficient identification of potential risks, enabling you and your team to prioritize your efforts for activities that require human effort.

Prioritize Risk and Simplify Workflows

The sheer volume of security alerts can overwhelm even the most robust security operations. A consolidated security ecosystem mitigates this challenge by automatically grouping related alerts and prioritising events that demand immediate attention. Unifying and visualizing activities in one place more rapidly identify the root causes of threats and their potential impact. Armed with this knowledge, you can assess the scope of an incident efficiently, build a timeline of the attack, and take swift, targeted action to effectively neutralize the threat.

Swiftly Investigate with End-to-end Digital Forensics

Incident resolution demands a thorough understanding of the attack’s entry point and the ability to track down any traces left by adversaries. With a consolidated security ecosystem, conduct swift and comprehensive investigations using end-to-end digital forensics and review key artefacts such as event logs, registry keys, and browser history across your entire IT environment — significantly enhancing your incident response capabilities. A full view of attacker activity can help you determine the extent of the compromise, identify weaknesses in your defenses, and take appropriate remedial actions.

Coordinate Responses with Remediation and Policy Enforcement

Enable coordinated responses and future-proof defences by integrating prevention technologies across your entire tech stack. Leverage communication between various security components and take decisive action against active threats in real-time. For example, an attack blocked on the network can automatically update policies on endpoints, ensuring consistent security measures across your infrastructure. This proactive approach to security ultimately reduces the risk of successful cyberattacks.

Consolidate to Mitigate

With a rapidly changing threat landscape, consolidation offers the security improvements your organisation needs to give it the balance of power. Simplifying and streamlining your cybersecurity solutions begins with gaining visibility into your tech stack. This enables your team to identify where consolidation can improve your team’s productivity and effectiveness in detecting and mitigating risk.

How Rapid7 Can Help: Managed Threat Complete

Managed Threat Complete offers a simplified security stack, fuelling your D&R program to give you a 24x7x365 SOC, IR, XDR technology, SIEM, SOAR, threat intelligence, and unlimited VRM in a single service. This ensures your environment is monitored round-the-clock and end-to-end by an elite SOC that works transparently with your in-house team, helping to further expand your resources.

Learn more.

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/979606/

Security updates have been issued by AlmaLinux (python3.11), Debian (composer), Fedora (thunderbird), Mageia (chromium-browser-stable, python-aiohttp, python-gunicorn, python-werkzeug, and virtualbox), Oracle (libreswan and python3.11), Red Hat (git, kpatch-patch, python3.11, python3.9, and thunderbird), and SUSE (avahi, ghostscript, grafana and mybatis, hdf5, kernel, openssl-1_1-livepatches, python-docker, and wget).

Is Ransomware Protection Working?

Post Syndicated from Bozho original https://techblog.bozho.net/is-ransomware-protection-working/

Recently an organization I’m familiar with suffered a ransomware attack. A LockBit ransomware infected a workstation and spread to a few servers. The attack was contained thanks to a quick incident response, but there was one important thing: the organization had a top-notch EDR installed. It’s in the Gartner’s leader quadrant for endpoint protection platforms. And it didn’t stop it. I won’t mention names, because the point here is not to attack a particular brand, but the fact that the attackers created a different build of the LockBit codebase meant that the signature-based detection didn’t work and so the files on the endpoints were encrypted. Not only that – attackers downloaded port scanners and executed password spraying on the network, in order to pivot to the servers, without much trouble caused by the EDR.

Ransomware is the most widespread type of attack, we have a security industry that’s trying to cope with those attacks for a long time, and in 2024 it’s sufficient to just rebuild the ransomware from source in order to not get caught. That’s a problem. To add insult to injury, the malware executable was called lockbit.exe_.

I think we need to think of better ways to tackle ransomware. I’m not an antivirus/EDR in-depth expert, but I have been thinking the following: why doesn’t the operating system have a built-in anti-ransomware functionality? The malware can stop endpoint agents, but it can’t stop the OS (well, it can stop OS services or modify OS dlls, I agree). I was thinking of native NTFS listeners to detect ransomware behavior – very fast enumeration and modification of files. Then I read this article titled “Microsoft Can Fix Ransomware Tomorrow” which echoed my thoughts. It’s not that simple as the title implies, and the article goes to argue that it’s hard, but something has to be done.

This may seriously impact the EDR market – if the top threat – ransomware – is automatically contained by built-in functionality, why purchase an EDR? Obviously, there are many other reasons to do so, but every budget cut will start from the EDR. Ransomware has been a successful driver for more investment in security. To prevent putting too much functionality in the OS, EDRs can tap into the detection capability of the OS to take further actions.

But ransomware works in a very simple way. Detecting mass-encryption of files is easy. Rate-limiting it is reasonable. You may be aware, but CPUs have AES-specific instructions that improve the performance of AES encryption – the fact that these instructions are used may be a further indicator of an ongoing ransomware attack.

I think we have to tap into the lower levels of our software stack – file system, OS, CPU – in order to tackle ransomware. It’s clear that omnipotent endpoint agents don’t exist, no matter how much “AI magic” you sprinkle ontop of them in the data sheet (I remember once asking another EDR vendor on a conference how exactly their AI is working. My impression from the answer was: it’s rules).

As I said, I’m no AV/EDR expert, and I expect comments like “But we are already doing that”. I’m sure APIs like this one are utilized by AV/EDRs, but they may be too slow or too heavy to be used. And it may mean that this APIs can be optimized for the ransomware-detection usecase. I don’t have a ready answer (otherwise I’d be an EDR vendor), but I’d welcome a serious discussion on that. We can’t be in a situation where purchasing an expensive security tool doesn’t reliably solve the most prominent threat – “off-the-shelf” ransomware.

The post Is Ransomware Protection Working? appeared first on Bozho's tech blog.

Create anytime, anywhere with OctoStudio

Post Syndicated from Mitch Resnick & Natalie Rusk original https://www.raspberrypi.org/blog/octostudio-app/

Today our friends Mitch Resnick and Natalie Rusk from MIT’s Lifelong Kindergarten group tell you about OctoStudio, their free mobile app for children to create with code. Find their companion article for teachers in the upcoming issue of Hello World magazine, out for free on Monday 1 July.

When people see our new OctoStudio coding app, they often say that it reminds them of Scratch, the world’s most popular coding platform for kids. That’s not surprising, since the group of us developing OctoStudio were also involved in creating Scratch, with its distinctive building-block approach to programming. But there’s an important difference.

A young person connects coding blocks in their OctoStudio phone app.
A young person connects coding blocks to animate their OctoStudio project. Credit: MIT Media Lab

The difference is that we designed OctoStudio specifically for mobile phones and tablets, based on requests from educators in communities where children and families don’t have access to laptops and desktop computers, but do have access to mobile devices. 

OctoStudio takes advantage of special features of mobile phones and tablets, such as built-in sensors, so young people can create projects that respond to shaking or tilting, or even ‘beam’ signals between devices. And because of the small size of mobile devices, children and families can create projects anytime anywhere, and integrate digital coding with physical making.

OctoStudio makes it easy for beginners to start creating. Children can choose a character from a diverse collection of emojis, draw their own in the OctoStudio paint editor, or take and edit a photo. With just a couple coding blocks, they can make their characters move, jump, speak, or glow — and respond to shaking, tilting, or tapping on the phone or tablet:

A short OctoStudio blocks script.
A short OctoStudio blocks script.

Since our Lifelong Kindergarten group at the MIT Media Lab launched OctoStudio as a free app in October 2023, we’ve been delighted by the creativity and diversity of projects that children around the world have created with OctoStudio. As examples, we’d like to share with you three different projects from three different continents.

Getting active with OctoStudio 

When Xavier, a 10-year-old in Rwanda, started using OctoStudio, he was intrigued with the ‘When I shake’ block. He realized that he could create a step tracker project, by sensing how the phone shook each time he took a step. 

From the emoji library in OctoStudio, Xavier selected a rabbit, and he programmed it to grow a little bit each time he took a step. The more steps, the bigger the rabbit. To test the project, Xavier ran around in a circle. When he looked at the rabbit again, he saw how big it had grown and exclaimed: “Now it’s mega huge!” After finishing his project, Xavier made and posted a video tutorial to show others how to make their own step tracker using only 5 coding blocks.

Making creatures come to life on screen

One popular way to get started with OctoStudio is to make a favorite animal out of craft materials, take a photo of it, then bring your creation to life on the screen with OctoStudio coding blocks. As part of the Brazilian Creative Learning Network, educators Renato Barboza and Simone Lederman offer creative learning workshops in which children design creatures using a combination of natural materials and modeling clay. In these ‘fantastical creatures’ workshops, facilitators ask questions to encourage participants to design not only the creatures, but also develop ideas about how their creatures interact within their environment.

A girl holds up a winged creature she has grafted.

For example, two sisters created imaginary creatures, one with long sticks for arms, the other with big eyes and wings made from leaves. The sisters then took photos and made their creatures come to life in OctoStudio, making them jump, glow, and fly. They recorded sounds and explained more about their creatures, including where they live and what they like to eat.

A child uses the OctoStudio app on a mobile phone.

Beaming between devices

OctoStudio also opens up the possibility of projects involving multiple mobile devices, using the new ‘beam’ block to send signals between the devices (via Bluetooth). For example, children can make a character in a story or game look like it’s jumping from one device to another by sending a beam signal when the character reaches the edge of the screen.

Thawin, an elementary school student in Thailand, decided to use the ‘beam’ block to create a project about caring for the environment. He embedded one tablet in a cardboard cutout of a watering can, and programmed it to beam a signal each time he shook it as if he were sprinkling water. Then, he added a tree emoji to another tablet, and programmed the tree to grow each time it received a beam signal. He proudly shared his project with his classmates: each time someone shook the watering can, the tree grew.

Get started with OctoStudio

To get started with OctoStudio, you can download it for free from app stores for Android and iOS phones and tablets. The app is translated into more than 25 languages, and comes with sample projects and mini-tutorials. 

Here are some resources for learning and exploring more:

You can share your OctoStudio stories, photos, and videos on social media using @octostudioapp or #octostudio. We can’t wait to hear about your and your children’s experiences!

The post Create anytime, anywhere with OctoStudio appeared first on Raspberry Pi Foundation.

Monitoring Configuration Backups with Zabbix and Oxidized

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/monitoring-configuration-backups-with-zabbix-and-oxidized/28260/

As a Zabbix partner, we help customers worldwide with all their Zabbix needs, ranging from building a simple template all the way to (massive) turn-key implementations, trainings, and support contracts. Quite often during projects, we get the question, “How about making configuration backups of our network equipment? We need this, as another tool was also capable of doing this!”

The answer is always the same – yes, but no. Yes, technically it is possible to get your configuration backups in Zabbix. It’s not even that hard to set up initially. However, you really should not want configuration backups. Zabbix is not made for them, and you will run into limitations within minutes. As you can imagine, the customer is never happy with this limitation, and some actively start to question where we think the limitation is to see if it is a limitation for them as well. So we simply set up an SSH agent item and get that config out:

Voila! Once per hour Zabbix will log in to that device, execute the command ‘show full-configuration,’ and get back the result. Frankly, it just works. You check the Monitoring -> Latest data section of the host and see that there is data for this item. Problem solved, right?

No. As a matter of fact, this is where the problems start. Zabbix allows us to store up to 64KB of data in a item value. The above screenshot is of a (small) fortigate firewall and the config if stored in a text file is just over 1.1MB. So, Zabbix truncates the data, which renders the backup useless –  restore will never work. At the same time, Zabbix is not sanitizing the output, so all secrets are left in it.

To make it even worse, it’s challenging to make a diff of different config versions/revisions – that feature is just not there. Most of the time, the customer is at this point convinced that Zabbix is not the right tool and the next question pops up – “Now what? How can we fix this?” This is where our added value is presented, as we do have a solution here which is rather affordable (free) as well.

The solution is Oxidized, which is basically Rancid on steroids. This project started years ago and is released under the Apache 2.0 license. We found it by accident, started playing around with it, and never left it. The project is available on Github (https://github.com/ytti/oxidized) and written in Ruby. Incidentally, if you (or your company) have Ruby devs and want to give something back to the community, the Oxidized project is looking for extra maintainers!

At this point, we show our customers the GUI of Oxidized, which in our case involves just making backups of a few firewalls:

So we have the name, the model, and (in this case) just one group. The status shows whether the backup was successful or not, the last update and when the last change was detected. At the same time, under actions, we can get the full config file, look at previous revisions(and diff them) combined with a ‘execute now’ option.

Looking at the versions, it’s simply showing this:

This is already giving us a nice idea of what is going on. We see the versions and dates at a glance, but the moment we check the diff option, we can easily see what was actually changed:

The perfect solution, except that it is not integrated with Zabbix. That means double administration and a lot of extra work, combined with the inevitable errors – devices not added, credential mismatches, connection errors, etc. Luckily, we can easily change the format of the above information from GUI to json by just adding ‘.json’ at the end of the url:

http://<IP/DNS>:<PORT>/nodes.json

This will give the following output:

[
  {
    "name": "fw-mid-01",
    "full_name": "fw-mid-01",
    "ip": "192.168.4.254",
    "group": "default",
    "model": "FortiOS",
    "last": {
      "start": "2024-06-13 06:46:14 UTC",
      "end": "2024-06-13 06:46:54 UTC",
      "status": "no_connection",
      "time": 40.018852483
    },
    "vars": null,
    "mtime": "unknown",
    "status": "no_connection",
    "time": "2024-06-13 06:46:54 UTC"
  },
  {
    "name": "FW-HUNZE",
    "full_name": "FW-HUNZE",
    "ip": "192.168.0.254",
    "group": "default",
    "model": "FortiOS",
    "last": {
      "start": "2024-06-13 06:46:54 UTC",
      "end": "2024-06-13 06:47:04 UTC",
      "status": "success",
      "time": 10.029043912
    },
    "vars": null,
    "mtime": "2024-06-13 06:47:05 UTC",
    "status": "success",
    "time": "2024-06-13 06:47:04 UTC"
  }
]

As you might know, Zabbix is perfectly capable of parsing json formats and creating items and triggers out of them. A master item, dependent lld (https://blog.zabbix.com/low-level-discovery-with-dependent-items/13634/), and within minutes you’ve got Oxidized making configuration backups while Zabbix is monitoring and alerting on the status:

At this point we’re getting close to a nice integration, but we haven’t overcome the double configuration management yet.

Oxidized can read its configuration from multiple sources, including a CSV file, SQL, SQLite, MySQL or HTTP. The easiest is a CSV file – just make sure you’ve got all information in the correct column and it works. An example:

Oxidized config:

source:
  default: csv
  csv:
    file: /var/lib/oxidized/router.db
    delimiter: !ruby/regexp /:/
    map:
      name: 0
      ip: 1
      model: 2
      username: 3
      password: 4
    vars_map:
      enable: 5

CSV file:

rtr01.local:192.168.1.1:ios:oxidized:5uP3R53cR3T:T0p53cR3t

Great, now we have to configure 2 places (Zabbix and Oxidized) and get a username/password cleartext in a CSV file. What about SQL as a source, and letting it connect to Zabbix? From there we should be able to get information regarding the hostname, but somehow we need the credentials as well. That’s not a default piece of information in Zabbix, but UserMacros can save us here.

So on our host we add 2 extra macros:

At the same time, we need to tell Oxidized what kind of device it is. There are multiple ways of doing this, obviously. A tag, a usermacro, hostgroups, you name it. In order to do this, we place a tag on the host:

Now we make sure that Oxidized is only taking hosts with the tag ‘oxidized’ and extract from them the host name, IP address, model, username, and password:

+-----------+----------------+---------+------------+------------------------------+
| host      | ip             | model   | username   | password                     |
+-----------+----------------+---------+------------+------------------------------+
| fw-mid-01 | 192.168.4.254  | fortios | <redacted> | <redacted>                   |
| FW-HUNZE  | 192.168.0.254  | fortios | <redacted> | <redacted>                   |
+-----------+----------------+---------+------------+------------------------------+

This way, we simply add our host in Zabbix, add the SSH credentials, and Oxidized will pick it up the next time a backup is scheduled. Zabbix will immediately start monitoring the status of those jobs and alert you if something fails.

This blog post is not meant as a complete integration write down, but rather as a way to give some insight into how we as a partner operate in the field, taking advantage of the flexibility of the products we work with. This post should give you enough information to build it yourself, but of course we’re always available to help you or just build it as part of our consultancy offering.

 

The post Monitoring Configuration Backups with Zabbix and Oxidized appeared first on Zabbix Blog.