Tag Archives: cloud

Zero Configuration Service Mesh with On-Demand Cluster Discovery

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/zero-configuration-service-mesh-with-on-demand-cluster-discovery-ac6483b52a51

by David Vroom, James Mulcahy, Ling Yuan, Rob Gulewich

In this post we discuss Netflix’s adoption of service mesh: some history, motivations, and how we worked with Kinvolk and the Envoy community on a feature that streamlines service mesh adoption in complex microservice environments: on-demand cluster discovery.

A brief history of IPC at Netflix

Netflix was early to the cloud, particularly for large-scale companies: we began the migration in 2008, and by 2010, Netflix streaming was fully run on AWS. Today we have a wealth of tools, both OSS and commercial, all designed for cloud-native environments. In 2010, however, nearly none of it existed: the CNCF wasn’t formed until 2015! Since there were no existing solutions available, we needed to build them ourselves.

For Inter-Process Communication (IPC) between services, we needed the rich feature set that a mid-tier load balancer typically provides. We also needed a solution that addressed the reality of working in the cloud: a highly dynamic environment where nodes are coming up and down, and services need to quickly react to changes and route around failures. To improve availability, we designed systems where components could fail separately and avoid single points of failure. These design principles led us to client-side load-balancing, and the 2012 Christmas Eve outage solidified this decision even further. During these early years in the cloud, we built Eureka for Service Discovery and Ribbon (internally known as NIWS) for IPC. Eureka solved the problem of how services discover what instances to talk to, and Ribbon provided the client-side logic for load-balancing, as well as many other resiliency features. These two technologies, alongside a host of other resiliency and chaos tools, made a massive difference: our reliability improved measurably as a result.

Eureka and Ribbon presented a simple but powerful interface, which made adopting them easy. In order for a service to talk to another, it needs to know two things: the name of the destination service, and whether or not the traffic should be secure. The abstractions that Eureka provides for this are Virtual IPs (VIPs) for insecure communication, and Secure VIPs (SVIPs) for secure. A service advertises a VIP name and port to Eureka (eg: myservice, port 8080), or an SVIP name and port (eg: myservice-secure, port 8443), or both. IPC clients are instantiated targeting that VIP or SVIP, and the Eureka client code handles the translation of that VIP to a set of IP and port pairs by fetching them from the Eureka server. The client can also optionally enable IPC features like retries or circuit breaking, or stick with a set of reasonable defaults.

A diagram showing an IPC client in a Java app directly communicating to hosts registered as SVIP A. Host and port information for SVIP A is fetched from Eureka by the IPC client.

In this architecture, service to service communication no longer goes through the single point of failure of a load balancer. The downside is that Eureka is a new single point of failure as the source of truth for what hosts are registered for VIPs. However, if Eureka goes down, services can continue to communicate with each other, though their host information will become stale over time as instances for a VIP come up and down. The ability to run in a degraded but available state during an outage is still a marked improvement over completely stopping traffic flow.

Why mesh?

The above architecture has served us well over the last decade, though changing business needs and evolving industry standards have added more complexity to our IPC ecosystem in a number of ways. First, we’ve grown the number of different IPC clients. Our internal IPC traffic is now a mix of plain REST, GraphQL, and gRPC. Second, we’ve moved from a Java-only environment to a Polyglot one: we now also support node.js, Python, and a variety of OSS and off the shelf software. Third, we’ve continued to add more functionality to our IPC clients: features such as adaptive concurrency limiting, circuit breaking, hedging, and fault injection have become standard tools that our engineers reach for to make our system more reliable. Compared to a decade ago, we now support more features, in more languages, in more clients. Keeping feature parity between all of these implementations and ensuring that they all behave the same way is challenging: what we want is a single, well-tested implementation of all of this functionality, so we can make changes and fix bugs in one place.

This is where service mesh comes in: we can centralize IPC features in a single implementation, and keep per-language clients as simple as possible: they only need to know how to talk to the local proxy. Envoy is a great fit for us as the proxy: it’s a battle-tested OSS product at use in high scale in the industry, with many critical resiliency features, and good extension points for when we need to extend its functionality. The ability to configure proxies via a central control plane is a killer feature: this allows us to dynamically configure client-side load balancing as if it was a central load balancer, but still avoids a load balancer as a single point of failure in the service to service request path.

Moving to mesh

Once we decided that moving to service mesh was the right bet to make, the next question became: how should we go about moving? We decided on a number of constraints for the migration. First: we wanted to keep the existing interface. The abstraction of specifying a VIP name plus secure serves us well, and we didn’t want to break backwards compatibility. Second: we wanted to automate the migration and to make it as seamless as possible. These two constraints meant that we needed to support the Discovery abstractions in Envoy, so that IPC clients could continue to use it under the hood. Fortunately, Envoy had ready to use abstractions for this. VIPs could be represented as Envoy Clusters, and proxies could fetch them from our control plane using the Cluster Discovery Service (CDS). The hosts in those clusters are represented as Envoy Endpoints, and could be fetched using the Endpoint Discovery Service (EDS).

We soon ran into a stumbling block to a seamless migration: Envoy requires that clusters be specified as part of the proxy’s config. If service A needs to talk to clusters B and C, then you need to define clusters B and C as part of A’s proxy config. This can be challenging at scale: any given service might communicate with dozens of clusters, and that set of clusters is different for every app. In addition, Netflix is always changing: we’re constantly adding new initiatives like live streaming, ads and games, and evolving our architecture. This means the clusters that a service communicates with will change over time. There are a number of different approaches to populating cluster config that we evaluated, given the Envoy primitives available to us:

  1. Get service owners to define the clusters their service needs to talk to. This option seems simple, but in practice, service owners don’t always know, or want to know, what services they talk to. Services often import libraries provided by other teams that talk to multiple other services under the hood, or communicate with other operational services like telemetry and logging. This means that service owners would need to know how these auxiliary services and libraries are implemented under the hood, and adjust config when they change.
  2. Auto-generate Envoy config based on a service’s call graph. This method is simple for pre-existing services, but is challenging when bringing up a new service or adding a new upstream cluster to communicate with.
  3. Push all clusters to every app: this option was appealing in its simplicity, but back of the napkin math quickly showed us that pushing millions of endpoints to each proxy wasn’t feasible.

Given our goal of a seamless adoption, each of these options had significant enough downsides that we explored another option: what if we could fetch cluster information on-demand at runtime, rather than predefining it? At the time, the service mesh effort was still being bootstrapped, with only a few engineers working on it. We approached Kinvolk to see if they could work with us and the Envoy community in implementing this feature. The result of this collaboration was On-Demand Cluster Discovery (ODCDS). With this feature, proxies could now look up cluster information the first time they attempt to connect to it, rather than predefining all of the clusters in config.

With this capability in place, we needed to give the proxies cluster information to look up. We had already developed a service mesh control plane that implements the Envoy XDS services. We then needed to fetch service information from Eureka in order to return to the proxies. We represent Eureka VIPs and SVIPs as separate Envoy Cluster Discovery Service (CDS) clusters (so service myservice may have clusters myservice.vip and myservice.svip). Individual hosts in a cluster are represented as separate Endpoint Discovery Service (EDS) endpoints. This allows us to reuse the same Eureka abstractions, and IPC clients like Ribbon can move to mesh with minimal changes. With both the control plane and data plane changes in place, the flow works as follows:

  1. Client request comes into Envoy
  2. Extract the target cluster based on the Host / :authority header (the header used here is configurable, but this is our approach). If that cluster is known already, jump to step 7
  3. The cluster doesn’t exist, so we pause the in flight request
  4. Make a request to the Cluster Discovery Service (CDS) endpoint on the control plane. The control plane generates a customized CDS response based on the service’s configuration and Eureka registration information
  5. Envoy gets back the cluster (CDS), which triggers a pull of the endpoints via Endpoint Discovery Service (EDS). Endpoints for the cluster are returned based on Eureka status information for that VIP or SVIP
  6. Client request unpauses
  7. Envoy handles the request as normal: it picks an endpoint using a load-balancing algorithm and issues the request

This flow is completed in a few milliseconds, but only on the first request to the cluster. Afterward, Envoy behaves as if the cluster was defined in the config. Critically, this system allows us to seamlessly migrate services to service mesh with no configuration required, satisfying one of our main adoption constraints. The abstraction we present continues to be VIP name plus secure, and we can migrate to mesh by configuring individual IPC clients to connect to the local proxy instead of the upstream app directly. We continue to use Eureka as the source of truth for VIPs and instance status, which allows us to support a heterogeneous environment of some apps on mesh and some not while we migrate. There’s an additional benefit: we can keep Envoy memory usage low by only fetching data for clusters that we’re actually communicating with.

A diagram showing an IPC client in a Java app communicating through Envoy to hosts registered as SVIP A. Cluster and endpoint information for SVIP A is fetched from the mesh control plane by Envoy. The mesh control plane fetches host information from Eureka.

There is a downside to fetching this data on-demand: this adds latency to the first request to a cluster. We have run into use-cases where services need very low-latency access on the first request, and adding a few extra milliseconds adds too much overhead. For these use-cases, the services need to either predefine the clusters they communicate with, or prime connections before their first request. We’ve also considered pre-pushing clusters from the control plane as proxies start up, based on historical request patterns. Overall, we feel the reduced complexity in the system justifies the downside for a small set of services.

We’re still early in our service mesh journey. Now that we’re using it in earnest, there are many more Envoy improvements that we’d love to work with the community on. The porting of our adaptive concurrency limiting implementation to Envoy was a great start — we’re looking forward to collaborating with the community on many more. We’re particularly interested in the community’s work on incremental EDS. EDS endpoints account for the largest volume of updates, and this puts undue pressure on both the control plane and Envoy.

We’d like to give a big thank-you to the folks at Kinvolk for their Envoy contributions: Alban Crequy, Andrew Randall, Danielle Tal, and in particular Krzesimir Nowak for his excellent work. We’d also like to thank the Envoy community for their support and razor-sharp reviews: Adi Peleg, Dmitri Dolguikh, Harvey Tuch, Matt Klein, and Mark Roth. It’s been a great experience working with you all on this.

This is the first in a series of posts on our journey to service mesh, so stay tuned. If this sounds like fun, and you want to work on service mesh at scale, come work with us — we’re hiring!


Zero Configuration Service Mesh with On-Demand Cluster Discovery was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Putting the Bare Metal Server in the PhoenixNAP Bare Metal Cloud

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/putting-the-bare-metal-server-in-the-phoenixnap-bare-metal-cloud-intel-xeon-sapphire-rapids-supermicro/

We install a special server into the PhoenixNAP Bare Metal Cloud so we can show an instance’s lifecycle from hardware to operation

The post Putting the Bare Metal Server in the PhoenixNAP Bare Metal Cloud appeared first on ServeTheHome.

Let’s Architect! Architecting with custom chips and accelerators

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-custom-chips-and-accelerators/

It’s hard to imagine a world without computer chips. They are at the heart of the devices that we use to work and play every day. Currently, Amazon Web Services (AWS) is offering customers the next generation of computer chip, with lower cost, higher performance, and a reduced carbon footprint.

This edition of Let’s Architect! focuses on custom computer chips, accelerators, and technologies developed by AWS, such as AWS Nitro System, custom-designed Arm-based AWS Graviton processors that support data-intensive workloads, as well as AWS Trainium, and AWS Inferentia chips optimized for machine learning training and inference.

In this post, we discuss these new AWS technologies, their main characteristics, and how to take advantage of them in your architecture.

Deliver high performance ML inference with AWS Inferentia

As Deep Learning models become increasingly large and complex, the training cost for these models increases, as well as the inference time for serving.

With AWS Inferentia, machine learning practitioners can deploy complex neural-network models that are built and trained on popular frameworks, such as Tensorflow, PyTorch, and MXNet on AWS Inferentia-based Amazon EC2 Inf1 instances.

This video introduces you to the main concepts of AWS Inferentia, a service designed to reduce both cost and latency for inference. To speed up inference, AWS Inferentia: selects and shares a model across multiple chips, places pieces inside the on-chip cache, then streams the data via pipeline for low-latency predictions.

Presenters discuss through the structure of the chip, software considerations, as well as anecdotes from the Amazon Alexa team, who uses AWS Inferentia to serve predictions. If you want to learn more about high throughput coupled with low latency, explore Achieve 12x higher throughput and lowest latency for PyTorch Natural Language Processing applications out-of-the-box on AWS Inferentia on the AWS Machine Learning Blog.

AWS Inferentia shares a model across different chips to speed up inference

AWS Inferentia shares a model across different chips to speed up inference

AWS Lambda Functions Powered by AWS Graviton2 Processor – Run Your Functions on Arm and Get Up to 34% Better Price Performance

AWS Lambda is a serverless, event-driven compute service that enables code to run from virtually any type of application or backend service, without provisioning or managing servers. Lambda uses a high-availability compute infrastructure and performs all of the administration of the compute resources, including server- and operating-system maintenance, capacity-provisioning, and automatic scaling and logging.

AWS Graviton processors are designed to deliver the best price and performance for cloud workloads. AWS Graviton3 processors are the latest in the AWS Graviton processor family and provide up to: 25% increased compute performance, two-times higher floating-point performance, and two-times faster cryptographic workload performance compared with AWS Graviton2 processors. This means you can migrate AWS Lambda functions to Graviton in minutes, plus get as much as 19% improved performance at approximately 20% lower cost (compared with x86).

Comparison between x86 and Arm/Graviton2 results for the AWS Lambda function computing prime numbers

Comparison between x86 and Arm/Graviton2 results for the AWS Lambda function computing prime numbers (click to enlarge)

Powering next-gen Amazon EC2: Deep dive on the Nitro System

The AWS Nitro System is a collection of building-block technologies that includes AWS-built hardware offload and security components. It is powering the next generation of Amazon EC2 instances, with a broadening selection of compute, storage, memory, and networking options.

In this session, dive deep into the Nitro System, reviewing its design and architecture, exploring new innovations to the Nitro platform, and understanding how it allows for fasting innovation and increased security while reducing costs.

Traditionally, hypervisors protect the physical hardware and bios; virtualize the CPU, storage, networking; and provide a rich set of management capabilities. With the AWS Nitro System, AWS breaks apart those functions and offloads them to dedicated hardware and software.

AWS Nitro System separates functions and offloads them to dedicated hardware and software, in place of a traditional hypervisor

AWS Nitro System separates functions and offloads them to dedicated hardware and software, in place of a traditional hypervisor

How Amazon migrated a large ecommerce platform to AWS Graviton

In this re:Invent 2021 session, we learn about the benefits Amazon’s ecommerce Datapath platform has realized with AWS Graviton.

With a range of 25%-40% performance gains across 53,000 Amazon EC2 instances worldwide for Prime Day 2021, the Datapath team is lowering their internal costs with AWS Graviton’s improved price performance. Explore the software updates that were required to achieve this and the testing approach used to optimize and validate the deployments. Finally, learn about the Datapath team’s migration approach that was used for their production deployment.

AWS Graviton2: core components

AWS Graviton2: core components

See you next time!

Thanks for exploring custom computer chips, accelerators, and technologies developed by AWS. Join us in a couple of weeks when we talk more about architectures and the daily challenges faced while working with distributed systems.

Other posts in this series

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Let’s Architect! Modern data architectures

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-modern-data-architectures/

With the rapid growth in data coming from data platforms and applications, and the continuous improvements in state-of-the-art machine learning algorithms, data are becoming key assets for companies.

Modern data architectures include data mesh—a recent style that represents a paradigm shift, in which data is treated as a product and data architectures are designed around business domains. This type of approach supports the idea of distributed data, where each business domain focuses on the quality of the data it produces and exposes to the consumers.

In this edition of Let’s Architect!, we focus on data mesh and how it is designed on AWS, plus other approaches to adopt modern architectural patterns.

Design a data mesh architecture using AWS Lake Formation and AWS Glue

Domain Driven Design (DDD) is a software design approach where a solution is divided into domains aligned with business capabilities, software, and organizational boundaries. Unlike software architectures, most data architectures are often designed around technologies rather than business domains.

In this blog, you can learn about data mesh, an architectural pattern that applies the principles of DDD to data architectures. Data are organized into domains and considered the product that each team owns and offers for consumption.

A data mesh design organizes around data domains. Each domain owns multiple data products with their own data and technology stacks

A data mesh design organizes around data domains. Each domain owns multiple data products with their own data and technology stacks

Building Data Mesh Architectures on AWS

In this video, discover how to use the data mesh approach in AWS. Specifically, how to implement certain design patterns for building a data mesh architecture with AWS services in the cloud.

This is a pragmatic presentation to get a quick understanding of data mesh fundamentals, the benefits/challenges, and the AWS services that you can use to build it. This video provides additional context to the aforementioned blog post and includes several examples on the benefits of modern data architectures.

This diagram demonstrates the pattern for sharing data catalogs between producer domains and consumer domains

This diagram demonstrates the pattern for sharing data catalogs between producer domains and consumer domains

Build a modern data architecture on AWS with Amazon AppFlow, AWS Lake Formation, and Amazon Redshift

In this blog, you can learn how to build a modern data strategy using AWS managed services to ingest data from sources like Salesforce. Also discussed is how to automatically create metadata catalogs and share data seamlessly between the data lake and data warehouse, plus creating alerts in the event of an orchestrated data workflow failure.

The second part of the post explains how a data warehouse can be built by using an agile data modeling pattern, as well as how ELT jobs were quickly developed, orchestrated, and configured to perform automated data quality testing.

A data platform architecture and the subcomponents used to build it

A data platform architecture and the subcomponents used to build it

AWS Lake Formation Workshop

With a modern data architecture on AWS, architects and engineers can rapidly build scalable data lakes; use a broad and deep collection of purpose-built data services; and ensure compliance via unified data access, security, and governance. As data mesh is a modern architectural pattern, you can build it using a service like AWS Lake Formation.

Familiarize yourself with new technologies and services by not only learning how they work, but also to building prototypes and projects to gain hands-on experience. This workshop allows builders to become familiar with the features of AWS Lake Formation and its integrations with other AWS services.

A data catalog is a key component in a data mesh architecture. AWS Glue crawlers interact with data stores and other elements to populate the data catalog

A data catalog is a key component in a data mesh architecture. AWS Glue crawlers interact with data stores and other elements to populate the data catalog

See you next time!

Thanks for joining our discussion on data mesh! See you in a couple of weeks when we talk more about architectures and the challenges that we face every day while working with distributed systems.

Other posts in this series

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Streamlining evidence collection with AWS Audit Manager

Post Syndicated from Nicholas Parks original https://aws.amazon.com/blogs/security/streamlining-evidence-collection-with-aws-audit-manager/

In this post, we will show you how to deploy a solution into your Amazon Web Services (AWS) account that enables you to simply attach manual evidence to controls using AWS Audit Manager. Making evidence-collection as seamless as possible minimizes audit fatigue and helps you maintain a strong compliance posture.

As an AWS customer, you can use APIs to deliver high quality software at a rapid pace. If you have compliance-focused teams that rely on manual, ticket-based processes, you might find it difficult to document audit changes as those changes increase in velocity and volume.

As your organization works to meet audit and regulatory obligations, you can save time by incorporating audit compliance processes into a DevOps model. You can use modern services like Audit Manager to make this easier. Audit Manager automates evidence collection and generates reports, which helps reduce manual auditing efforts and enables you to scale your cloud auditing capabilities along with your business.

AWS Audit Manager uses services such as AWS Security Hub, AWS Config, and AWS CloudTrail to automatically collect and organize evidence, such as resource configuration snapshots, user activity, and compliance check results. However, for controls represented in your software or processes without an AWS service-specific metric to gather, you need to manually create and provide documentation as evidence to demonstrate that you have established organizational processes to maintain compliance. The solution in this blog post streamlines these types of activities.

Solution architecture

This solution creates an HTTPS API endpoint, which allows integration with other software development lifecycle (SDLC) solutions, IT service management (ITSM) products, and clinical trial management systems (CTMS) solutions that capture trial process change amendment documentation (in the case of pharmaceutical companies who use AWS to build robust pharmacovigilance solutions). The endpoint can also be a backend microservice to an application that allows contract research organizations (CRO) investigators to add their compliance supporting documentation.

In this solution’s current form, you can submit an evidence file payload along with the assessment and control details to the API and this solution will tie all the information together for the audit report. This post and solution is directed towards engineering teams who are looking for a way to accelerate evidence collection. To maximize the effectiveness of this solution, your engineering team will also need to collaborate with cross-functional groups, such as audit and business stakeholders, to design a process and service that constructs and sends the message(s) to the API and to scale out usage across the organization.

To download the code for this solution, and the configuration that enables you to set up auto-ingestion of manual evidence, see the aws-audit-manager-manual-evidence-automation GitHub repository.

Architecture overview

In this solution, you use AWS Serverless Application Model (AWS SAM) templates to build the solution and deploy to your AWS account. See Figure 1 for an illustration of the high-level architecture.

Figure 1. The architecture of the AWS Audit Manager automation solution

Figure 1. The architecture of the AWS Audit Manager automation solution

The SAM template creates resources that support the following workflow:

  1. A client can call an Amazon API Gateway endpoint by sending a payload that includes assessment details and the evidence payload.
  2. An AWS Lambda function implements the API to handle the request.
  3. The Lambda function uploads the evidence to an Amazon Simple Storage Service (Amazon S3) bucket (3a) and uses AWS Key Management Service (AWS KMS) to encrypt the data (3b).
  4. The Lambda function also initializes the AWS Step Functions workflow.
  5. Within the Step Functions workflow, a Standard Workflow calls two Lambda functions. The first looks for a matching control within an assessment, and the second updates the control within the assessment with the evidence.
  6. When the Step Functions workflow concludes, it sends a notification for success or failure to subscribers of an Amazon Simple Notification Service (Amazon SNS) topic.

Deploy the solution

The project available in the aws-audit-manager-manual-evidence-automation GitHub repository contains source code and supporting files for a serverless application you can deploy with the AWS SAM command line interface (CLI). It includes the following files and folders:

src Code for the application’s Lambda implementation of the Step Functions workflow.
It also includes a Step Functions definition file.
template.yml A template that defines the application’s AWS resources.

Resources for this project are defined in the template.yml file. You can update the template to add AWS resources through the same deployment process that updates your application code.

Prerequisites

This solution assumes the following:

  1. AWS Audit Manager is enabled.
  2. You have already created an assessment in AWS Audit Manager.
  3. You have the necessary tools to use the AWS SAM CLI (see details in the table that follows).

For more information about setting up Audit Manager and selecting a framework, see Getting started with Audit Manager in the blog post AWS Audit Manager Simplifies Audit Preparation.

The AWS SAM CLI is an extension of the AWS CLI that adds functionality for building and testing Lambda applications. The AWS SAM CLI uses Docker to run your functions in an Amazon Linux environment that matches Lambda. It can also emulate your application’s build environment and API.

To use the AWS SAM CLI, you need the following tools:

AWS SAM CLI Install the AWS SAM CLI
Node.js Install Node.js 14, including the npm package management tool
Docker Install Docker community edition

To deploy the solution

  1. Open your terminal and use the following command to create a folder to clone the project into, then navigate to that folder. Be sure to replace <FolderName> with your own value.

    mkdir Desktop/<FolderName> && cd $_

  2. Clone the project into the folder you just created by using the following command.

    git clone https://github.com/aws-samples/aws-audit-manager-manual-evidence-automation.git

  3. Navigate into the newly created project folder by using the following command.

    cd aws-audit-manager-manual-evidence-automation

  4. In the AWS SAM shell, use the following command to build the source of your application.

    sam build

  5. In the AWS SAM shell, use the following command to package and deploy your application to AWS. Be sure to replace <DOC-EXAMPLE-BUCKET> with your own unique S3 bucket name.

    sam deploy –guided –parameter-overrides paramBucketName=<DOC-EXAMPLE-BUCKET>

  6. When prompted, enter the AWS Region where AWS Audit Manager was configured. For the rest of the prompts, leave the default values.
  7. To activate the IAM authentication feature for API gateway, override the default value by using the following command.

    paramUseIAMwithGateway=AWS_IAM

To test the deployed solution

After you deploy the solution, run an invocation like the one below for an assessment (using curl). Be sure to replace <YOURAPIENDPOINT> and <AWS REGION> with your own values.

curl –location –request POST
‘https://<YOURAPIENDPOINT>.execute-api.<AWS REGION>.amazonaws.com/Prod’ \
–header ‘x-api-key: ‘ \
–form ‘payload=@”<PATH TO FILE>”‘ \
–form ‘AssessmentName=”GxP21cfr11″‘ \
–form ‘ControlSetName=”General requirements”‘ \
–form ‘ControlIdName=”11.100(a)”‘

Check to see that your file is correctly attached to the control for your assessment.

Form-data interface parameters

The API implements a form-data interface that expects four parameters:

  1. AssessmentName: The name for the assessment in Audit Manager. In this example, the AssessmentName is GxP21cfr11.
  2. ControlSetName: The display name for a control set within an assessment. In this example, the ControlSetName is General requirements.
  3. ControlIdName: this is a particular control within a control set. In this example, the ControlIdName is 11.100(a).
  4. Payload: this is the file representing evidence to be uploaded.

As a refresher of Audit Manager concepts, evidence is collected for a particular control. Controls are grouped into control sets. Control sets can be grouped into a particular framework. The assessment is considered an implementation, or an instance, of the framework. For more information, see AWS Audit Manager concepts and terminology.

To clean up the deployed solution

To clean up the solution, use the following commands to delete the AWS CloudFormation stack and your S3 bucket. Be sure to replace <YourStackId> and <DOC-EXAMPLE-BUCKET> with your own values.

aws cloudformation delete-stack –stack-name <YourStackId>
aws s3 rb s3://<DOC-EXAMPLE-BUCKET> –force

Conclusion

This solution provides a way to allow for better coordination between your software delivery organization and compliance professionals. This allows your organization to continuously deliver new updates without overwhelming your security professionals with manual audit review tasks.

Next steps

There are various ways to extend this solution.

  1. Update the API Lambda implementation to be a webhook for your favorite software development lifecycle (SDLC) or IT service management (ITSM) solution.
  2. Modify the steps within the Step Functions state machine to more closely match your unique compliance processes.
  3. Use AWS CodePipeline to start Step Functions state machines natively, or integrate a variation of this solution with any continuous compliance workflow that you have.

Learn more AWS Audit Manager, DevOps, and AWS for Health and start building!

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Nicholas Parks

Nicholas Parks

Nicholas has been using AWS since 2010 across various enterprise verticals including healthcare, life sciences, financial, retail, and telecommunications. Nicholas focuses on modernizations in pursuit of new revenue as well as application migrations. He specializes in Lean, DevOps cultural change, and Continuous Delivery.

Brian Tang

Brian Tang

Brian Tang is an AWS Solutions Architect based out of Boston, MA. He has 10 years of experience helping enterprise customers across a wide range of industries complete digital transformations by migrating business-critical workloads to the cloud. His core interests include DevOps and serverless-based solutions. Outside of work, he loves rock climbing and playing guitar.

Exposing a Kafka Cluster via a VPC Endpoint Service

Post Syndicated from Grab Tech original https://engineering.grab.com/exposing-kafka-cluster

In large organisations, it is a common practice to isolate the cloud resources of different verticals. Amazon Web Services (AWS) Virtual Private Cloud (VPC) is a convenient way of doing so. At Grab, while our core AWS services reside in a main VPC, a number of Grab Tech Families (TFs) have their own dedicated VPC. One such example is GrabKios. Previously known as “Kudo”, GrabKios was acquired by Grab in 2017 and has always been residing in its own AWS account and dedicated VPC.

In this article, we explore how we exposed an Apache Kafka cluster across multiple Availability Zones (AZs) in Grab’s main VPC, to producers and consumers residing in the GrabKios VPC, via a VPC Endpoint Service. This design is part of Coban unified stream processing platform at Grab.

There are several ways of enabling communication between applications across distinct VPCs; VPC peering is the most straightforward and affordable option. However, it potentially exposes the entire VPC networks to each other, needlessly increasing the attack surface.

Security has always been one of Grab’s top concerns and with Grab’s increasing growth, there is a need to deprecate VPC peering and shift to a method of only exposing services that require remote access. The AWS VPC Endpoint Service allows us to do exactly that for TCP/IPv4 communications within a single AWS region.

Setting up a VPC Endpoint Service compared to VPC peering is already relatively complex. On top of that, we need to expose an Apache Kafka cluster via such an endpoint, which comes with an extra challenge. Apache Kafka requires clients, called producers and consumers, to be able to deterministically establish a TCP connection to all brokers forming the cluster, not just any one of them.

Last but not least, we need a design that optimises performance and cost by limiting data transfer across AZs.

Note: All variable names, port numbers and other details used in this article are only used as examples.

Architecture overview

As shown in this diagram, the Kafka cluster resides in the service provider VPC (Grab’s main VPC) while local Kafka producers and consumers reside in the service consumer VPC (GrabKios VPC).

In Grab’s main VPC, we created a Network Load Balancer (NLB) and set it up across all three AZs, enabling cross-zone load balancing. We then created a VPC Endpoint Service associated with that NLB.

Next, we created a VPC Endpoint Network Interface in the GrabKios VPC, also set up across all three AZs, and attached it to the remote VPC endpoint service in Grab’s main VPC. Apart from this, we also created a Route 53 Private Hosted Zone .grab and a CNAME record kafka.grab that points to the VPC Endpoint Network Interface hostname.

Lastly, we configured producers and consumers to use kafka.grab:10000 as their Kafka bootstrap server endpoint, 10000/tcp being an arbitrary port of our choosing. We will explain the significance of these in later sections.

Search data flow

Network Load Balancer setup

On the NLB in Grab’s main VPC, we set up the corresponding bootstrap listener on port 10000/tcp, associated with a target group containing all of the Kafka brokers forming the cluster. But this listener alone is not enough.

As mentioned earlier, Apache Kafka requires producers and consumers to be able to deterministically establish a TCP connection to all brokers. That’s why we created one listener for every broker in the cluster, incrementing the TCP port number for each new listener, so each broker endpoint would have the same name but with different port numbers, e.g. kafka.grab:10001 and kafka.grab:10002.

We then associated each listener with a dedicated target group containing only the targeted Kafka broker, so that remote producers and consumers could differentiate between the brokers by their TCP port number.

The following listeners and associated target groups were set up on the NLB:

  • 10000/tcp (bootstrap) -> 9094/tcp @ [broker 101, broker 201, broker 301]
  • 10001/tcp -> 9094/tcp @ [broker 101]
  • 10002/tcp -> 9094/tcp @ [broker 201]
  • 10003/tcp -> 9094/tcp @ [broker 301]

Security Group rules

In the Kafka brokers’ Security Group (SG), we added an ingress SG rule allowing 9094/tcp traffic from each of the three private IP addresses of the NLB. As mentioned earlier, the NLB was set up across all three AZs, with each having its own private IP address.

On the GrabKios VPC (consumer side), we created a new SG and attached it to the VPC Endpoint Network Interface. We also added ingress rules to allow all producers and consumers to connect to tcp/10000-10003.

Kafka setup

Kafka brokers typically come with a listener on port 9092/tcp, advertising the brokers by their private IP addresses. We kept that default listener so that local producers and consumers in Grab’s main VPC could still connect directly.

$ kcat -L -b 10.0.0.1:9092
 3 brokers:
 broker 101 at 10.0.0.1:9092 (controller)  
 broker 201 at 10.0.0.2:9092
 broker 301 at 10.0.0.3:9092
... truncated output ...

We also configured all brokers with an additional listener on port 9094/tcp that advertises the brokers by:

  • Their shared private name kafka.grab.
  • Their distinct TCP ports previously set up on the NLB’s dedicated listeners.
$ kcat -L -b 10.0.0.1:9094
 3 brokers:
 broker 101 at kafka.grab:10001 (controller)  
 broker 201 at kafka.grab:10002
 broker 301 at kafka.grab:10003
... truncated output ...

Note that there is a difference in how the broker’s endpoints are advertised in the two outputs above. The latter enables connection to any particular broker from the GrabKios VPC via the VPC Endpoint Service.

It would definitely be possible to advertise the brokers directly with the remote VPC Endpoint Interface hostname instead of kafka.grab, but relying on such a private name presents at least two advantages.

First, it decouples the Kafka deployment in the service provider VPC from the infrastructure deployment in the service consumer VPC. Second, it makes the Kafka cluster easier to expose to other remote VPCs, should we need it in the future.

Limiting data transfer across Availability Zones

At this stage of the setup, our Kafka cluster is fully reachable from producers and consumers in the GrabKios VPC. Yet, the design is not optimal.

When a producer or a consumer in the GrabKios VPC needs to connect to a particular broker, it uses its individual endpoint made up of the shared name kafka.grab and the broker’s dedicated TCP port.

The shared name arbitrarily resolves into one of the three IP addresses of the VPC Endpoint Network Interface, one for each AZ.

Hence, there is a fair chance that the obtained IP address is neither in the client’s AZ nor in that of the target Kafka broker. The probability of this happening can be as high as 2/3 when both client and broker reside in the same AZ and 1/3 when they do not.

While that is of little concern for the initial bootstrap connection, it becomes a serious drawback for actual data transfer, impacting the performance and incurring unnecessary data transfer cost.

For this reason, we created three additional CNAME records in the Private Hosted Zone in the GrabKios VPC, one for each AZ, with each pointing to the VPC Endpoint Network Interface zonal hostname in the corresponding AZ:

  • kafka-az1.grab
  • kafka-az2.grab
  • kafka-az3.grab

Note that we used az1, az2, az3 instead of the typical AWS 1a, 1b, 1c suffixes, because the latter’s mapping is not consistent across AWS accounts.

We also reconfigured each Kafka broker in Grab’s main VPC by setting their 9094/tcp listener to advertise brokers by their new zonal private names.

$ kcat -L -b 10.0.0.1:9094
 3 brokers:
 broker 101 at kafka-az1.grab:10001 (controller)  
 broker 201 at kafka-az2.grab:10002
 broker 301 at kafka-az3.grab:10003
... truncated output ...

Our private zonal names are shared by all brokers in the same AZ while TCP ports remain distinct for each broker. However, this is not clearly shown in the output above because our cluster only counts three brokers, one in each AZ.

The previous common name kafka.grab remains in the GrabKios VPC’s Private Hosted Zone and allows connections to any broker via an arbitrary, likely non-optimal route. GrabKios VPC producers and consumers still use that highly-available endpoint to initiate bootstrap connections to the cluster.

Search data flow

Future improvements

For this setup, scalability is our main challenge. If we add a new broker to this Kafka cluster, we would need to:

  • Assign a new TCP port number to it.
  • Set up a new dedicated listener on that TCP port on the NLB.
  • Configure the newly spun up Kafka broker to advertise its service with the same TCP port number and the private zonal name corresponding to its AZ.
  • Add the new broker to the target group of the bootstrap listener on the NLB.
  • Update the network SG rules on the service consumer side to allow connections to the newly allocated TCP port.

We rely on Terraform to dynamically deploy all AWS infrastructure and on Jenkins and Ansible to deploy and configure Apache Kafka. There is limited overhead but there are still a few manual actions due to a lack of integration. These include transferring newly allocated TCP ports and their corresponding EC2 instances’ IP addresses to our Ansible inventory, commit them to our codebase and trigger a Jenkins job deploying the new Kafka broker.

Another concern of this setup is that it is only applicable for AWS. As we are aiming to be multi-cloud, we may need to port it to Microsoft Azure and leverage the Azure Private Link service.

In both cases, running Kafka on Kubernetes with the Strimzi operator would be helpful in addressing the scalability challenge and reducing our adherence to one particular cloud provider. We will explain how this solution has helped us address these challenges in a future article.


Special thanks to David Virgil Naranjo whose blog post inspired this work.


Join us

Grab is a leading superapp in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across over 400 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Monitoring Juniper Mist wireless network

Post Syndicated from Brian van Baekel original https://blog.zabbix.com/monitoring-juniper-mist-wireless-network/19093/

As Premium Zabbix partner, Opensource ICT Solutions is building Zabbix solutions all over the world. That means we have customers with a broad variety of requirements, thoughts on how to monitor things, which metrics are important and how to alert upon it. If one of those customers approaches us with a question concerning a task the likes of which we have never done before, it’s a challenge. And we love challenges! This blog post will cover one such challenge that we solved some time ago.

Quanza is a leading infrastructure operator offering a broad portfolio of services to completely take over the management of networks, data centers and cloud services. With more than 70 colleagues and at least as many specializations, everyone at Quanza works towards the same goal: designing, building, and operating an optimal IT infrastructure. Exactly like you would expect it… and then some. Quanza understands that you prefer to focus on your own innovation. By continuously mapping out your wishes, Quanza provides customized solutions that keep your network up and running 24×7. Today and in the future.

With a relentless focus on mission-critical environments, often of relevance to society, Quanza has an impressive line-up of customers. Some enterprises that chose to partner up with Quanza are SURF, Payvision, the Volksbank, and the Amsterdam Internet Exchange (AMS-IX), one of the world’s largest internet hubs.

Recently, customers started asking Quanza to embed Juniper MIST products for wired and wireless networks in their service portfolio. In order to fully support the network’s lifecycle (build, operate and innovate), the Juniper MIST products will need to be monitored by their 24×7 NOC. This is where we came into play, with our Zabbix knowledge.

We quickly decided to combine the knowledge Quanza has of the Juniper MIST equipment and API and our Zabbix knowledge to build the best possible monitoring solution.

SNMP or cloud?

The Juniper MIST solution is a cloud-based solution that provides a single pane of management for Juniper Networks products. As it’s cloud-based, it’s not a “traditional” network solution. As such, SNMP is not an option for device monitoring as they are communicating only with “the cloud” and we cannot access them directly like we used to do with traditional network equipment.

So, we started to investigate other options. One of the most common options right now is talking to some sort of API and pulling the metrics from that API. With Zabbix “HTTP agent” item key, this is no problem at all. Unfortunately, that’s not how the MIST API works. It’s pushing data instead of letting you pull it (actually, it does – but this doesn’t scale at all). Now, the Zabbix HTTP Agent item type allows trapping, but only in a specific Zabbix sender format. Of course, the MIST API does not allow that.

This means we have a problem. SNMP is not available. Pulling data is not a viable, scalable option. Pushing the data is an option, but Zabbix does not understand that.

Since we are not talking about some sort of proprietary monitoring tool which is completely closed and way too static, there is always a solution with Zabbix as long as you’re creative enough.

Getting data into Zabbix

We needed some middleware. Something that was able to receive that data from MIST and convert it into something that we can push into Zabbix.

That’s exactly what we did. We, together with Quanza, built a middleware that uses an API token to authenticate against the MIST API endpoint. Once the authentication is successful, the middleware is allowed to subscribe to certain “channels”. These channels provide event and performance data. You can compare it with MQTT, where a subscription to channels/topics is needed to get the information you are interested in.

Mist Middleware explained

  • Step 1:  Authenticate using an API token.
  • Step 2: Subscribe to channels
  • Step 3: Receive performance and event data
  • Step 4: Filter out only the relevant (performance) data for Zabbix
  • Step 5: Push into Zabbix

Once we had this in place, the MIST part was finished. We had our data and were able to push it into the monitoring solution.

Parsing in Zabbix

So, right now we have the data available for Zabbix. Time to find a neat way to use it. As the environments (both inventory and the types of equipment that are used) might be dynamic, we definitely do not want to apply any manual work to monitor newly added sites/equipment.

That means that low-level discovery rules are pretty much the only viable solution.

Here we go:

Describing host prototypes

 

 

Within Zabbix, we configure 1 host (the Discovery host) and apply a template on that host, with exactly 1 LLD rule: Query our middleware, and based on the information received, create new hosts (Host prototypes).

The data that is received looks like this:

{
"NODEID":"<NODEID>",
"NAME":"AP-<SITE>-<NUMBER",
"SITENAME":"<SITENAME>",
"SITEID":"<SITEID>",
"MAC":"<MAC ADDRESS>",
"ORGNAME":"<ORG NAME>"
},

Those new/discovered hosts will have the names of the AP and corresponding organization and location (in Mist: site). We also link a template to the discovered host and add it to a Host group with the variables we’ll need later, such as the organization, site name, siteID etc.

So, We need to parse those JSON elements. Luckily Zabbix provides, within the LLD rule config the option to parse this into LLD macros, so for example the Node id is parsed into {#ID} with the use of JSONPath $.NODEID:

LLD macro configuration

Once this process is complete, we have a new host per AP. Of course, there is no data on that host and querying the middleware or Mist is a bad idea. Scalability will be extremely problematic with more than a few organizations and sites configured in the Mist environment. As we’re building this with a big network integrator, scalability is a thing and we do not want to risk having a noticeable performance impact by using polling.

How about pushing data from the middleware into Zabbix? Once the data is received from Mist by the middleware, it’s parsed, filtered and then it pushes out whatever must be pushed out to Zabbix. We decided the best option is to push per host as we have those already available in Zabbix.

Now we should ensure two things:

    • do not overwhelm Zabbix with data being pushed in
    • Getting all the data with the least number of ‘pushes’ into Zabbix

Again, the flexibility of Zabbix is extremely useful here. On the AP hosts, there is a template with exactly 1 trapper item: receive performance data. From there, everything will be handled by the Zabbix ‘Master/Dependent’ item concept. We then extract data like temperatures, CPU load, memory usage, etc.

At the same time, we receive data regarding network usage (interface statistics) and radio information. As we do not know upfront how many network interfaces and radio’s there are on a particular Access Point, we do not want to hard-code such information. Here we are combining the concept of low-level discovery with dependent items (The following blog post covers the logic behind such an approach: Low-Level Discovery with Dependent items – Zabbix Blog)

Using ‘low-level discovery with dependent Items’, all relevant items are created ‘dynamically’ in such a way that a change on the MIST side (for example a new type of Access Point) doesn’t require changes on the Zabbix side. Monitoring starts within minutes and you’ll never miss any problem that might arise!
Just to give you an idea of the flow:
The Master Item gets a JSON format like this (and we’ve parsed only a small portion here) pushed into it from the middleware:

{
"mac":"<MAC ADDRESS>",
"model":"<MODEL>",
"port_stat":{
"eth0":{
"up":true,
"speed":1000,
"full_duplex":true,
"tx_bytes":37291245,
"tx_pkts":169443,
"rx_bytes":123742282,
"rx_pkts":779249,
"rx_errors":0,
"rx_peak_bps":14184,
"tx_peak_bps":5444
}
},
"cpu_util":2,
"cpu_user":652611,
"cpu_system":901455,
"radio_stat":{
"band_5":{
"num_clients":<CLIENTS>,
"channel":<CHANNEL>,
"bandwidth":0,
"power":0,
"tx_bytes":0,
"tx_pkts":0,
"rx_bytes":0,
"rx_pkts":0,
"noise_floor":<NOISE>,
"disabled":true,
"usage":"5",
"util_all":0,
"util_tx":0,
"util_rx_in_bss":0,
"util_rx_other_bss":0,
"util_unknown_wifi":0,
"util_non_wifi":0
}
"env_stat":{
"cpu_temp":<CPU TEMP>,
"ambient_temp":<AMBIENT TEMP>,
"humidity":0,
"attitude":0,
"pressure":0
}
}

Within the Master item, we’re basically not parsing anything, it’s just there to receive the values and push them into the Dependent items. In the dependent items, we start “cherry-picking” only those metrics that we would like to see. As it’s JSON format, preprocessing step “JSONPath” comes in handy. At the same time, we’re looking into efficiency, so a second step is added: discard unchanged with heartbeat (1d):

Example: Getting out the statistics of the 2.4Ghz band radio:

Item prototype proprocessing

Of course, this has to be done with all items.

So far, we’ve heavily focussed on the technical part, but Zabbix does have quite a few options to visualize the data as well. As we’re waiting on the next LTS release, we have only set up a very small dashboard with a few widgets. One of the better ones:

number_clients

Here we’re using the new graph type widget, but instead of plotting the number of clients per AP, we’re plotting a dataset with an “aggregate” function. Of course, if we look at the dashboard widgets, there are many more things that can be visualized…

Efficiency and security considerations

As we were building this, we had 2 main considerations:

    • Efficiency
    • Security

Efficiency, as we are anticipating that Quanza will be responsible for quite a few MIST environments on top of the current environments in the near future, combined with a strict limit of allowed API calls against the MIST API. As such, it is really important to keep those API calls as low as possible. Next to that, with every new Access Point added, the load on the Zabbix server is increasing. Now that is not really a problem, as Zabbix is perfectly capable of monitoring thousands of metrics simultaneously, though it has its limits. And you do not want to hit those limits in a production environment with the only solution being migration to beefier hardware.

Security-wise this challenge had a few things going on since we’re talking to an external exposed API. MIST can invoke webhooks. This might’ve been a bit easier (we explored it, but there were of course other things to keep in mind while going down that road), but the main concern was the requirement that Zabbix / an interface to Zabbix is exposed to the internet. That didn’t look too appealing and required a bit more maintenance. The preferable solution was to create that middleware where we have full control of what queries are executed, how the API token is protected, which connections are established etc. etc.

Conclusion

Although this question was challenging, together with Quanza we created a scalable, secure, and dynamic solution. Zabbix is flexible enough to facilitate the tricks required to provide reliable monitoring and alerting in an efficient and secure manner. We strongly believe the only limitation is your own creativity and this case proves that once again.

Quanza can now ensure the availability of their customer Juniper MIST-based networks, and in case something breaks their 24×7 manned NOC will be able to take whatever action is required to ensure the availability of the customers’ network – all thanks to the flexibility of Zabbix.

The post Monitoring Juniper Mist wireless network appeared first on Zabbix Blog.

Snaring the Bad Folks

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/snaring-the-bad-folks-66726a1f4c80

Project by Netflix’s Cloud Infrastructure Security team (Alex Bainbridge, Mike Grima, Nick Siow)

Cloud security is a hard problem, but an even harder one is cloud security at scale. In recent years we’ve seen several cloud focused data breaches and evidence shows that threat actors are becoming more advanced with their techniques, goals, and tooling. With 2021 set to be a new high for the number of data breaches, it was plainly evident that we needed to evolve how we approach our cloud infrastructure security strategy.

In 2020, we decided to reinvent how we handle cloud security findings by redefining how we write and respond to cloud detections. We knew that given our scale, we needed to rely heavily on automations and that we needed to build our solutions using battle tested scalable infrastructure.

Introducing Snare

Snare Logo

Snare is our Detection, Enrichment, and Response platform for handling cloud security related findings at Netflix. Snare is responsible for receiving millions of records a minute, analyzing, alerting, and responding to them. Snare also provides a space for our security engineers to track what’s going on, drill down into various findings, follow their investigation flow, and ensure that findings are reaching their proper resolution. Snare can be broken down into the following parts: Detection, Enrichment, Reporting & Management, and Remediation.

Snare Finding Lifecycle

Overview

Snare was built from the ground up to be scalable to manage Netflix’s massive scale. We currently process tens of millions of log records every minute and analyze these events to perform in-house custom detections. We collect findings from a number of sources, which includes AWS Security Hub, AWS Config Rules, and our own in-house custom detections. Once ingested, findings are then enriched and processed with additional metadata collected from Netflix’s internal data sources. Finally, findings are checked against suppression rules and routed to our control plane for triaging and remediation.

Where We Are Today

We’ve developed, deployed, and operated Snare for almost a year, and since then, we’ve seen tremendous improvements while handling our cloud security findings. A number of findings are auto remediated, others utilize slack alerts to loop in the oncall to triage via the Snare UI. One major improvement was a direct time savings for our detection squad. Utilizing Snare, we were able to perform more granular tuning and aggregation of findings leading to an average of 73.5% reduction in our false positive finding volume across our ingestion streams. With this additional time, we were able to focus on new detections and new features for Snare.

Speaking of new detections, we’ve more than doubled the number of our in-house detections, and onboarded several detection solutions from security vendors. The Snare framework enables us to write detections quickly and efficiently with all of the plumbing and configurations abstracted away from us. Detection authors only need to be concerned with their actual detection logic, and everything else is handled for them.

Simple Snare Root User Detection

As for security vendors, we’ve most notably worked with AWS to ensure that services like GuardDuty and Security Hub are first class citizens when it comes to detection sources. Integration with Security Hub was a critical design decision from the start due to the high amount of leverage we get from receiving all of the AWS Security findings in a normalized format and in a centralized location. Security Hub has played an integral role in our platform, and made evaluations of AWS security services and new features easy to try out and adopt. Our plumbing between Security Hub and Snare is managed through AWS Organizations as well as EventBridge rules deployed in every region and account to aid in aggregating all findings into our centralized Snare platform.

High Level Security Service Plumbing
Example AWS Security Finding from our testing/sandbox account In Snare UI

One area that we are investing heavily is our automated remediation potential. We’ve explored a few different options ranging from fully automated remediations, manually triggered remediations, as well as automated playbooks for additional data gathering during incident triage. We decided to employ AWS Step Functions to be our execution environment due to the unique DAGs we could build and the simplistic “wait”/”task token” functionality, which allows us to involve humans when necessary for approval/input.

Building on top of step functions, we created a 4 step remediation process: pre-processing, decision, remediation, and post-processing. Pre/post processing can be used for managing out-of-band resource checks, or any work that needs to be done in order to ensure a successful remediation. The decision step is used to perform a final pre-flight check before remediation. This can involve a human reachout, verifying the resource is still around, etc. The remediation step is where we perform our actual remediation. We’ve been able to use this to a great deal of success with infrastructure-wide misconfigured resources being automatically fixed near real time, and enabling the creation of new fully automated incident response playbooks. We’re still exploring new ways we might be able to use this, and are excited for how we might evolve our approach in the near future.

Step Function DAG for S3 Public Access Block Remediation

Diagram from a remediation to enable S3’s public access block on a non-compliant bucket. Each choice stage allows for dynamic routing to a variety of different stages based on the output of the previous function. Wait stages are used when human intervention/approval is needed.

Extensible Learnings

We’ve come a long way in our journey, and we’ve had numerous learning opportunities that we wanted to collect and share. Hopefully, we’ve made the mistakes and learned from those experiences.

Information is Key

Home grown context and metadata streams are invaluable for a detection and response program. By uniting detections and context, you’re able to unlock a new world of possibilities for reducing false positives, creating new detections that rely on business specific context, and help better tailor your severities and automated remediation decisions based on your desired risk appetite. A common theme we’ve often encountered is the need to bring additional context throughout various stages of our pipeline, so make sure to plan for that from the get-go.

Step Functions for Remediations

Step functions provide a highly extensible and unique platform to create remediations. Utilizing the AWS CDK, we were able to build a platform to enable us to easily roll out new remediations. While creating our remediation platform, we explored SSM Automation Runbooks. While SSM Automation Runbooks have great potential for remediating simple issues, we found they weren’t flexible enough to cover a wide spread of our needs, nor did they offer some of the more advanced features we were looking for such as reaching out to humans. Step functions gave us the right amount of flexibility, control, and ease of use in order to be a great asset for the Snare platform.

Closing Thoughts

We’ve come a long way in a year, and we still have a number of interesting things on the horizon. We’re looking at continuing to create new, more advanced features and detections for Snare to reduce cloud security risks in order to keep up with all of the exciting things happening here at Netflix. Make sure to check out some of our other recent blog posts!

Special Thanks

Special thanks to everyone who helped to contribute and provide feedback during the design and implementation of Snare. Notably Shannon Morrison, Sapna Solanki, Jason Schroth from our partner team Detection Engineering, as well as some of the folks from AWS — Prateek Sharma & Ely Kahn. Additional thanks to the rest of our Cloud Infrastructure Security team (Hee Won Kim, Joseph Kjar, Steven Reiling, Patrick Sanders, Srinath Kuruvadi) for their support and help with Snare features, processes, and design decisions!


Snaring the Bad Folks was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Zabbix 6.0 LTS – The next great leap in monitoring by Alexei Vladishev / Zabbix Summit Online 2021

Post Syndicated from Alexei Vladishev original https://blog.zabbix.com/zabbix-6-0-lts-the-next-great-leap-in-monitoring-by-alexei-vladishev-zabbix-summit-online-2021/17683/

The Zabbix Summit Online 2021 keynote speech by Zabbix founder and CEO Alexei Vladishev focuses on the role of Zabbix in modern, dynamic IT infrastructures. The keynote speech also highlights the major milestones leading up to Zabbix 6.0 LTS and together we take a look at the future of Zabbix.

The full recording of the speech is available on the official Zabbix Youtube channel.

Digital transformation journey
Infrastructure monitoring challenges
Zabbix – Universal Open Source enterprise-level monitoring solution
Cost-Effectiveness
Deploy Anywhere
Monitor Anything
Monitoring of Kubernetes and Hybrid Clouds
Data collection and Aggregation
Security on all levels
Powerful Solution for MSPs
Scalability and High Availability
Machine learning and Statistical analysis
More value to users
New visualization capabilities
IoT monitoring
Infrastructure as a code
Tags for classification
What’s next?
Advanced event correlation engine
Multi DC Monitoring
Zabbix Release Schedule
Zabbix Roadmap
Questions

Digital transformation journey

First, let’s talk about how Zabbix plays a role as a part of the Digital Transformation journey for many companies.

As IT infrastructures evolve, there are many ongoing challenges. Most larger companies for example have a set of legacy systems that require to be integrated with more modern systems. This results in a mix of legacy and new technologies and protocols. This means that most management and monitoring tools need to support all of these technologies – Zabbix is no exception here.

Hybrid clouds, containers, and container orchestration systems such as K8S and OpenShift have also played an immense part in the digital transformation of enterprises. It has been a very major paradigm shift – from physical machines to virtual machines, to containers and hybrid parts. We certainly must provide the required set of technologies to monitor such environments and the monitoring endpoints unique to them.

The rapid increase in the complexity of IT infrastructures caused by the two previous points requires our tools to be a lot more scalable than before. We have many more moving parts, likely located in different locations that we need to stay aware of. This also means that any downtime is not acceptable – this is why the high availability of our tools is also vital to us.

Let’s not forget that with increased complexity, many new potential security attack vectors arise and our tools need to support features that can help us with minimizing the security risks.

But making our infrastructures more agile usually comes at a very real financial cost. We must not forget that most of the time we are working with a dedicated budget for our tools and procedures.

Infrastructure monitoring challenges

The increase in the complexity of IT infrastructures also poses multiple monitoring challenges that we have to strive to overcome:

  • Requirements for scalability and high availability for our tools
    • The growing number of devices and networks as well as the increased complexity of IT infrastructures
  • Increasingly complex infrastructures often force us to utilize multiple tools to obtain the required metrics
    • This leads to a requirement for a single pane of glass to enable centralized monitoring
  • Collecting values is often not enough – we need to be able to leverage the collected data to gain the most value out of it
  • We need a solution that can deliver centralized visualization and reporting based on the obtained data
  • Our tools need to be hand-picked so that they can deliver the best ROI in an already complex infrastructure

Zabbix – Universal Open Source enterprise-level monitoring solution

Zabbix is a Universal free and Open Source enterprise-level monitoring solution. The tool comes at absolutely no cost and is available for everyone to try out and use. Zabbix provides the monitoring of modern IT infrastructures on multiple levels.

Universal is the term that we are focusing on. Given the open-source nature of the product, Zabbix can be used in infrastructures of different sizes – from small and medium organizations to large, globe-spanning enterprises. Zabbix is also capable of delivering monitoring of the whole IT stack – from hardware and network monitoring to high-level monitoring such as Business Service monitoring and more.

Cost-Effectiveness

Zabbix delivers a large set of enterprise-grade features at no cost! Features such as 2FA, Single sign-on solutions, no restrictions when it comes to data collection methods, number of monitored devices and services, or database size.

  • Exceptionally low total cost of ownership
    • Free and Open Source solution with quality and security in mind
    • Backed by reliable vendors, a global partner network, and commercial services, such as the 24/7 support
    • No limitations regarding how you use the software
    • Free and readily available documentation, HOWTOs, community resources, videos, and more.
    • Zabbix engineers are easy to find and hire for your organization
    • Cost is fully under your control – Zabbix Commercial services are under fixed-price agreements

Deploy Anywhere

Our users always have the choice of where and how they wish to deploy Zabbix. With official packages for the most popular operating systems such as RHEL, Oracle Linux, Ubuntu, Raspberry Pi OS, and more. With official Helm charts, you can quickly also deploy Zabbix in a Kubernetes cluster or in your OpenShift instance. We also provide official Docker container images with pre-installed Zabbix components that you can deploy in your environment.

We also provide one-click deployment options for multiple cloud service providers, such as Amazon AWS, Microsoft Azure, Google Cloud, Openstack, and many other cloud service providers.

Monitor Anything

With Zabbix, you can monitor anything – from legacy solutions to modern systems. With a large selection of official solutions and substantial community backing our users can be sure that they can find a suitable approach to monitor their IT infrastructure components. There are hundreds of ready-to-use monitoring solutions by Zabbix.

Whenever you deploy a new IT solution in your enterprise, you will want to tie it together with the existing toolset. Zabbix provides many out of the box integrations for the most popular ticketing and alerting systems

Recently we have introduced advanced search capabilities for the Zabbix integrations page, which allows you to quickly lookup the integrations that currently exist on the market. If you visit the Zabbix integrations page and look up a specific vendor or tool, you will see a list of both the official solutions supported by Zabbix and also a long list of community solutions backed by our users, partners, and customers.

Monitoring of Kubernetes and Hybrid Clouds

Nowadays many existing companies are considering migrating their existing infrastructure to either solutions such as Kubernetes or OpenShift, or utilizing cloud service providers such as Amazon AWS or Microsoft Azure.

I am proud to announce, that with the release of Zabbix 6.0 LTS, Zabbix will officially support out-of-the-box monitoring of OpenShfit and Kubernetes clusters.

Data collection and Aggregation

Let’s cover a few recent features that improve the out-of-the-box flexibility of Zabbix by a large margin.

Synthetic monitoring is a feature that was introduced a year ago in Zabbix version 5.2 and it has already become quite popular with our user base. The feature enables monitoring of different devices and solutions over the HTTP protocol. By using synthetic monitoring Zabbix can connect to your HTTP endpoints, such as cloud APIs, Kubernetes, and OpenShift APIs, and other HTTP endpoints, collect the metrics and then process them to extract the required information. Synthetic monitoring is extremely transparent and flexible – it can be fine-tuned to communicate with any HTTP endpoints.

Another major feature introduced in Zabbix 5.4 is the new trigger syntax. This enables our users to define much more flexible trigger expressions, supporting many new problem detection use cases. In addition, we can use this syntax to perform flexible data aggregation operations. For example, now we can aggregate data filtered by wildcards, tags, and host groups, instead of specifying individual items. This is extremely valuable for monitoring complex infrastructures, such as Kubernetes or cloud environments. At the same time, the new syntax is a lot more simple to learn and understand when compared to the old trigger syntax.

Security on all levels

Many companies are concerned about security and data protection when it comes to the tools that they are using in their day-to-day tasks. I’m happy to tell you that Zabbix follows the highest security standards when it comes to the development and usage of the product.

Zabbix is secure by design. In the diagram below you can see all of the Zabbix components, all of which are interconnected, like Zabbix Agent, Server, Proxy, Database, and Frontend. All of the communication between different Zabbix components can be encrypted by using strong encryption protocols like TLS.

If you’re using Zabbix Agent, the agent does not require root privileges. You can run Zabbix Agent under a normal user with all of the necessary user level restrictions in place. Zabbix agent can also be restricted with metric allow and deny lists, so it has access only to the metrics which are permitted for collection by your company policies.

The connections between the Zabbix database backend and the Zabbix Frontend and Zabbix Server also support encryption as of version 5.0 LTS.

As for the frontend component – users can add an additional security layer for their Zabbix frontends by configuring 2FA and SSO logins. Zabbix 6.0 LTS also introduces flexible login password complexity requirements, which can reduce the security breach risk if your frontend is exposed to the internet. To ensure that Zabbix meets the highest standards of the company security compliance, the new Audit log, introduced in Zabbix 6.0 LTS, is capable of logging all of the Zabbix Frontend and Zabbix Server operations.

For an additional security layer – sensitive information like Usernames, Passwords, API keys can be stored in an external vault. Currently, Zabbix supports secret storage in the HashiCorp Vault. Support for the CyberArk vault will be added in the Zabbix 6.2 release.

Another Zabbix feature – the Zabbix API, is often used for the automation of day-to-day configuration workflows, as well as custom integrations and data migration tasks. Zabbix 5.4 added the ability to create API tokens for particular frontend users with pre-defined token expiration dates.

In Zabbix 5.2 we added another layer for the Zabbix Frontend user permissions – User Roles. Now it is possible to define granular user roles with different types of rights and privileges, assigned to specific types of users in your organization. With User Roles, we can define which parts of the Zabbix UI the specific user role has access to and which UI actions the members of this role can perform. This can be combined with API method restrictions which can also be defined for a particular role.

Powerful Solution for MSPs

When we combine all of these features, we can see how Zabbix becomes a powerful solution for MSP customers. MSPs can use Zabbix as an added value service. This way they can provide a monitoring service for their customers and get additional revenue out of it. It is possible to build a customer portal which is a combination of User Roles for read-only access to dashboards and customized UI, rebranding option – which was just introduced in Zabbix 6.0 LTS, and a combination of SLA reporting together with scheduled PDF reports, so the customers can receive reports on a weekly, daily or monthly basis.

Scalability and High Availability

With a growing number of devices and ever-increasing network complexity, Scalability and High availability are extremely important requirements.

Zabbix provides Load balancing options for Zabbix UI and Zabbix API. In order to scale the Zabbix Frontend and Zabbix API, we can simply deploy additional Zabbix Frontend nodes, thus introducing redundancy and high availability.

Zabbix 6.0 LTS comes with out-of-the-box support for the Zabbix Server High Availability cluster. If one of the Zabbix Server nodes goes down, Zabbix will automatically switch to one of the standby nodes. And the best thing about the Zabbix Server High Availability cluster – it takes only 5 minutes to get it up and running. the HA cluster is very easy to configure and use.

One of the features in our future roadmap is introducing support for the History API to work with different time-series DB backends for extra efficiency and scalability. Another feature that we would like to implement in the future is load balancing for Zabbix Servers and Zabbix Proxies. Combining all of these features would truly make Zabbix a cloud-native application with unlimited horizontal scalability.

Machine learning and Statistical analysis

Defining static trigger thresholds is a relatively simple task, but it doesn’t scale too well in dynamic environments. With Machine Learning and Statistical Analysis, we can analyze our data trends and perform anomaly detection. This has been greatly extended in Zabbix 6.0 LTS with Anomaly Detection and Baseline Monitoring functionality.

Zabbix 6.0 Adds an extended set of functions for trend analysis and trend prediction. These support multiple flexible parameters, such as the ability to define seasonality for your data analysis. This is another way how to get additional insights out of the data collected by Zabbix

More value to users

When I think about the direction that Zabbix is headed in, and look at the Zabbix roadmap, one of the main questions I ask is “How can we deliver more value to our enterprise users?”

In Zabbix 6.0 LTS we made some major steps to make Zabbix fit not only for infrastructure monitoring but also fit for Business Service monitoring – the monitoring of services that we provide for our end-users or internal company users. Zabbix 6.0 LTS comes with complex service level object definitions, real-time SLA reporting, multi-tenancy options, Business Service alerting options, and root cause and Impact analysis.

New visualization capabilities

It is important to present the collected data in a human-readable way. That’s why we invest a lot of time and effort in order to improve the native visualization capabilities. In Zabbix 6.0 LTS we have introduced Geographical Maps together with additional widgets for TOP N reporting and templated and multi-page dashboards.

The introduction of reports in Zabbix 5.2 allowed our users to leverage their Zabbix Dashboards to generate scheduled PDF reports with respect to user permissions. Our users can generate daily, weekly, monthly or yearly reports and send them to their infrastructure administrators or customers.

IoT monitoring

With the introduction of support for Modbus and MQTT protocols, Zabbix can be used to monitor IoT devices and obtain environmental information from different sensors such as temperature, humidity, and more. In addition, Zabbix can now be used to monitor factory equipment, building management systems, IoT gateways, and more.

Infrastructure as a code

With IT infrastructures growing in scale, automation is more important than ever. For this reason, many companies prefer preserving and deploying their infrastructure as code. With the support of YAML format for our templates, you can now keep them in a git repository and by utilizing CI/CD tools you can now deploy your templates automatically.

This enables our users to manage their templates in a central location – the git repository, which helps users to perform change management and versioning and then deploy the template to Zabbix by using CI/CD tools.

Tags for classification

Over the past few versions, we have made a major push to support tags for most Zabbix entities. The switch from applications to tags in Zabbix 5.4 made the tool much more flexible. Tags can now be used for the classification of items, triggers, hosts, business services. The tags that the users define can also be used in alerting, filtering, and reporting.

What’s next?

You’re probably wondering – what’s coming next? What are the main vectors for the future development of Zabbix?

First off – we will continue to invest in usability. While the tool is made by professionals for professionals, it is important for us to make using the tool as easy as possible. Improvements to the Zabbix Frontend, general usability, and UX can be expected very soon.

We plan to continue to invest in the visualization and reporting capabilities of Zabbix. We want all data collected by our monitoring tool to provide information in a single pane of glass. This way our users can see the full picture of their environment while also seeing the root cause analysis for the ongoing problems that we face. This way we can get most of the data that Zabbix collects.

Extending the scope of monitoring is an ongoing process for us. We would like to implement additional features for compliance monitoring. I think that we will be able to introduce a solution for application performance monitoring very soon. We’d like to make log monitoring more powerful and comprehensive. monitoring of public and private clouds is also very important for us, given the current IT paradigms.

We’d like to make sure that Zabbix is absolutely extendable on all levels. While we can already extend Zabbix with different types of plugins, webhooks, and UI modules there’s more to come in the near future.

The topic of high availability, scalability, and load balancing is extremely important to us. We will continue building on the existing foundations to make Zabbix a truly cloud-native solution.

Advanced event correlation engine

Advanced event processing is a really important topic. When we talk about a monitoring solution, we pay very much attention to the number of metrics that we are collecting. We mustn’t forget, that for large-scale environments the number of events that we generate based on those metrics is also extremely important. We need to keep control and manage the ever-growing number of different events coming from different sources. This is why we would like to focus on noise reduction, specifically – root cause analysis.

For this reason, we can expect Zabbix to introduce an advanced event correlation model in the future. This model should have the ability to filter and deduplicate the events as well as perform event enrichment, thus leading to a much better root cause analysis.

Multi DC Monitoring

Currently, Multi DC monitoring can be done with Zabbix by deploying a distributed Zabbix instance that utilizes Zabbix proxies. But there are use cases, where it would be more beneficial to have multiple Zabbix servers deployed across different datacenters – all reporting to a single location for centralized event processing, centralized visualization, and reporting as well as centralized dashboards. This is something that is coming soon to Zabbix.

Zabbix Release Schedule

Of course, the burning question is – when is Zabbix 6.0 LTS going to be released? And we are very close to finalizing the next LTS release. I would expect Zabbix 6.0 LTS to be officially released in January 2022.

As for Zabbix 6.2 and 6.4 – these releases are still planned for Q2 and Q4, 2022. The next LTS release – Zabbix 7.0 LTS is planned to be released in Q2, 2023.

Zabbix Roadmap

If you want to follow the development of Zabbix – we have a special page just for that – the Zabbix Roadmap. Here you can find up-to-date information about the development plans for Zabbix 6.2, 6.4, and 7.0 LTS. The Roadmap also represents the current development status of Zabbix 6.0 LTS.

Questions

Q: What would you say is the main benefit of why users should migrate from Zabbix 5.0/4.0 or older versions to 6.0 LTS?

A: I think that Zabbix 6.0 LTS is a very different product – even when you compare it with the relatively recent Zabbix 5.0 LTS. It comes with many improvements, some of which I mentioned here in my keynote. For example, Business Service monitoring provides huge added value to enterprise customers.

With the new trigger syntax and the new functions related to anomaly detection and baseline monitoring our users can get much more out of the data that they already have in their monitoring tool.

The new visualization options – multiple new widgets, geographical maps, scheduled PDF reporting provide a lot of added value to our end-users and to their customers as well.

Q: Any plans to make changes on the Zabbix DB backend level – make it more scaleable or completely redesign it?

A: Right now we keep all of our information in a relational database such as MySQL or PostgreSQL. We have added the support for TimescaleDB which brings some huge advantages to our users, thanks to improved data storage and performance efficiency.

But we still have users that wish to connect different storage engines to Zabbix – maybe specifically optimized to keep time-series data. Actually, this is already on our roadmap. Our plan is to introduce a unified API for historical data so that if you wish to attach your own storage, we just have to deploy a plugin that will communicate both with our historical API and also talk to the storage engine of your choosing. This feature is coming and is already on our Roadmap.

Q: What is your personal favorite feature? Something that you 100% wanted to see implemented in Zabbix 6.0 LTS?

A: I see Zabbix 6.0 LTS as a combination of Zabbix 5.2, 5.4, and finally the features introduced directly in Zabbix 6.0 LTS. Personally, I think that my favorite features in Zabbix 6.0 LTS are features that make up the latest implementation of Anomaly detection.

We could be at the very beginning of exploring more advanced machine learning and statistical analysis capabilities, but I’m pretty sure that with every new release of Zabbix there will be new features related to machine learning, anomaly detection, and trend prediction.

This could provide a way for Zabbix to generate and share insights with our users. Analysis of what’s happening with your system, with your metrics – how the metrics in your system behave.

Deploying Zabbix in Amazon Web Services cloud platform

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/deploying-zabbix-in-amazon-web-services-cloud-platform/17283/

With the rapid evolution and proliferation of different cloud services, many organizations have decided to move parts of their infrastructures from on-prem to cloud. As an essential part of your infrastructure, Zabbix is no exception – you always have the option to either deploy Zabbix on-prem or select from one of the many supported cloud service providers to deploy your Zabbix Server or Zabbix Proxy on.

In this blog post, let’s look at how we can quickly deploy Zabbix Server and Zabbix Proxy nodes in Amazon Web Services cloud platform.

Deploying the Zabbix Server in AWS

Let’s begin with the Zabbix download page. Under the Zabbix Cloud Images section, select the AWS cloud vendor and then the Cloud Image you wish to deploy. Let’s start with Zabbix Server 5.0 with MySQL DB backend and Nginx Web server backend for our frontend.

Next, we will be redirected to the AWS marketplace, where we will have to subscribe to the Zabbix Server 5.0 image.

Once we have subscribed to the Zabbix Server image, we can continue with the deployment configuration.

Next, we must select our Region, Zabbix minor version (usually the latest available), and the Fulfillment option. Once that is done, we can finalize the launch configuration.

Select the preferred Launch option, EC2 Instance Type, VPC, and Subent settings on the Launch page.

Next – We have to select or create a security group.

We also have to select or generate EC 2 Key pair – make sure to save your private key in a safe location!

Note that creating a security group based on seller settings does not guarantee that the group will have an inbound SSH access rule! Make sure to double-check the security group and manually add the SSH inbound rule if it hasn’t yet been added. We will need to access this instance via SSH to obtain the initial frontend login credentials!

Once you click on the Launch button, the deployment process for your Zabbix application will be initiated.

Accessing the application

Let’s open up the Instances section and open our newly deployed Zabbix instance

We can access the Zabbix Frontend by opening the Public IPv4 address or Public IPv4 DNS of the Zabbix instance

Note that the Zabbix frontend password is still unknown to us. Recall how I mention that we will need to access the instance via SSH to obtain the frontend password. Let’s do so now.

Write down the login credentials and use them to log in to the Zabbix instance.

Accessing the database

In case we wish to access the Zabbix database backend, we can do so from the command line. Zabbix database can be accessed by using the root user. By default, it can be used without a password.

The MySQL root password is stored in /root/.my.cnf configuration file.

Modifying the Zabbix Frontend timezone

By default, the Zabbix frontend uses the “UTC” timezone. If you need to change it, edit php_value[date.timezone] PHP variable in /etc/php-fpm.d/zabbix.conf and restart php-fpm process:

systemctl restart php-fpm

Zabbix proxy

If you wish to deploy a Zabbix proxy instance in your AWS cloud, the deployment steps are very much the same. Most likely, you will still require SSH access if you wish to perform some configuration changes in the Zabbix proxy configuration file.

Note, that by default, the SQLite proxy database is stored in /tmp/zabbix_proxy.sqlite3

As always, don’t forget the point the proxy at your Zabbix server instance by modifying the Server parameter in the Zabbix proxy configuration file, located in /etc/zabbix/zabbix_proxy.conf

And that’s all! With just a few clicks, we are able to deploy a fully functional Zabbix instance or a small Zabbix proxy to distribute or scale our monitoring. Don’t forget that AWS is just one of the many cloud service providers you can use with Official Zabbix images. If you have any questions about the AWS deployment – you are very much encouraged to leave a comment under this blog post.

If you wish to learn more about the Zabbix Monitoring solution, check out the official documentation https://www.zabbix.com/documentation/current/manual/quickstart.

How We Cut GrabFood.com’s Page JavaScript Asset Sizes by 3x

Post Syndicated from Grab Tech original https://engineering.grab.com/grabfood-bundle-size

Introduction

Every week, GrabFood.com’s cloud infrastructure serves over >1TB network egress and 175 million requests, which increased our costs. To minimise cloud costs, we had to look at optimising (and reducing) GrabFood.com’s bundle size.

Any reduction in bundle size helps with:

  • Faster site loads! (especially for locations with lower mobile broadband speeds)
  • Cost savings for users: Less data required for each site load
  • Cost savings for Grab: Less network egress required to serve users
  • Faster build times: Fewer dependencies -> less code for webpack to bundle -> faster builds
  • Smaller builds: Fewer dependencies -> less code -> smaller builds

After applying the 7 webpack bundle optimisations, we were able to yield the following improvements:

  • 7% faster page load time from 2600ms to 2400ms
  • 66% faster JS static asset load time from 180ms to 60ms
  • 3x smaller JS static assets from 750KB to 250KB
  • 1.5x less network egress from 1800GB to 1200GB
  • 20% less for CloudFront costs from $1750 to $1400
  • 1.4x smaller bundle from 40MB to 27MB
  • 3.6x faster build time from ~2000s to ~550s

Solution

One of the biggest factors influencing bundle size is dependencies. As mentioned earlier, fewer dependencies mean fewer lines of code to compile, which result in a smaller bundle size. Thus, to optimise GrabFood.com’s bundle size, we had to look into our dependencies.

Tldr;

Jump to Step C: Reducing your Dependencies to see the 7 strategies we used to cut down our bundle size.

Step A: Identify Your Dependencies

In this step, we need to ask ourselves ‘what are our largest dependencies?’. We used the webpack-bundle-analyzer to inspect GrabFood.com’s bundles. This gave us an overview of all our dependencies and we could easily see which bundle assets were the largest.

Our grabfood.com bundle analyzer output
Our grabfood.com bundle analyzer output
  • For Next.js, you should use @next/bundle-analyze instead.
  • Bundle analysis output allows us to easily inspect what’s in our bundle.

What to look out for:

I: Large dependencies (fairly obvious, because the box size will be large)

II: Duplicate dependencies (same library that is bundled multiple times across different assets)

III: Dependencies that look like they don’t belong (e.g. Why is ‘elliptic’ in my frontend bundle?)

What to avoid:

  • Isolating dependencies that are very small (e.g. <20kb). Not worth focusing on this due to very meagre returns.
    • E.g. Business logic like your React code
    • E.g. Small node dependencies

Step B: Investigate the Usage of Your Dependencies (Where are my Dependencies Used?)

In this step, we are trying to answer this question: “Given a dependency, which files and features are making use of it?”.

Our grabfood.com bundle analyzer output
Image source

There are two broad approaches that can be used to identify how our dependencies are used:

I: Top-down approach: “Where does our project use dependency X?”

  • Conceptually identify which feature(s) requires the use of dependency X.
  • E.g. Given that we have ‘jwt-simple’ as a dependency, which set of features in my project requires JWT encoding/decoding?

II: Bottom-up approach: “How did dependency X get used in my project?”

  • Trace dependencies by manually tracing import() and require() statements
  • Alternatively, use dependency visualisation tools such as dependency-cruiser to identify file interdependencies. Note that output can quickly get noisy for any non-trivial project, so use it for inspecting small groups of files (e.g. single domains).

Our recommendation is to use a mix of both Top-down and Bottom-up approaches to identify and isolate dependencies.

Dos:

  • Be methodical when tracing dependencies: Use a document to track your progress as you manually trace inter-file dependencies.
  • Use dependency visualisation tools like dependency-cruiser to quickly view a given file’s dependencies.
  • Consult Dr. Google if you get stuck somewhere, especially if the dependencies are buried deep in a dependency tree i.e. non-1st-degree dependencies (e.g. “Why webpack includes elliptic bn.js modules in bundle”)

Don’ts:

  • Stick to a single approach – Know when to switch between Top-down and Bottom-up approaches to narrow down the search space.

Step C: Reducing Your Dependencies

Now that you know what your largest dependencies are and where they are used, the next step is figuring out how you can shrink your dependencies.

Our grabfood.com bundle analyzer output
Image source

Here are 7 strategies that you can use to reduce your dependencies:

  1. Lazy load large dependencies and less-used dependencies
  2. Unify instances of duplicate modules
  3. Use libraries that are exported in ES Modules format
  4. Replace libraries whose features are already available on the Browser Web API
  5. Avoid large dependencies by changing your technical approach
  6. Avoid using node dependencies or libraries that require node dependencies
  7. Optimise your external dependencies

Note: These strategies have been listed in ascending order of difficulty – focus on the easy wins first 🙂

1. Lazy Load Large Dependencies and Less-used Dependencies

When a file adds +2MB worth of dependencies
“When a file adds +2MB worth of dependencies”, Image source

Similar to how lazy loading is used to break down large React pages to improve page performance, we can also lazy load libraries that are rarely used, or are not immediately used until prior to certain user actions.

Before:


const crypto = require(crypto)

const computeHash = (value, secret) => {

 return crypto.createHmac(value, secret)

}

After:


const computeHash = async (value, secret) => {

 const crypto = await import(crypto)

 return crypto.createHmac(value, secret)

}

Example:

  • Scenario: Use of Anti-abuse library prior to sensitive API calls
  • Action: Instead of bundling the anti-abuse library together with the main page asset, we opted to lazy load the library only when we needed to use it (i.e. load the library just before making certain sensitive API calls).
  • Results: Saved 400KB on the main page asset.

Notes:

  • Any form of lazy loading will incur some latency on the user, since the asset must be loaded with XMLHttpRequest.

2. Unify Instances of Duplicate Modules

Image source

If you see the same dependency appearing in multiple assets, consider unifying these duplicate dependencies under a single entrypoint.

Before:


// ComponentOne.jsx

import GrabMaps from grab-maps

// ComponentTwo.jsx

import GrabMaps, { Marker } from grab-maps

After:


// grabMapsImportFn.js

const grabMapsImportFn = () => import(grab-maps)

// ComponentOne.tsx

const grabMaps = await grabMapsImportFn()

const GrabMaps = grabMaps.default

// ComponentTwo.tsx

const grabMaps = await grabMapsImportFn()

const GrabMaps = grabMaps.default

const Marker = grabMaps.Marker

Example:

  • Scenario: Duplicate ‘grab-maps’ dependencies in bundle
  • Action: We observed that we were bundling the same ‘grab-maps’ dependency in 4 different assets so we refactored the application to use a single entrypoint, ensuring that we only bundled one instance of ‘grab-maps’.
  • Results: Saved 2MB on total bundle size.

Notes:

  • Alternative approach: Manually define a new cacheGroup to target a specific module (see more) with ‘enforce:true’, in order to force webpack to always create a separate chunk for the module. Useful for cases where the single dependency is very large (i.e. >100KB), or when asynchronously loading a module isn’t an option.
  • Certain libraries that appear in multiple assets (e.g. antd) should not be mistaken as identical dependencies. You can verify this by inspecting each module with one another. If the contents are different, then webpack has already done its job of tree-shaking the dependency and only importing code used by our code.
  • Webpack relies on the import() statement to identify that a given module is to be explicitly bundled as a separate chunk (see more).

3. Use Libraries that are Exported in ES Modules Format

Did you say ‘tree-shaking’?
“Did you say ‘tree-shaking’?”, Image source
  • If a given library has a variant with an ES Module distribution, use that variant instead.
  • ES Modules allows webpack to perform tree-shaking automatically, allowing you to save on your bundle size because unused library code is not bundled.
  • Use bundlephobia to quickly ascertain if a given library is tree-shakeable (e.g. ‘lodash-es’ vs ‘lodash’)

Before:


import { get } from lodash

After:


import { get } from lodash-es

Example:

  • Use Case: Using Lodash utilities
  • Action: Instead of using the standard ‘lodash’ library, you can swap it out with ‘lodash-es’, which is bundled using ES Modules and is functionally equivalent.
  • Results: Saved 0KB – We were already directly importing individual Lodash functions (e.g. ‘lodash/get’), therefore importing only the code we need. Still, ES Modules is a more convenient way to go about this 👍.

Notes:

  • Alternative approach: Use babel plugins (e.g. ‘babel-plugin-transform-imports’) to transform your import statements at build time to selectively import specific code for a given library.

4. Replace Libraries whose Features are Already Available on the Browser Web API

When you replace axios with fetch
“When you replace axios with fetch”, Image source

If you are relying on libraries for functionality that is available on the Web API, you should revise your implementation to leverage on the Web API, allowing you to skip certain libraries when bundling, thus saving on bundle size.

Before:


import axios from axios

const getEndpointData = async () => {

 const response = await axios.get(/some-endpoint)

 return response

}

After:


const getEndpointData = async () => {

 const response = await fetch(/some-endpoint)

 return response

}

Example:

  • Use Case: Replacing axios with fetch() in the anti-abuse library
  • Action: We observed that our anti-abuse library was relying on axios to make web requests. Since our web app is only targeting modern browsers – most of which support fetch() (with the notable exception of IE) – we refactored the library’s code to use fetch() exclusively.
  • Results: Saved 15KB on anti-abuse library size.

5. Avoid Large Dependencies by Changing your Technical Approach

Image source

If it is acceptable to change your technical approach, we can avoid using certain dependencies altogether.

Before:


import jwt from jwt-simple

const encodeCookieData = (data) => {

 const result = jwt.encode(data, some-secret)

 return result

}

After:


const encodeCookieData = (data) => {

 const result = JSON.stringify(data)

 return result

}

Example:

  • Scenario: Encoding for browser cookie persistence
  • Action: As we needed to store certain user preferences in the user’s browser, we previously opted to use JWT encoding; this involved signing JWTs on the client side, which has a hard dependency on ‘crypto’. We revised the implementation to use plain JSON encoding instead, removing the need for ‘crypto’.
  • Results: Saved 250KB per page asset, 13MB in total bundle size.

6. Avoid Using Node Dependencies or Libraries that Require Node Dependencies

“When someone does require(‘crypto’)”
“When someone does require(‘crypto’)”, Image source

You should not need to use node-related dependencies, unless your application relies on a node dependency directly or indirectly.

Examples of node dependencies: ‘Buffer’, ‘crypto’, ‘https’ (see more)

Before:


import jwt from jsonwebtoken

const decodeJwt = async (value) => {

 const result = await new Promise((resolve) => {

 jwt.verify(token, 'some-secret', (err, decoded) => resolve(decoded))

 })

 return result

}

After:


import jwt_decode from jwt-decode

const decodeJwt = (value) => {

 const result = jwt_decode(value)

 return result

}

Example:

  • Scenario: Decoding JWTs on the client side
  • Action: In terms of JWT usage on the client side, we only need to decode JWTs – we do not need any logic related to encoding JWTs. Therefore, we can opt to use libraries that perform just decoding (e.g. ‘jwt-decode’) instead of libraries (e.g. ‘jsonwebtoken’) that performs the full suite of JWT-related operations (e.g. signing, verifying).
  • Results: Same as in Point 5: Example. (i.e. no need to decode JWTs anymore, since we aren’t using JWT encoding for browser cookie persistence)

7. Optimise your External Dependencies

“Team: Can you reduce the bundle size further? You:“
“Team: Can you reduce the bundle size further? You: (nervous grin)“, Image source

We can do a deep-dive into our dependencies to identify possible size optimisations by applying all the aforementioned techniques. If your size optimisation changes get accepted, regardless of whether it’s publicly (e.g. GitHub) or privately hosted (own company library), it’s a win-win for everybody! 🥳

Example:

  • Scenario: Creating custom ‘node-forge’ builds for our Anti-abuse library
  • Action: Our Anti-abuse library only uses certain features of ‘node-forge’. Thankfully, the ‘node-forge’ maintainers have provided an easy way to make custom builds that only bundle selective features (see more).
  • Results: Saved 85KB in Anti-abuse library size and reduced bundle size for all other dependent projects.

Step D: Verify that You have Modified the Dependencies

Now… where did I put that needle?
“Now… where did I put that needle?”, Image source

So, you’ve found some opportunities for major bundle size savings, that’s great!

But as always, it’s best to be methodical to measure the impact of your changes, and to make sure no features have been broken.

  1. Perform your code changes
  2. Build the project again and open the bundle analysis report
  3. Verify the state of a given dependency
    • Deleted dependency – you should not be able to find the dependency
    • Lazy-loaded dependency – you should see the dependency bundled as a separate chunk
    • Non-duplicated dependency – you should only see a single chunk for the non-duplicated dependency
  4. Run tests to make sure you didn’t break anything (i.e. unit tests, manual tests)

Other Considerations

Preventive Measures

  • Periodically monitor your bundle size to identify increases in bundle size
  • Periodically monitor your site load times to identify increases in site load times

Webpack Configuration Options

  1. Disable bundling node modules with ‘node: false’
    • Only if your project doesn’t already include libraries that rely on node modules.
    • Allows for fast detection when someone tries to use a library that requires node modules, as the build will fail
  2. Experiment with ‘cacheGroups’
    • Most default configurations of webpack do a pretty good job of identifying and bundling the most commonly used dependencies into a single chunk (usually called vendor.js)
    • You can experiment with webpack optimisation options to see if you get better results
  3. Experiment with import() ‘Magic Comments’
    • You may experiment with import() magic comments to modify the behaviour of specific import() statements, although the default setting will do just fine for most cases.

If you can’t remove the dependency:

  • For all dependencies that must be used, it’s probably best to lazy load all of them so you won’t block the page’s initial rendering (see more).

Conclusion

Image source

To summarise, here’s how you can go about this business of reducing your bundle size.

Namely…

  1. Identify Your Dependencies
  2. Investigate the Usage of Your Dependencies
  3. Reduce Your Dependencies
  4. Verify that You have Modified the Dependencies

And by using these 7 strategies…

  1. Lazy load large dependencies and less-used dependencies
  2. Unify instance of duplicate modules
  3. Use libraries that are exported in ES Modules format
  4. Replace libraries whose features are already available on the Browser Web API
  5. Avoid large dependencies by changing your technical approach
  6. Avoid using node dependencies
  7. Optimise your external dependencies

You can have…

  • Faster page load time (smaller individual pages)
  • Smaller bundle (fewer dependencies)
  • Lower network egress costs (smaller assets)
  • Faster builds (fewer dependencies to handle)

Now armed with this information, may your eyes be keen, your bundles be lean, your sites be fast, and your cloud costs be low! 🚀 ✌️


Special thanks to Han Wu, Melvin Lee, Yanye Li, and Shujuan Cheong for proofreading this article. 🙂


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Choosing a CI/CD approach: AWS Services with BigHat Biosciences

Post Syndicated from Mike Apted original https://aws.amazon.com/blogs/devops/choosing-ci-cd-aws-services-bighat-biosciences/

Founded in 2019, BigHat Biosciences’ mission is to improve human health by reimagining antibody discovery and engineering to create better antibodies faster. Their integrated computational + experimental approach speeds up antibody design and discovery by combining high-speed molecular characterization with machine learning technologies to guide the search for better antibodies. They apply these design capabilities to develop new generations of safer and more effective treatments for patients suffering from today’s most challenging diseases. Their platform, from wet lab robots to cloud-based data and logistics plane, is woven together with rapidly changing BigHat-proprietary software. BigHat uses continuous integration and continuous deployment (CI/CD) throughout their data engineering workflows and when training and evaluating their machine learning (ML) models.

 

BigHat Biosciences Logo

 

In a previous post, we discussed the key considerations when choosing a CI/CD approach. In this post, we explore BigHat’s decisions and motivations in adopting managed AWS CI/CD services. You may find that your organization has commonalities with BigHat and some of their insights may apply to you. Throughout the post, considerations are informed and choices are guided by the best practices in the AWS Well-Architected Framework.

How did BigHat decide what they needed?

Making decisions on appropriate (CI/CD) solutions requires understanding the characteristics of your organization, the environment you operate in, and your current priorities and goals.

“As a company designing therapeutics for patients rather than software, the role of technology at BigHat is to enable a radically better approach to develop therapeutic molecules,” says Eddie Abrams, VP of Engineering at BigHat. “We need to automate as much as possible. We need the speed, agility, reliability and reproducibility of fully automated infrastructure to enable our company to solve complex problems with maximum scientific rigor while integrating best in class data analysis. Our engineering-first approach supports that.”

BigHat possesses a unique insight to an unsolved problem. As an early stage startup, their core focus is optimizing the fully integrated platform that they built from the ground-up to guide the design for better molecules. They respond to feedback from partners and learn from their own internal experimentation. With each iteration, the quality of what they’re creating improves, and they gain greater insight and improved models to support the next iteration. More than anything, they need to be able to iterate rapidly. They don’t need any additional complexity that would distract from their mission. They need uncomplicated and enabling solutions.

They also have to take into consideration the regulatory requirements that apply to them as a company, the data they work with and its security requirements; and the market segment they compete in. Although they don’t control these factors, they can control how they respond to them, and they want to be able to respond quickly. It’s not only speed that matters in designing for security and compliance, but also visibility and track-ability. These often overlooked and critical considerations are instrumental in choosing a CI/CD strategy and platform.

“The ability to learn faster than your competitors may be the only sustainable competitive advantage,” says Cindy Alvarez in her book Lean Customer Development.

The tighter the feedback loop, the easier it is to make a change. Rapid iteration allows BigHat to easily build upon what works, and make adjustments as they identify avenues that won’t lead to success.

Feature set

CI/CD is applicable to more than just the traditional use case. It doesn’t have to be software delivered in a classic fashion. In the case of BigHat, they apply CI/CD in their data engineering workflows and in training their ML models. BigHat uses automated solutions in all aspects of their workflow. Automation further supports taking what they have created internally and enabling advances in antibody design and development for safer, more effective treatments of conditions.

“We see a broadening of the notion of what can come under CI/CD,” says Abrams. “We use automated solutions wherever possible including robotics to perform scaled assays. The goal in tightening the loop is to improve precision and speed, and reduce latency and lag time.”

BigHat reached the conclusion that they would adopt managed service offerings wherever possible, including in their CI/CD tooling and other automation initiatives.

“The phrase ‘undifferentiated heavy lifting’ has always resonated,” says Abrams. “Building, scaling, and operating core software and infrastructure are hard problems, but solving them isn’t itself a differentiating advantage for a therapeutics company. But whether we can automate that infrastructure, and how we can use that infrastructure at scale on a rock solid control plane to provide our custom solutions iteratively, reliably and efficiently absolutely does give us an edge. We need an end-to-end, complete infrastructure solution that doesn’t force us to integrate a patchwork of solutions ourselves. AWS provides exactly what we need in this regard.”

Reducing risk

Startups can be full of risk, with the upside being potential future reward. They face risk in finding the right problem, in finding a solution to that problem, and in finding a viable customer base to buy that solution.

A key priority for early stage startups is removing risk from as many areas of the business as possible. Any steps an early stage startup can take to remove risk without commensurately limiting reward makes them more viable. The more risk a startup can drive out of their hypothesis the more likely their success, in part because they’re more attractive to customers, employees, and investors alike. The more likely their product solves their problem, the more willing a customer is to give it a chance. Likewise, the more attractive they are to investors when compared to alternative startups with greater risk in reaching their next major milestone.

Adoption of managed services for CI/CD accomplishes this goal in several ways. The most important advantage remains speed. The core functionality required can be stood up very quickly, as it’s an existing service. Customers have a large body of reference examples and documentation available to demonstrate how to use that service. They also insulate teams from the need to configure and then operate the underlying infrastructure. The team remains focused on their differentiation and their core value proposition.

“We are automated right up to the organizational level and because of this, running those services ourselves represents operational risk,” says Abrams. “The largest day-to-day infrastructure risk to us is having the business stalled while something is not working. Do I want to operate these services, and focus my staff on that? There is no guarantee I can just throw more compute at a self-managed software service I’m running and make it scale effectively. There is no guarantee that if one datacenter is having a network or electrical problem that I can simply switch to another datacenter. I prefer AWS manages those scale and uptime problems.”

Embracing an opinionated model

BigHat is a startup with a singular focus on using ML to reduce the time and difficulty of designing antibodies and other therapeutic proteins. By adopting managed services, they have removed the burden of implementing and maintaining CI/CD systems.

Accepting the opinionated guardrails of the managed service approach allows, and to a degree reinforces, the focus on what makes a startup unique. Rather than being focused on performance tuning, making decisions on what OS version to use, or which of the myriad optional puzzle pieces to put together, they can use a well-integrated set of tools built to work with each other in a defined fashion.

The opinionated model means best practices are baked into the toolchain. Instead of hiring for specialized administration skills they’re hiring for specialized biotech skills.

“The only degrees of freedom I care about are the ones that improve our technologies and reduce the time, cost, and risk of bringing a therapeutic to market,” says Abrams. “We focus on exactly where we can gain operational advantages by simply adopting managed services that already embrace the Well-Architected Framework. If we had to tackle all of these engineering needs with limited resources, we would be spending into a solved problem. Before AWS, startups just didn’t do these sorts of things very well. Offloading this effort to a trusted partner is pretty liberating.”

Beyond the reduction in operational concerns, BigHat can also expect continuous improvement of that service over time to be delivered automatically by the provider. For their use case they will likely derive more benefit for less cost over time without any investment required.

Overview of solution

BigHat uses the following key services:

 

BigHat Reference Architecture

Security

Managed services are supported, owned and operated by the provider . This allows BigHat to leave concerns like patching and security of the underlying infrastructure and services to the provider. BigHat continues to maintain ownership in the shared responsibility model, but their scope of concern is significantly narrowed. The surface area the’re responsible for is reduced, helping to minimize risk. Choosing a partner with best in class observability, tracking, compliance and auditing tools is critical to any company that manages sensitive data.

Cost advantages

A startup must also make strategic decisions about where to deploy the capital they have raised from their investors. The vendor managed services bring a model focused on consumption, and allow the startup to make decisions about where they want to spend. This is often referred to as an operational expense (OpEx) model, in other words “pay as you go”, like a utility. This is in contrast to a large upfront investment in both time and capital to build these tools. The lack of need for extensive engineering efforts to stand up these tools, and continued investment to evolve them, acts as a form of capital expenditure (CapEx) avoidance. Startups can allocate their capital where it matters most for them.

“This is corporate-level changing stuff,” says Abrams. “We engage in a weekly leadership review of cost budgets. Operationally I can set the spending knob where I want it monthly, weekly or even daily, and avoid the risks involved in traditional capacity provisioning.”

The right tool for the right time

A key consideration for BigHat was the ability to extend the provider managed tools, where needed, to incorporate extended functionality from the ecosystem. This allows for additional functionality that isn’t covered by the core managed services, while maintaining a focus on their product development versus operating these tools.

Startups must also ask themselves what they need now, versus what they need in the future. As their needs change and grow, they can augment, extend, and replace the tools they have chosen to meet the new requirements. Starting with a vendor-managed service is not a one-way door; it’s an opportunity to defer investment in building and operating these capabilities yourself until that investment is justified. The time to value in using managed services initially doesn’t leave a startup with a sunk cost that limits future options.

“You have to think about the degree you want to adopt a hybrid model for the services you run. Today we aren’t running any software or services that require us to run our own compute instances. It’s very rare we run into something that is hard to do using just the services AWS already provides. Where our needs are not met, we can communicate them to AWS and we can choose to wait for them on their roadmap, which we have done in several cases, or we can elect to do it ourselves,” says Abrams. “This freedom to tweak and expand our service model at will is incomparably liberating.”

Conclusion

BigHat Biosciences was able to make an informed decision by considering the priorities of the business at this stage of its lifecycle. They adopted and embraced opinionated and service provider-managed tooling, which allowed them to inherit a largely best practice set of technology and practices, de-risk their operations, and focus on product velocity and customer feedback. This maintains future flexibility, which delivers significantly more value to the business in its current stage.

“We believe that the underlying engineering, the underlying automation story, is an advantage that applies to every aspect of what we do for our customers,” says Abrams. “By taking those advantages into every aspect of the business, we deliver on operations in a way that provides a competitive advantage a lot of other companies miss by not thinking about it this way.”

About the authors

Mike is a Principal Solutions Architect with the Startup Team at Amazon Web Services. He is a former founder, current mentor, and enjoys helping startups live their best cloud life.

 

 

 

Sean is a Senior Startup Solutions Architect at AWS. Before AWS, he was Director of Scientific Computing at the Howard Hughes Medical Institute.

Lift and shift your Zabbix to Oracle Cloud with MySQL database service

Post Syndicated from Vittorio Cioe original https://blog.zabbix.com/lift-and-shift-your-zabbix-to-oracle-cloud-with-mysql-database-service/12792/

 

If you are tired of administering the infrastructure on your own and would prefer to gain time to focus on real monitoring activities rather than costly platform upgrades, you can easily lift and shift your MySQL-based Zabbix installation stack to Oracle Cloud.

Contents

I. Moving to the Cloud (1:46)
II. Moving Zabbix to Oracle Cloud (2:41)

1. Planning migration (3:22)
2. Migrating Zabbix to Oracle Cloud (6:17)
3. Migrating the database to MySQL Database Service (8:47)

III. Questions & Answers (15:12)

Moving to the Cloud

The data is increasingly moving to the cloud — the consumer data followed by the enterprise data, as enterprises are always a bit slower in adopting technologies.

Data moving to the cloud

Oracle Cloud Infrastructure, OCI, is the 4th cloud provider in the Cloud Infrastructure Ranking of the Gartner Magic Quadrant based on ‘Completeness of Vision’ and ‘Ability to Execute’.

OCI is available in 26 regions and has 26 data centers across the world with 12 more planned.

26 Regions Live, 12+ Planned

24+ Industry and Regional Certifications

Moving Zabbix to Oracle Cloud

With Zabbix in the Oracle Cloud you can:

  1. get the latest updates on the technology stack, minimizing downtime and service windows.
  2. convert the time you spend managing your monitoring platform into the time you spend monitoring your platforms.
  3. leverage the most secure and cost-effective cloud platform in the market, including security information and security updates made available by OCI.

Planning migration

To plan effective migration of the on-premise Zabbix instance with clients, proxies, management server, interface, and database, we need to migrate the last three instance components. Basically, we need:

  • the server configuration;
  • on-premise network topology to understand what can communicate with the outside or what would eventually go over VPN, that is, the network topology of client and proxies; and
  • the database.

Migration requirements

We also need to set up the following in the OCI tenancy:

  • MySQL Database System,
  • Compute instance for the Zabbix Server,
  • storage for database and backup,
  • networking/load balancing.

The target architecture involves setting up the VPN from your data center to the Oracle cloud tenancy and deploying the load balancer, the Zabbix server in redundancy over availability domains, and the MySQL database in a separate subnet.

Required Components:
• Cloud Networking,
• Zabbix Cloud Image,
• MySQL Database Service,
• VPN Connection for client/proxies.

Oracle Cloud target architecture for Zabbix

You can also have a lighter setup, for instance, with proxies communicating over TLS connections over the Internet or communicating directly with the Zabbix Server in the Oracle Cloud, and the Zabbix server interfacing with the database. Here, you will need fewer elements: server, database, and VCN.

Oracle Cloud target architecture for Zabbix — a simpler solution

Migrating Zabbix to Oracle Cloud

Zabbix migration to the Oracle Cloud is straightforward.

1. Before you begin:

  • set up tenancy and compartments,
  • set up cloud networking — public and private VCN.

2. Zabbix deployment on the VM:

  • select one-click deployment or DIY — use the official Zabbix OCI Marketplace Image or deploy an OCI Compute Instance and install manually,
  • choose the desired Compute ‘shape’ during deployment.

3. Configuration:

  • start the instance,
  • edit the config file,
  • point to the database with the IP address, username, and password (to do that, you’ll need to open several ports in the cloud network via the GUI).

The OCI infrastructure allows for multiple choices. The Zabbix Server is lightweight software requiring resources. In the majority of cases, a powerful VM will be enough. Otherwise, you’ll have the Oracle Cloud available.

Compute services for any enterprise use case

In the Oracle Cloud you’ll have the bare metal option — the physical machines dedicated to a single customer, Kubernetes container engine, and a lot of fast storage possibilities, which end up being quite cheap.

Migrating the database to MySQL Database Service

MySQL Database Service is the managed offer for MySQL in Oracle Cloud, fully developed, managed, and supported by the MySQL team. It is secure and provides the latest features as it leverages the Oracle Cloud, which has been rated by various sources as one of the most secure cloud platforms.

In addition, the platform is built on the MySQL Enterprise Edition binaries, so it is fully compatible with the platform you might be using. Finally, it costs way less on a yearly basis than a full-blown on-premise MySQL Enterprise subscription.

MySQL Database Service — 100% developed, managed, and supported by the MySQL team

Considerations before migration

Before you begin:

  • check your MySQL 8.0 compatibility,
  • check your database size (to assess the time needed to migrate), and
  • plan a service window.

High-level migration plan

  1. Set up cloud networking.
  2. Set up your (on-premise) networking secure connection (to communicate with the cloud).
  3. Create MySQL Database Service DB System with storage.
  4. Move the data using MySQL Shell Dump & Load utility.

Creating MySQL DB system with just a few clicks

  • Create a customized configuration.
  • Start the wizard to create DB system.
  • Select Virtual Cloud Network (VCN).
  • Select subnet to place your MySQL endpoint.
  • Select MySQL configuration (or create customized instances for your workload).
  • The shape for the DB System (CPU and RAM) will be set automatically.
  • Select the size of the storage for data and backup.
  • Create a backup policy or accept the default.

Creating MySQL instances

You can use MySQL Shell Upgrade Checker Utility to check the compatibility with MySQL8.0.

util.checkForServerUpgrade()

Loading the data

To move the data, you can use the MySQL Shell Dump & Load utility, which is capable of multi-threading and is callable with the JavaScript methods from MySQL Shell.

So, you can dump on what can be a bastion machine, and load your instance to the cloud. It will take several minutes to load the database of several gigabytes, so it is necessary to plan the service maintenance window accordingly.

In addition, the utility is easy to use. You just need to connect to an instance and dump.

MySQL Shell Dump & Load

The operation is pretty straightforward and the migration time will depend on the size of the database.

Free trial

You can have a test drive of the MySQL Database Service with $300 in cloud credits, which you can spend in the Oracle Cloud on MySQL Database Service or other cloud services.

 

Questions & Answers

Question. Do you help with migrating the databases from older versions to MySQL 8.0?

Answer. Yes, this is the thing we normally do for our customers — providing guidance, though data migration is normally straightforward.

Question. Does the database size matter? How efficient MySQL Shell Dump is? What if my database is terabytes in size?

Answer. MySQL Shell Dump & Load utility is much more efficient than what MySQL Dump used to be. The database size still matters. In that case, it will require more time, still way less than it used to take