Tag Archives: serverless architectures

How Launchpad from Pega enables secure SaaS extensibility with AWS Lambda

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-launchpad-from-pega-enables-secure-saas-extensibility-with-aws-lambda/

Large organizations increasingly adopt software as a service (SaaS) solutions to focus on business priorities, reduce infrastructure management overhead, and optimize costs. These organizations expect SaaS vendors to provide customizability facilities for tailoring the solution behavior according to their needs. Although traditional approaches like feature flags and webhooks offer some flexibility, they often fall short of providing a high degree of customizability. A new emerging pattern in this space is tenant-supplied custom code execution, which allows tenants to inject their own code into specific workflow points, enabling deep customization while preserving the core SaaS solutions’ integrity and security.

In this post, we share how Pegasystems (Pega) built Launchpad, its new SaaS development platform, to solve a core challenge in multi-tenant environments: enabling secure customer customization. By running tenant code in isolated environments with AWS Lambda, Launchpad offers its customers a secure, scalable foundation, eliminating the need for bespoke code customizations.

Solution overview

Launchpad, which is built on AWS, is an end-to-end platform on which software providers can build, launch, and operate workflow-centric B2B SaaS applications and AI solutions. It provides a managed, secure, scalable cloud environment for hosting multi-tenant applications and data. It accelerates the build experience with generative AI-powered low code tools, prebuilt capabilities, and subscriber-level configuration. Being a multi-tenant platform at its core, Launchpad had to maintain stringent isolation across tenants in its architecture.

One of the requirements Launchpad had was to allow their tenants to augment the workflows natively by providing custom code. Some common scenarios included communicating with external systems with proprietary non-industry-standard protocols, reuse of existing business logic, and SDK-based custom code development.

The solution necessitated the ability for tenants to provide custom code that would implement the required business logic, which Launchpad would be executing. This required architecting a secure runtime environment for custom code execution that maintains the highest degree of cross-tenant isolation within the multi-tenant architecture, at the same time allowing sufficient access to platform APIs and services. It was essential to build an architecture that would decouple the environment running tenant code from the core SaaS platform, as illustrated in the following diagram.

Architecting the solution topology

To achieve the required high level of compute isolation for running code provided by different tenants, Launchpad has adopted Lambda functions in its architecture as the secure ephemeral compute environment. Each untrusted code snippet provided by tenants is bootstrapped as a stand-alone Lambda function, with strong Firecracker-based isolation across different functions and execution environments addressing Launchpad’s requirements. This isolation provides dedicated resources, customizable access permissions, independent monitoring and operations, and automatic scaling for each function, while maintaining complete separation from other functions and their execution environments, as illustrated in the following diagram.

With Lambda being a serverless compute service, adopting it for the Launchpad architecture yielded several significant benefits. The major business benefit was that tenants could implement thousands of custom workflow augmentations on their own simply by providing code snippets, instead of the Launchpad engineering team being responsible for implementing them in the core platform code. Other benefits included:

  • Managed runtimes – AWS handles patching and updating the underlying infrastructure, operating system, and runtimes for customers, reducing the potential attack surface.
  • Fine-grained permissions – Each function can have its own set of access policies to tightly control what resources and actions it can access.
  • No need to pre-provision and pay for overprovisioned capacity – Lambda functions scale up and down automatically based on traffic patterns.
  • Built-in monitoring – Lambda functions emit detailed metrics, logs, and traces through Amazon CloudWatch and AWS X-Ray out of the box, making it straightforward to monitor tenant code execution.

To further reduce risks, Launchpad runs these Lambda functions with untrusted code in a dedicated AWS account. This account is separated from the core SaaS platform account. When end-users create a new function in the Launchpad authoring portal, they upload their code and specify the code handler to be executed during the invocation. Users can also map function input and output to Launchpad fields for further processing to enable an even higher degree of customizability and integration. The multi-tenant authoring service is a Control Plane component that runs as a microservice on the Amazon Elastic Kubernetes Service (Amazon EKS) cluster and uses the Lambda API for function lifecycle management, as illustrated in the following diagram. After a function resource is created, it can be used for further invocations.

Runtime architecture

At runtime, when Launchpad needs to invoke a function, it calls the Lambda Invoke API. Before the function is invoked, the multi-tenant runtime service performs a tenancy check to make sure the request is coming from an authorized tenant by doing the token validation. After a successful validation, the service invokes the required Lambda function. To invoke functions hosted in a different AWS account, the multi-tenant runtime service uses an AWS Identity and Access Management (IAM) role to assume the required permissions and invokes the Lambda service using the AWS SDK. The sequence of interactions is shown in the following architecture diagram.

The workflow consists of the following steps:

  1. An incoming user request reaches the application gateway service.
  2. The application gateway authenticates the request using the tenancy security service.
  3. After it’s authenticated, the request is forwarded to the multi-tenant runtime service
  4. The multi-tenant runtime service validates the supplied token and performs a tenancy check. This makes sure tenants can only invoke own functions they have permissions for (for example, functions they own).
  5. The multi-tenant runtime service pod assumes the IAM role required for invoking the tenant-specific Lambda function in a different AWS account.
  6. The multi-tenant runtime service pod invokes the required Lambda function.

Invoking the platform API from custom code is as straightforward as connecting to any external API. The custom code can authenticate with the platform using OAuth2. To facilitate the authentication, the developer can pass along the credentials as input parameters to the function from the core platform. Then the developer can create a corresponding record (isolated by tenant) in the platform that stores the credentials per function, and pass credentials as input parameters during invocation.

Distributed architecture observability

Operating a distributed architecture that runs untrusted code across multiple AWS accounts requires a comprehensive observability strategy. Launchpad’s approach combines centralized logging and monitoring with cross-account aggregation to provide a unified operational view of the platform.

The monitoring architecture uses CloudWatch Metrics to observe the Lambda functions, aggregating them through a centralized observability layer. This setup empowers platform operators to correlate Lambda function metrics with the core platform services running on Amazon EKS. Launchpad also collects per-function telemetry like function invocations, error rates, and execution time, which allows them to observe per-tenant metrics. These telemetry dimensions enable both a platform-wide and tenant-specific monitoring perspective.

For logging and troubleshooting, Launchpad implements a unified logging pipeline that aggregates Lambda function logs with application gateway and runtime service logs. Each request flowing through the system carries a correlation ID, so operators can trace execution paths across the core SaaS services and into the tenant functions running in the AWS account running tenant Lambda functions.

With this multi-layer observability architecture, Launchpad can maintain operational excellence while running tenant code securely at scale. Regular operational reviews drive continuous improvements in monitoring coverage and incident response procedures. Having per-tenant Lambda functions make it possible for Launchpad to use tenant-specific cost allocation tags, further empowering them to understand the cost footprint of running tenant custom code.

Best practices

When building a SaaS solution, maintaining a unified core code base is essential for scalability and manageability. Implementing per-tenant variations within the core platform code can lead to maintenance complexity and technical debt. Instead, architect your SaaS solution to have extension points, which allow your tenants to inject their custom code at specific points in the workflow, enabling customization without compromising the platform’s maintainability. This pattern makes sure the core SaaS platform remains clean and standardized while offering the flexibility that customers demand.

Additional best practices include:

  • Use separate accounts for running Lambda functions with untrusted tenant-provided code to make sure it’s isolated from your core SaaS platform code.
  • Grant absolute minimum required access permissions to the execution role assigned to the function. The custom code running within the execution environment gets permissions defined in the execution role when making requests to AWS API endpoints. If the function doesn’t need to reach out to AWS API endpoints, remove all permissions from the execution role and add an explicit AWSDenyAll policy.
  • Use separate Lambda functions for each code snippet and each tenant. This will provide the highest degree of cross-tenant isolation. Resources are not reused across different functions and execution environments.
  • Use Lambda layers in case you need to add a layer of vendor-provided code in order to keep it separated from the untrusted tenant-provided code.
  • Implement additional security controls, such as using Amazon Virtual Private Cloud (Amazon VPC) constructs to restrict network access and VPC Flow Logs for network activity monitoring.

Conclusion

The implementation of a secure untrusted code execution environment within SaaS platforms addresses a critical need for tenant customization while maintaining architectural integrity. Lambda offers a built-in isolation model, fine-grained security controls, and serverless scalability, so SaaS providers such as Launchpad can address the requirements of executing tenant-provided code in a multi-tenant environment and offer robust customization capabilities while maintaining strict security boundaries and operational efficiency. This architectural pattern enables providers to focus on core platform development while confidently supporting tenant-specific workflows through the secure and scalable Lambda execution environment.

To learn more, refer to the Security Overview of AWS Lambda white paper. For additional serverless architectural patterns, see Serverlessland.com.


About the authors

How CyberArk is streamlining serverless governance by codifying architectural blueprints

Post Syndicated from Anton Aleksandrov original https://aws.amazon.com/blogs/architecture/how-cyberark-is-streamlining-serverless-governance-by-codifying-architectural-blueprints/

This post was co-written with Ran Isenberg, Principal Software Architect at CyberArk and an AWS Serverless Hero.

Serverless architectures enable agility and simplified cloud resource management. Organizations embracing serverless architectures build robust, distributed cloud applications. As organizations grow and the number of development teams increases, maintaining architectural consistency, standardization, and governance across projects becomes crucial.

In this post, you will discover how CyberArk, a leading identity security company, efficiently implements serverless architecture governance, reduces duplicative efforts, and saves months of development time by codifying architectural blueprints. This approach helps to prevent redundant efforts and promotes uniform architectural standards, facilitating the seamless adoption of organizational best practices and governance across diverse teams.

Overview

The risk of duplicative efforts and architectural inconsistencies is particularly pronounced in large organizations, especially for requirements unrelated to specific business domains owned by individual teams. Diverse approaches to Infrastructure-as-Code, CI/CD, observability, and security can lead to inconsistent implementations across teams. Application developers should focus on delivering business value efficiently, rather than navigating the complexities of building and operating distributed architectures while adhering to organizational best practices. To achieve this, you need an approach that empowers developers and provides guardrails to ensure vetted architectural patterns are consistently applied. This solution should enable accelerated delivery without sacrificing agility and innovation.

Some organizations implement internal wiki consolidating architectural guidance. While well-intentioned, relying solely on documentation assumes development teams diligently follow the guidelines, which often requires manual validation and limits scalability. To overcome this limitation, organizations should adopt a scalable approach that codifies, automates, and promotes architectural best practices. This mechanism allows developers to focus on delivering business-domain value and drives standardized operational excellence, governance, and organizational policies adherence.

Introducing serverless blueprints

CyberArk engineering team had over 900 developers. It was looking for ways to ensure they build their serverless services based on vetted architectural and security best practices with fully automated governance controls enforcement. The solution came in the form of codified architecture blueprints and automated tooling.

Serverless architectures are composed using loosely coupled services, integrated based on the application requirements. Application developers use IaC tools such as AWS CDK and HashiCorp Terraform to define their serverless architectures and integration patterns. CyberArk has augmented the IaC with governance tools, such as cdk-nag, AWS Config, and AWS Control Tower. With these complementary tools in place, they’ve built serverless blueprints which include architectural definitions based on organizational best practices, as well as automatically applied governance controls

To illustrate this, consider a simple serverless architecture pattern. In this common pattern, an SQS queue serves as the event source for a Lambda function, which parses incoming messages and updates an Amazon S3 bucket.

A simple serverless architecture with SQS Queue, Lambda function, and S3 Bucket

Figure 1. A simple serverless architecture with SQS Queue, Lambda function, and S3 Bucket

While this pattern seems simple, turning it into an enterprise-ready service requires additional effort. You must consider aspects like resiliency, security, governance, observability, and coding best practices. Let’s examine several examples codified in architectural blueprints at CyberArk.

Error-handling best practices

Your services should be resilient. Retries can help to overcome occasional network hiccups, but you also need to handle scenarios when your function consistently fails to process particular messages (known as poison message) – for example, because of a code bug. This can lead to endless processing loops, data loss, and potential extra charges. To address this, a blueprint can implement a failure handling mechanism with a dead letter queue, alerting, and redrive. This pattern is straightforward to implement and adds extra resiliency to your architecture. It is also generic and does not contain any business domain code. This is a typical example of an architectural pattern that can be codified in a blueprint and reused across development teams.

The simple serverless architecture with added resiliency best practices

Figure 2. The simple serverless architecture with added resiliency best practices

Security best practices

Another example is securing S3 buckets. Organizations must enforce S3 security best practices, such as enabling access logs, blocking public access, and enabling encryption at rest. Codifying these guardrails in architectural blueprints adds an extra layer that allows your developers to comply with organization standards without having to explicitly implement adherence to each best practice and policy on their own.

The simple serverless architecture with added security best practices

Figure 3. The simple serverless architecture with added security best practices

The following code snippet uses AWS CDK to create an S3 bucket with common best practices:

def _create_bucket(self, server_access_logs_bucket: s3.Bucket, is_production_env: bool) -> s3.Bucket:
    # Create an S3 bucket with AWS-managed keys encryption
    bucket = s3.Bucket(
        self,
        constants.BUCKET_NAME,
        versioned=True if is_production_env else False,
        encryption=s3.BucketEncryption.S3_MANAGED,
        block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
        enforce_ssl=True,
        server_access_logs_bucket=server_access_logs_bucket, 
        # redacted
    )

Additional security best practices you can codify in your blueprints include the principle of least privilege access, VPC-attachment, and code signing for sensitive Lambda functions, and using KMS keys for encryption.

Lambda best practices

Your Lambda functions are another example of where blueprints can help. By providing a function blueprint implementing the baseline for capabilities like observability, idempotency, and batch processing out-of-the-box, you enable developers to focus on their business domain code.

Layered view of a Lambda function in CyberArk’s serverless architecture blueprint

Figure 4. Layered view of a Lambda function in CyberArk’s serverless architecture blueprint

CyberArk embeds Powertools for AWS Lambda, a toolkit that implements serverless best practices to increase developer velocity, into their blueprints. The following code snippets embed Powertools for enabling enhanced observability and implementing batch processing.

# CDK code
lambda_function = lambda.Function(
    environment={
        constants.POWERTOOLS_SERVICE_NAME: constants.SERVICE_NAME,
        constants.POWER_TOOLS_LOG_LEVEL: 'INFO',  
    },
    tracing=lambda.Tracing.ACTIVE,
    layers=["powertools-layer"],
    log_format=lambda.LogFormat.JSON.value,
    system_log_level=lambda.SystemLogLevel.INFO.value
    # redacted
)

# Function handler code
processor = BatchProcessor(event_type=EventType.SQS, model=OrderSqsRecord)

@logger.inject_lambda_context
@metrics.log_metrics
@tracer.capture_lambda_handler(capture_response=False)
def lambda_handler(event, context: LambdaContext):
    return process_partial_response(
        event=event,
        record_handler=record_handler,
        processor=processor,
        context=context,
)

Governance controls

Blueprints are not static; they evolve as you adopt new best practices and governance policies. Developers start with a vetted blueprint but can deviate as they evolve their serverless apps. To enable continuous adherence, it is important to use a combination of organizational governance tools, such as AWS Control Tower and Service Control Policies, and architecture blueprints that embed governance controls automatically enforced by CI/CD. This ensures that any architectural modification will be validated for adhering to organizational standards.

AWS defines proactive controls as mechanisms that prevent developers from deploying resources that violate governance policies. Detective controls are mechanisms that detect, log, and alert on resource or configuration changes that violate governance policies.

Applying governance controls at all stages of CI/CD

Figure 5. Applying governance controls at all stages of CI/CD

Depending on the IaC tool, you can leverage different types of governance tools for proactive control enforcement. The following screenshot shows a proactive control violation identified during CI/CD via the cdk-nag framework. You can see cdk-nag throwing an error for the stack deployment due to Lambda execution role being assigned wild-card permissions.

Exception thrown by cdk-nag for using wildcard permissions

Figure 6. Exception thrown by cdk-nag for using wildcard permissions

See the practical guide for implementing serverless governance.

Sample code

Ran Isenberg has open-sourced a sample Lambda Handler Cookbook blueprint illustrating some of the patterns CyberArk has adopted.

Additional serverless architecture patterns you might consider implementing in your blueprints are server-side encryption for an Amazon SNS topic with an encrypted Amazon SQS queue subscribed, auto-adjusting provisioned concurrency for Lambda functions, secure Serverless Aurora Cluster with bastion host, and more.

See more patterns implemented at serverlessland.com and cdkpatterns.com

Conclusion

Translating architectural and security best practices into modular IaC definitions, such as CDK constructs or Terraform modules, is a scalable and reusable technique that allows CyberArk to reduce duplicative efforts and save months of development time. Using IaC tools like AWS CDK or Terraform, augmented with governance tools like cdk-nag or checkov, enabled CyberArk to share implementation best practices and encode governance policies into architectural blueprints. Development teams adopting these blueprints do not need to reinvent the wheel, each trying to solve the same problem on their own. Instead, they leverage the knowledge codified in the blueprint.

Further reading

Tenant portability: Move tenants across tiers in a SaaS application

Post Syndicated from Aman Lal original https://aws.amazon.com/blogs/architecture/tenant-portability-move-tenants-across-tiers-in-a-saas-application/

In today’s fast-paced software as a service (SaaS) landscape, tenant portability is a critical capability for SaaS providers seeking to stay competitive. By enabling seamless movement between tiers, tenant portability allows businesses to adapt to changing needs. However, manual orchestration of portability requests can be a significant bottleneck, hindering scalability and requiring substantial resources. As tenant volumes and portability requests grow, this approach becomes increasingly unsustainable, making it essential to implement a more efficient solution.

This blog post delves into the significance of tenant portability and outlines the essential steps for its implementation, with a focus on seamless integration into the SaaS serverless reference architecture. The following diagram illustrates the tier change process, highlighting the roles of tenants and admins, as well as the impact on new and existing services in the architecture. The subsequent sections will provide a detailed walkthrough of the sequence of events shown in this diagram.

Incorporating tenant portability within a SaaS serverless reference architecture

Figure 1. Incorporating tenant portability within a SaaS serverless reference architecture

Why do we need tenant portability?

  • Flexibility: Tier upgrades or downgrades initiated by the tenant help align with evolving customer demand, preferences, budget, and business strategies. These tier changes generally alter the service contract between the tenant and the SaaS provider.
  • Quality of service: Generally initiated by the SaaS admin in response to a security breach or when the tenant is reaching service limits, these incidents might require tenant migration to maintain service level agreements (SLAs).

High-level portability flow

Tenant portability is generally achieved through a well-orchestrated process that ensures seamless tier transitions. This process comprises of the following steps:

High-level tenant portability flow

Figure 2. High-level tenant portability flow

  1. Port identity stores: Evaluate the need for migrating the tenant’s identity store to the target tier. In scenarios where the existing identity store is incompatible with the target tier, you’ll need to provision a new destination identity store and administrative users.
  2. Update tenant configuration: SaaS applications store tenant configuration details such as tenant identifier and tier that are required for operation.
  3. Resource management: Initiate deployment pipelines to provision resources in the target tier and update infrastructure-tenant mapping tables.
  4. Data migration: Migrate tenant data from the old tier to the newly provisioned target tier infrastructure.
  5. Cutover: Redirect tenant traffic to the new infrastructure, enabling zero-downtime utilization of updated resources.

Consideration walkthrough

We’ll now delve into each step of the portability workflow, highlighting key considerations for a successful implementation.

1. Port identity stores

The key consideration for porting identity is migrating user identities while maintaining a consistent end-user experience, without requiring password resets or changes to user IDs.

Create a new identity store and associated application client that the frontend can use; after that, we’ll need a mechanism to migrate users. In the reference architecture using Amazon Cognito, a silo refers to each tenant having its own user pool, while a pool refers to multiple tenants sharing a user pool through user groups.

To ensure a smooth migration process, it’s important to communicate with users and provide them with options to avoid password resets. One approach is to notify users to log in before a deadline to avoid password resets. Employ just-in-time migration, enabling password retention during login for uninterrupted user experience with existing passwords.

However, this requires waiting for all users to migrate, potentially leading to a prolonged migration window. As a complementary measure, after the deadline, the remaining users can be migrated by using bulk import, which enforces password resets. This ensures a consistent migration within a defined timeframe, albeit inconveniencing some users.

2. Update tenant configuration

SaaS providers rely on metadata stores to maintain all tenant-related configuration. Updates to tenant metadata should be completed carefully during the porting process. When you update the tenant configuration for the new tier, two key aspects must be considered:

  • Retain tenant IDs throughout the porting process to ensure smooth integration of tenant logging, metrics, and cost allocation post-migration, providing a continuous record of events.
  • Establish new API keys and a throttling mechanism tailored to the new tier to accommodate higher usage limits for the tenants.

To handle this, a new tenant portability service can be introduced in the SaaS reference architecture. This service assigns a different AWS API Gateway usage plan to the tenant based on the requested tier change, and orchestrates calls to other downstream services. Subsequently, the existing tenant management service will need an extension to handle tenant metadata updates (tier, user-pool-id, app-client-id) based on the incoming porting request.

3. Resource management

Successful portability hinges on two crucial aspects during infrastructure provisioning:

  • Ensure tenant isolation constructs are respected in the porting process through mechanisms to prevent cross-tenant access. Either role-based access control (RBAC) or attribute-based-access control (ABAC) can be used to ensure this. ABAC isolation is generally easier to manage during porting if the tenant identifier is preserved, as in the previous step.
  • Ensure instrumentation and metric collection are set up correctly in the new tier. Recreate identical metric filters to ensure monitoring visibility for SaaS operations.

To handle infrastructure provisioning and deprovisioning in the reference architecture, extend the tenant provisioning service:

  • Update the tenant-stack mapping table to record migrated tenant stack details.
  • Initiate infrastructure provisioning or destruction pipelines as needed (for example, to run destruction pipelines after the data migration and user cutover steps).

Finally, ensure new resources comply with required compliance standards by applying relevant security configurations and deploying a compliant version of the application.

By addressing these aspects, SaaS providers can ensure a seamless transition while maintaining tenant isolation and operational continuity.

4. Data migration

The data migration strategy is heavily influenced by architectural decisions such as the storage engine and isolation approach. Minimizing user downtime during migration requires a focus on accelerating the migration process, maintaining service availability, and setting up a replication channel for incremental updates. Additionally, it’s crucial to address schema changes made by tenants in a silo model to ensure data integrity and avoid data loss when transitioning to a pool model.

Extending the reference architecture, a new data porting service can be introduced to enable Amazon DynamoDB data migration between different tiers. DynamoDB partition migration can be accomplished through multiple approaches, including AWS Glue, custom scripts, or duplicating DynamoDB tables and bulk-deleting partitions. We recommend a hybrid approach to achieve zero-downtime migration. This solution applies only when the DynamoDB schema remains consistent across tiers. If the schema has changed, a custom solution is required for data migration.

5. Cutover

The cutover phase involves redirecting users to the new infrastructure, disabling continuous data replication, and ensuring that compliance requirements are met. This includes running tests or obtaining audits/certifications, especially when moving to high-sensitivity silos. After a successful cutover, cleanup activities are necessary, including removing temporary infrastructure and deleting historical tenant data from the previous tier. However, before deleting data, ensure that audit trails are preserved and compliant with regulatory requirements, and that data deletion aligns with organizational policies.

Conclusion

In conclusion, portability is a vital feature for multi-tenant SaaS. It allows tenants to move data and configurations between tiers effortlessly and can be incorporated in reference architecture as above. Key considerations include maintaining consistent identities, staying compliant, reducing downtime and automating the process.