Tag Archives: Access Control

Effortless enterprise authentication at Grab: Dex in action

2025-05-23 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/dex-in-action

Introduction

Grab, Southeast Asia’s leading superapp, has created many internal applications to support its diverse range of internal and external business needs. Authentication¹ and authorisation² serve as fundamental components of application development, as robust identity and access management are essential for all systems.

We recognised the need for a centralised internal system to manage access, authentication, and authorisation. This system would streamline access management, ensure compliance with audit requirements, enhance developer velocity, and simplify authentication and authorisation processes for both developers and business operations.

Grab created Concedo to fulfill this requirement by providing a mechanism for services to configure their access control based on their specific role to permission matrix (R2PM)³. This allows for quick and easy integration with Concedo, enabling developers to expedite the shipping of their systems without investing excessive time in building the authentication and authorisation module.

The authentication mechanism, based on Google’s OAuth2.0⁴, includes custom features that enhance identity for service integration. However, this customisation isn’t standard, creating integration challenges with external platforms like Databricks and Datadog. These platforms then use their own authentication and authorisation, resulting in a fragmented and undesirable sign-on experience for users.

Figure 1. Undesired user sign-on experience due to fragmented authentication approaches.

The inconsistency in user experience also resulted in complications. The lack of standardisation led to difficulties in establishing authentication and authorisation for individual applications. Additionally, it created substantial administrative overhead due to the necessity of managing multiple identities. The absence of standardisation also hindered transparency in access control across all applications.

This led us to inquire how a standardised protocol could be established to function seamlessly across all applications, regardless of whether they were developed internally or sourced from external platforms.

Figure 2. Desired state, having something in between the different identity providers (IdP).

Choosing among industry standards

We wanted to build a platform to serve both authentication and authorisation, providing a seamless integration and user sign-on experience. We then asked ourselves, “What are the current industry standards we can leverage on?”.

Security Assertion Markup Language (SAML): An authentication protocol which leverages heavily on session cookies to manage each authentication session.
Open Authorisation (OAuth): An authorisation protocol which focuses on granting access for particular details rather than providing user identity information.
OpenID Connect (OIDC)⁵: An authentication protocol built on OAuth 2.0, enabling single sign-on (SSO). OIDC unifies and standardises user authentication, making it a solution for organisations with numerous applications.

OIDC enhances user experience by redirecting them to an identity provider (IdP) like Google or Microsoft for authentication when accessing an application. Upon successful verification, the IdP sends a secure token with the user’s identity information back to the application, granting access without the need for additional credentials.

With OIDC, authentication and authorisation are fully implemented, enabling seamless integration across platforms, including mobile, API, and browser-based applications, while also providing SSO functionality.

Figure 3. Desired state with the protocol decided.

OIDC seemed like an ideal solution, but it came with potential drawbacks:

OIDC relies on trusting a third-party authentication service. Any disruption to this service could result in downtime.
Compromised credentials could affect access to multiple services.

In the following section, we will explore our strategies in mitigating these challenges effectively.

Implementing the chosen standard

With OIDC chosen as the standard, the focus shifted to implementation.

We have always been a supporter of open source projects. Rather than building a platform from the ground up, we leveraged existing solutions while seeking opportunities to contribute back to the open source community.

The team explored Cloud Native Computing Foundation (CNCF) projects and discovered Dex – A federated OpenID connect provider that aims to allow integration of any IdP into an application using OIDC. Dex was selected as our open-source platform of choice due to its alignment with our high-level objectives.

Figure 4. Desired state with Dex as the platform foundation.

How Dex works

Figure 5. High level architecture of Dex. [Source](https://dexidp.io/docs/)

When a user or machine tries to access a protected application or service, they are redirected to Dex for authentication. Dex acts as a middleman (identity aggregator) between the user and various IdPs to establish an authenticated session.

Figure 6. Simplified sequence diagram of how authentication works for Dex.

Dex’s key features include enabling SSO experiences, allowing users to access multiple applications after authenticating through a single provider. Dex also supports multiple IdP use cases and provides standardised OIDC authentication tokens.

Dex implementation separated application authentication concerns, established a single source of truth for identity, enabled new IdP additions, ensured adherence to security best practices, and provided scalability for deployments of all sizes.

How Dex is streamlining authentication and authorisation

Token delegation

When services communicate with each other, one service often assigns an identity to ensure that authorisation can be carried out on a specific service. For example, in figure 7, a service account or robot account is typically used as an identity so that service B can identify the caller.

Figure 7. Service identification through service account.

Although service accounts are the recommended approach for enabling Service B to identify the caller, they come with challenges that must be addressed:

Service account compromise: Service accounts often have high-level privileges and typically broad access to Service B. If compromised, they pose a significant security risk, making careful management essential.
Access control issue: The other approach creates unnecessary complexity by requiring Service A to handle user-level permissions for Service B. This violates the principle of separation of concerns.

To address this issue, Dex introduced a token exchange feature.

Figure 8. Token exchange example with trusted peers established.

The token exchange process involves two main components; token minting and trust relationship.

Token minting

The user (Alice) logs into Service A.
Service A, acting as a trusted peer, is authorised to mint tokens.
Service A generates a token valid for both Service A and Service B. This is reflected in the token’s “aud” (audience) field: “aud”: “serviceA serviceB”

Trust relationship

Service B must be configured to trust Service A as a peer.
Service B accepts tokens minted by Service A.

This approach differs from the service account-based scenario by using a trust-based peer relationship. Service A is authorised to mint tokens for Service B providing a more sophisticated but preferred method. The token is properly scoped for both services, ensuring a clear audit trail of token issuance, while reducing token manipulation risks.

Kill switch

As highlighted earlier,

OIDC relies on trusting a third-party authentication service. Any disruption to this service could result in downtime.

Dex’s ability to support multiple IdPs enables traffic to be shifted to a different IdP if one, such as Google, experiences an outage. This “kill switch” mechanism ensures that integrated services are not disrupted and do not require any changes to mitigate the issue. It is only triggered during specific IdP outages.

Figure 9. Trigger kill switch without having other services changing from their end.

Looking forward

Following the successful implementation of Dex as the unified authentication provider, the next phase in enhancing our identity and access management infrastructure is to leverage this robust identity foundation to establish a unified and simplified authorisation model. This initiative is driven by the recognition that the current authorisation landscape remains fragmented and complex, leading to potential inefficiencies and security vulnerabilities.

By centralising authorisation and aligning it with the unified identity provided by Dex, we can streamline access control, improve user experience, and strengthen security across our applications and services. This will involve consolidating authorisation policies, standardising access control mechanisms, and simplifying the management of user permissions.

Shoutout to the awesome Concedo team for driving Dex integration and to our leadership for steering the way toward a simpler, unified authentication and authorisation journey!

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Definition of terms

Authentication: Who you are. Making sure you are who you say you are by verifying your identity. ↩
Authorisation: What you can do. Defining the resources or actions you are allowed to access or perform after your identity has been verified. ↩
Role-to-Permission Matrix (R2PM): A structured framework used to map roles within an organisation to the permissions or access rights each role has in a system, application, or process. This matrix serves as a critical component in access control and identity management, ensuring that users have appropriate access based on their roles while minimising the risk of unauthorised access. ↩
Open Authorisation (OAuth 2.0): Protocol for authorisation. For example, Google Login on third-party portals allows your identity to remain with Google, but third-party portals can obtain limited access to specific data such as your profile photo. ↩
OpenID Connect (OIDC): Identity protocol built on top of OAuth 2.0. On top of authorisation provided by OAuth 2.0, it verifies and provides a trusted identity. ↩

IAM Policies and Bucket Policies and ACLs! Oh, My! (Controlling Access to S3 Resources)

2023-07-07 Kai Zhao

Post Syndicated from Kai Zhao original https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/

Updated on July 6, 2023: This post has been updated to reflect the current guidance around the usage of S3 ACL and to include S3 Access Points and the Block Public Access for accounts and S3 buckets.

Updated on April 27, 2023: Amazon S3 now automatically enables S3 Block Public Access and disables S3 access control lists (ACLs) for all new S3 buckets in all AWS Regions.

Updated on January 8, 2019: Based on customer feedback, we updated the third paragraph in the “What about S3 ACLs?” section to clarify permission management.

In this post, we will discuss Amazon S3 Bucket Policies and IAM Policies and its different use cases. This post will assist you in distinguishing between the usage of IAM policies and S3 bucket policies. We will also discuss how these policies integrate with some default S3 bucket security settings like automatically enabling S3 Block Public Access and disabling S3 access control lists (ACLs).

IAM policies vs. S3 bucket policies

AWS access is managed by setting IAM policies and linking them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is an object in AWS that when associated with an identity or resource, defines their permissions. IAM policies specify what actions are allowed or denied on what AWS resources (e.g. user Alice can read objects from the “Production” bucket but can’t write objects in the “Dev” bucket whereas user Bob can have full access to S3).

S3 bucket policies, on the other hand, are resource-based policies that you can use to grant access permissions to your Amazon S3 buckets and the objects in it. S3 bucket policies can allow or deny requests based on the elements in the policy.(e.g. allow user Alice to PUT but not DELETE objects in the bucket).

Note: You attach S3 bucket policies at the bucket level (i.e. you can’t attach a bucket policy to an S3 object), but the permissions specified in the bucket policy apply to all the objects in the bucket. You can also specify permissions at the object level by putting an object as the resource in the Bucket policy.

IAM policies and S3 bucket policies are both used for access control and they’re both written in JSON using the AWS access policy language. Let’s look at an example policy of each type:

Sample S3 Bucket Policy

This S3 bucket policy enables any IAM principal (user or role) in account 111122223333 to use the Amazon S3 GET Bucket (List Objects) operation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:root"]
      },
      "Action": "s3:ListBucket",
      "Resource": ["arn:aws:s3:::my_bucket"]
    }
  ]
}

This S3 bucket policy enables the IAM role ‘Role-name’ under the account 111122223333 to use the Amazon S3 GET Bucket (List Objects) operation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/Role-name"
      },
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my_bucket"
    }
  ]
}

Sample IAM Policy

This IAM policy grants the IAM principal it is attached to permission to perform any S3 operation on the contents of the bucket named “my_bucket”.

{
  "Version": "2012-10-17",
  "Statement":[{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my_bucket/*"]
    }
  ]
}

Note that the S3 bucket policy includes a “Principal” element, which lists the principals that bucket policy controls access for. The “Principal” element is unnecessary in an IAM policy, because the principal is by default the entity that the IAM policy is attached to.

S3 bucket policies (as the name would imply) only control access to S3 resources for the bucket they’re attached to, whereas IAM policies can specify nearly any AWS action. One of the neat things about AWS is that you can actually apply both IAM policies and S3 bucket policies simultaneously, with the ultimate authorization being the least-privilege union of all the permissions (more on this in the section below titled “How does authorization work with multiple access control mechanisms?”).

When to use IAM policies vs. S3 policies

Use IAM policies if:

You need to control access to AWS services other than S3. IAM policies will be easier to manage since you can centrally manage all of your permissions in IAM, instead of spreading them between IAM and S3.
You have numerous S3 buckets each with different permissions requirements. IAM policies will be easier to manage since you don’t have to define a large number of S3 bucket policies and can instead rely on fewer, more detailed IAM policies.
You prefer to keep access control policies in the IAM environment.

Use S3 bucket policies if:

You want a simple way to grant cross-account access to your S3 environment, without using IAM roles.
Your IAM policies bump up against the size limit (up to 2 kb for users, 5 kb for groups, and 10 kb for roles). S3 supports bucket policies of up 20 kb.
You prefer to keep access control policies in the S3 environment.
You want to apply common security controls to all principals who interact with S3 buckets, such as restricting the IP addresses or VPC a bucket can be accessed from.

If you’re still unsure of which to use, consider which audit question is most important to you:

If you’re more interested in “What can this user do in AWS?” then IAM policies are probably the way to go. You can easily answer this by looking up an IAM user and then examining their IAM policies to see what rights they have.
If you’re more interested in “Who can access this S3 bucket?” then S3 bucket policies will likely suit you better. You can easily answer this by looking up a bucket and examining the bucket policy.

Whichever method you choose, we recommend staying as consistent as possible. Auditing permissions becomes more challenging as the number of IAM policies and S3 bucket policies grows.

What about S3 ACLs?

An S3 ACL is a sub-resource that’s attached to every S3 bucket and object. It defines which AWS accounts or groups are granted access and the type of access. You can attach S3 ACLs to both buckets and individual objects within a bucket to manage permissions for those objects. As a general rule, AWS recommends using S3 bucket policies or IAM policies for access control. S3 ACLs is a legacy access control mechanism that predates IAM. By default, Object Ownership is set to the Bucket owner enforced setting and all ACLs are disabled, as can be seen below.

A majority of modern use cases in Amazon S3 no longer require the use of ACLs, and we recommend that you keep ACLs disabled by applying the Bucket owner enforced setting. This approach simplifies permissions management: you can use policies to more easily control access to every object in your bucket, regardless of who uploaded the objects in your bucket. When ACLs are disabled, the bucket owner owns all the objects in the bucket and manages access to data exclusively using access management policies.

S3 bucket policies and IAM policies define object-level permissions by providing those objects in the Resource element in your policy statements. The statement will apply to those objects in the bucket. Consolidating object-specific permissions into one policy (as opposed to multiple S3 ACLs) makes it simpler for you to determine effective permissions for your users and roles.

You can disable ACLs on both newly created and already existing buckets. For newly created buckets, ACLs are disabled by default. In the case of an existing bucket that already has objects in it, after you disable ACLs, the object and bucket ACLs are no longer part of an access evaluation, and access is granted or denied on the basis of policies.

S3 Access Points and S3 Access

In some cases customers have use cases with complex entitlement: Amazon s3 is used to store shared datasets where data is aggregated and accessed by different applications, individuals or teams for different use cases. Managing access to this shared bucket requires a single bucket policy that controls access for dozens to hundreds of applications with different permission levels. As an application set grows, the bucket policy becomes more complex, time consuming to manage, and needs to be audited to make sure that changes don’t have an unexpected impact on another application.

These customers need additional policy space for access to their data, and that buckets. To support these use cases, Amazon S3 provides a feature called Amazon S3 Access Points. Amazon S3 access points simplify data access for any AWS service or customer application that stores data in S3.

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket.

Amazon S3 access points support AWS Identity and Access Management (IAM) resource policies that allow you to control the use of the access point by resource, user, or other conditions. For an application or user to be able to access objects through an access point, both the access point and the underlying bucket must permit the request.

Note that Adding an S3 access point to a bucket doesn’t change the bucket’s ehaviour when the bucket is accessed directly through the bucket’s name or Amazon Resource Name (ARN). All existing operations against the bucket will continue to work as before. Restrictions that you include in an access point policy apply only to requests made through that access point.

Sample Access point policy

This access point policy grants the IAM user Alice permissions to GET and PUT objects through the access point ‘my-access-point’ in account 111122223333.

{
  “Version”: “2012-10-17”,
  “Statement”:[{
    “Effect”: “Allow”,
    “Principal”: { “AWS”: “arn:aws:iam::111122223333:user/Alice” },
    “Action”: [“s3:GetObject”, “s3:PutObject”],
    “Resource”: “arn:aws:s3:us-west-2:111122223333:accesspoint/my-access-point/object/*”
    }
  ]
}

Blocking Public Access for accounts and buckets

Public access is granted to buckets and objects through access control lists (ACLs), bucket policies, access point policies, or all. In order to ensure that public access to this bucket and its objects is blocked, you can turn on Block all public on both the bucket level or the account level.

The Amazon S3 Block Public Access feature provides settings for access points, buckets, and accounts to help you manage public access to Amazon S3 resources. By default, new buckets, access points, and objects don’t allow public access. However, users can modify bucket policies, access point policies, or object permissions to allow public access. S3 Block Public Access settings override these policies and permissions so that you can limit public access to these resources.

With S3 Block Public Access, account administrators and bucket owners can easily set up centralized controls to limit public access to their Amazon S3 resources that are enforced regardless of how the resources are created.

If you apply a setting to an account, it applies to all buckets and access points that are owned by that account. Similarly, if you apply a setting to a bucket, it applies to all access points associated with that bucket.

Block Public Access for buckets

These settings apply only to this bucket and its access points. AWS recommends that you turn on Block all public access, but before applying any of these settings, ensure that your applications will work correctly without public access. If you require some level of public access to this bucket or objects within, you can customize the individual settings below to suit your specific storage use cases.

You can use the S3 console, AWS CLI, AWS SDKs, and REST API to grant public access to one or more buckets. This setting is on by default at the account creation, as can be seen below (using the S3 console).

Turning off this session will create a warning in the account, as AWS recommends this setting to be turned un unless public access is required for specific and verified use cases such as static website hosting.

This setting can also be turned on for existing buckets. In the AWS Management Console this is done by opening the Amazon S3 console at https://console.aws.amazon.com/s3/, choosing the name of the bucket you want, choosing the Permissions tab. And Choosing Edit to change the public access settings for the bucket.

Block Public Access for accounts

In order to ensure that public access to all your S3 buckets and objects is blocked, turn on Block all public access. These settings apply account-wide for all current and future buckets and access points. AWS recommends that you turn on Block all public access, but before applying any of these settings, ensure that your applications will work correctly without public access. If you require some level of public access to your buckets or objects, you can customize the individual settings below to suit your specific storage use cases.

You can use the S3 console, AWS CLI, AWS SDKs, and REST API to configure block public access settings for all the buckets in your account. This setting can be turned on in the AWS Management Console by opening the Amazon S3 console at https://console.aws.amazon.com/s3/, and clicking Block Public Access setting for this account on the left panel. And Choosing Edit to change the public access settings for the bucket.

When working with AWS organizations, you can prevent people from modifying the Block Public Access on the account level by adding a Service control policy (SCP) that denies editing this. An example of such a SCP can be seen below:

{
  “Version”: “2012-10-17”,
  “Statement”:[{
    “Sid”: “DenyTurningOffBlockPublicAccessForThisAccount”,
    “Effect”: “Deny”,
    “Action”: “s3:PutAccountPublicAccessBlock”,
    “Resource”: “arn:aws:s3:::*”
    }
  ]
}

How does authorization work with multiple access control mechanisms?

Whenever an AWS principal issues a request to S3, the authorization decision depends on the union of all the IAM policies, S3 bucket policies, and S3 ACLs that apply as well as if Block Public Access is enabled on either the account, bucket or access point.

In accordance with the principle of least-privilege, decisions default to DENY and an explicit DENY always trumps an ALLOW. For example, if an IAM policy grants access to an object, the S3 bucket policies denies access to that object, and there is no S3 ACL, then access will be denied. Similarly, if no method specifies an ALLOW, then the request will be denied by default. Only if no method specifies a DENY and one or more methods specify an ALLOW will the request be allowed.

When Amazon S3 receives a request to access a bucket or an object, it determines whether the bucket or the bucket owner’s account has a block public access setting applied. If the request was made through an access point, Amazon S3 also checks for block public access settings for the access point. If there is an existing block public access setting that prohibits the requested access, Amazon S3 rejects the request.

This diagram illustrates the authorization process.

We hope that this post clarifies some of the confusion around the various ways you can control access to your S3 environment.

Using IAM Access Analyzer for S3 to review bucket access

Another interesting feature that can be used is IAM Access Analyzer for S3 to review bucket access. You can use IAM Access Analyzer for S3 to review buckets with bucket ACLs, bucket policies, or access point policies that grant public access. IAM Access Analyzer for S3 alerts you to buckets that are configured to allow access to anyone on the internet or other AWS accounts, including AWS accounts outside of your organization. For each public or shared bucket, you receive findings that report the source and level of public or shared access.

In IAM Access Analyzer for S3, you can block all public access to a bucket with a single click. You can also drill down into bucket-level permission settings to configure granular levels of access. For specific and verified use cases that require public or shared access, you can acknowledge and record your intent for the bucket to remain public or shared by archiving the findings for the bucket.

Additional Resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Zero traffic cost for Kafka consumers

2023-07-07 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/zero-traffic-cost

Introduction

Coban, Grab’s real-time data streaming platform team, has been building an ecosystem around Kafka, serving all Grab verticals. Along with stability and performance, one of our priorities is also cost efficiency.

In this article, we explain how the Coban team has substantially reduced Grab’s annual cost for data streaming by enabling Kafka consumers to fetch from the closest replica.

Problem statement

The Grab platform is primarily hosted on AWS cloud, located in one region, spanning over three Availability Zones (AZs). When it comes to data streaming, both the Kafka brokers and Kafka clients run across these three AZs.

Figure 1 – Initial design, consumers fetching from the partition leader

Figure 1 shows the initial design of our data streaming platform. To ensure high availability and resilience, we configured each Kafka partition to have three replicas. We have also set up our Kafka clusters to be rack-aware (i.e. 1 “rack” = 1 AZ) so that all three replicas reside in three different AZs.

The problem with this design is that it generates staggering cross-AZ network traffic. This is because, by default, Kafka clients communicate only with the partition leader, which has a 67% probability of residing in a different AZ.

This is a concern as we are charged for cross-AZ traffic as per AWS’s network traffic pricing model. With this design, our cross-AZ traffic amounted to half of the total cost of our Kafka platform.

The Kafka cross-AZ traffic for this design can be broken down into three components as shown in Figure 1:

Producing (step 1): Typically, a single service produces data to a given Kafka topic. Cross-AZ traffic occurs when the producer does not reside in the same AZ as the partition leader it is producing data to. This cross-AZ traffic cost is minimal, because the data is transferred to a different AZ at most once (excluding retries).
Replicating (step 2): The ingested data is replicated from the partition leader to the two partition followers, which reside in two other AZs. The cost of this is also relatively small, because the data is only transferred to a different AZ twice.
Consuming (step 3): Most of the cross-AZ traffic occurs here because there are many consumers for a single Kafka topic. Similar to the producers, the consumers incur cross-AZ traffic when they do not reside in the same AZ as the partition leader. However, on the consuming side, cross-AZ traffic can occur as many times as there are consumers (on average, two-thirds of the number of consumers). The solution described in this article addresses this particular component of the cross-AZ traffic in the initial design.

Solution

Kafka 2.3 introduced the ability for consumers to fetch from partition replicas. This opens the door to a more cost-efficient design.

Figure 2 – Target design, consumers fetching from the closest replica

Step 3 of Figure 2 shows how consumers can now consume data from the replica that resides in their own AZ. Implementing this feature requires rack-awareness and extra configurations for both the Kafka brokers and consumers. We will describe this in the following sections.

The Coban journey

Kafka upgrade

Our journey started with the upgrade of our legacy Kafka clusters. We decided to upgrade them directly to version 3.1, in favour of capturing bug fixes and optimisations over version 2.3. This was a safe move as version 3.1 was deemed stable for almost a year and we projected no additional operational cost for this upgrade.

To perform an online upgrade with no disruptions for our users, we broke down the process into three stages.

Stage 1: Upgrading Zookeeper. All versions of Kafka are tested by the community with a specific version of Zookeeper. To ensure stability, we followed this same process. The upgraded Zookeeper would be backward compatible with the pre-upgrade version of Kafka which was still in use at this early stage of the operation.
Stage 2: Rolling out the upgrade of Kafka to version 3.1 with an explicit backward-compatible inter-broker protocol version (inter.broker.protocol.version). During this progressive rollout, the Kafka cluster is temporarily composed of brokers with heterogeneous Kafka versions, but they can communicate with one another because they are explicitly set up to use the same inter-broker protocol version. At this stage, we also upgraded Cruise Control to a compatible version, and we configured Kafka to import the updated cruise-control-metrics-reporter JAR file on startup.
Stage 3: Upgrading the inter-broker protocol version. This last stage makes all brokers use the most recent version of the inter-broker protocol. During the progressive rollout of this change, brokers with the new protocol version can still communicate with brokers on the old protocol version.

Configuration

Enabling Kafka consumers to fetch from the closest replica requires a configuration change on both Kafka brokers and Kafka consumers. They also need to be aware of their AZ, which is done by leveraging Kafka rack-awareness (1 “rack” = 1 AZ).

Brokers

In our Kafka brokers’ configuration, we already had broker.rack set up to distribute the replicas across different AZs for resiliency. Our Ansible role for Kafka automatically sets it with the AZ ID that is dynamically retrieved from the EC2 instance’s metadata at deployment time.

- name: Get availability zone ID
  uri:
    url: http://169.254.169.254/latest/meta-data/placement/availability-zone-id
    method: GET
    return_content: yes
  register: ec2_instance_az_id

Note that we use AWS AZ IDs (suffixed az1, az2, az3) instead of the typical AWS AZ names (suffixed 1a, 1b, 1c) because the latter’s mapping is not consistent across AWS accounts.

Also, we added the new replica.selector.class parameter, set with value org.apache.kafka.common.replica.RackAwareReplicaSelector, to enable the new feature on the server side.

Consumers

On the Kafka consumer side, we mostly rely on Coban’s internal Kafka SDK in Golang, which streamlines how service teams across all Grab verticals utilise Coban Kafka clusters. We have updated the SDK to support fetching from the closest replica.

Our users only have to export an environment variable to enable this new feature. The SDK then dynamically retrieves the underlying host’s AZ ID from the host’s metadata on startup, and sets a new client.rack parameter with that information. This is similar to what the Kafka brokers do at deployment time.

We have also implemented the same logic for our non-SDK consumers, namely Flink pipelines and Kafka Connect connectors.

Impact

We rolled out fetching from the closest replica at the turn of the year and the feature has been progressively rolled out on more and more Kafka consumers since then.

Figure 3 – Variation of our cross-AZ traffic before and after enabling fetching from the closest replica

Figure 3 shows the relative impact of this change on our cross-AZ traffic, as reported by AWS Cost Explorer. AWS charges cross-AZ traffic on both ends of the data transfer, thus the two data series. On the Kafka brokers’ side, less cross-AZ traffic is sent out, thereby causing the steep drop in the dark green line. On the Kafka consumers’ side, less cross-AZ traffic is received, causing the steep drop in the light green line. Hence, both ends benefit by fetching from the closest replica.

Throughout the observeration period, we maintained a relatively stable volume of data consumption. However, after three months, we observed a substantial 25% drop in our cross-AZ traffic compared to December’s average. This reduction had a direct impact on our cross-AZ costs as it directly correlates with the cross-AZ traffic volume in a linear manner.

Caveats

Increased end-to-end latency

After enabling fetching from the closest replica, we have observed an increase of up to 500ms in end-to-end latency, that comes from the producer to the consumers. Though this is expected by design, it makes this new feature unsuitable for Grab’s most latency-sensitive use cases. For these use cases, we retained the traditional design whereby consumers fetch directly from the partition leaders, even when they reside in different AZs.

Figure 4 – End-to-end latency (99th percentile) of one of our streams, before and after enabling fetching from the closest replica

Inability to gracefully isolate a broker

We have also verified the behaviour of Kafka clients during a broker rotation; a common maintenance operation for Kafka. One of the early steps of our corresponding runbook is to demote the broker that is to be rotated, so that all of its partition leaders are drained and moved to other brokers.

In the traditional architecture design, Kafka clients only communicate with the partition leaders, so demoting a broker gracefully isolates it from all of the Kafka clients. This ensures that the maintenance is seamless for them. However, by fetching from the closest replica, Kafka consumers still consume from the demoted broker, as it keeps serving partition followers. When the broker effectively goes down for maintenance, those consumers are suddenly disconnected. To work around this, they must handle connection errors properly and implement a retry mechanism.

Potentially skewed load

Another caveat we have observed is that the load on the brokers is directly determined by the location of the consumers. If they are not well balanced across all of the three AZs, then the load on the brokers is similarly skewed. At times, new brokers can be added to support an increasing load on an AZ. However, it is undesirable to remove any brokers from the less loaded AZs as more consumers can suddenly relocate there at any time. Having these additional brokers and underutilisation of existing brokers on other AZs can also impact cost efficiency.

Figure 5 – Average CPU utilisation by AZ of one of our critical Kafka clusters

Figure 5 shows the CPU utilisation by AZ for one of our critical Kafka clusters. The skewage is visible after 01/03/2023. To better manage this skewage in load across AZs, we have updated our SDK to expose the AZ as a new metric. This allows us to monitor the skewness of the consumers and take measures proactively, for example, moving some of them to different AZs.

What’s next?

We have implemented the feature to fetch from the closest replica on all our Kafka clusters and all Kafka consumers that we control. This includes internal Coban pipelines as well as the managed pipelines that our users can self-serve as part of our data streaming offering.

We are now evangelising and advocating for more of our users to adopt this feature.

Beyond Coban, other teams at Grab are also working to reduce their cross-AZ traffic, notably, Sentry, the team that is in charge of Grab’s service mesh.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Migrating from Role to Attribute-based Access Control

2023-03-09 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/migrating-to-abac

Grab has always regarded security as one of our top priorities; this is especially important for data platform teams. We need to control access to data and resources in order to protect our consumers and ensure compliance with various, continuously evolving security standards.

Additionally, we want to keep the process convenient, simple, and easily scalable for teams. However, as Grab continues to grow, we have more services and resources to manage and it becomes increasingly difficult to keep the process frictionless. That’s why we decided to move from Role-Based Access Control (RBAC) to Attribute-Based Access Control (ABAC) for our Kafka Control Plane (KCP).

In this article, you will learn how Grab’s streaming data platform team (Coban) deleted manual role and permission management of hundreds of roles and resources, and reduced operational overhead of requesting or approving permissions to zero by moving from RBAC to ABAC.

Introduction

Kafka is widely used across Grab teams as a streaming platform. For decentralised Kafka resource (e.g. topic) management, teams have the right to create, update, or delete based on their needs. As the data platform team, we implemented a KCP to ensure that these operations are only performed by authorised parties, especially on multi-tenant Kafka clusters.

For internal access management, Grab uses its own Identity and Access Management (IAM) service, based on RBAC, to support authentication and authorisation processes:

Authentication verifies the identity of a user or service, for example, if the provided token is valid or expired.
Authorisation determines their access rights, for example, whether users can only update and/or delete their own Kafka topics.

In RBAC, roles, permissions, actions, resources, and the relationships between them need to be defined in the IAM service. They are used to determine whether a user can access a certain resource.

In the following example, we can see how IAM concepts come together. The Coban engineer role belongs to the Engineering-coban group and has permission to update the retention topic. Any engineer added to the Engineering-coban group will also be able to update the topic retention.

Following the same concept, each team using the KCP has its own roles, permissions, and resources created in the system. However, there are some disadvantages to this approach:

It leads to a significant growth in the number of access control artifacts both platform and user teams need to manage, and increased time and effort to debug access control issues. We start off by finding which group the engineer belongs to and locating the group that should be used for KCP, and then trace to role and permissions.
All group membership access requests of new joiners need to be reviewed and approved by their direct managers. This leads to a lot of backlog as new joiners might have multiple groups to join and managers might not be able to review them timely. In some cases, roles need to be re-applied or renewed every 90 days, which further adds to the delay.
Group memberships are not updated to reflect active members in the team, leaving some engineers with access they don’t need and others with access they should have but don’t.

Solution

With ABAC, access management becomes a lot easier. Any new joiner to a specific team gets the same access rights as everyone on that team – no need for manual approval from a manager. However, for ABAC to work, we need these components in place:

User attributes: Who is the subject (actor) of a request?
Resource attributes: Which object (resource) does the actor want to deal with?
Evaluation engine: How do we decide if the actor is allowed to perform the action on the resource?

User attributes

All users have certain attributes depending on the department or team they belong to. This data is then stored and synced automatically with the human resource management system (HRMS) tool, which acts as a source of truth for Grab-wide data, every time a user switches teams, roles, or leaves the company.

Resource attributes

Resource provisioning is an authenticated operation. This means that KCP knows who sent the requests and what each request/action is about. Similarly, resource attributes can be derived from their creators. For new resource provisioning, it is possible to capture the resource tags and store them after authentication. For existing resources, a major challenge was the need to backfill the tagging and ensure a seamless transition from the user’s perspective. In the past, all resource provisioning operations were done by a centralised platform team and most of the existing resource attributes are still under platform team’s ownership.

Evaluation engine

We chose to use Open Policy Agent (OPA) as our policy evaluation engine mainly for its wide community support, applicable feature set, and extensibility to other tools and platforms in our system. This is also currently used by our team for Kafka authorisation. The policies are written in Rego, the default language supported by OPA.

Architecture and implementation

With ABAC, the access control process looks like this:

User attributes

Authentication is handled by the IAM service. In the /generate_token call, a user requests an authentication token from KCP before calling an authenticated endpoint. KCP then calls IAM to generate a token and returns it to the user.

In the /create_topic call, the user includes the generated token in the request header. KCP takes the token and verifies the token validity with IAM. User attributes are then extracted from the token payload for later use in request authorisation.

Some of the common attributes we use for our policy are user identifier, department code, and team code, which provide details like a user’s department and work scope.

When it comes to data governance and central platform and identity teams, one of the major challenges was standardising the set of attributes to be used for clear and consistent ABAC policies across platforms so that their lifecycle and changes could be governed. This was an important shift in the mental model for attribute management over the RBAC model.

Resource attributes

For newly created resources, attributes will be derived from user attributes that are captured during the authentication process.

Previously with RBAC, existing resources did not have the required attributes. Since migrating to ABAC, the implementation has tagged newly created resources and ensured that their attributes are up to standard. IAM was also still doing the actual authorisation using RBAC.

It is also important to note that we collaborated with data governance teams to backfill Kafka resource ownership. Having accurate ownership of resources like data lake or Kafka topics enabled us to move toward a self-service model and remove bottlenecks from centralised platform teams.

After identifying most of the resource ownership, we started switching over to ABAC. The transition was smooth and had no impact on user experience. The remaining unidentified resources were tagged to lost-and-found and could be reclaimed by service teams when they needed permission to manage them.

Open Policy Agent

The most common question when implementing the policy is “how do you define ownership by attributes?”. With respect to the principle of least privilege, each policy must be sufficiently strict to limit access to only the relevant parties. In the end, we aligned as an organisation on defining ownership by department and team.

We created a simple example below to demonstrate how to define a policy:

package authz

import future.keywords

default allow = false

allow if {
        input.endpoint == "updateTopic"
        is_owner(input.resource_attributes)
}

is_owner(md) if {
        md.department == input.user_attributes.department
        md.team == input.user_attributes.team
}

In this example, we start with denying access to everyone. If the updateTopic endpoint is called and the department and team attributes between user and resource are matched, access is allowed.

With a similar scenario, we would need 1 role, 1 action, 1 resource, and 1 mapping (a.k.a permission) between action and resource. We will need to keep adding resources and permissions when we have new resources created. Compared to the policy above, no other changes are required.

With ABAC, there are no further setup or access requests needed when a user changes teams. The user will be tagged to different attributes, automatically granted access to the new team’s resources, and excluded from the previous team’s resources.

Another consideration we had was making sure that the policy is well-written and transparent in terms of change history. We decided to include this as part of our application code so every change is accounted for in the unit test and review process.

Authorisation

The last part of the ABAC process is authorisation logic. We added the logic to the middleware so that we could make a call to OPA for authorisation.

To ensure token validity after authentication, KCP extracts user attributes from the token payload and fetches resource attributes from the resource store. It combines the request metadata such as method and endpoint, along with the user and resource attributes into an OPA request. OPA then evaluates the request based on the redefined policy above and returns a response.

Auditability

For ABAC authorisation, there are two key areas of consideration:

Who made changes to the policy, who deployed, and when the change was made
Who accessed what resource and when

We manage policies in a dedicated GitLab repository and changes are submitted via merge requests. Based on the commit history, we can easily tell who made changes, reviewed, approved, and deployed the policy.

For resource access, OPA produces a decision log containing user attributes, resource attributes, and the authorisation decision for every call it serves. The log is kept for five days in Kibana for debugging purposes, then moved to S3 where it is kept for 28 days.

Impact

The move to ABAC authorisation has improved our controls as compared to the previous RBAC model, with the biggest impact being fewer resources to manage. Some other benefits include:

Optimised resource allocation: Discarded over 200 roles, 200 permissions, and almost 3000 unused resources from IAM services, simplifying our debugging process. Now, we can simply check the user and resource attributes as needed.
Simplified resource management: In the three months we have been using ABAC, about 600 resources have been added without any increase in complexity for authorisation, which is significantly lesser than the RBAC model.
Reduction in delays and waiting time: Engineers no longer have to wait for approval for KCP access.
Better governance over resource ownership and costs: ABAC allowed us to have a standardised and accurate tagging system of almost 3000 resources.

Learnings

Although ABAC does provide significant improvements over RBAC, it comes with its own caveats:

It needs a reliable and comprehensive attribute tagging system to function properly. This only became possible after roughly three months of identifying and tagging the ownership of existing resources by both automated and manual methods.
Tags should be kept up to date with the company’s growth. Teams could lose access to their resources if they are wrongly tagged. It needs a mechanism to keep up with changes, or people will unexpectedly lose access when user and resource attributes are changed.

What’s next?

To keep up with organisational growth, KCP needs to start listening to the IAM stream, which is where all IAM changes are published. This will allow KCP to regularly update user attributes and refresh resource attributes when restructuring occurs, allowing authorisation to be done with the right data.
Constant collaboration with HR to ensure that we maintain sufficient user attributes (no extra unused information) that remain clean so ABAC works as expected.

References

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Improved access controls: API access can now be selectively disabled

2023-01-11 Joseph So

Post Syndicated from Joseph So original https://blog.cloudflare.com/improved-api-access-control/

Improved access controls: API access can now be selectively disabled

Starting today, it is possible to selectively scope API access to your account to specific users.

We are making it easier for account owners to view and manage the access their users have on an account by allowing them to restrict API access to the account. Ensuring users have the least amount of access they need, and maximizing visibility of the access is critical, and our move today is another step in this direction.

When Cloudflare was first introduced, a single user had access to a single account. As we have been adopted by larger enterprises, the need to maximize access granularity and retain control of an account has become progressively more important. Nowadays, enterprises using Cloudflare could have tens or hundreds of users on an account, some of which need to do account configuration, and some that do not. In addition, to centralize the configuration of the account, some enterprises have a need for service accounts, or those shared between several members of an organization.

While account owners have always been able to restrict access to an account by their users, they haven’t been able to view the keys and tokens created by their users. Restricting use of the API is the first step in a direction that will allow account owners a single control plane experience to manage their users’ access.

Steps to secure an account

The safest thing to do to reduce risk is to scope every user to the minimal amount of access required, and the second is to monitor what they do with their access.

While a dashboard login has some degree of non-repudiation, especially when being protected by multiple factors and an SSO configuration, an API key or token can be leaked, and no further authentication factors will block the use of this credential. Therefore, in order to reduce the attack surface, we can limit what the token can do.

A Cloudflare account owner can now access their members page, and turn API access on or off for specific users, as well as account wide.

This feature is available for our enterprise users starting today.

Moving forward

On our journey to making the account management experience safer, and more granular, we will continue to increase the level of control account owners have over their accounts. Building these API restrictions is a first step on the way to allowing account-owned API tokens (which will limit the need to have personal tokens), as well as increasing general visibility of tokens among account members.

Zero trust with Kafka

2022-12-07 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/zero-trust-with-kafka

Introduction

Grab’s real-time data platform team, also known as Coban, has been operating large-scale Kafka clusters for all Grab verticals, with a strong focus on ensuring a best-in-class-performance and 99.99% availability.

Security has always been one of Grab’s top priorities and as fraudsters continue to evolve, there is an increased need to continue strengthening the security of our data streaming platform. One of the ways of doing this is to move from a pure network-based access control to state-of-the-art security and zero trust by default, such as:

Authentication: The identity of any remote systems – clients and servers – is established and ascertained first, prior to any further communications.
Authorisation: Access to Kafka is granted based on the principle of least privilege; no access is given by default. Kafka clients are associated with the whitelisted Kafka topics and permissions – consume or produce – they strictly need. Also, granted access is auditable.
Confidentiality: All in-transit traffic is encrypted.

Solution

We decided to use mutual Transport Layer Security (mTLS) for authentication and encryption. mTLS enables clients to authenticate servers, and servers to reciprocally authenticate clients.

Kafka supports other authentication mechanisms, like OAuth, or Salted Challenge Response Authentication Mechanism (SCRAM), but we chose mTLS because it is able to verify the peer’s identity offline. This verification ability means that systems do not need an active connection to an authentication server to ascertain the identity of a peer. This enables operating in disparate network environments, where all parties do not necessarily have access to such a central authority.

We opted for Hashicorp Vault and its PKI engine to dynamically generate clients and servers’ certificates. This enables us to enforce the usage of short-lived certificates for clients, which is a way to mitigate the potential impact of a client certificate being compromised or maliciously shared. We said zero trust, right?

For authorisation, we chose Policy-Based Access Control (PBAC), a more scalable solution than Role-Based Access Control (RBAC), and the Open Policy Agent (OPA) as our policy engine, for its wide community support.

To integrate mTLS and the OPA with Kafka, we leveraged Strimzi, the Kafka on Kubernetes operator. In a previous article, we have alluded to Strimzi and hinted at how it would help with scalability and cloud agnosticism. Built-in security is undoubtedly an additional driver of our adoption of Strimzi.

Server authentication

We first set up a single Root Certificate Authority (CA) for each environment (staging, production, etc.). This Root CA, in blue on the diagram, is securely managed by the Hashicorp Vault cluster. Note that the color of the certificates, keys, signing arrows and signatures on the diagrams are consistent throughout this article.

To secure the cluster’s internal communications, like the communications between the Kafka broker and Zookeeper pods, Strimzi sets up a Cluster CA, which is signed by the Root CA (step 1). The Cluster CA is then used to sign the individual Kafka broker and zookeeper certificates (step 2). Lastly, the Root CA’s public certificate is imported into the truststores of both the Kafka broker and Zookeeper (step 3), so that all pods can mutually verify their certificates when authenticating one with the other.

Strimzi’s embedded Cluster CA dynamically generates valid individual certificates when spinning up new Kafka and Zookeeper pods. The signing operation (step 2) is handled automatically by Strimzi.

For client access to Kafka brokers, Strimzi creates a different set of intermediate CA and server certificates, as shown in the next diagram.

Figure 2 – Server authentication process for client access to Kafka brokers

The same Root CA from Figure 1 now signs a different intermediate CA, which the Strimzi community calls the Client CA (step 1). This naming is misleading since it does not actually sign any client certificates, but only the server certificates (step 2) that are set up on the external listener of the Kafka brokers. These server certificates are for the Kafka clients to authenticate the servers. This time, the Root CA’s public certificate will be imported into the Kafka Client truststore (step 3).

Client authentication

For client authentication, the Kafka client first needs to authenticate to Hashicorp Vault and request an ephemeral certificate from the Vault PKI engine (step 1). Vault then issues a certificate and signs it using its Root CA (step 2). With this certificate, the client can now authenticate to Kafka brokers, who will use the Root CA’s public certificate already in their truststore, as previously described (step 3).

CA tree

Putting together the three different authentication processes we have just covered, the CA tree now looks like this. Note that this is a simplified view for a single environment, a single cluster, and two clients only.

Figure 4 – Complete certificate authority tree

As mentioned earlier, each environment (staging, production, etc.) has its own Root CA. Within an environment, each Strimzi cluster has its own pair of intermediate CAs: the Cluster CA and the Client CA. At the leaf level, the Zookeeper and Kafka broker pods each have their own individual certificates.

On the right side of the diagram, each Kafka client can get an ephemeral certificate from Hashicorp Vault whenever they need to connect to Kafka. Each team or application has a dedicated Vault PKI role in Hashicorp Vault, restricting what can be requested for its certificate (e.g., Subject, TTL, etc.).

Strimzi deployment

We heavily use Terraform to manage and provision our Kafka and Kafka-related components. This enables us to quickly and reliably spin up new clusters and perform cluster scaling operations.

Under the hood, Strimzi Kafka deployment is a Kubernetes deployment. To increase the performance and the reliability of the Kafka cluster, we create dedicated Kubernetes nodes for each Strimzi Kafka broker and each Zookeeper pod, using Kubernetes taints and tolerations. This ensures that all resources of a single node are dedicated solely to either a single Kafka broker or a single Zookeeper pod.

We also decided to go with a single Kafka cluster by Kubernetes cluster to make the management easier.

Client setup

Coban provides backend microservice teams from all Grab verticals with a popular Kafka SDK in Golang, to standardise how teams utilise Coban Kafka clusters. Adding mTLS support mostly boils down to upgrading our SDK.

Our enhanced SDK provides a default mTLS configuration that works out of the box for most teams, while still allowing customisation, e.g., for teams that have their own Hashicorp Vault Infrastructure for compliance reasons. Similarly, clients can choose among various Vault auth methods such as AWS or Kubernetes to authenticate to Hashicorp Vault, or even implement their own logic for getting a valid client certificate.

To mitigate the potential risk of a user maliciously sharing their application’s certificate with other applications or users, we limit the maximum Time-To-Live (TTL) for any given certificate. This also removes the overhead of maintaining a Certificate Revocation List (CRL). Additionally, our SDK stores the certificate and its associated private key in memory only, never on disk, hence reducing the attack surface.

In our case, Hashicorp Vault is a dependency. To prevent it from reducing the overall availability of our data streaming platform, we have added two features to our SDK – a configurable retry mechanism and automatic renewal of clients’ short-lived certificates when two thirds of their TTL is reached. The upgraded SDK also produces new metrics around this certificate renewal process, enabling better monitoring and alerting.

Authorisation

For authorisation, we set up the Open Policy Agent (OPA) as a standalone deployment in the Kubernetes cluster, and configured Strimzi to integrate the Kafka brokers with that OPA.

OPA policies – written in the Rego language – describe the authorisation logic. They are created in a GitLab repository along with the authorisation rules, called data sources (step 1). Whenever there is a change, a GitLab CI pipeline automatically creates a bundle of the policies and data sources, and pushes it to an S3 bucket (step 2). From there, it is fetched by the OPA (step 3).

When a client – identified by its TLS certificate’s Subject – attempts to consume or produce a Kafka record (step 4), the Kafka broker pod first issues an authorisation request to the OPA (step 5) before processing the client’s request. The outcome of the authorisation request is then cached by the Kafka broker pod to improve performance.

As the core component of the authorisation process, the OPA is deployed with the same high availability as the Kafka cluster itself, i.e. spread across the same number of Availability Zones. Also, we decided to go with one dedicated OPA by Kafka cluster instead of having a unique global OPA shared between multiple clusters. This is to reduce the blast radius of any OPA incidents.

For monitoring and alerting around authorisation, we submitted an Open Source contribution in the opa-kafka-plugin project in order to enable the OPA authoriser to expose some metrics. Our contribution to the open source code allows us to monitor various aspects of the OPA, such as the number of authorised and unauthorised requests, as well as the cache hit-and-miss rates. Also, we set up alerts for suspicious activity such as unauthorised requests.

Finally, as a platform team, we need to make authorisation a scalable, self-service process. Thus, we rely on the Git repository’s permissions to let Kafka topics’ owners approve the data source changes pertaining to their topics.

Teams who need their applications to access a Kafka topic would write and submit a JSON data source as simple as this:

{
 "example_topic": {
   "read": [
     "clientA.grab",
     "clientB.grab"
   ],
   "write": [
     "clientB.grab"
   ]
 }
}

GitLab CI unit tests and business logic checks are set up in the Git repository to ensure that the submitted changes are valid. After that, the change would be submitted to the topic’s owner for review and approval.

What’s next?

The performance impact of this security design is significant compared to unauthenticated, unauthorised, plaintext Kafka. We observed a drop in throughput, mostly due to the low performance of encryption and decryption in Java, and are currently benchmarking different encryption ciphers to mitigate this.

Also, on authorisation, our current PBAC design is pretty static, with a list of applications granted access for each topic. In the future, we plan to move to Attribute-Based Access Control (ABAC), creating dynamic policies based on teams and topics’ metadata. For example, teams could be granted read and write access to all of their own topics by default. Leveraging a versatile component such as the OPA as our authorisation controller enables this evolution.

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Masking field values with Amazon Elasticsearch Service

2021-01-11 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/security/masking-field-values-with-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services. Amazon ES provides a deep security model that spans many layers of interaction and supports fine-grained access control at the cluster, index, document, and field level, on a per-user basis. The service’s security plugin integrates with federated identity providers for Kibana login.

A common use case for Amazon ES is log analytics. Customers configure their applications to store log data to the Elasticsearch cluster, where the data can be queried for insights into the functionality and use of the applications over time. In many cases, users reviewing those insights should not have access to all the details from the log data. The log data for a web application, for example, might include the source IP addresses of incoming requests. Privacy rules in many countries require that those details be masked, wholly or in part. This post explains how to set up field masking within your Amazon ES domain.

Field masking is an alternative to field-level security that lets you anonymize the data in a field rather than remove it altogether. When creating a role, add a list of fields to mask. Field masking affects whether you can see the contents of a field when you search. You can use field masking to either perform a random hash or pattern-based substitution of sensitive information from users, who shouldn’t have access to that information.

When you use field masking, Amazon ES creates a hash of the actual field values before returning the search results. You can apply field masking on a per-role basis, supporting different levels of visibility depending on the identity of the user making the query. Currently, field masking is only available for string-based fields. A search result with a masked field (clientIP) looks like this:

{
  "_index": "web_logs",
  "_type": "_doc",
  "_id": "1",
  "_score": 1,
  "_source": {
    "agent": "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1",
    "bytes": 0,
    "clientIP": "7e4df8d4df7086ee9c05efe1e21cce8ff017a711ee9addf1155608ca45d38219",
    "host": "www.example.com",
    "extension": "txt",
    "geo": {
      "src": "EG",
      "dest": "CN",
      "coordinates": {
        "lat": 35.98531194,
        "lon": -85.80931806
      }
    },
    "machine": {
      "ram": 17179869184,
      "os": "win 7"
    }
  }
}

To follow along in this post, make sure you have an Amazon ES domain with Elasticsearch version 6.7 or higher, sample data loaded (this example uses the web logs data supplied by Kibana), and access to Kibana through a role with administrator privileges for the domain.

Configure field masking

Field masking is managed by defining specific access controls within the Kibana visualization system. You’ll need to create a new Kibana role, define the fine-grained access-control privileges for that role, specify which fields to mask, and apply that role to specific users.

You can use either the Kibana console or direct-to-API calls to set up field masking. In our first example, we’ll use the Kibana console.

To configure field masking in the Kibana console

Log in to Kibana, choose the Security pane, and then choose Roles, as shown in Figure 1.

Figure 1: Choose security roles
Choose the plus sign (+) to create a new role, as shown in Figure 2.

Figure 2: Create role
Choose the Index Permissions tab, and then choose Add index permissions, as shown in Figure 3.

Figure 3: Set index permissions
Add index patterns and appropriate permissions for data access. See the Amazon ES documentation for details on configuring fine-grained access control.
Once you’ve set Index Patterns, Permissions: Action Groups, Document Level Security Query, and Include or exclude fields, you can use the Anonymize fields entry to mask the clientIP, as shown in Figure 4.

Figure 4: Anonymize field
Choose Save Role Definition.
Next, you need to create one or more users and apply the role to the new users. Go back to the Security page and choose Internal User Database, as shown in Figure 5.

Figure 5: Select Internal User Database
Choose the plus sign (+) to create a new user, as shown in Figure 6.

Figure 6: Create user
Add a username and password, and under Open Distro Security Roles, select the role es-mask-role, as shown in Figure 7.

Figure 7: Select the username, password, and roles
Choose Submit.

If you prefer, you can perform the same task by using the Amazon ES REST API using Kibana dev tools.

Use the following API to create a role as described in below snippet and shown in Figure 8.

PUT _opendistro/_security/api/roles/es-mask-role
{
  "cluster_permissions": [],
  "index_permissions": [
    {
      "index_patterns": [
        "web_logs"
      ],
      "dls": "",
      "fls": [],
      "masked_fields": [
        "clientIP"
      ],
      "allowed_actions": [
        "data_access"
      ]
    }
  ]
}

Sample response:

{
  "status": "CREATED",
  "message": "'es-mask-role' created."
}

Figure 8: API to create Role

Use the following API to create a user with the role as described in below snippet and shown in Figure 9.

PUT _opendistro/_security/api/internalusers/es-mask-user
{
  "password": "xxxxxxxxxxx",
  "opendistro_security_roles": [
    "es-mask-role"
  ]
}

Sample response:

{
  "status": "CREATED",
  "message": "'es-mask-user' created."
}

Figure 9: API to create User

Verify field masking

You can verify field masking by running a simple search query using Kibana dev tools (GET web_logs/_search) and retrieving the data first by using the kibana_user (with no field masking), and then by using the es-mask-user (with field masking) you just created.

Query responses run by the kibana_user (all access) have the original values in all fields, as shown in Figure 10.

Figure 10: Retrieval of the full clientIP data with kibana_user

Figure 11, following, shows an example of what you would see if you logged in as the es-mask-user. In this case, the clientIP field is hidden due to the es-mask-role you created.

Figure 11: Retrieval of the masked clientIP data with es-mask-user

Use pattern-based field masking

Rather than creating a hash, you can use one or more regular expressions and replacement strings to mask a field. The syntax is <field>::/<regular-expression>/::<replacement-string>.

You can use either the Kibana console or direct-to-API calls to set up pattern-based field masking. In the following example, clientIP is masked in such a way that the last three parts of the IP address are masked by xxx using the pattern is clientIP::/[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$/::xxx.xxx.xxx>. You see only the first part of the IP address, as shown in Figure 12.

Figure 12: Anonymize the field with a pattern

Run the search query to verify that the last three parts of clientIP are masked by custom characters and only the first part is shown to the requester, as shown in Figure 13.

Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user

Conclusion

Field level security should be the primary approach for ensuring data access security – however if there are specific business requirements that cannot be met with this approach, then field masking may offer a viable alternative. By using field masking, you can selectively allow or prevent your users from seeing private information such as personally identifying information (PII) or personal healthcare information (PHI). For more information about fine-grained access control, see the Amazon Elasticsearch Service Developer Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Automate domain join for Amazon EC2 instances from multiple AWS accounts and Regions

2020-12-10 Sanjay Patel

Post Syndicated from Sanjay Patel original https://aws.amazon.com/blogs/security/automate-domain-join-for-amazon-ec2-instances-multiple-aws-accounts-regions/

As organizations scale up their Amazon Web Services (AWS) presence, they are faced with the challenge of administering user identities and controlling access across multiple accounts and Regions. As this presence grows, managing user access to cloud resources such as Amazon Elastic Compute Cloud (Amazon EC2) becomes increasingly complex. AWS Directory Service for Microsoft Active Directory (also known as an AWS Managed Microsoft AD) makes it easier and more cost-effective for you to manage this complexity. AWS Managed Microsoft AD is built on highly available, AWS managed infrastructure. Each directory is deployed across multiple Availability Zones, and monitoring automatically detects and replaces domain controllers that fail. In addition, data replication and automated daily snapshots are configured for you. You don’t have to install software, and AWS handles all patching and software updates. AWS Managed Microsoft AD enables you to leverage your existing on-premises user credentials to access cloud resources such as the AWS Management Console and EC2 instances.

This blog post describes how EC2 resources launched across multiple AWS accounts and Regions can automatically domain-join a centralized AWS Managed Microsoft AD. The solution we describe in this post is implemented for both Windows and Linux instances. Removal of Computer objects from Active Directory upon instance termination is also implemented. The solution uses Amazon DynamoDB to centrally store account and directory information in a central security account. We also provide AWS CloudFormation templates and platform-specific domain join scripts for you to use with AWS Lambda as a quick start solution.

Architecture

The following diagram shows the domain-join process for EC2 instances across multiple accounts and Regions using AWS Managed Microsoft AD.

Figure 1: EC2 domain join architecture

The event flow works as follows:

An EC2 instance is launched in a peered virtual private cloud (VPC) of a workload or security account. VPCs that are hosting EC2 instances need to be peered with the VPC that contains AWS Managed Microsoft AD to enable network connectivity with Active Directory.
An Amazon CloudWatch Events rule detects an EC2 instance in the “running” state.
The CloudWatch event is forwarded to a regional CloudWatch event bus in the security account.
If the CloudWatch event bus is in the same Region as AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Queue Service (Amazon SQS) queue, referred to as the domain-join queue in this post.
If the CloudWatch event bus is in a different Region from AWS Managed Microsoft AD, it delivers the event to an Amazon Simple Notification Service (Amazon SNS) topic. The event is then delivered to the domain-join queue described in step 4, through the Amazon SNS topic subscription.
Messages in the domain-join queue are held for five minutes to allow for EC2 instances to stabilize after they reach the “running” state. This delay allows time for installation of additional software components and agents through the use of EC2 user data and AWS Systems Manager Distributor.
After the holding period is over, messages in the domain-join queue invoke the AWS AD Join/Leave Lambda function. The Lambda function does the following:
1. Retrieves the AWS account ID that originated the event from the message and retrieves account-specific configurations from a DynamoDB table. This configuration identifies AWS Managed Microsoft AD domain controller IPs, credentials required to perform EC2 domain join, and an AWS Identity and Access Management (IAM) role that can be assumed by the Lambda function to invoke AWS Systems Manager Run Command.
2. If needed, uses AWS Security Token Service (AWS STS) and prepares a cross-account access session.
3. Retrieves EC2 instance information, such as the instance state, platform, and tags, and validates the instance state.
4. Retrieves platform-specific domain-join scripts that are deployed with the Lambda function’s code bundle, and configures invocation of those scripts by using data read from the DynamoDB table (bash script for Linux instances and PowerShell script for Windows instances).
5. Uses AWS Systems Manager Run Command to invoke the domain-join script on the instance. Run Command enables you to remotely and securely manage the configuration of your managed instances.
6. The domain-join script runs on the instance. It uses script parameters and instance attributes to configure the instance and perform the domain join. The adGroupName tag value is used to configure the Active Directory user group that will have permissions to log in to the instance. The instance is rebooted to complete the domain join process. Various software components are installed on the instance when the script runs. For the Linux instance, sssd, realmd, krb5, samba-common, adcli, unzip, and packageit are installed. For the Windows instance, the RDS-RD-Server feature is installed.

Removal of EC2 instances from AWS Managed Microsoft AD upon instance termination follows a similar sequence of steps. Each instance that is domain joined creates an Active Directory domain object under the “Computer” hierarchy. This domain object needs to be removed upon instance termination so that a new instance that uses the same private IP address in the subnet (at a future time) can successfully domain join and enable instance access with Active Directory credentials. Removal of the Active Directory Computer object is done by running the leaveDomaini.ps1 script (included with this blog) through Run Command on the Active Directory Tools instance identified in Figure 1.

Prerequisites and setup

To build the solution outlined in this post, you need:

AWS Managed Microsoft AD with an appropriate DNS name (for example, example.com). For more information about getting started with AWS Managed Microsoft AD, see Create Your AWS Managed Microsoft AD directory.
AD Tools. To install AD Tools and use it to create the required users:
1. Launch a Windows EC2 instance in the same account and Region, and domain-join it with the directory you created in the previous step. Log in to the instance through Remote Desktop Protocol (RDP) and install AD Tools as described in Installing the Active Directory Administration Tools.
2. After the AD Tools are installed, launch the AD Users & Computers application to create domain users, and assign those users to an Active Directory security group (for example, my_UserGroup) that has permission to access domain-joined instances.
3. Create a least-privileged user for performing domain joins as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The identity of this user is stored in the DynamoDB table and read by the AD Join Lambda function to invoke Active Directory join scripts.
4. Store the password for the least-privileged user in an encrypted Systems Manager parameter. The password for this user is stored in the secure string System Manager parameter and read by the AD Join Lambda function at runtime while processing Amazon SQS messages.
5. Assign a unique tag key and value to identify the AD Tools instance. This instance will be invoked by the Lambda function to delete Computer objects from Active Directory upon termination of domain-joined instances.
All VPCs that are hosting EC2 instances to be domain joined must be peered with the VPC that hosts the relevant AWS Managed Microsoft AD. Alternatively, AWS Transit Gateway could be used to establish this connectivity.
In addition to having network connectivity to the AWS Managed Microsoft AD domain controllers, domain join scripts that run on EC2 instances must be able to resolve relevant Active Directory resource records. In this solution, we leverage Amazon Route 53 Outbound Resolver to forward DNS queries to the AWS Managed Microsoft AD DNS servers, while still preserving the default DNS capabilities that are available to the VPC. Learn more about deploying Route 53 Outbound Resolver and resolver rules to resolve your directory DNS name to DNS IPs.
Each domain-join EC2 instance must have a Systems Manager Agent (SSM Agent) installed and an IAM role that provides equivalent permissions as provided by the AmazonEC2RoleforSSM built-in policy. The SSM Agent is used to allow domain-join scripts to run automatically. See Working with SSM Agent for more information on installing and configuring SSM Agents on EC2 instances.

Solution deployment

The steps in this section deploy AD Join solution components by using the AWS CloudFormation service.

The CloudFormation template provided with this solution (mad_auto_join_leave.json) deploys resources that are identified in the security account’s AWS Region that hosts AWS Managed Microsoft AD (the top left quadrant highlighted in Figure 1). The template deploys a DynamoDB resource with 5 read and 5 write capacity units. This should be adjusted to match your usage. DynamoDB also provides the ability to auto-scale these capacities. You will need to create and deploy additional CloudFormation stacks for cross-account, cross-Region scenarios.

To deploy the solution

Create a versioned Amazon Simple Storage Service (Amazon S3) bucket to store a zip file (for example, adJoinCode.zip) that contains Python Lambda code and domain join/leave bash and PowerShell scripts. Upload the source code zip file to an S3 bucket and find the version associated with the object.
Navigate to the AWS CloudFormation console. Choose the appropriate AWS Region, and then choose Create Stack. Select With new resources.
Choose Upload a template file (for this solution, mad_auto_join_leave.json), select the CloudFormation stack file, and then choose Next.
Enter the stack name and values for the other parameters, and then choose Next.

Figure 2: Defining the stack name and parameters

The parameters are defined as follows:

S3CodeBucket: The name of the S3 bucket that holds the Lambda code zip file object.
adJoinLambdaCodeFileName: The name of the Lambda code zip file that includes Lambda Python code, bash, and Powershell scripts.
adJoinLambdaCodeVersion: The S3 Version ID of the uploaded Lambda code zip file.
DynamoDBTableName: The name of the DynamoDB table that will hold account configuration information.
CreateDynamoDBTable: The flag that indicates whether to create a new DynamoDB table or use an existing table.
ADToolsHostTagKey: The tag key of the Windows EC2 instance that has AD Tools installed and that will be used for removal of Active Directory Computer objects upon instance termination.
ADToolsHostTagValue: The tag value for the key identified by the ADToolsHostTagKey parameter.
Acknowledge creation of AWS resources and choose to continue to deploy AWS resources through AWS CloudFormation.The CloudFormation stack creation process is initiated, and after a few minutes, upon completion, the stack status is marked as CREATE_COMPLETE. The following resources are created when the CloudFormation stack deploys successfully:
- An AD Join Lambda function with associated scripts and IAM role.
- A CloudWatch Events rule to detect the “running” and “terminated” states for EC2 instances.
- An SQS event queue to hold the EC2 instance “running” and “terminated” events.
- CloudWatch event mapping to the SQS event queue and further to the Lambda function.
- A DynamoDB table to hold the account configuration (if you chose this option).

The DynamoDB table hosts account-level configurations. Account-specific configuration is required for an instance from a given account to join the Active Directory domain. Each DynamoDB item contains the account-specific configuration shown in the following table. Storing account-level information in the DynamoDB table provides the ability to use multiple AWS Managed Microsoft AD directories and group various accounts accordingly. Additional account configurations can also be stored in this table for implementation of various centralized security services (instance inspection, patch management, and so on).

Attribute	Description
accountId	AWS account number
adJoinUserName	User ID with AD Join permissions
adJoinUserPwParam	Encrypted Systems Manager parameter containing the AD Join user’s password
dnsIP1	Domain controller 1 IP address2
dnsIP2	Domain controller 2 IP address
assumeRoleARN	Amazon Resource Name (ARN) of the role assumed by the AD Join Lambda function

Following is an example of how you could insert an item (row) in a DynamoDB table for an account.

aws dynamodb put-item --table-name <DynamoDB-Table-Name> --item file://itemData.json

where itemData.json is as follows.

{
    "accountId": { "S": "123412341234" },
    " adJoinUserName": { "S": "ADJoinUser" },
    " adJoinUserPwParam": { "S": "ADJoinUser-PwParam" },
    "dnsName": { "S": "example.com" },
    "dnsIP1": { "S": "192.0.2.1" },
    "dnsIP2": { "S": "192.0.2.2" },
    "assumeRoleARN": { "S": "arn:aws:iam::111122223333:role/adJoinLambdaRole" }
}

(Update with your own values as appropriate for your environment.)

In the preceding example, adJoinLambdaRole is assumed by the AD Join Lambda function (if needed) to establish cross-account access using AWS Security Token Service (AWS STS). The role needs to provide sufficient privileges for the AD Join Lambda function to retrieve instance information and run cross-account Systems Manager commands.

adJoinUserName identifies a user with the minimum privileges to do the domain join; you created this user in the prerequisite steps.

adJoinUserPwParam identifies the name of the encrypted Systems Manager parameter that stores the password for the AD Join user. You created this parameter in the prerequisite steps.

Solution test

After you successfully deploy the solution using the steps in the previous section, the next step is to test the deployed solution.

To test the solution

Navigate to the AWS EC2 console and launch a Linux instance. Launch the instance in a public subnet of the available VPC.
Choose an IAM role that gives at least AmazonEC2RoleforSSM permissions to the instance.
Add an adGroupName tag with the value that identifies the name of the Active Directory security group whose members should have access to the instance.
Make sure that the security group associated with your instance has permissions for your IP address to log in to the instance by using the Secure Shell (SSH) protocol.
Wait for the instance to launch and perform the Active Directory domain join. You can navigate to the AWS SQS console and observe a delayed message that represents the CloudWatch instance “running” event. This message is processed after five minutes; after that you can observe the Lambda function’s message processing log in CloudWatch logs.
Log in to the instance with Active Directory user credentials. This user must be the member of the Active Directory security group identified by the adGroupName tag value. Following is an example login command.
```
ssh ‘[email protected]’@<public-dns-name|public-ip-address>
```
Similarly, launch a Windows EC2 instance to validate the Active Directory domain join by using Remote Desktop Protocol (RDP).
Terminate domain-joined instances. Log in to the AD Tools instance to validate that the Active Directory Computer object that represents the instance is deleted.

The AD Join Lambda function invokes Systems Manager commands to deliver and run domain join scripts on the EC2 instances. The AWS-RunPowerShellScript command is used for Microsoft Windows instances, and the AWS-RunShellScript command is used for Linux instances. Systems Manager command parameters and execution status can be observed in the Systems Manager Run Command console.

The AD user used to perform the domain join is a least-privileged user, as described in Delegate Directory Join Privileges for AWS Managed Microsoft AD. The password for this user is passed to instances by way of SSM Run Commands, as described above. The password is visible in the SSM Command history log and in the domain join scripts run on the instance. Alternatively, all script parameters can be read locally on the instance through the “adjoin” encrypted SSM parameter. Refer to the domain join scripts for details of the “adjoin” SSM parameter.

Additional information

Directory sharing

AWS Managed Microsoft AD can be shared with other AWS accounts in the same Region. Learn how to use this feature and seamlessly domain join Microsoft Windows EC2 instances and Linux instances.

autoadjoin tag

Launching EC2 instances with an autoadjoin tag key with a “false” value excludes the instance from the automated Active Directory join process. You might want to do this in scenarios where you want to install additional agent software before or after the Active Directory join process. You can invoke domain join scripts (bash or PowerShell) by using user data or other means. However, you’ll need to reboot the instance and re-run scripts to complete the domain join process.

Summary

In this blog post, we demonstrated how you could automate the Active Directory domain join process for EC2 instances to AWS Managed Microsoft AD across multiple accounts and Regions, and also centrally manage this configuration by using AWS DynamoDB. By adopting this model, administrators can centrally manage Active Directory–aware applications and resources across their accounts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Aligning IAM policies to user personas for AWS Security Hub

2020-11-02 Vaibhawa Kumar

Post Syndicated from Vaibhawa Kumar original https://aws.amazon.com/blogs/security/aligning-iam-policies-to-user-personas-for-aws-security-hub/

AWS Security Hub provides you with a comprehensive view of your security posture across your accounts in Amazon Web Services (AWS) and gives you the ability to take action on your high-priority security alerts. There are several different user personas that use Security Hub, and they typically require different AWS Identity and Access Management (IAM) permissions. Those personas include: a security administrator; a security analyst or engineer on a central security team or in a Cloud Center of Excellence; and a Developer Operations (DevOps) engineer or application builder who is the primary owner of an AWS account. In this post, we show how to deploy sample IAM policies for these three personas.

The first persona, a security administrator or cloud system administrator (sysadmin), is responsible for setting up and configuring Security Hub, and they typically need access to all of the Security Hub APIs to do this work. As part of this work, the sysadmin enables Security Hub on various accounts and Regions, decides which standards and controls should be enabled on which accounts, enables product integrations, creates insights, sets up custom actions and automated remediations, and configures IAM policies for other users.

The second persona, a security analyst or engineer using Security Hub, is part of a central security team, often part of a Cloud Center of Excellence. We often see that cloud security efforts are centralized in Cloud Centers of Excellence within a company. These security analysts or engineers typically have access to the master account in Security Hub and can view and take action on findings from any of the connected member accounts. They typically are not configuring Security Hub, so they don’t need permissions to do so.

The third persona is a DevOps engineer or application builder. This user needs the ability to view findings and take action only on the findings associated with their account. For cloud workloads, security is often decentralized down to these users. Security Hub enables them to take more proactive responsibility for the security of their own account by directly viewing and taking action on findings in their account. They typically don’t need permissions to set up and configure Security Hub, because that is done by a central sysadmin.

Overview

The following reference architecture presents an overview of a Security Hub master-member account structure and three personas: a security administrator, a security analyst/engineer, and a DevOps engineer.

Figure 1: Reference architecture

In this blog post, we show how you can create and use the following AWS managed and customer managed IAM policies to support these three personas:

The sysadmin persona needs permissions to configure and manage Security Hub, account memberships, insights, and integrations, and to create remediations and take actions and perform record and workflow updates. The AWS managed IAM policy called AWSSecurityHubFullAccess provides the permissions for this persona. An IAM user or role with these permissions can deploy and configure Security Hub in master and member accounts. They can also update findings. The sysadmin also requires permissions to configure AWS Config and Amazon CloudWatch event rules to set up automated responses and remediations.
The security analyst persona needs permissions to read, list, and describe findings, standards, controls, and products; to update findings; and to create and update insights for Security Hub resources in the master account. The AWS managed IAM policy called AWSSecurityHubReadOnlyAccess provides the permissions needed for read, list, and describe actions, and a customer managed policy will be attached to give permissions to create and update insights and to update findings.
The DevOps engineer persona needs the same permissions as the security analyst persona, but they will only have the ability to access their own Security Hub member AWS account(s) and won’t have access to the master account.

Depending on your specific use case, you might want to provide additional permissions to the security analyst and DevOps engineer personas. For example, you might want to also grant them permissions to create custom actions by using the UpdateActionTarget API. In that case, you should also ensure that they have appropriate permissions to create CloudWatch event rules. You can also restrict these personas to only be able to update certain fields in findings (for example, only update workflow status but not severity) by using IAM context keys.

Prerequisites

You must have already enabled Security Hub with one account as master and other associated accounts as members. You will use the following AWS services:

Implementation

To create the required customer managed policies and associate them to users and roles, you will perform these tasks, described in more detail later in this section:

Create a customer managed policy and associate it with the user and role for the security analyst persona in the Security Hub master account, along with the AWSSecurityHubReadOnlyAccess AWS managed policy.
Create a customer managed policy and associate it with the user and role for the DevOps persona in the Security Hub member account, along with the AWSSecurityHubReadOnlyAccess AWS managed policy.
Create a sysadmin user and role and associate it with the AWS managed policy for AWSSecurityHubFullAccess, along with AWS Config and CloudWatch event rule permissions in the Security Hub master account.

The following policy JSON script is for those two customer managed policies.

Security Hub - Security Analyst policy 
MasterCustomer Managed Policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "SecurityAnalystMasterCMP",
            "Effect": "Allow",
            "Action": [
                "securityhub:UpdateInsight",
                "securityhub:CreateInsight",
	  			"securityhub:BatchUpdateFindings"
            ],
            "Resource": "*"
        }
    ]
}

Security Hub – DevOps Engineer policy: 
MemberCustomer Managed Policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DevOpsMemberCMP",
            "Effect": "Allow",
            "Action": [
                "securityhub:UpdateInsight",
                "securityhub:CreateInsight",
                "securityhub:BatchUpdateFindings"
            ],
            "Resource": "*"
        }
    ]
}

Step 1: Create the user and role for the security analyst persona

First, create a customer managed policy and associate it with the user and role for the security analyst persona in the Security Hub master account, along with the AWSSecurityHubReadOnlyAccess AWS managed policy.

To create the IAM policy, user, and role (console method)

Sign in to the AWS Management Console in the Security Hub master account and open the IAM console.
In the IAM console navigation pane, choose Policies, and then choose Create Policies.
Choose the JSON tab.
Copy the Security Hub – Security Analyst policy JSON shown earlier in this section, and paste it into the visual editor. When you are finished, choose Review policy.
On the Review policy page, enter a name and a description (optional) for the policy that you’re creating. Review the policy summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
Follow the instructions in the AWS Identity and Access Management User Guide to create a user and role, and attach the policy you just created.
In the IAM console navigation pane, choose the user or role (or both) that you just created, and on the Permissions tab, choose Add Permissions.
Choose the permissions category Attach existing policies directly to filter and select the managed policy AWSSecurityHubReadOnlyAccess. Review the permissions and then choose Add permissions.
Save all the changes and try using the user and role after few minutes.

Step 2: Create the user and role for the DevOps persona

Next, create a customer managed policy and associate it with the user and role for the DevOps persona in the Security Hub member account, along with the AWSSecurityHubReadOnlyAccess AWS managed policy.

To create the IAM policy, user, and role (console method)

Sign in to the AWS Management Console in the Security Hub member account and open the IAM console.
In the IAM console navigation pane, choose Policies and then choose Create Policies.
Choose the JSON tab.
Copy the Security Hub – DevOps Engineer policy JSON shown earlier in this section, and paste it into the visual editor. When you are finished, choose Review policy.
On the Review policy page, enter a name and a description (optional) for the policy that you are creating. Review the policy summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
Follow the instructions in the AWS Identity and Access Management User Guide to create a user and role, and attach the policy you just created.
In the IAM console navigation pane, choose the user or role (or both) that you just created, and on the Permissions tab, choose Add Permissions.
Choose the permissions category Attach existing policies directly to filter and select the managed policy AWSSecurityHubReadOnlyAccess. Review the permissions and then choose Add permissions.
Save all the changes and try using the user and role after few minutes.

Step 3: Create the user and role for the sysadmin persona

Next, create a user and role for the sysadmin persona and associate it with the AWS managed policies AWSSecurityHubFullAccess, CloudWatchEventsFullAccess, and full access to AWS Config in the Security Hub master account.

To create the IAM user and role (console method)

Sign in to the AWS Management Console in the Security Hub member account and open the IAM console.
Follow the instructions in the AWS Identity and Access Management User Guide to create a user and role.
In the IAM console navigation pane, choose the user or role (or both) that you just created, and on the Permissions tab, choose Add Permissions.
Choose the permissions category Attach existing policies directly to filter and select managed policies, and attach the managed policies AWSSecurityHubFullAccess and CloudWatchEventsFullAccess.
Create another policy to grant full access to AWS Config as described in the AWS Config Developer Guide, and attach the policy to the user or role. Review the permissions, and choose Add permissions.
Save all the changes and try using the user and role after few minutes.

Test the users and roles in the Security Hub master and member accounts

Finally, test the three users and roles that you created for the respective personas in the preceding steps.

To test the users and roles

Security analyst user and role:
1. Sign in to the AWS Management Console in the master account and open Security Hub.
2. Make sure that Security Hub UI features (such as the Summary, Security Standard, Insights, Findings, and Integrations) are rendered so that the Security Analyst can view them.
3. Navigate to a finding and change the workflow status as described in the topic Setting the workflow status for findings.
DevOps engineer user and role:
1. Sign in to the AWS Management Console in the member account and open Security Hub.
2. Make sure that Security Hub UI features (such as the Summary, Security Standard, Insights, Findings, and Integrations) are rendered so that the DevOps Engineer can view them.
3. Create a custom insight for the member account and other attribute groupings.
Sysadmin user and role:
1. Sign in to the AWS Management Console in the master or member account, and open Security Hub.
2. Try admin-related operations such as create a custom action, invite another account, and so on.

Adding policy conditions

You might want to further restrict the permissions for the security analyst and DevOps personas. IAM policies for Security Hub’s BatchUpdateFindings API enable you to specify conditions to prevent a user from making any update to a specific finding field. The following example disallows setting the Workflow Status field to Suppressed.

{
	"Sid": "CMPCondition",
	"Effect": "Deny",
	"Action": "securityhub:BatchUpdateFindings",
	"Resource": "*",
	"Condition": {
		"StringEquals": {
			"securityhub:ASFFSyntaxPath/Workflow.Status": "SUPPRESSED"
		}
	}
}

Summary

In this post, we showed you how to align AWS managed and customer managed IAM policies to different user personas, so that you can allow different users to access Security Hub with least privilege permissions. Security Hub enables both central security teams and individual DevOps engineers to understand and improve the security posture of the AWS accounts in their organization.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security Hub forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Securing and managing multi-cloud Presto Clusters with Grab’s DataGateway

2020-08-24 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/data-gateway

Introduction

Data is the lifeblood of Grab and the insights we gain from it drive all the most critical business decisions made by Grabbers and our leaders every day.

Grab’s Data Engineering (DE) team is responsible for maintaining the data platform, which consists of data pipelines, job schedulers, and the query/computation engines that are the key components for generating insights from data. SQL is the core language for analytics at Grab and as of early 2020, our Presto platform serves about 200 user groups that add up to 500 users who run 350,000 queries every day. These queries span across 10,000 tables that process up to 1PB of data daily.

In 2016, we started the DataGateway project to enable us to manage data access for the hundreds of Grabbers who needed access to Presto for their work. Since then, DataGateway has grown to become much more than just an access control mechanism for Presto. In this blog, we want to share what we’ve achieved since the initial launch of the project.

The problems we wanted to solve

As we were reviewing the key challenges around data access in Grab and assessing possible solutions, we came up with this prioritized list of user requirements we wanted to work on:

Use a single endpoint to serve everyone.
Manage user access to clusters, schemas, tables, and fields.
Provide seamless user experience when presto clusters are scaled up/down, in/out, or provisioned/decommissioned.
Capture audit trail of user activities.

To provide Grabbers with the critical need of interactive querying, as well as performing extract, transform, load (ETL) jobs, we evaluated several technologies. Presto was among the ones we evaluated, and was what we eventually chose although it didn’t meet all of our requirements out of the box. In order to address these gaps, we came up with the idea of a security gateway for the Presto compute engine that could also act as a load balancer/proxy, this is how we ended up creating the DataGateway.

DataGateway is a service that sits between clients and Presto clusters. It is essentially a smart HTTP proxy server that is an abstraction layer on top of the Presto clusters that handles the following actions:

Parse incoming SQL statements to get requested schemas, tables, and fields.
Manage user Access Control List (ACL) to limit users’ data access by checking against the SQL parsing results.
Manage users’ cluster access.
Redirect users’ traffic to the authorized clusters.
Show meaningful error messages to users whenever the query is rejected or exceptions from clusters are encountered.

Anatomy of DataGateway

The DataGateway’s key components are as follows:

API Service
SQL Parser
Auth framework
Administration UI

We leveraged Kubernetes to run all these components as microservices.

API Service

This is the component that manages all users and cluster-facing processes. We integrated this service with the Presto API, which means it appears to be the same as a Presto cluster to a client. It accepts query requests from clients, gets the parsing result and runs authorization from the SQL Parser and the Auth Framework.

If everything is good to go, the API Service forwards queries to the assigned clusters and continues the entire query process.

Auth Framework

This handles both authentication and authorization requests. It stores the ACL of users and communicates with the API Service and the SQL Parser to run the entire authentication process. But why is it a microservice instead of a module in API Service, you ask? It’s because we keep evolving the security checks at Grab to ensure that everything is compliant with our security requirements, especially when dealing with data.

We wanted to make it flexible to fulfill ad-hoc requests from the security team without affecting the API Service. Furthermore, there are different authentication methods out there that we might need to deal with (OAuth2, SSO, you name it). The API Service supports multiple authentication frameworks that enable different authentication methods for different users.

SQL Parser

This is a SQL parsing engine to get schema, tables, and fields by reading SQL statements. Since Presto SQL parsing works differently in each version, we would compile multiple SQL Parsers that are identical to the Presto clusters we run. The SQL Parser becomes the single source of truth.

Admin UI

This is a UI for Presto administrators to manage clusters and user access, as well as to select an authentication framework, making it easier for the administrators to deal with the entire ecosystem.

How we deployed DataGateway using Kubernetes

In the past couple of years, we’ve had significant growth in workloads from analysts and data scientists. As we were very enthusiastic about Kubernetes, DataGateway was chosen as one of the earliest services for deployment in Kubernetes. DataGateway in Kubernetes is known to be highly available and fully scalable to handle traffic from users and systems.

We also tested the HPA feature of Kubernetes, which is a dynamic scaling feature to scale in or out the number of pods based on actual traffic and resource consumption.

*Figure 2. DataGateway deployment using Kubernetes*

Functionality of DataGateway

This section highlights some of the ways we use DataGateway to manage our Presto ecosystem efficiently.

Restrict users based on Schema/Table level access

In a setup where a Presto cluster is deployed on AWS Amazon Elastic MapReduce (EMR) or Elastic Kubernetes Service (EKS), we configure an IAM role and attach it to the EMR or EKS nodes. The IAM role is set to limit the access to S3 storage. However, the IAM only provides bucket-level and file-level control; it doesn’t meet our requirements to have schema, table, and column-level ACLs. That’s how DataGateway is found useful in such scenarios.

One of the DataGateway services is an SQL Parser. As previously covered, this is a service that parses and digs out schemas and tables involved in a query. The API service receives the parsing result and checks against the ACL of users, and decides whether to allow or reject the query. This is a remarkable improvement in our security control since we now have another layer to restrict access, on top of the S3 storage. We’ve implemented an SQL-based access control down to table level.

As shown in the Figure 3, user A is trying run a SQL statement select * from locations.cities. The SQL Parser reads the statement and tells the API service that user A is trying to read data from the table cities in the schema locations. Then, the API service checks against the ACL of user A. The service finds that user A has only read access to table countries in schema locations. Eventually, the API service denies this attempt because user A doesn’t have read access to table cities in the schema locations.

*Figure 3. An example of how to check user access to run SQL statements*

The above flow shows an access denied result because the user doesn’t have the appropriate permissions.

Seamless User Experience during the EMR migration

We use AWS EMR to deploy Presto as an SQL query engine since deployment is really easy. However, without DataGateway, any EMR operations such as terminations, new cluster deployment, config changes, and version upgrades, would require quite a bit of user involvement. We would sometimes need users to make changes on their side. For example, request users to change the endpoints to connect to suitable clusters.

With DataGateway, ACLs exist for each of the user accounts. The ACL includes the list of EMR clusters that users are allowed to access. As a Presto access management platform, here the DataGateway redirects user traffics to an appropriate cluster based on the ACL, like a proxy. Users always connect to the same endpoint we offer, which is the DataGateway. To switch over from one cluster to another, we just need to edit the cluster ACL and everything is handled seamlessly.

*Figure 4. Cluster switching using DataGateway*

Figure 4 highlights the case when we’re switching EMR from one cluster to another. No changes are required from users.

We executed the migration of our entire Presto platform from an AWS EMR instance to another AWS EMR instance using the same methodology. The migrations were executed with little to no disruption for our users. We were able to move 40 clusters with hundreds of users. They were able to issue millions of queries daily in a few phases over a couple of months.

In most cases, users didn’t have to make any changes on their end, they just continued using Presto as usual while we made the changes in the background.

Multi-Cloud Data Lake/Presto Cluster maintenance

Recently, we started to build and maintain data lakes not just in one cloud, but two – in AWS and Azure. Since most end-users are AWS-based, and each team has their own AWS sub-account to run their services and workloads, it would be a nightmare to bridge all the connections and access routes between these two clouds from end-to-end, sub-account by sub-account.

Here, the DataGateway plays the role of the multi-cloud gateway. Since all end-users’ AWS sub-accounts have peered to DataGateway’s network, everything becomes much easier to handle.

For end-users, they retain the same Presto connection profile. The DE team then handles the connection setup from DataGateway to Azure, and also the deployment of Presto clusters in Azure.

When all is set, end-users use the same endpoint to DataGateway. We offer a feature called Cluster Switch that allows users to switch between AWS Presto cluster and Azure Presto Cluster on the fly by filling in parameters on the connection string. This feature allows users to switch to their target Presto cluster without any endpoint changes. The switch works instantly whenever they do the change. That means users can run different queries in different clusters based on their requirements.

This feature has helped the DE team to maintain Presto Cluster easily. We can spin up different Presto clusters for different teams, so that each team has their own query engine to run their queries with dedicated resources.

*Figure 5. Sub-account connections and Queries*

Figure 5 shows an example of how sub-accounts connect to DataGateway and run queries on resources in different clouds and clusters.

*Figure 6. Sample scenario without DataGateway*

Figure 6 shows a scenario of what would happen if DataGatway doesn’t exist. Each of the accounts would have to maintain its own connections, Virtual Private C loud (VPC) peering, and express link to connect to our Presto resources.

Summary

DataGateway is playing a key role in Grab’s entire Presto ecosystem. It helps us manage user access and cluster selections on a single endpoint, ensuring that everyone is running their Presto queries on the same place. It also helps distribute workload to different types and versions of Presto clusters.

When we started to deploy the DataGateway on Kubernetes, our vision for the Presto ecosystem underwent an epic change as it further motivated us to continuously improve. Since then, we’ve had new ideas on deployment method/pipeline, microservice implementations, scaling strategy, resource control, we even made use of Kubernetes and designed an on-demand, container-based Presto cluster provisioning engine. We’ll share this in another engineering blog, so do stay tuned!.

We also made crucial enhancements on data access control as we extended Presto’s access controls down to the schema/table-level.

In day-to-day operations, especially when we started to implement data lake in multiple clouds, DataGateway solved a lot of implementation issues. DataGateway made it simpler to switch a user’s Presto cluster from one cloud to another or allow a user to use a different Presto cluster using parameters. DataGateway allowed us to provide a seamless experience to our users.

Looking forward, we’ve more and more ideas for our Presto ecosystem, such Spark DataGateway or AWS Athena integrations, to keep our data safe at any time and to provide our users with a smoother experience when dealing with data used for analysis or research.

Authored by Vinnson Lee on behalf of the Presto Development Team at Grab – Edwin Law, Qui Hieu Nguyen, Rahul Penti, Wenli Wan, Wang Hui and the Data Engineering Team.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.