Tag Archives: identity-management

How to monitor and track failed logins for your AWS Managed Microsoft AD

2021-07-02 Tekena Orugbani

Post Syndicated from Tekena Orugbani original https://aws.amazon.com/blogs/security/how-to-monitor-and-track-failed-logins-for-your-aws-managed-microsoft-ad/

AWS Directory Service for Microsoft Active Directory provides customers with the ability to review security logs on their AWS Managed Microsoft AD domain controllers by either using a domain management Amazon Elastic Compute Cloud (Amazon EC2) instance or by forwarding domain controller security event logs to Amazon CloudWatch Logs.

You can further improve visibility by monitoring Windows login activities on your AWS Managed Microsoft AD domain-joined EC2 instances, and in this blog post, I show you how. Monitoring and tracking Windows security events on your AWS Managed Microsoft AD domain-joined instances can reveal unexpected activities on your domain-joined EC2 instances so that you can take proactive remediating action.

For example, every time there is an unsuccessful attempt to log in to a domain-joined EC2 instance or on-premises server by using an AWS Managed Microsoft AD user or a local account, an “Audit Failure” Windows security event with ID 4625 is recorded on the EC2 instance itself. The event data includes details of the account name, workstation name, and source network address. Unsuccessful attempts to log in to non–domain-joined EC2 instances and servers are handled the same way. You can track and monitor these events on an ongoing basis across your fleet of Windows EC2 instances by using the solution described here.

Solution overview

Figure 1 shows the workflow for the solution.

Figure 1: Solution architecture

The workflow steps are as follows:

An Amazon CloudWatch agent that is running on the EC2 instances sends the Windows security event logs to Amazon CloudWatch.
CloudWatch filters the logs based on the filter you specify. When the configured threshold is met, CloudWatch posts an alert to an SNS topic.
Amazon Simple Notification Service (Amazon SNS) invokes an AWS Lambda function.
The Lambda function scans through the events and determines which EC2 instance(s) generated the security events at a frequency that satisfies the configured threshold. It discards any other instances listed in the events that don’t meet the specified criteria. The function sends an email to the configured email address with a high-level description of the event logs and the instance(s) that generated them.
Amazon Simple Email Service (Amazon SES) delivers the emails in the specified mailbox.

Note: Although this example uses email notification via Amazon SES to monitor failed logins, there are opportunities to extend the solution. For example, you can integrate with a Security Information and Event Management (SIEM) tool that may potentially be integrated with a ticketing service and/or some automation or incident response process when a set threshold for failed logins is breached.

Prerequisites

Before you deploy the solution, you must complete the following steps:

Create AWS Identity and Access Management (IAM) roles for use with the CloudWatch agent
Sign up for Amazon SES
Verify the sender and recipient email addresses that you’ll use to send and receive email notifications

Deploy the solution

The solution I present here involves four main steps:

Install and configure the CloudWatch agent for your EC2 instances.
Create a metric filter in CloudWatch.
Create a CloudWatch alarm based on the metric filter and add SNS notification.
Create a Lambda function and subscribe the function to the SNS topic.

Step 1: Install and configure the CloudWatch agent for all your EC2 instances

The first step is to create an AWS Systems Manager parameter to contain the JSON configuration for the CloudWatch agent that runs on the EC2 instances. You’ll then use Systems Manager Run Command to install the CloudWatch agent on the instances and to apply the configuration in the Parameter Store to the CloudWatch agent.

To install and configure the CloudWatch agent

Open the AWS Systems Manager console and in the navigation pane, choose Parameter Store to create a new Systems Manager parameter.
Give your parameter a name. In my example, I named my parameter AmazonCloudWatch-Windows.
For Tier, choose Standard. For Type, choose String. For Data type, choose Text.
For the value of the parameter, enter the following JSON configuration and choose Create Parameter.

Note: This JSON configuration creates a log group in CloudWatch with the name /aws/SecurityAuditLogs. If you would prefer to use another log group name, you can modify the JSON configuration. Also, if you already have a Systems Manager parameter named AmazonCloudWatch-Windows, you can use any other name of your choice.
```
{ "logs": { "logs_collected": { "windows_events": { "collect_list": [ { "event_format": "xml", "event_levels": [ "VERBOSE", "INFORMATION", "WARNING", "ERROR", "CRITICAL" ], "event_name": "Security", "log_group_name": "/aws/SecurityAuditLogs", "log_stream_name": "{instance_id}" } ] } } }, "metrics": { "metrics_collected": { "statsd": { "metrics_aggregation_interval": 60, "metrics_collection_interval": 10, "service_address": ":8125" } } } }
```
The Parameter details page should look similar to the following.

Figure 2: Create the System Manager parameter for the CloudWatch agent
Next, you’ll use Run Command to install and configure the CloudWatch agent. In the navigation pane, choose Run Command.
On the Run a command page, in the search box, enter Document name prefix: Equals: AWS-ConfigureAWSPackage. Press Enter and select the document that appears.
Under Command parameters, for Name, enter AmazonCloudWatchAgent.

Figure 3: Install the CloudWatch agent on the instances
Under Targets, specify your EC2 instances based on their tags, or choose them manually, and then choose Run.
To configure the CloudWatch agent, choose Run Command again. On the Run a command screen, enter Document name prefix: Equals: AmazonCloudWatch-ManageAgent. Press Enter and select the document that appears.
Under Command parameters, for Optional Configuration Location, enter the name of the Systems Manager parameter you created earlier. In my example, I used the name AmazonCloudWatch-Windows. Keep the defaults for the other settings.

Figure 4: Configure the CloudWatch agent on the instances
Under Targets, specify your EC2 instances based on their tags, or choose them manually, and then choose Run.

Step 2: Create a metric filter in CloudWatch

After the completion of the tasks in Step 1, your EC2 instances should now be sending logs to a log group in Amazon CloudWatch called /aws/SecurityAuditLogs. The log group should have log streams named after the EC2 instances that are sending the logs to CloudWatch. The next step is to create a metric filter to filter the noise from the logs.

To create a metric filter

Open the CloudWatch console and in the left navigation menu, choose Log Groups.
Select the check box next to the /aws/SecurityAuditLogs log group, choose Actions, and then choose Create metric filter.
On the Define pattern page, enter Audit Failure, keep the defaults for the other settings, and then choose Next.
Enter values for Filter name, Metric namespace, Metric name, and Metric value, and then choose Next to create the metric filter.

Figure 5: Create a CloudWatch metric filter

Step 3: Create a CloudWatch alarm based on the metric filter and add SNS notification

In this step, you set a threshold for how many “Audit Failure” events you want to allow within a period of time before triggering an alarm.

To create the CloudWatch alarm and add SNS notification

Open the Amazon Simple Notification Service console and in the left navigation menu, choose Topics.
Choose Create topic, and then choose Standard.
Provide a name for your topic, and then choose Create topic. In my example, I named the topic WindowsSecurityLogsAlarmNotifications.
Open the CloudWatch console, choose Log groups, and select the /aws/SecurityAuditLogs log group.
Choose the Metric filters tab, select the check box next to the WindowsSecurityAuditFailures filter you just created, and choose Create alarm.
On the Specify metric and conditions page, set the parameters as follows:
1. For Statistic, choose Sample count.
2. For Period, choose 5 minutes.
3. For Threshold type, choose Static.
4. For Define the alarm condition, choose Greater>threshold.
5. For Define the threshold value, specify the threshold number of failed login attempts that will cause a notification to be sent.
  
  Note: In my example, I’ve specified to be notified after five failed login attempts. You should determine the appropriate threshold to use, based on your organization’s security policies.
Figure 6: Create a CloudWatch alarm
On the Configure actions page, choose Next.
Choose In alarm, choose Select an existing SNS topic, and then select the SNS topic you created earlier in this procedure.
Specify a name for the alarm, and then choose Create Alarm.

Step 4: Create a Lambda function and subscribe the function to the SNS topic

CloudWatch alarm messages are predefined, can’t be modified, and don’t provide details based on CloudWatch streams. Additionally, a CloudWatch alarm will trigger when a combination of failed login attempts on two or more instances meets the threshold. For instance, in my example, when there are three failed attempts on one instance and two failed attempts on a second instance all within a 5-minute period, a CloudWatch alarm will be triggered.

The purpose of the Lambda function that you’ll create in this step is to validate whether the triggered alarms meet the specified threshold on a per-instance basis before the function sends an email notification to the designated email address. When a CloudWatch alarm is triggered, the function reads through the CloudWatch logs and filters the logs based on CloudWatch log streams that meet the specified threshold for the alarm. If no individual CloudWatch log stream (that is, no individual instance or server) meets the threshold, the function won’t send a notification. The function only sends a notification if it determines that one or more instances have each met the specified threshold. The function also provides more information about the failed login attempts when it does send you an email.

To create the Lambda function and subscribe it to the SNS topic

Open the AWS Lambda console and choose Create function.
Choose Author from scratch, and provide a name for your function. Under Runtime, select Node.js 14.x, and then choose Create function.

Double-click index.js, replace the code with the following code, and then choose Deploy.

var aws = require('aws-sdk');
var cwl = new aws.CloudWatchLogs();
var ses = new aws.SES();
let alarmThreshold = process.env.ALARM_THRESHOLD;

exports.handler = function(event, context) {
    var message = JSON.parse(event.Records[0].Sns.Message);
    var alarmName = message.AlarmName;
    var oldState = message.OldStateValue;
    var newState = message.NewStateValue;
    var reason = message.NewStateReason;
    var requestParams = {
        metricName: message.Trigger.MetricName,
        metricNamespace: message.Trigger.Namespace
    };
    cwl.describeMetricFilters(requestParams, function(err, data) {
        if(err) console.error('Error is:', err);
        else {
            console.log('Metric Filter data is:', data);
    	    getInstanceIdsAndSendEmail(message, data);
        }
    });
};

function getInstanceIdsAndSendEmail(message, metricFilterData) {
    var timestamp = Date.parse(message.StateChangeTime);
    var offset = message.Trigger.Period * message.Trigger.EvaluationPeriods * 1000;
    var metricFilter = metricFilterData.metricFilters[0];
    var dictInstances = {};
    var arrayInstances = [];
    var instancesFinalList = [];
    var key;
    var val;
    // Getting the Instance Ids
    var paramsForInstanceId = {
        'logGroupName' : metricFilter.logGroupName,
        'filterPattern' : metricFilter.filterPattern ? metricFilter.filterPattern : "",
         'startTime' : timestamp - offset,
         'endTime' : timestamp
    };
    cwl.filterLogEvents(paramsForInstanceId, function (err, data){
        if (err) {
            console.error('Filtering failure:', err);
        } else {
            var events = data.events;
            for (var i in events) {
                var InstanceId = JSON.stringify(events[i]['logStreamName']);
                arrayInstances.push(InstanceId);
            }
            console.log('Array Instance is:', arrayInstances);
            for (var i = 0; i < arrayInstances.length; i++) {
                var instId = arrayInstances[i];
                dictInstances[instId] = dictInstances[instId] ? dictInstances[instId] + 1 : 1;
            }
            console.log('Instance(s) and number of audit failure occurrences:', dictInstances);
            for([key, val] of Object.entries(dictInstances)) {
                if (val > alarmThreshold) {
                    instancesFinalList.push(key.replace(/['"]+/g, ''));
                }
            }
            console.log('Instance(s) with failure audit that exceed the threshold:', instancesFinalList);
    	    getLogsAndSendEmail(message, metricFilterData, instancesFinalList);
        }
    });
}

function getLogsAndSendEmail(message, metricFilterData, logStreamNames_Instance) {
    var timestamp = Date.parse(message.StateChangeTime);
    var offset = message.Trigger.Period * message.Trigger.EvaluationPeriods * 1000;
    var metricFilter = metricFilterData.metricFilters[0];
    var dictInstances = {};
    var arrayInstances = [];
    var instancesFinalList = []

    // Send Email to the Instances
    var paramsForEmail = {
        'logGroupName' : metricFilter.logGroupName,
        'filterPattern' : metricFilter.filterPattern ? metricFilter.filterPattern : "",
         'startTime' : timestamp - offset,
         'endTime' : timestamp,
         'logStreamNames' : logStreamNames_Instance
    };
    cwl.filterLogEvents(paramsForEmail, function (err, data){
        if (err) {
            console.error('Filtering failure:', err);
        } else {
            console.log("===SENDING EMAIL===");
			var email = ses.sendEmail(generateEmailContent(data, message), function(err, data){
                if(err) console.error(err);
                else {
                    console.log("===EMAIL SENT===");
                    console.log(data);
                }
            });
        }
    });
}

function generateEmailContent(data, message) {
    var events = data.events;
    let senderEmail = process.env.SENDER_EMAIL;
    let recipientEmail = process.env.RECIPIENT_EMAIL.split(",");
    console.log('Recipient is: ', recipientEmail);
    console.log('Events are:', events);
    var style = '<style> pre {color: red;} </style>';
    var logData = '<br/>Logs:<br/>' + style;
    for (var i in events) {
        logData += '<pre>Instance:' + JSON.stringify(events[i]['logStreamName'])  + '</pre>';
        logData += '<pre>Message:' + JSON.stringify(events[i]['message']) + '</pre><br/>';
    }
    var date = new Date(message.StateChangeTime);
    var text = 'Alarm Name: ' + '<b>' + message.AlarmName + '</b><br/>' + 
               'Message: ' + 'There has been an unusually high number of Windows Security Audit Failure events for the instance(s) with details below. Please review the event logs <br/>' +
               'Account ID: ' + message.AWSAccountId + '<br/>'+
               'Region: ' + message.Region + '<br/>'+
               'Alarm Time: ' + date.toString() + '<br/>'+
               logData;
    var subject = 'Alarm Triggered - ' + message.AlarmName;
    var emailContent = {
        Destination: {
            ToAddresses: recipientEmail
        },

        Message: {
            Body: {
                Html: {
                    Data: text
                }
            },
            Subject: {
                Data: subject
            }
        },
        Source: senderEmail
    };
    return emailContent;
}

Choose Add trigger, and in the drop-down list, choose SNS.
Under SNS topic, select the SNS topic you created in Step 3, and then choose Add.

Figure 7: Create the AWS Lambda function
Choose the Configuration tab, and then choose Environment variables. Choose Edit to add the environment variables for ALARM_THRESHOLD, RECIPIENT_EMAIL, and SENDER_EMAIL, and then choose Save.

Figure 8: The Lambda environment variables

Note: The variables’ keys must be set exactly as ALARM_THRESHOLD, RECIPIENT_EMAIL, and SENDER_EMAIL, because otherwise the code will fail. For the recipient, you can specify a single email or multiple email addresses that are separated by commas, as shown in Figure 8, provided that the emails are verified as specified in the Prerequisites section.

Next, create an IAM policy, which you’ll attach to a role that will be assumed by the Lambda function. This policy provides permissions to perform the DescribeMetricFilters, FilterLogEvents, and SendEmail API calls that are necessary for the function to work. It also provides permissions to create a log group and log stream in CloudWatch for the Lambda function, so that you can review the logs if the Lambda function fails to run properly.

To create the IAM policy

Sign in to the IAM console, and in the navigation bar, choose Policies.
In the content pane, choose Create policy, and then choose JSON.

Replace the content with the following script. Make sure to replace the placeholders with the ARN of the Lambda function, the ARNs for log group creation and the ARN of your SES verified email address to use as sender.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "ses:SendEmail",
            "Resource": "<arn-of-verified-ses-email-sender>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:DescribeMetricFilters"
            ],
            "Resource": "<arn-for-CloudWatch-log-groups>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:FilterLogEvents"
            ],
            "Resource": "<arn-of-CloudWatch-log-group-created-in-step-1>"

        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "<arn-for-CloudWatch-log-groups>"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "<arn-of-lambda-function:*>"
            ]
        }
    ]
}

Here is how it appears in my example. Note that FilterSendSecurityEvents is the name of my Lambda function and /aws/SecurityAuditLogs is the name my log group created in Step 1.

Figure 9: Example policy for the IAM role to be attached to the Lambda function

Choose Review policy, specify a name and a description for the policy, and then choose Create policy.

Next, create an IAM role and attach this policy.

To create the IAM role and attach the policy

In the IAM console navigation bar, choose Roles, and then choose Create role.
Under Choose the service that will use this role, choose Lambda, and then choose Next: Permissions.
On the next page, select the policy you just created, and then choose Next: Tags. Add an optional tag, and then choose Next: Review.
Specify a name and description for the role, and then choose Create role.
To attach this role to the Lambda function, go to the AWS Lambda console. Navigate to the Lambda function, and choose Configurations.
Choose Permissions, and then under Execution role, choose Edit.
On the Edit basic settings page, under Existing role, select the role you just created, and then choose Save.

And that’s it! You will now be notified whenever there are “Audit Failure” events that reach the threshold you set on a per-instance basis for your AWS Managed Microsoft AD domain-joined instances. If you installed and configured the CloudWatch agent on non–domain-joined instances in Step 1, then you’ll also get notifications for “Audit Failure” events that are generated by failed login attempts that use local accounts.

Conclusion

In this post, I showed you how you can proactively track and monitor Windows security audit failures across your AWS Managed Microsoft AD domain-joined EC2 instances. This helps provide greater visibility into Windows login activities for administrators, so that they can take action to maintain the security of their server fleet. This solution can also be extended to potentially trigger an automation workflow or incident response process in the event of unexpected events.

Although this blog has specifically targeted AWS Managed Microsoft AD domain-joined instances, the procedure here also applies to standalone EC2 instances or on-premises servers that are configured to send logs to CloudWatch.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Directory Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Edge Authentication and Token-Agnostic Identity Propagation

2021-02-09 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/edge-authentication-and-token-agnostic-identity-propagation-514e47e0b602

by AIM Team Members Karen Casella, Travis Nelson, Sunny Singh; with prior art and contributions by Justin Ryan, Satyajit Thadeshwar

As most developers can attest, dealing with security protocols and identity tokens, as well as user and device authentication, can be challenging. Imagine having multiple protocols, multiple tokens, 200M+ users, and thousands of device types, and the problem can explode in scope. A few years ago, we decided to address this complexity by spinning up a new initiative, and eventually a new team, to move the complex handling of user and device authentication, and various security protocols and tokens, to the edge of the network, managed by a set of centralized services, and a single team. In the process, we changed end-to-end identity propagation within the network of services to use a cryptographically-verifiable token-agnostic identity object.

Read on to learn more about this journey and how we have been able to:

Reduce complexity for service owners, who no longer need to have knowledge of and responsibility for terminating security protocols and dealing with myriad security tokens,
Improve security by delegating token management to services and teams with expertise in this area, and
Improve audit-ability and forensic analysis.

How We Got Here

Netflix started as a website that allowed members to manage their DVD queue. This website was later enhanced with the capability to stream content. Streaming devices came a bit later, but these initial devices were limited in capability. Over time, devices increased in capability and functions that were once only accessible on the website became accessible through streaming devices. Scale of the Netflix service was growing rapidly, with over 2000 device types supported.

Services supporting these functions now had an increased burden of being able to understand multiple tokens and security protocols in order to identify the user and device and authorize access to those functions. The whole system was quite complex, and starting to become brittle. Plus, the architecture of the Edge tier was evolving to a PaaS (platform as a service) model, and we had some tough decisions to make about how, and where, to handle identity token handling.

Complexity: Multiple Services Handling Auth Tokens

To demonstrate the complexity of the system, following is a description of how the user login flow worked prior to the changes described in this article:

At the highest level, the steps involved in this (greatly simplified) flow are as follows:

User enters their credentials and the Netflix client transmits the credentials, along with the ESN of the device to the Edge gateway, AKA Zuul.
Zuul redirects the user call to the API /login endpoint.
The API server orchestrates backend systems to authenticate the user.
Upon successful authentication of the claims provided, the API server sends a cookie response back upstream, including the customerId (a Long), the ESN (a String) and an expiration directive.
Zuul sends the Cookies back to the Netflix client.

This model had some problems, e.g.:

Externally valid tokens were being minted deep down in the stack and they needed to be propagated all the way upstream, opening possibilities for them to be logged inappropriately or potentially mismanaged.
Upstream systems had to reopen the tokens to identify the user logging in and potentially manage multiple parallel identity data structures, which could easily get out of sync.

Multiple Protocols & Tokens

The example above shows one flow, dealing with one protocol (HTTP/S) and one type of token (Cookies). There are several protocols and tokens in use across the Netflix streaming product, as summarized below:

These tokens were consumed by, and potentially mutated by, several systems within the Netflix streaming ecosystem, for example:

To complicate things further, there were multiple methods for transmitting these tokens, or the data contained therein, from system to system. In some cases, tokens were cracked open and identity data elements extracted as simple primitives or strings to be used in API calls, or passed from system to system via request context headers, or even as URL parameters. There were no checks in place to ensure the integrity of the tokens or the data contained therein.

At Netflix Scale

Meanwhile, the scale at which Netflix operated grew exponentially. At the time of this article, Netflix has 200M+ subscribers, with over a billion devices. We are serving over 2.5 million requests per second, a large percentage of which require some form of authentication. In the old architecture, each of these requests resulted in an API call to authenticate the claims presented with the request, as shown:

EdgePaas Enters the Picture

To further complicate the situation, the Edge Engineering team was in the middle of migrating from an old API server architecture to a new PaaS-based approach. As we migrated to EdgePaaS, front-end services were moved from the Java-based API to a BFF (backend for frontend), aka NodeQuark, as shown:

This model enables front-end engineers to own and operate their services outside of the core API framework. However, this introduced another layer of complexity — how would these NodeQuark services deal with identity tokens? NodeQuark services are written in JavaScript and terminating a protocol as complex as MSL would have been difficult and wasteful, as would replicating all of the logic for token management.

So, Where Were We Again?

To summarize, we found ourselves with a complex and inefficient solution for handling authentication and identity tokens at massive scale. We had multiple types and sources of identity tokens, each requiring special handling, the logic for which was replicated in various systems. Critical identity data was being propagated throughout the server ecosystem in an inconsistent fashion.

Edge Authentication to the Rescue

We realized that in order to solve this problem, a unified identity model was needed. We would need to process authentication tokens (and protocols) further upstream. We did this by moving authentication and protocol termination to the edge of the network, and created a new integrity-protected token-agnostic identity object to propagate throughout the server ecosystem.

Moving Authentication to the Edge

Keeping in mind our objectives to improve security and reduce complexity, and ultimately provide a better user experience, we strategized on how to centralize device authentication operations and user identification and authentication token management to the services edge.

At a high-level, Zuul (cloud gateway) was to become the termination point for token inspection and payload encryption/decryption. In the case that Zuul would be unable to handle these operations (a small percentage), e.g., if tokens were not present, needed to be renewed, or were otherwise invalid, Zuul would delegate those operations to a new set of Edge Authentication Services to handle cryptographic key exchange and token creation or renewal.

Edge Authentication Services

Edge Authentication Services (EAS) is both an architectural concept of moving authentication and identification of devices and users higher up on the stack to the cloud edge, as well as a suite of services that have been developed to handle each token type.

EAS is functionally a series of filters that run in Zuul, which may call out to external services to support their domain, e.g., to a service to handle MSL tokens or another for Cookies. EAS also covers the read-only processing of tokens to create Passports (more on that later).

The basic pattern for how EAS handles requests is as follows:

For each request coming into the Netflix service, the EAS Inbound Filter in Zuul inspects the tokens provided by the device client and either passes through the request to the Passport Injection Filter, or delegates to one of the Edge Authentication Services to process. The Passport Injection Filter generates a token-agnostic identity to propagate down through the rest of the server ecosystem. On the response path, the EAS Outbound Filter determines, with help from the Edge Authentication Services as needed, generates the tokens needed to send back to the client device.

The system architecture now takes the form of:

Notice that tokens never traverse past the Edge gateway / EAS boundary. The MSL security protocol is terminated at the Edge and all tokens are cracked open and identity data is propagated through the server ecosystem in a token-agnostic manner.

A Note on Resilience

On the happy path, Zuul is able to process the large percentage of tokens that are valid and not expired, and the Edge Auth Services handle the remainder of the requests.

The EAS services are designed to be fault tolerant, e.g., in the case where Zuul identifies that Cookies are valid, but expired, and the renewal call to EAS fails or is latent:

In this failure scenario, the EAS filter in Zuul will be lenient and allow the resolved identity to be propagated and will indicate that the renewal call should be rescheduled on the next request.

Token-Agnostic Identity (Passport)

An easily mutable identity structure would not suffice because that would mean passing less trusted identities from service to service. A token-agnostic identity structure was needed.

We introduced an identity structure called “Passport” which allowed us to propagate the user and device identity information in a uniform way. The Passport is also a kind of token, but there are many benefits to using an internal structure that differs from external tokens. However, downstream systems still need access to the user and device identity.

A Passport is a short-lived identity structure created at the Edge for each request, i.e., it is scoped to the life of the request and it is completely internal to the Netflix ecosystem. These are generated in Zuul via a set of Identity Filters. A Passport contains both user & device identity, is in protobuf format, and is integrity protected by HMAC.

Passport Structure

As noted above, the Passport is modeled as a Protocol Buffer. At the highest level, the definition of the Passport is as follows:

message Passport {

   Header header = 1;

   UserInfo user_info = 2;

   DeviceInfo device_info = 3;

   Integrity user_integrity = 4;

   Integrity device_integrity = 5;

The Header element communicates the name of the service that created the Passport. What’s more interesting is what is propagated related to the user and device.

User & Device Information

The UserInfo element contains all of the information required to identify the user on whose behalf requests are being made, with the DeviceInfo element containing all of the information required for the device on which the user is visiting Netflix:

message UserInfo {

    Source source = 1;

    int64 created = 2;

    int64 expires = 3;

    Int64Wrapper customer_id = 4;

        … (some internal stuff) …

    PassportAuthenticationLevel authentication_level = 11;

    repeated UserAction actions = 12;

message DeviceInfo {

    Source source = 1;

    int64 created = 2;

    int64 expires = 3;

    StringValue esn = 4;

    Int32Value device_type = 5;

    repeated DeviceAction actions = 7;

    PassportAuthenticationLevel authentication_level = 8;

        … (some more internal stuff) …

Both UserInfo and DeviceInfo carry the Source and PassportAuthenticationLevel for the request. The Source list is a classification of claims, with the protocol being used and the services used to validate the claims. The PassportAuthenticationLevel is the level of trust that we put into the authentication claims.

enum Source {

    NONE = 0;

    COOKIE = 1;

    COOKIE_INSECURE = 2;

    MSL = 3;

    PARTNER_TOKEN = 4;

…

enum PassportAuthenticationLevel {

    LOW = 1; // untrusted transport

    HIGH = 2; // secure tokens over TLS

    HIGHEST = 3; // MSL or user credentials

Downstream applications can use these values to make Authorization and/or user experience decisions.

Passport Integrity

The integrity of the Passport is protected via an HMAC (hash-based message authentication code), which is a specific type of MAC involving a crytographic hash function and a secret cryptographic key. It may be used to simultaneously verify both the data integrity and authenticity of a message.

User and device integrity are defined as:

message Integrity {

    int32 version = 1;

    string key_name = 2;

    bytes hmac = 3;

Version 1 of the Integrity element uses SHA-256 for the HMAC, which is encoded as a ByteArray. Future versions of Integrity may use a different has function or encoding. In version 1, the HMAC field contains the 256 bits from MacSpec.SHA_256.

Integrity protection guarantees that Passport field are not mutated after the Passport is created. Client applications can use the Passport Introspector to check the integrity of the Passport before using any of the values contained therein.

Passport Introspector

The Passport object itself is opaque; clients can use the Passport Introspector to extract the Passport from the headers and retrieve the contents inside it. The Passport Introspector is a wrapper over the Passport binary data. Clients create an Introspector via a factory and then have access to basic accessor methods:

public interface PassportIntrospector {

    Long getCustomerId();

    Long getAccountOwnerId();

    String getEsn();

    Integer getDeviceTypeId();

    String getPassportAsString();

…

Passport Actions

In the Passport protocol buffer definition shown above, there are Passport Actions defined:

message UserInfo {

    repeated UserAction actions = 12;

…

message DeviceInfo {

    repeated DeviceAction actions = 7;

…

Passport Actions are explicit signals sent by downstream services, when an update to user or device identity has been performed. The signal is used by EAS to either create or update the corresponding type of token.

Login Flow, Revisited

Let’s wrap up with an example of all of these solutions working together.

With the movement of authentication and protocol termination to the Edge, and the introduction of Passports as identity, the Login Flow described earlier has morphed into the following:

User enters their credentials and the Netflix client transmits the credentials, along with the ESN of the device to the Edge gateway, AKA Zuul.
Identity filters running in Zuul generate a device-bound Passport and pass it along to the API /login endpoint.
The API server propagates the Passport to the mid-tier services responsible for authentication the user.
Upon successful authentication of the claims provided, these services create a Passport Action and send it, along with the original Passport, back up stream to API and Zuul.
Zuul makes a call to the Cookie Service to resolve the Passport and Passport Actions and sends the Cookies back to the Netflix client.

Key Benefits and Learnings

Simplified Authorization

One of the reasons there were external tokens flowing into downstream systems was because authorization decisions often depend on authentication claims in tokens and the trust associated with each token type. In our Passport structure, we have assigned levels to this trust, meaning that systems requiring authorization decisions can write sensible rules around the Passport instead of replicating the trust rules in code across many services.

An Explicit and Extensible Identity Model

Having a structure that is the canonical identity is very useful. Alternatives where identity primitives are passed around are brittle and hard to debug. If the customer identity changed from service A to service D in a call chain, who changed it? Once the identity structure is passed through all key systems, it is relatively easy to add new external token types, new trust levels, or new ways to represent identity.

Operational Concerns and Visibility

Having a structure, like Passport, allows you to define the services that can write a Passport and other services can validate it. When the Passport is propagated and when we see it in logs, we can open it up, validate it, and know what the identity is. We also know the provenance of the Passport, and can trace it back to where it entered the system. This makes the debugging of any identity-related anomalies much easier.

Reduced Downstream System Complexity & Load

Passing a uniform structure to downstream systems means that those systems can easily look up the device and user identity, using an introspection library. Instead of having separate handling for each type of external token, they can use the common structure.

By offloading token processing from these systems to the central Edge Authentication Services, downstream systems saw significant gains in CPU, request latency, and garbage collection metrics, all of which help reduce cluster footprint and cloud costs. The following examples of these gains are from the primary API service.

In the prior implementation, it was necessary to incur decryption/termination costs twice per request because we needed the ability to route at the edge but also needed rich termination in the downstream service. Some of the performance improvement is due to consolidation of this — MSL requests now only need to be processed once.

CPU to RPS Ratio

Offloading token processing resulted in a 30% reduction in CPU cost per request and a 40% reduction in load average. The following graph shows the CPU to RPS ratio, where lower is better:

API Response Time

Response times for all calls on the API service showed significant improvement, with a 30% reduction in average latency and a 20% drop in 99th percentile latency:

Garbage Collection

The API service also saw a significant reduction in GC pressure and GC pause times, as shown in the Stop The World Garbage Collection metrics:

Developer Velocity

Abstracting these authentication and identity-related concerns away from the developers of microservices means that they can focus on their core domain. Changes in this area are now done once, and in one set of specialized services, versus being distributed across multiple.

What’s Next?

Strong(er) Authentication

We are currently expanding the Edge Authentication Services to support Multi-Factor Authentication via a new service called “Resistor”. We selectively introduce the second factor for connections that are suspicious, based on machine learning models. As we onboard new flows, we are introducing new factors, e.g., one-time passwords (OTP) sent to email or phone, push notifications to mobile devices, and third-party authenticator applications. We may also explore opt-in Multi-Factor Authentication for users who desire the added security on their accounts.

Flexible Authorization

Now that we have a verified identity flowing through the system, we can use that as a strong signal for authorization decisions. Last year, we started to explore a new Product Access Strategy (PACS) and are currently working on moving it into production for several new experiences in the Netflix streaming product. PACS recently powered the experience access control for the Streamfest, a weekend of free Netflix in India.

Want More?

Team members presented this work at QCon San Francisco (and were two of the top three attended talks at the conference!):

The authors are members of the Netflix Access & Identity Management team. We pride ourselves on being experts at distributed systems development, operations and identity management. And, we’re hiring Senior Software Engineers! Reach out on LinkedIn if you are interested.

Edge Authentication and Token-Agnostic Identity Propagation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.