Using JSONPath effectively in AWS Step Functions

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-jsonpath-effectively-in-aws-step-functions/

This post is written by Dhiraj Mahapatro, Senior Serverless Specialist SA, Serverless.

AWS Step Functions uses Amazon States Language (ASL), which is a JSON-based, structured language used to define the state machine. ASL uses paths for input and output processing in between states. Paths follow JSONPath syntax.

JSONPath provides the capability to select parts of JSON structures similar to how XPath expressions select nodes of XML documents. Step Functions provides the data flow simulator, which helps in modeling input and output path processing using JSONPath.

This blog post explains how you can effectively use JSONPath in a Step Functions workflow. It shows how you can separate concerns between states by specifically identifying input to and output from each state. It also explains how you can use advanced JSONPath expressions for filtering and mapping JSON content.

Overview

The sample application in this blog is based on a use case in the insurance domain. A new potential customer signs up with an insurance company by creating an account. The customer provides their basic information, and their interests in the types of insurances for shopping later.

The information provided by the potential insurance customer is accepted by the insurance company’s new account application for processing. This application is built using Step Functions, which accepts provided input as a JSON payload and applies the following business logic:

Example application architecture

  1. Verify the identity of the user.
  2. Verify the address of the user.
  3. Approve the new account application if the checks pass.
  4. Upon approval, insert user information into the Amazon DynamoDB Accounts table.
  5. Collect home insurance interests and store in an Amazon SQS queue.
  6. Send email notification to the user about the application approval.
  7. Deny the new account application if the checks fail.
  8. Send an email notification to the user about the application denial.

Deploying the application

Before deploying the solution, you need:

To deploy:

  1. From a terminal window, clone the GitHub repo:
    git clone [email protected]:aws-samples/serverless-account-signup-service.git
  2. Change directory:
    cd ./serverless-account-signup-service
  3. Download and install dependencies:
    sam build
  4. Deploy the application to your AWS account:
    sam deploy --guided
  5. During the guided deployment process, enter a valid email address for the parameter “Email” to receive email notifications.
  6. Once deployed, a confirmation email is sent to the provided email address from SNS. Confirm the subscription by clicking the link in the email.
    Email confirmation

To run the application using the AWS CLI, replace the state machine ARN from the output of deployment steps:

aws stepfunctions start-execution \
  --state-machine-arn <StepFunctionArnHere> \
  --input "{\"data\":{\"firstname\":\"Jane\",\"lastname\":\"Doe\",\"identity\":{\"email\":\"[email protected]\",\"ssn\":\"123-45-6789\"},\"address\":{\"street\":\"123 Main St\",\"city\":\"Columbus\",\"state\":\"OH\",\"zip\":\"43219\"},\"interests\":[{\"category\":\"home\",\"type\":\"own\",\"yearBuilt\":2004},{\"category\":\"auto\",\"type\":\"car\",\"yearBuilt\":2012},{\"category\":\"boat\",\"type\":\"snowmobile\",\"yearBuilt\":2020},{\"category\":\"auto\",\"type\":\"motorcycle\",\"yearBuilt\":2018},{\"category\":\"auto\",\"type\":\"RV\",\"yearBuilt\":2015},{\"category\":\"home\",\"type\":\"business\",\"yearBuilt\":2009}]}}"

Paths in Step Functions

Here is the sample payload structure :

{
  "data": {
    "firstname": "Jane",
    "lastname": "Doe",
    "identity": {
      "email": "[email protected]",
      "ssn": "123-45-6789"
    },
    "address": {
      "street": "123 Main St",
      "city": "Columbus",
      "state": "OH",
      "zip": "43219"
    },
    "interests": [
      {"category": "home", "type": "own", "yearBuilt": 2004},
      {"category": "auto", "type": "car", "yearBuilt": 2012},
      {"category": "boat", "type": "snowmobile", "yearBuilt": 2020},
      {"category": "auto", "type": "motorcycle", "yearBuilt": 2018},
      {"category": "auto", "type": "RV", "yearBuilt": 2015},
      {"category": "home", "type": "business", "yearBuilt": 2009}
    ]
  }
}

The payload has data about the new user (identity and address information) and the user’s interests in the types of insurance.

The Compute Blog post on using data flow simulator elaborates on how to use Step Functions paths. To summarize how paths work:

  1. InputPath – What input does a task need?
  2. Parameters – How does the task need the structure of the input to be?
  3. ResultSelectors – What to choose from the task’s output?
  4. ResultPath – Where to put the chosen output?
  5. OutputPath – What output to send to the next state?

The key idea is that the input of downstream states input depends on the output of previous states. JSONPath expressions help structuring input and output between states.

Using JSONPath inside paths

This is how paths are used in the sample application for each type.

InputPath

The first two main tasks in the Step Functions state machine validate the identity and the address of the user. Since both validations are unrelated, they can work independently by using parallel state.

Each state needs the identity and address information provided by the input payload. There is no requirement to provide interests to those states, so InputPath can help answer “What input does a task need?”.

Inside the Check Identity state:

"InputPath": "$.data.identity"

Inside the Check Address state:

"InputPath": "$.data.address"

Parameters

What should the input of the underlying task look like? Check Identity and Check Address use their respective AWS Lambda functions. When Lambda functions or any other AWS service integration is used as a task, the state machine should follow the request syntax of the corresponding service.

For a Lambda function as a task, the state should provide the FunctionName and an optional Payload as parameters. For the Check Identity state, the parameters section looks like:

"Parameters": {
    "FunctionName": "${CheckIdentityFunctionArn}",
    "Payload.$": "$"
}

Here, Payload is the entire identity JSON object provided by InputPath.

ResultSelector

Once the Check Identity task is invoked, the Lambda function successfully validates the user’s identity and responds with an approval response:

{
  "ExecutedVersion": "$LATEST",
  "Payload": {
    "statusCode": "200",
    "body": "{\"approved\": true,\"message\": \"identity validation passed\"}"
  },
  "SdkHttpMetadata": {
    "HttpHeaders": {
      "Connection": "keep-alive",
      "Content-Length": "43",
      "Content-Type": "application/json",
      "Date": "Thu, 16 Apr 2020 17:58:15 GMT",
      "X-Amz-Executed-Version": "$LATEST",
      "x-amzn-Remapped-Content-Length": "0",
      "x-amzn-RequestId": "88fba57b-adbe-467f-abf4-daca36fc9028",
      "X-Amzn-Trace-Id": "root=1-5e989cb6-90039fd8971196666b022b62;sampled=0"
    },
    "HttpStatusCode": 200
  },
  "SdkResponseMetadata": {
    "RequestId": "88fba57b-adbe-467f-abf4-daca36fc9028"
  },
  "StatusCode": 200
}

The identity validation approval must be provided to the downstream states for additional processing. However, the downstream states only need the Payload.body from the preceding JSON.

You can use a combination of intrinsic function and ResultSelector to choose attributes from the task’s output:

"ResultSelector": {
  "identity.$": "States.StringToJson($.Payload.body)"
}

ResultSelector takes the JSON string $.Payload.body and applies States.StringToJson to convert the string to JSON store in a new attribute named identity:

"identity": {
    "approved": true,
    "message": "identity validation passed"
}

When Check Identity and Check Address states finish their work and exit, the step output from each state is captured as a JSON array. This JSON array is the step output of the parallel state. Reconcile the results from the JSON array using the ResultSelector that is available in parallel state.

"ResultSelector": {
    "identityResult.$": "$[0].result.identity",
    "addressResult.$": "$[1].result.address"
}

ResultPath

After ResultSelector, where should the identity result go to in the initial payload? The downstream states need access to the actual input payload in addition to the results from the previous state. ResultPath provides the mechanism to extend the initial payload to add results from the previous state.

ResultPath: "$.result" informs the state machine that any result selected from the task output (actual output if none specified) should go under result JSON attribute and result should get added to the incoming payload. The output from ResultPath looks like:

{
  "data": {
    "firstname": "Jane",
    "lastname": "Doe",
    "identity": {
      "email": "[email protected]",
      "ssn": "123-45-6789"
    },
    "address": {
      "street": "123 Main St",
      "city": "Columbus",
      "state": "OH",
      "zip": "43219"
    },
    "interests": [
      {"category":"home", "type":"own", "yearBuilt":2004},
      {"category":"auto", "type":"car", "yearBuilt":2012},
      {"category":"boat", "type":"snowmobile","yearBuilt":2020},
      {"category":"auto", "type":"motorcycle","yearBuilt":2018},
      {"category":"auto", "type":"RV", "yearBuilt":2015},
      {"category":"home", "type":"business", "yearBuilt":2009}
    ]
  },
  "result": {
    "identity": {
      "approved": true,
      "message": "identity validation passed"
    }
  }
}

The preceding JSON has results from an operation but also the incoming payload is intact for business logic in downstream states.

This pattern ensures that the previous state keeps the payload hydrated for the next state. Use these combinations of paths across all states to make sure that each state has all the information needed.

As with the parallel state’s ResultSelector, an appropriate ResultPath is needed to hold both the results from Check Identity and Check Address to get the below results JSON object added to the payload:

"results": {
  "addressResult": {
    "approved": true,
    "message": "address validation passed"
  },
  "identityResult": {
    "approved": true,
    "message": "identity validation passed"
  }
}

With this approach for all of the downstream states, the input payload is still intact and the state machine has collected results from each state in results.

OutputPath

To return results from the state machine, ideally you do not send back the input payload to the caller of the Step Functions workflow. You can use OutputPath to select a portion of the state output as an end result. OutputPath determines what output to send to the next state.

In the sample application, the last states (Approved Message and Deny Message) defined OutputPath as:

“OutputPath”: “$.results”

The output from the state machine is:

{
  "addressResult": {
    "approved": true,
    "message": "address validation passed"
  },
  "identityResult": {
    "approved": true,
    "message": "identity validation passed"
  },
  "accountAddition": {
    "statusCode": 200
  },
  "homeInsuranceInterests": {
    "statusCode": 200
  },
  "sendApprovedNotification": {
    "statusCode": 200
  }
}

This response strategy is also effective when using a Synchronous Express Workflow for this business logic.

Advanced JSONPath

You can declaratively use advanced JSONPath expressions to apply logic without writing imperative code in utility functions.

Let’s focus on the interests that the new customer has asked for in the input payload. The Step Functions state machine has a state that focuses on interests in the “home” insurance category. Once the new account application is approved and added to the database successfully, the application captures home insurance interests. It adds home-related detail in an HomeInterestsQueue SQS queue and transitions to the Approved Message state.

The interests JSON array has the information about insurance interests. An effective way to get home-related details is to filter out the interests array based on the category “home”. You can try this in data flow simulator:

Data flow simulator

You can apply additional filter expressions to filter data according to your use case. To learn more, visit the the data flow simulator blog.

Inside the state machine JSON, the Home Insurance Interests task has:

"InputPath": "$..interests[?(@.category==home)]"

It uses advanced JSONPath with $.. notation and [?(@.category==home)] filters.

Using advanced expressions on JSONPath is not limited to home insurance interests and can be extended to other categories and business logic.

Cleanup

To delete the sample application, use the latest version of the AWS SAM CLI and run:

sam delete

Conclusion

This post uses a sample application to highlight effective use of JSONPath and data filtering strategies that can be used in Step Functions.

JSONPath provides the flexibility to work on JSON objects and arrays inside the Step Functions states machine by reducing the amount of utility code. It allows developers to build state machines by separating concerns for states’ input and output data. Advanced JSONPath expressions help writing declarative filtering logic without needing imperative utility code, optimizing cost, and performance.

For more serverless learning resources, visit Serverless Land.

[$] A viable solution for Python concurrency

Post Syndicated from original https://lwn.net/Articles/872869/rss

Concerns over the performance of programs written in Python are often
overstated — for some use cases, at least. But there is no getting around
the problem imposed by the infamous global interpreter lock (GIL), which
severely limits the concurrency of multi-threaded Python code. Various
efforts to remove the GIL have been made
over the years, but none have come anywhere near the point where they would
be considered for inclusion into the CPython interpreter. Now, though, Sam
Gross has entered
the arena
with a proof-of-concept implementation that may solve the
problem for real.

Plasma 25th Anniversary Edition released

Post Syndicated from original https://lwn.net/Articles/872952/rss

The KDE project is celebrating its 25th anniversary with a special release
of the Plasma desktop.

This time around, Plasma renews its looks and, not only do you get
a new wallpaper, but also a gust of fresh air from an updated
theme: Breeze – Blue Ocean. The new Breeze theme makes KDE apps and
tools not only more attractive, but also easier to use both on the
desktop and your phone and tablet.

Of course, looks are not the only you can expect from Plasma 25AE:
extra speed, increased reliability and new features have also found
their way into the app launcher, the software manager, the Wayland
implementation, and most other Plasma tools and utilities.

Lots of details can be found in the
changelog
.

Security updates for Thursday

Post Syndicated from original https://lwn.net/Articles/872945/rss

Security updates have been issued by Mageia (golang, grilo, mediawiki, plib, python-flask-restx, python-mpmath, thunderbird, and xstream/xmlpull/mxparser), Oracle (389-ds-base, grafana, httpd:2.4, kernel, libxml2, and openssl), Red Hat (httpd), and SUSE (kernel).

Privacy-Preserving Compromised Credential Checking

Post Syndicated from Luke Valenta original https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking/

Privacy-Preserving Compromised Credential Checking

Privacy-Preserving Compromised Credential Checking

Today we’re announcing a public demo and an open-sourced Go implementation of a next-generation, privacy-preserving compromised credential checking protocol called MIGP (“Might I Get Pwned”, a nod to Troy Hunt’s “Have I Been Pwned”). Compromised credential checking services are used to alert users when their credentials might have been exposed in data breaches. Critically, the ‘privacy-preserving’ property of the MIGP protocol means that clients can check for leaked credentials without leaking any information to the service about the queried password, and only a small amount of information about the queried username. Thus, not only can the service inform you when one of your usernames and passwords may have become compromised, but it does so without exposing any unnecessary information, keeping credential checking from becoming a vulnerability itself. The ‘next-generation’ property comes from the fact that MIGP advances upon the current state of the art in credential checking services by allowing clients to not only check if their exact password is present in a data breach, but to check if similar passwords have been exposed as well.

For example, suppose your password last year was amazon20\$, and you change your password each year (so your current password is amazon21\$). If last year’s password got leaked, MIGP could tell you that your current password is weak and guessable as it is a simple variant of the leaked password.

The MIGP protocol was designed by researchers at Cornell Tech and the University of Wisconsin-Madison, and we encourage you to read the paper for more details. In this blog post, we provide motivation for why compromised credential checking is important for security hygiene, and how the MIGP protocol improves upon the current generation of credential checking services. We then describe our implementation and the deployment of MIGP within Cloudflare’s infrastructure.

Our MIGP demo and public API are not meant to replace existing credential checking services today, but rather demonstrate what is possible in the space. We aim to push the envelope in terms of privacy and are excited to employ some cutting-edge cryptographic primitives along the way.

The threat of data breaches

Data breaches are rampant. The regularity of news articles detailing how tens or hundreds of millions of customer records have been compromised have made us almost numb to the details. Perhaps we all hope to stay safe just by being a small fish in the middle of a very large school of similar fish that is being predated upon. But we can do better than just hope that our particular authentication credentials are safe. We can actually check those credentials against known databases of the very same compromised user information we learn about from the news.

Many of the security breaches we read about involve leaked databases containing user details. In the worst cases, user data entered during account registration on a particular website is made available (often offered for sale) after a data breach. Think of the addresses, password hints, credit card numbers, and other private details you have submitted via an online form. We rely on the care taken by the online services in question to protect those details. On top of this, consider that the same (or quite similar) usernames and passwords are commonly used on more than one site. Our information across all of those sites may be as vulnerable as the site with the weakest security practices. Attackers take advantage of this fact to actively compromise accounts and exploit users every day.

Credential stuffing is an attack in which malicious parties use leaked credentials from an account on one service to attempt to log in to a variety of other services. These attacks are effective because of the prevalence of reused credentials across services and domains. After all, who hasn’t at some point had a favorite password they used for everything? (Quick plug: please use a password manager like LastPass to generate unique and complex passwords for each service you use.)

Website operators have (or should have) a vested interest in making sure that users of their service are using secure and non-compromised credentials. Given the sophistication of techniques employed by malevolent actors, the standard requirement to “include uppercase, lowercase, digit, and special characters” really is not enough (and can be actively harmful according to NIST’s latest guidance). We need to offer better options to users that keep them safe and preserve the privacy of vulnerable information. Dealing with account compromise and recovery is an expensive process for all parties involved.

Users and organizations need a way to know if their credentials have been compromised, but how can they do it? One approach is to scour dark web forums for data breach torrent links, download and parse gigabytes or terabytes of archives to your laptop, and then search the dataset to see if their credentials have been exposed. This approach is not workable for the majority of Internet users and website operators, but fortunately there’s a better way — have someone with terabytes to spare do it for you!

Making compromise checking fast and easy

This is exactly what compromised credential checking services do: they aggregate breach datasets and make it possible for a client to determine whether a username and password are present in the breached data. Have I Been Pwned (HIBP), launched by Troy Hunt in 2013, was the first major public breach alerting site. It provides a service, Pwned Passwords, where users can efficiently check if their passwords have been compromised. The initial version of Pwned Passwords required users to send the full password hash to the service to check if it appears in a data breach. In a 2018 collaboration with Cloudflare, the service was upgraded to allow users to run range queries over the password dataset, leaking only the salted hash prefix rather than the entire hash. Cloudflare continues to support the HIBP project by providing CDN and security support for organizations to download the raw Pwned Password datasets.

The HIBP approach was replicated by Google Password Checkup (GPC) in 2019, with the primary difference that GPC alerts are based on username-password pairs instead of passwords alone, which limits the rate of false positives. Enzoic and Microsoft Password Monitor are two other similar services. This year, Cloudflare also released Exposed Credential Checks as part of our Web Application Firewall (WAF) to help inform opted-in website owners when login attempts to their sites use compromised credentials. In fact, we use MIGP on the backend for this service to ensure that plaintext credentials never leave the edge server on which they are being processed.

Most standalone credential checking services work by having a user submit a query containing their password’s or username-password pair’s hash prefix. However, this leaks some information to the service, which could be problematic if the service turns out to be malicious or is compromised. In a collaboration with researchers at Cornell Tech published at CCS’19, we showed just how damaging this leaked information can be. Malevolent actors with access to the data shared with most credential checking services can drastically improve the effectiveness of password-guessing attacks. This left open the question: how can you do compromised credential checking without sharing (leaking!) vulnerable credentials to the service provider itself?

What does a privacy-preserving credential checking service look like?

In the aforementioned CCS’19 paper, we proposed an alternative system in which only the hash prefix of the username is exposed to the MIGP server (independent work out of Google and Stanford proposed a similar system). No information about the password leaves the user device, alleviating the risk of password-guessing attacks. These credential checking services help to preserve password secrecy, but still have a limitation: they can only alert users if the exact queried password appears in the breach.

The present evolution of this work, Might I Get Pwned (MIGP), proposes a next-generation similarity-aware compromised credential checking service that supports checking if a password similar to the one queried has been exposed in the data breach. This approach supports the detection of credential tweaking attacks, an advanced version of credential stuffing.

Credential tweaking takes advantage of the fact that many users, when forced to change their password, use simple variants of their original password. Rather than just attempting to log in using an exact leaked password, say ‘password123’, a credential tweaking attacker might also attempt to log in with easily-predictable variants of the password such as ‘password124’ and ‘password123!’.

There are two main mechanisms described in the MIGP paper to add password variant support: client-side generation and server-side precomputation. With client-side generation, the client simply applies a series of transform rules to the password to derive the set of variants (e.g., truncating the last letter or adding a ‘!’ at the end), and runs multiple queries to the MIGP service with each username and password variant pair. The second approach is server-side precomputation, where the server applies the transform rules to generate the password variants when encrypting the dataset, essentially treating the password variants as additional entries in the breach dataset. The MIGP paper describes tradeoffs between the two approaches and techniques for generating variants in more detail. Our demo service includes variant support via server-side precomputation.

Breach extraction attacks and countermeasures

One challenge for credential checking services are breach extraction attacks, in which an adversary attempts to learn username-password pairs that are present in the breach dataset (which might not be publicly available) so that they can attempt to use them in future credential stuffing or tweaking attacks. Similarity-aware credential checking services like MIGP can make these attacks more effective, since adversaries can potentially check for more breached credentials per API query. Fortunately, additional measures can be incorporated into the protocol to help counteract these attacks. For example, if it is problematic to leak the number of ciphertexts in a given bucket, dummy entries and padding can be employed, or an alternative length-hiding bucket format can be used. Slow hashing and API rate limiting are other common countermeasures that credential checking services can deploy to slow down breach extraction attacks. For instance, our demo service applies the memory-hard slow hash algorithm scrypt to credentials as part of the key derivation function to slow down these attacks.

Let’s now get into the nitty-gritty of how the MIGP protocol works. For readers not interested in the cryptographic details, feel free to skip to the demo below!

MIGP protocol

There are two parties involved in the MIGP protocol: the client and the server. The server has access to a dataset of plaintext breach entries (username-password pairs), and a secret key used for both the precomputation and the online portions of the protocol. In brief, the client performs some computation over the username and password and sends the result to the server; the server then returns a response that allows the client to determine if their password (or a similar password) is present in the breach dataset.

Privacy-Preserving Compromised Credential Checking
Full protocol description from the MIGP paper: clients learn if their credentials are in the breach dataset, leaking only the hash prefix of the queried username to the server

Precomputation

At a high level, the MIGP server partitions the breach dataset into buckets based on the hash prefix of the username (the bucket identifier), which is usually 16-20 bits in length.

Privacy-Preserving Compromised Credential Checking
During the precomputation phase of the MIGP protocol, the server derives password variants, encrypts entries, and stores them in buckets based on the hash prefix of the username

We use server-side precomputation as the variant generation mechanism in our implementation. The server derives one ciphertext for each exact username-password pair in the dataset, and an additional ciphertext per password variant. A bucket consists of the set ciphertexts for all breach entries and variants with the same username hash prefix. For instance, suppose there are n breach entries assigned to a particular bucket. If we compute m variants per entry, counting the original entry as one of the variants, there will be n*m ciphertexts stored in the bucket. This introduces a large expansion in the size of the processed dataset, so in practice it is necessary to limit the number of variants computed per entry. Our demo server stores 10 ciphertexts per breach entry in the input: the exact entry, eight variants (see Appendix A of the MIGP paper), and a special variant for allowing username-only checks.

Each ciphertext is the encryption of a username-password (or password variant) pair along with some associated metadata. The metadata describes whether the entry corresponds to an exact password appearing in the breach, or a variant of a breached password. The server derives a per-entry secret key pad using a key derivation function (KDF) with the username-password pair and server secret as inputs, and uses XOR encryption to derive the entry ciphertext. The bucket format also supports storing optional encrypted metadata, such as the date the breach was discovered.

Input:
  Secret sk       // Server secret key
  String u        // Username
  String w        // Password (or password variant)
  Byte mdFlag     // Metadata flag
  String mdString // Optional metadata string

Output:
  String C        // Ciphertext

function Encrypt(sk, u, w, mdFlag, mdString):
  padHdr=KDF1(u, w, sk)
  padBody=KDF2(u, w, sk)
  zeros=[0] * KEY_CHECK_LEN
  C=XOR(padHdr, zeros || mdFlag) || mdString.length || XOR(padBody, mdString)

The precomputation phase only needs to be done rarely, such as when the MIGP parameters are changed (in which case the entire dataset must be re-processed), or when new breach datasets are added (in which case the new data can be appended to the existing buckets).

Online phase

Privacy-Preserving Compromised Credential Checking
During the online phase of the MIGP protocol, the client requests a bucket of encrypted breach entries corresponding to the queried username, and with the server’s help derives a key that allows it to decrypt an entry corresponding to the queried credentials

The online phase of the MIGP protocol allows a client to check if a username-password pair (or variant) appears in the server’s breach dataset, while only leaking the hash prefix of the username to the server. The client and server engage in an OPRF protocol message exchange to allow the client to derive the per-entry decryption key, without leaking the username and password to the server, or the server’s secret key to the client. The client then computes the bucket identifier from the queried username and downloads the corresponding bucket of entries from the server. Using the decryption key derived in the previous step, the client scans through the entries in the bucket attempting to decrypt each one. If the decryption succeeds, this signals to the client that their queried credentials (or a variant thereof) are in the server’s dataset. The decrypted metadata flag indicates whether the entry corresponds to the exact password or a password variant.

The MIGP protocol solves many of the shortcomings of existing credential checking services with its solution that avoids leaking any information about the client’s queried password to the server, while also providing a mechanism for checking for similar password compromise. Read on to see the protocol in action!

MIGP demo

As the state of the art in attack methodologies evolve with new techniques such as credential tweaking, so must the defenses. To that end, we’ve collaborated with the designers of the MIGP protocol to prototype and deploy the MIGP protocol within Cloudflare’s infrastructure.

Our MIGP demo server is deployed at migp.cloudflare.com, and runs entirely on top of Cloudflare Workers. We use Workers KV for efficient storage and retrieval of buckets of encrypted breach entries, capping out each bucket size at the current KV value limit of 25MB. In our instantiation, we set the username hash prefix length to 20 bits, so that there are a total of 2^20 (or just over 1 million) buckets.

There are currently two ways to interact with the demo MIGP service: via the browser client at migp.cloudflare.com, or via the Go client included in our open-sourced MIGP library. As shown in the screenshots below, the browser client displays the request from your device and the response from the MIGP service. You should take caution to not input any sensitive credentials in a third-party service (feel free to use the test credentials [email protected] and password1 for the demo).

Keep in mind that “absence of evidence is not evidence of absence”, especially in the context of data breaches. We intend to periodically update the breach datasets used by the service as new public breaches become available, but no breach alerting service will be able to provide 100% accuracy in assuring that your credentials are safe.

See the MIGP demo in action in the attached screenshots. Note that in all cases, the username ([email protected]) and corresponding username prefix hash (000f90f4) remain the same, so the client retrieves the exact same bucket contents from the server each time. However, the blindElement parameter in the client request differs per request, allowing the client to decrypt different bucket elements depending on the queried credentials.

Privacy-Preserving Compromised Credential Checking
Example query in which the credentials are exposed in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which similar credentials were exposed in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which the username is present in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which the credentials are not found in the dataset

Open-sourced MIGP library

We are open-sourcing our implementation of the MIGP library under the BSD-3 License. The code is written in Go and is available at https://github.com/cloudflare/migp-go. Under the hood, we use Cloudflare’s CIRCL library for OPRF support and Go’s supplementary cryptography library for scrypt support. Check out the repository for instructions on setting up the MIGP client to connect to Cloudflare’s demo MIGP service. Community contributions and feedback are welcome!

Future directions

In this post, we announced our open-sourced implementation and demo deployment of MIGP, a next-generation breach alerting service. Our deployment is intended to lead the way for other credential compromise checking services to migrate to a more privacy-friendly model, but is not itself currently meant for production use. However, we identify several concrete steps that can be taken to improve our service in the future:

  • Add more breach datasets to the database of precomputed entries
  • Increase the number of variants in server-side precomputation
  • Add library support in more programming languages to reach a broader developer base
  • Hide the number of ciphertexts per bucket by padding with dummy entries
  • Add support for efficient client-side variant checking by batching API calls to the server

For exciting future research directions that we are investigating — including one proposal to remove the transmission of plaintext passwords from client to server entirely — take a look at https://blog.cloudflare.com/research-directions-in-password-security.

We are excited to share and build upon these ideas with the wider Internet community, and hope that our efforts impact positive change in the password security ecosystem. We are particularly interested in collaborating with stakeholders in the space to develop, test, and deploy next-generation protocols to improve user security and privacy. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

Post Syndicated from Marwan Fayed original https://blog.cloudflare.com/addressing-agility/

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

At large operational scales, IP addressing stifles innovation in network- and web-oriented services. For every architectural change, and certainly when starting to design new systems, the first set of questions we are forced to ask are:

  • Which block of IP addresses do or can we use?
  • Do we have enough in IPv4? If not, where or how can we get them?
  • How do we use IPv6 addresses, and does this affect other uses of IPv6?
  • Oh, and what careful plan, checks, time, and people do we need for migration?

Having to stop and worry about IP addresses costs time, money, resources. This may sound surprising, given the visionary and resilient advent of IP, 40+ years ago. By their very design, IP addresses should be the last thing that any network has to think about. However, if the Internet has laid anything bare, it’s that small or seemingly unimportant weaknesses — often invisible or impossible to see at design time — always show up at sufficient scale.

One thing we do know: “more addresses” should never be the answer. In IPv4 that type of thinking only contributes to their scarcity, driving up further their market prices. IPv6 is absolutely necessary, but only one part of the solution. For example, in IPv6, the best practice says that the smallest allocation, just for personal use, is /56 — that’s 272 or about 4,722,000,000,000,000,000,000 addresses. I certainly can’t reason about numbers that large. Can you?

In this blog post, we’ll explain why IP addressing is a problem for web services, the underlying causes, and then describe an innovative solution that we’re calling Addressing Agility, alongside the lessons we’ve learned. The best part of all may be the kinds of new systems and architectures enabled by Addressing Agility. The full details are available in our recent paper from ACM SIGCOMM 2021. As a preview, here is a summary of some of the things we learned:

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

It’s true! There is no limit to the number of names that can appear on any single address; the address of any name can change with every new query, anywhere; and address changes can be made for any reason, be it service provisioning or policy or performance evaluation, or others we’ve yet to encounter…

Explained below are the reasons this is all true, the way we get there, and the reasons these lessons matter for HTTP and TLS services of any size. The key insight on which we build: On the Internet Protocol (IP) design, much like the global postal system, addresses have never been, should never be, and in no way are ever, needed to represent names. We just sometimes treat addresses as if they do. Instead, this work shows that all names should share all of their addresses, any set of their addresses, or even just one address.

The narrow waist is a funnel, but also a choke point

Decades-old conventions artificially tie IP addresses to names and resources. This is understandable since the architecture and software that drive the Internet evolved from a setting in which one computer had one name and (most often) one network interface card. It would be natural, then, for the Internet to evolve such that one IP address would be associated with names and software processes.

Among end clients and network carriers, where there is little need for names and less need for listening processes, these IP bindings have little impact. However, the name and process conventions create strong limitations on all content hosting, distribution, and content-service providers (CSPs). Once assigned to names, interfaces, and sockets, addresses become largely static and require effort, planning, and care to change if change is possible at all.

The “narrow waist” of IP has enabled the Internet, but much like TCP has been to transport protocols and HTTP to application protocols, IP has become a stifling bottleneck to innovation. The idea is depicted by the figure below, in which we see that otherwise separate communication bindings (with names) and connection bindings (with interfaces and sockets) create transitive relationships between them.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

The transitive lock is hard to break, because changing either can have an impact on the other. Moreover, service providers often use IP addresses to represent policies and service levels that themselves exist independently of names. Ultimately the IP bindings are one more thing to think about — and for no good reason.

Let’s put this another way. When thinking of new designs, new architectures, or just better resource allocations, the first set of questions should never be “which IP addresses do we use?” or “do we have IP addresses for this?” Questions like these and their answers slow development and innovation.

We realised that IP bindings are not only artificial but, according to the original visionary RFCs and standards, also incorrect. In fact, the notion of IP addresses as being representative of anything other than reachability runs counter to their original design. In the original RFC and related drafts, the architects are explicit, “A distinction is made between names, addresses, and routes. A name indicates what we seek. An address indicates where it is. A route indicates how to get there.” Any association to IP of information like SNI or HTTP host in higher-layer protocols is a clear violation of the layering principle.

Of course none of our work exists in isolation. It does, however, complete a long-standing evolution to decouple IP addresses from their conventional use, an evolution that consists of standing on the shoulders of giants.

The Evolving Past…

Looking backwards over the last 20 years, it’s easy to see that a quest for addressing agility has been ongoing for some time, and one in which Cloudflare has been deeply invested.

The decades-old one-to-one binding between IP and network card interfaces was first broken a few years ago when Google’s Maglev combined Equal Cost MultiPath (ECMP) and consistent hashing to disseminate traffic from one ‘virtual’ IP address among many servers. As an aside, according to the original Internet Protocol RFCs, this use of IP is proscribed and there is nothing virtual about it.

Many similar systems have since emerged at GitHub, Facebook, and elsewhere, including our very own Unimog. More recently, Cloudflare designed a new programmable sockets architecture called bpf_sk_lookup to decouple IP addresses from sockets and processes.

But what about those names? The value of ‘virtual hosting’ was cemented in 1997 when HTTP 1.1 defined the host field as mandatory. This was the first official acknowledgement that multiple names can coexist on a single IP address, and was necessarily reproduced by TLS in the Server Name Indication field. These are absolute requirements since the number of possible names is greater than the number of IP addresses.

…Indicates an Agile Future

Looking ahead, Shakespeare was wise to ask, “What’s in a Name?” If the Internet could speak then it might say, “That name which we label by any other address would be just as reachable.”

If Shakespeare instead asked, “What is in an address?” then the Internet would similarly answer, “That address which we label by any other name would be just as reachable, too.”

A strong implication emerges from the truth of those answers: The mapping between names and addresses is any-to-any. If this is true then any address can be used to reach a name as long as a name is reachable at an address.

In fact, a version of many addresses for a name has been available since 1995 with the introduction of DNS-based load-balancing. Then why not all addresses for all names, or any addresses at any given time for all names? Or — as we’ll soon discover — one address for all names! But first let’s talk about the manner in which addressing agility is achieved.

Achieving Addressing Agility: Ignore names, map policies

The key to addressing agility is authoritative DNS — but not in the static name-to-IP mappings stored in some form of a record or lookup table. Consider that from any client’s perspective, the binding only appears `on-query’. For all practical uses of the mapping, the query’s response is the last possible moment in the lifetime of a request where a name can be bound to an address.

This leads to the observation that name mappings are actually made, not in some record or zone file, but at the moment the response is returned. It’s a subtle, but important distinction. Today’s DNS systems use a name to look up a set of addresses, and then sometimes use some policy to decide which specific address to return. The idea is shown in the figure below. When a query arrives, a lookup reveals the addresses associated with that name, and then returns one or more of those addresses. Often, additional policy or logic filters are used to narrow the address selection, such as service level or geo-regional coverage. The important detail is that addresses are identified with a name first, and policies are only applied afterwards.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

(a) Conventional Authoritative DNS

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

(b) Addressing Agility

Addressing agility is achieved by inverting this relationship. Instead of IP addresses pre-assigned to a name, our architecture begins with a policy that may (or in our case, not) include a name. For example, a policy may be represented by attributes such as location and account type and ignore the name (which we did in our deployment). The attributes identify a pool of addresses that are associated with that policy. The pool itself may be isolated to that policy or have elements shared with other pools and policies. Moreover, all the addresses in the pool are equivalent. This means that any of the addresses may be returned — or even selected at random — without inspecting the DNS query name.

Now pause for a moment because there are two really noteworthy implications that fall out  to per-query responses:

i. IP addresses can be, and are, computed and assigned at runtime or query-time.

ii. The lifetime of the IP-to-name mapping is the larger of the ensuing connection lifetime and the TTL in downstream caches.

The outcome is powerful and means that the binding itself is otherwise ephemeral and can be changed without regard to previous bindings, resolvers, clients, or purpose. Also, scale is no issue, and we know because we deployed it at the edge.

IPv6 — new clothes, same emperor

Before talking about our deployment, let’s first address the proverbial elephant in the room: IPv6. The first thing to make clear is that everything — everything — discussed here in the context of IPv4 applies equally in IPv6. As is true of the global postal system, addresses are addresses, whether in Canada, Cambodia, Cameroon, Chile, or China — and that includes their relatively static, inflexible nature.

Despite equivalence, the obvious question remains: Surely all the reasons to pursue Addressing Agility are satisfied simply by changing to IPv6? Counter-intuitive as the answer may be, the answer is a definite, absolute no! IPv6 may mitigate against address exhaustion, at least for the lifetimes of everyone alive today, but the abundance of IPv6 prefixes and addresses makes reasoning difficult about its bindings to names and resources.

The abundance of IPv6 addresses also risks inefficiencies because operators can take advantage of the bit length and large prefix sizes to embed meaning into the IP address. This is a powerful feature of IPv6, but also means many, many, addresses in any prefix will go unused.

To be clear, Cloudflare is demonstrably one of the biggest advocates of IPv6, and for good reasons, not least that the abundance of addresses ensures longevity. Even so, IPv6 changes little about the way addresses are tied to names and resources, whereas an address’ agility ensures flexibility and responsiveness for their lifetimes.

A Side-note: Agility is for Everyone

One last comment on the architecture and its transferability — Addressing Agility is usable, even desirable, for any service that operates authoritative DNS. Other content-oriented service providers are obvious contenders, but so too are smaller operators. Universities, enterprises, and governments are just a few examples of organizations that can operate their own authoritative services. So long as the operators are able to accept connections on the IP addresses that are returned, all are potential beneficiaries of addressing agility as a result.

Policy-based randomized addresses — at scale

We’ve been working with Addressing Agility live at the edge, with production traffic, since June 2020, as follows:

  • More than 20 million hostnames and services
  • All data centers in Canada (giving a reasonable population and multiple time zones)
  • /20 (4096 addresses) in IPv4 and /44 in IPv6
  • /24 (256 addresses) in IPv4 from January 2021 to June 2021
  • For every query, generate a random host-portion within the prefix.

After all, the true test of agility is most extreme when a random address is generated for every query that hits our servers. Then we decided to truly put the idea to the test. In June 2021, in our Montreal data center and soon after in Toronto, all 20+ million zones were mapped to one-single address.

Over the course of one year, every query for a domain captured by the policy received an address selected at random — from a set of as few as 4096 addresses, then 256, and then one. Internally, we refer to the address set of one as Ao1, and we’ll return to this point later.

The measure of success: “Nothing to see here”

There may be a number of questions our readers are quietly asking themselves:

  • What did this break on the Internet?
  • What effect did this have on Cloudflare systems?
  • What would I see happening if I could?

The short answer to each question above is nothing. But — and this is important — address randomization does expose weaknesses in the designs of systems that rely on the Internet. The weaknesses always, every one, occurs because the designers ascribe meaning to IP addresses beyond reachability. (And, if only incidentally, every one of those weaknesses are circumvented by the use of one address, or ‘Ao1.’)

To better understand the nature of “nothing”, let’s answer the above questions starting from the bottom of the list.

What would I see if I could?

The answer is shown by the example in the figure below. From all data centers in the “Rest of World” outside our deployment, a query for a zone returns the same addresses (such is Cloudflare’s global anycast system). In contrast, every query that lands in a deployment data center receives a random address. These can be seen below in successive dig commands to two different data centers.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

For those who may be wondering about subsequent request traffic, yes, this means that servers are configured to accept connection requests for any of the 20+ million domains on all addresses in the address pool.

Ok, but surely Cloudflare’s surrounding systems needed modification?

Nope. This is a drop-in transparent change to the data pipeline for authoritative DNS. Each of routing prefix advertisements in BGP, DDoS, load balancers, distributed cache, … no changes were required.

There is, however, one fascinating side effect: randomization is to IP addresses as a good hash function is to a hash table — it evenly maps an arbitrary size input to a fixed number of outputs. The effect can be seen by looking at measures of load-per-IP before and after randomization as in the graphs below, with data taken from 1% samples of requests at one data center over seven days.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services
Before Addressing Agility
Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services
Randomization on /20
Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services
Randomization on /24

Before randomization, for only a small portion of Cloudflare’s IP space, (a) the difference between greatest and least requests per IP (y1-axis on the left) is three orders of magnitude; similarly, bytes per IP (y2-axis on the right) is almost six orders of magnitude. After randomization, (b) for all domains on a single /20 that previously occupied multiple /20s, these reduce to 2 and 3 orders of magnitude, respectively. Taking this one step further down to /24 in (c), per-query randomization of 20+ million zones onto 256 addresses reduces differences in load to small constant factors.

This might matter to any content service provider that might think about provisioning resources by IP address. A priori predictions of load generated by a customer can be hard. The above graphs are evidence that the best path forward is to give all the addresses to all the names.

Surely this breaks something on the wider Internet?

Here, too, the answer is no! Well, perhaps more precisely stated as, “no, randomization breaks nothing… but it can expose weaknesses in systems and their designs.”

Any systems that might be affected by address randomization appears to have a prerequisite: some meaning is ascribed to the IP address beyond just reachability. Addressing Agility keeps and even restores the semantics of IP addresses and the core Internet architecture, but it will break software systems that make assumptions about their meaning.

Let’s first cover a few examples, why they don’t matter, and then follow with a small change to addressing agility that bypasses weaknesses (by using one single IP address):

  • HTTP Connection Coalescing enables a client to re-use existing connections to request resources from different origins. Clients such as Firefox that permit coalescing when the URI authority matches the connection are unaffected. However, clients that require a URI host to resolve to the same IP address as the given connection will fail.
  • Non-TLS or HTTP-based services may be affected. One example is ssh, which maintains a hostname-to-IP mapping in its known_hosts. This association, while understandable, is outdated and already broken given that many DNS records presently return more than one IP address.
  • Non-SNI TLS certificates require a dedicated IP address. Providers are forced to charge a premium because each address can only support a single certificate without SNI. The bigger issue, independent of IP, is the use of TLS without SNI. We have launched efforts to understand non-SNI to hopefully end this unfortunate legacy.
  • DDoS protections that rely on destination IPs may be hindered, initially. We would argue that addressing agility is beneficial for two reasons. First, IP randomization distributes the attack load across all addresses in use, effectively serving as a layer-3 load-balancer. Second, DoS mitigations often work by changing IP addresses, an ability that is inherent in Addressing Agility.

All for on One, and One for All

We started with 20+ million zones bound to addresses across tens of thousands of addresses, and successfully served them from 4096 addresses in a /20 and then 256 addresses in a /24. Surely this trend begs the following question:

If randomization works over n addresses, then why not randomization over 1 address?

Indeed, why not? Recall from above the comment about randomization over IPs as being equivalent to a perfect hash function in a hash table. The thing about well-designed hash-based structures is that they preserve their properties for any size of the structure, even a size of 1. Such a reduction would be a true test of the foundations on which Addressing Agility is constructed.

So, test we did. From a /20 address set, to a /24 and then, from June 2021, to an address set of one /32, and equivalently a /128 (Ao1). It doesn’t just work. It really works. Concerns that might be exposed by randomization are resolved by Ao1. For example, non-TLS or non-HTTP services have a reliable IP address (or at least non-random and until there is a policy change on the name). Also, HTTP connection coalescing falls out as if for free and, yes, we see increased levels of coalescing where Ao1 is being used.

But why in IPv6 where there are so many addresses?

One argument against binding to a single IPv6 address is that there is no need, because address exhaustion is unlikely. This is a pre-CIDR position that, we claim, is benign at best and irresponsible at worst. As mentioned above, the number of IPv6 addresses makes reasoning about them difficult. In lieu of asking why use a single IPv6 address, we should be asking, “why not?”

Are there upstream implications? Yes, and opportunities!

Ao1 reveals an entirely different set of implications from IP randomization that, arguably, gives us a window into the future of Internet routing and reachability by amplifying the effects that seemingly small actions might have.

Why? The number of possible variable-length names in the universe will always exceed the number of fixed-length addresses. This means that, by the pigeonhole principle, single IP addresses must be shared by multiple names, and different content from unrelated parties.

The possible upstream effects amplified by Ao1 are worth raising and are described below. So far, though, we’ve seen none of these in our evaluations, nor have they come up in communications with upstream networks.

  • Upstream Routing Errors are Immediate and Total. If all traffic arrives on a single address (or prefix), then upstream routing errors affect all content equally. (This is the reason Cloudflare returns two addresses in non-contiguous address ranges.) Note, however, the same is true of threat blocking.
  • Upstream DoS Protections could be triggered. It is conceivable that the concentration of requests and traffic on a single address could be perceived upstream as a DoS attack and trigger upstream protections that may exist.

In both cases, the actions are mitigated by Addressing Agility’s ability to change addresses en masse so quickly. Prevention is also possible, but requires open communication and discourse.

One last upstream effect remains:

  • Port exhaustion in IPv4 NAT might be accelerated, and is solved by IPv6! From the client-side, the number of permissible concurrent connections to one-address is upper-bounded by the size of a transport protocol’s port field, for example about 65K in TCP.

For example, in TCP on Linux this was an issue until recently. (See this commit and SO_BIND_ADDRESS_NO_PORT in ip(7) man page.)  In UDP the issue remains. In QUIC, connection identifiers can prevent port exhaustion, but they have to be used. So far, though, we have yet to see any evidence that this is an issue.

Even so — and here is the best part — to the best of our knowledge this is the only risk to one-address uses, and is also immediately resolved by migrating to IPv6. (So, ISPs and network administrators, go forth and implement IPv6!)

We’re just getting started!

And so we end as we began. With no limit to the number of names on any single IP address, the ability to change the address per-query, for any reason, what could you build?

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

We are, indeed, just getting started! The flexibility and future-proofing enabled by Addressing Agility is enabling us to imagine, design, and build new systems and architectures. We’re planning BGP route leak detection and mitigation for anycast systems, measurement platforms, and more.

Further technical details on all the above, as well as acknowledgements to so many who helped make this possible, can be found in this paper and short talk. Even with these new possibilities, challenges remain. There are many open questions that include, but are in no way limited to the following:

  • What policies can be reasonably expressed or implemented?
  • Is there an abstract syntax or grammar with which to express them?
  • Could we use formal methods and verification to prevent erroneous or conflicting policies?

Addressing Agility is for everyone, even necessary for these ideas to succeed more widely. Input and ideas are welcomed at [email protected].

If you are a student enrolled in a PhD or equivalent research program and looking for an internship for 2022 in the USA or Canada and the EU or UK.

If you’re interested in contributing to projects like this or helping Cloudflare develop its traffic and address management systems, our Addressing Engineering team is hiring!

Research Directions in Password Security

Post Syndicated from Ian McQuoid original https://blog.cloudflare.com/research-directions-in-password-security/

Research Directions in Password Security

Research Directions in Password Security

As Internet users, we all deal with passwords every day. With so many different services, each with their own login systems, we have to somehow keep track of the credentials we use with each of these services. This situation leads some users to delegate credential storage to password managers like LastPass or a browser-based password manager, but this is far from universal. Instead, many people still rely on old-fashioned human memory, which has its limitations — leading to reused passwords and to security problems. This blog post discusses how Cloudflare Research is exploring how to minimize password exposure and thwart password attacks.

The Problem of Password Reuse

Because it’s too difficult to remember many distinct passwords, people often reuse them across different online services. When breached password datasets are leaked online, attackers can take advantage of these to conduct “credential stuffing attacks”. In a credential stuffing attack, an attacker tests breached credentials against multiple online login systems in an attempt to hijack user accounts. These attacks are highly effective because users tend to reuse the same credentials across different websites, and they have quickly become one of the most prevalent types of online guessing attacks. Automated attacks can be run at a large scale, testing out exposed passwords across multiple systems, under the assumption that some of these passwords will unlock accounts somewhere else (if they have been reused). When a data breach is detected, users of that service will likely receive a security notification and will reset that account password. However, if this password was reused elsewhere, they may easily forget that it needs to be changed for those accounts as well.

How can we protect against credential stuffing attacks? There are a number of methods that have been deployed — with varying degrees of success. Password managers address the problem of remembering a strong, unique password for every account, but many users have yet to adopt them. Multi-factor authentication is another potential solution — that is, using another form of authentication in addition to the username/password pair. This can work well, but has limits: for example, such solutions may rely on specialized hardware that not all clients have. Consumer systems are often reluctant to mandate multi-factor authentication, given concerns that people may find it too complicated to use; companies do not want to deploy something that risks impeding the growth of their user base.

Since there is no perfect solution, security researchers continue to try to find improvements. Two different approaches we will discuss in this blog post are hardening password systems using cryptographically secure keys, and detecting the reuse of compromised credentials, so they don’t leave an account open to guessing attacks.

Improved Authentication with PAKEs

Investigating how to securely authenticate a user just using what they can remember has been an important area in secure communication. To this end, the subarea of cryptography known as Password Authenticated Key Exchange (PAKE) came about. PAKEs deal with protocols for establishing cryptographically secure keys where the only source of authentication is a human memorizable (low-entropy, attacker-guessable) password — that is, the “what you know” side of authentication.

Before diving into the details, we’ll provide a high-level overview of the basic problem. Although passwords are typically protected in transit by being sent over HTTPS, servers handle them in plaintext to verify them once they arrive. Handling plaintext passwords increases security risk — for instance, they might get inadvertently logged and exposed. Ideally, the user’s password never gets sent to the server in the first place. This is where PAKEs come in — a means of verifying that the user and server share a password, ideally without revealing information about the password that could help attackers to discover or crack it.

A few words on PAKEs

PAKE protocols let two parties turn a password into a shared key. Each party only gets one guess at the password the other holds. If a user tries to log in to the wrong server with a PAKE, that server will not be able to turn around and impersonate the user. As such, PAKEs guarantee that communication with one of the parties is the only way for an attacker to test their (single) password guess. This may seem like an unneeded level of complexity when we could use already available tools like a key distribution mechanism along with password-over-TLS, but this puts a lot of trust in the service. You may trust a service with learning your password on that service, but what about if you accidentally use a password for a different service when trying to log in? Note the particular risks of a reused password: it is no longer just a secret shared between a user and a single service, but is now a secret shared between a user and multiple services. This therefore increases the password’s privacy sensitivity — a service should not know users’ account login information for other services.

Research Directions in Password Security
A comparison of shared secrets between passwords over TLS versus PAKEs.With passwords over TLS, a service might learn passwords used on another service. This problem does not arise with PAKEs.

PAKE protocols are built with the assumption that the server isn’t always working in the best interest of the client and, even more, cannot use any kind of public-key infrastructure during login (although it doesn’t hurt to have both!). This precludes the user from sending their plaintext password (or any information that could be used to derive it —  in a computational sense) to the server during login.

PAKE protocols have expanded into new territory since the seminal EKE paper of Bellovin and Merritt, where the client and server both remembered a plaintext version of the password. As mentioned above, when the server stores the plaintext password, the client risks having the password logged or leaked. To address this, new protocols were developed, referred to as augmented, verifier-based, or asymmetric PAKEs (aPAKEs), where the server stored a modified version (similar to a hash) of the password instead of the plaintext password. This mirrors the way many of us were taught to store passwords in a database, specifically as a hash of the password with accompanying salt and pepper. However, in these cases, attackers can still use traditional methods of attack such as targeted rainbow tables. To avoid these kinds of attacks, a new kind of PAKE was born, the strong asymmetric PAKE (saPAKE).

OPAQUE was the first saPAKE and it guarantees defense against precomputation by hiding the password dictionary itself! It does this by replacing the noninteractive hash function with an interactive protocol referred to as an Oblivious Pseudorandom Function (OPRF) where one party inputs their “salt”, another inputs their “password”, and only the password-providing party learns the output of the function. The fact that the password-providing party learns nothing (computationally) about the salt prevents offline precomputation by disallowing an attacker from evaluating the function in their head.

Another way to think about the three PAKE paradigms has to do with how each of them treats the password dictionary:

PAKE type Password Dictionary Threat Model
PAKE The password dictionary is public and common to every user. Without any guessing, the attacker learns the user’s password upon compromise of the server.
aPAKE Each user gets their own password dictionary; a description of the dictionary (e.g., the “salt”) is leaked to the client when they attempt to log in. The attacker must perform an independent precomputation for each client they want to attack.
saPAKE (e.g., OPAQUE) Each user gets their own password dictionary; the server only provides an online interface (the OPRF) to the dictionary. The adversary must wait until after they compromise the server to run an offline attack on the user’s password1.

OPAQUE also goes one step further and allows the user to perform the password transformation on their own device so that the server doesn’t see the plaintext password during registration either. Cloudflare Research has been involved with OPAQUE for a while now — for instance, you can read about our previous implementation work and demo if you want to learn more.

But OPAQUE is not a panacea: in the event of server compromise, the attacker can learn the salt that the server uses to evaluate the OPRF and can still run the same offline attack that was available in the aPAKE world, although this is now considerably more time-consuming and can be made increasingly difficult through the use of memory-hard hash functions like scrypt. This means that despite our best efforts, when a server is breached, the attacker can eventually come out with a list of plaintext passwords. Indeed, this attack is always inevitable as the attacker can always run the (sa)PAKE protocol in their head acting as both parties to test each password. With this being the case, we still need to take steps to defend against automated password attacks such as credential stuffing attacks and have ways of mitigating them.

Are You Overexposed?

To help detect and respond to credential stuffing, Cloudflare recently rolled out the Exposed Credential Checks feature on the Web Application Firewall (WAF), which can alert the origin if a user’s login credentials have appeared in a recent breach. Historically, compromised credential checking services have allowed users to be proactive against credential stuffing attacks when their username and password appear together in a breach. However, they do not account for recently proposed credential tweaking attacks, in which an attacker tries variants of a breached password, under the assumption that users often use slight modifications of the same password for different accounts, such as “sunshineFB”, “sunshineIG”, and so on. Therefore, compromised credential check services should incorporate methods of checking for credential tweaks.

Under the hood, Cloudflare’s Exposed Credential Checks feature relies on an underlying protocol deemed Might I Get Pwned (MIGP). MIGP uses the bucketization method proposed in Li et al. to avoid sending the plaintext username or password to the server while handling a large breach dataset. After receiving a user’s credentials, MIGP hashes the username and sends a portion of that hash as a “bucket identifier” to the server. The client and server can then perform a private membership test protocol to verify whether the user’s username/password pair appeared in that bucket, without ever having to send plaintext credentials to the server.

Unlike previous compromised credential check services, MIGP also enables credential tweaking checks by augmenting the original breach dataset with a set of password “variants”. For each leaked password, it generates a list of password variants, which are labeled as such to differentiate them from the original leaked password and added to the original dataset. For more information, you can check out the Cloudflare Research blog post detailing our open-source implementation and deployment of the MIGP protocol.  

Research Directions in Password Security

Measuring Credential Compromises

The question remains, just how important are these exposed credential checks for detecting and preventing credential stuffing attacks in practice? To answer this question, the Research Team has initiated a study investigating login requests to our own Cloudflare dashboard. For this study, we are collecting the data logged by Cloudflare’s Exposed Credential Check feature (described above), designed to be privacy-preserving: this check does not reveal a password, but provides a “yes/no” response on whether the submitted credentials appear in our breach dataset. Along with this signal, we are looking at other fields that may be indicative of malicious behavior such as bot score and IP reputation. As this project develops, we plan to cluster the data to find patterns of different types of credential stuffing attacks that we can generalize to form attack fingerprints. We can then feed these fingerprints into the alert logs for the Cloudflare Detection & Response team to see if they provide useful information for the security analysts.

Additionally, we hope to investigate potential post-compromise behavior as it relates to these compromise check fields. After an attacker successfully hijacks an account, they may take a number of actions such as changing the password, revoking all valid access tokens, or setting up a malicious script. By analyzing compromised credential checks along with these signals, we may be able to better differentiate benign from malicious behavior.

Future directions: OPAQUE and MIGP combined

This post has discussed how we’re approaching the problem of preventing credential stuffing attacks from two different angles. Through the deployment and analysis of compromised credential checks, we aim to prevent server compromise by detecting and preventing credential stuffing attacks before they happen. In addition, in the case that a server does get compromised, the wider use of OPAQUE would help address the problem of leaking passwords to an attacker by avoiding the reception and storage of plaintext passwords on the server as well as preventing precomputation attacks.

However, there are still remaining research challenges to address. Notably, the current method for interfacing with MIGP still requires the server to either pass along a plaintext version of the client’s password, or trust the client to honestly communicate with the MIGP service on behalf of the server. If we want to leverage the security guarantees of OPAQUE (or generally an saPAKE) with the analytics and alert system provided by MIGP in a privacy-preserving way, we need additional mechanisms.

At first glance, the privacy-preserving goals of both protocols seem to be perfect matches for each other. Both OPAQUE and MIGP are built upon the idea of replacing the traditional salted password hashes with an OPRF as a way of keeping the client’s plaintext passwords from ever leaving their device. However, both the interfaces for these protocols rely on user-provided inputs which aren’t cryptographically tied to each other. This allows an attacker to provide a false password to MIGP while providing their actual password to the OPAQUE server. Further, the security analysis of both protocols assume that their idealized building blocks are separated in an important way. This isn’t to say that the two protocols are incompatible, and indeed, much of these protocols may be salvaged.

The next stages for password privacy will be an integration of these two protocols such that a server can be made aware of credential stuffing attacks and the patterns of compromised account usage that can protect a server against the compromise of other servers while providing the same privacy guarantees OPAQUE does. Our goal is to allow you to protect yourself from other compromised servers while protecting your clients from compromise of your server. Stay tuned for updates!

We’re always keen to collaborate with others to build more secure systems, and would love to hear from those interested in password research. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.


1There are other ways of constructing saPAKE protocols. The curious reader can see this CRYPTO 2019 paper for details.

Should we teach AI and ML differently to other areas of computer science? A challenge

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/research-seminar-data-centric-ai-ml-teaching-in-school/

Between September 2021 and March 2022, we’re partnering with The Alan Turing Institute to host a series of free research seminars about how to teach AI and data science to young people.

In the second seminar of the series, we were excited to hear from Professor Carsten Schulte, Yannik Fleischer, and Lukas Höper from the University of Paderborn, Germany, who presented on the topic of teaching AI and machine learning (ML) from a data-centric perspective. Their talk raised the question of whether and how AI and ML should be taught differently from other themes in the computer science curriculum at school.

Machine behaviour — a new field of study?

The rationale behind the speakers’ work is a concept they call hybrid interaction system, referring to the way that humans and machines interact. To explain this concept, Carsten referred to an 2019 article published in Nature by Iyad Rahwan and colleagues: Machine hehaviour. The article’s authors propose that the study of AI agents (complex and simple algorithms that make decisions) should be a separate, cross-disciplinary field of study, because of the ubiquity and complexity of AI systems, and because these systems can have both beneficial and detrimental impacts on humanity, which can be difficult to evaluate. (Our previous seminar by Mhairi Aitken highlighted some of these impacts.) The authors state that to study this field, we need to draw on scientific practices from across different fields, as shown below:

Machine behaviour as a field sits at the intersection of AI engineering and behavioural science. Quantitative evidence from machine behaviour studies feeds into the study of the impact of technology, which in turn feeds questions and practices into engineering and behavioural science.
The interdisciplinarity of machine behaviour. (Image taken from Rahwan et al [1])

In establishing their argument, the authors compare the study of animal behaviour and machine behaviour, citing that both fields consider aspects such as mechanism, development, evolution and function. They describe how part of this proposed machine behaviour field may focus on studying individual machines’ behaviour, while collective machines and what they call ‘hybrid human-machine behaviour’ can also be studied. By focusing on the complexities of the interactions between machines and humans, we can think both about machines shaping human behaviour and humans shaping machine behaviour, and a sort of ‘co-behaviour’ as they work together. Thus, the authors conclude that machine behaviour is an interdisciplinary area that we should study in a different way to computer science.

Carsten and his team said that, as educators, we will need to draw on the parameters and frameworks of this machine behaviour field to be able to effectively teach AI and machine learning in school. They argue that our approach should be centred on data, rather than on code. I believe this is a challenge to those of us developing tools and resources to support young people, and that we should be open to these ideas as we forge ahead in our work in this area.

Ideas or artefacts?

In the interpretation of computational thinking popularised in 2006 by Jeanette Wing, she introduces computational thinking as being about ‘ideas, not artefacts’. When we, the computing education community, started to think about computational thinking, we moved from focusing on specific technology — and how to understand and use it — to the ideas or principles underlying the domain. The challenge now is: have we gone too far in that direction?

Carsten argued that, if we are to understand machine behaviour, and in particular, human-machine co-behaviour, which he refers to as the hybrid interaction system, then we need to be studying   artefacts as well as ideas.

Throughout the seminar, the speakers reminded us to keep in mind artefacts, issues of bias, the role of data, and potential implications for the way we teach.

Studying machine learning: a different focus

In addition, Carsten highlighted a number of differences between learning ML and learning other areas of computer science, including traditional programming:

  1. The process of problem-solving is different. Traditionally, we might try to understand the problem, derive a solution in terms of an algorithm, then understand the solution. In ML, the data shapes the model, and we do not need a deep understanding of either the problem or the solution.
  2. Our tolerance of inaccuracy is different. Traditionally, we teach young people to design programs that lead to an accurate solution. However, the nature of ML means that there will be an error rate, which we strive to minimise. 
  3. The role of code is different. Rather than the code doing the work as in traditional programming, the code is only a small part of a real-world ML system. 

These differences imply that our teaching should adapt too.

A graphic demonstrating that in machine learning as compared to other areas of computer science, the process of problem-solving, tolerance of inaccuracy, and role of code is different.
Click to enlarge.

ProDaBi: a programme for teaching AI, data science, and ML in secondary school

In Germany, education is devolved to state governments. Although computer science (known as informatics) was only last year introduced as a mandatory subject in lower secondary schools in North Rhine-Westphalia, where Paderborn is located, it has been taught at the upper secondary levels for many years. ProDaBi is a project that researchers have been running at Paderborn University since 2017, with the aim of developing a secondary school curriculum around data science, AI, and ML.

The ProDaBi curriculum includes:

  • Two modules for 11- to 12-year-olds covering decision trees and data awareness (ethical aspects), introduced this year
  • A short course for 13-year-olds covering aspects of artificial intelligence, through the game Hexapawn
  • A set of modules for 14- to 15-year-olds, covering data science, data exploration, decision trees, neural networks, and data awareness (ethical aspects), using Jupyter notebooks
  • A project-based course for 18-year-olds, including the above topics at a more advanced level, using Codap and Jupyter notebooks to develop practical skills through projects; this course has been running the longest and is currently in its fourth iteration

Although the ProDaBi project site is in German, an English translation is available.

Learning modules developed as part of the ProDaBi project.
Modules developed as part of the ProDaBi project

Our speakers described example activities from three of the modules:

  • Hexapawn, a two-player game inspired by the work of Donald Michie in 1961. The purpose of this activity is to support learners in reflecting on the way the machine learns. Children can then relate the activity to the behavior of AI agents such as autonomous cars. An English version of the activity is available. 
  • Data cards, a series of activities to teach about decision trees. The cards are designed in a ‘Top Trumps’ style, and based on food items, with unplugged and digital elements. 
  • Data awareness, a module focusing on the amount of data an individual can generate as they move through a city, in this case through the mobile phone network. Children are encouraged to reflect on personal data in the context of the interaction between the human and data-driven artefact, and how their view of the world influences their interpretation of the data that they are given.

Questioning how we should teach AI and ML at school

There was a lot to digest in this seminar: challenging ideas and some new concepts, for me anyway. An important takeaway for me was how much we do not yet know about the concepts and skills we should be teaching in school around AI and ML, and about the approaches that we should be using to teach them effectively. Research such as that being carried out in Paderborn, demonstrating a data-centric approach, can really augment our understanding, and I’m looking forward to following the work of Carsten and his team.

Carsten and colleagues ended with this summary and discussion point for the audience:

“‘AI education’ requires developing an adequate picture of the hybrid interaction system — a kind of data-driven, emergent ecosystem which needs to be made explicitly to understand the transformative role as well as the technological basics of these artificial intelligence tools and how they are related to data science.”

You can catch up on the seminar, including the Q&A with Carsten and his colleagues, here:

Join our next seminar

This seminar really extended our thinking about AI education, and we look forward to introducing new perspectives from different researchers each month. At our next seminar on Tuesday 2 November at 17:00–18:30 BST / 12:00–13:30 EDT / 9:00–10:30 PDT / 18:00–19:30 CEST, we will welcome Professor Matti Tedre and Henriikka Vartiainen (University of Eastern Finland). The two Finnish researchers will talk about emerging trajectories in ML education for K-12. We look forward to meeting you there.

Carsten and their colleagues are also running a series of seminars on AI and data science: you can find out about these on their registration page.

You can increase your own understanding of machine learning by joining our latest free online course!


[1] Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., … & Wellman, M. (2019). Machine behaviour. Nature, 568(7753), 477-486.

The post Should we teach AI and ML differently to other areas of computer science? A challenge appeared first on Raspberry Pi.

Разследването ни предизвика реакция на министъра на правосъдието за липсата на гражданство при шефа на СГС Янаки Стоилов срещу владетелите на задкулисието

Post Syndicated from Николай Марченко original https://bivol.bg/%D1%8F%D0%BD%D0%B0%D0%BA%D0%B8-%D1%81%D1%82%D0%BE%D0%B8%D0%BB%D0%BE%D0%B2-%D1%81%D1%80%D0%B5%D1%89%D1%83-%D0%B2%D0%BB%D0%B0%D0%B4%D0%B5%D1%82%D0%B5%D0%BB%D0%B8%D1%82%D0%B5-%D0%BD%D0%B0-%D0%B7%D0%B0.html

четвъртък 14 октомври 2021


Правосъдният министър проф. Янаки Стоилов изпрати информацията за гражданството на съдия Алексей Трифонов на ресорната Съдийска колегия на Висшия съдебен съвет (ВСС). Същата, която без да провери, има ли той…

Fact check: that "forensics" of the Mesa image is crazy

Post Syndicated from original https://blog.erratasec.com/2021/10/fact-check-that-forensics-of-mesa-image.html

Tina Peters, the elections clerk from Mesa County (Colorado) went rogue, creating a “disk-image” of the election server, and posting that image to the public Internet. Conspiracy theorists have been analyzing the disk-image trying to find anomalies supporting their conspiracy-theories. A recent example is this “forensics” report. In this blogpost, I debunk that report.

I suppose calling somebody a “conspiracy theorist” is insulting, but there’s three objective ways we can identify them as such.

The first is when they use the logic “everything we can’t explain is proof of the conspiracy“. In other words, since there’s no other rational explanation, the only remaining explanation is the conspiracy-theory. But there can be other possible explanations — just ones unknown to the person because they aren’t smart enough to understand them. We see that here: the person writing this report doesn’t understand some basic concepts, like “airgapped” networks.

This leads to the second way to recognize a conspiracy-theory, when it demands this one thing that’ll clear things up. Here, it’s demanding that a manual audit/recount of Mesa County be performed. But it won’t satisfy them. The Maricopa audit in neighboring Colorado, whose recount found no fraud, didn’t clear anything up — it just found more anomalies demanding more explanation. It’s like Obama’s birth certificate. The reason he ignored demands to show it was that first, there was no serious question (even if born in Kenya, he’d still be a natural born citizen — just like how Cruz was born in Canada and McCain in Panama), and second, showing the birth certificate wouldn’t change anything at all, as they’d just claim it was fake. There is no possibility of showing a birth certificate that can be proven isn’t fake.

The third way to objectively identify a conspiracy theory is when they repeat objectively crazy things. In this case, they keep demanding that the 2020 election be “decertified”. That’s not a thing. There is no regulation or law where that can happen. The most you can hope for is to use this information to prosecute the fraudster, prosecute the elections clerk who didn’t follow procedure, or convince legislators to change the rules for the next election. But there’s just no way to change the results of the last election even if wide spread fraud is now proven.

The document makes 6 individual claims. Let’s debunk them one-by-one.


#1 Data Integrity Violation

The report tracks some logs on how some votes were counted. It concludes:

If the reasons behind these findings cannot be adequately explained, then the county’s election results are indeterminate and must be decertified.

This neatly demonstrates two conditions I cited above. The analyst can’t explain the anomaly not because something bad happened, but because they don’t understand how Dominion’s voting software works. This demand for an explanation is a common attribute of conspiracy theories — the ignorant keep finding things they don’t understand and demand somebody else explain them.

Secondly, there’s the claim that the election results must be “decertified”. It’s something that Trump and his supporters believe is a thing, that somehow the courts will overturn the past election and reinstate Trump. This isn’t a rational claim. It’s not how the courts or the law works or the Constitution works.


#2 Intentional purging of Log Files

This is the issue that convinced Tina Peters to go rogue, that the normal Dominion software update gets rid of all the old system-log files. She leaked two disk-images, before and after the update, to show the disappearance of system-logs. She believes this violates the law demanding the “election records” be preserved. She claims because of this, the election can’t be audited.

Again, we are in crazy territory where they claim things that aren’t true. System-logs aren’t considered election records by any law or regulation. Moreover, they can’t be used to “audit” an election.

Currently, no state/county anywhere treats system-logs as election records (since they can’t be used for “audits”). Maybe this should be different. Maybe you can create a lawsuit where a judge rules that in future elections they must be treated as election records. Maybe you can convince legislatures to pass laws saying system-logs must be preserved. It’s not crazy to say this should be different in the future, it’s just crazy to say that past system-logs were covered under the rules.

And if you did change the rules, the way to preserve them wouldn’t be to let them sit on the C: boot-drive until they eventually rot and disappear (which will eventually happen no matter what). Instead, the process to preserve them would be to copy them elsewhere. The way Dominion works is that all election records that need to be preserved are copied over to the D: data drive.

Which means, by the way, that this entire forensics report is bogus. The Mesa disk image was only of the C: boot-drive, not of the D: data drive. Thus, it’s unable to say which records/logs were preserved or not. Everyone knows that system-logs probably weren’t, because they aren’t auditable election records, so you can still make the claim “system-logs weren’t preserved”. It’s just that you couldn’t make that claim based on a forensics of the C: boot-drive. Again, we are in crazy statements territory that identify something as a conspiracy-theory, weird claims about how reality works.

System-logs cannot be used to audit the vote. That’s confusing the word “audit” with “forensics”. The word “audit” implies you are looking for a definitive result, like whether the vote count was correct, or whether all procedures were followed. Forensics of system-logs can’t tell you that. Instead, they can only lead to indeterminate results.

That’s what you see here. This “forensics” report cannot make any definitive statement based upon the logs. It can find plenty of anomalies, meaning things the forensics investigator can’t understand. But none of that is positive proof of anything. If a hacker had flipped votes on this system, it’s unlikely we would have seen evidence in the log.

#3 Evidence of network connection

The report claims the computer was connected to a network. Of course this is true — it’s not a problem. The network was the one shown in the diagram below:

Specifically, this Mesa image was of the machine labeled “EMS Server” in the above diagram. From my forensics of the network logs, I can see that there are other computers on this network:

  1. Four ICC workstations (named ICC01 through ICC04)
  2. Two Adjudication Workstations (named ADJCLIENT01 and ADJCLINET03, I don’t know what happened to number 2).
  3. Two EMS Workstations (named EMSCLIENT01 and EMSCLIENT02).
  4. A printer, model Dell E310dw.
The word “airgapped” doesn’t mean the EMS Server is airgapped from any network, but that this entire little network is airgapped from anything else. The security of this network is physical security, the fact that nobody can enter the room who isn’t authorized.
I did my own forensics on the Mesa image and could find none of the normal signs that the server accessed the Internet, and pretty good evidence that most of the time, it was unconnected (it gets mad when it can’t find the Internet and produces logs stating this). This doesn’t mean I proved conclusively no Internet connection was ever made. It’s possible that somebody will find some new thing in that image that shows an Internet connection. It’s just that currently, there’s no reason to believe the “airgap” guarantee of security was violated.
The claimed evidence about the “Microsoft Report Server” is wrong.

#4 Lack of Software Updates
This is just stupid. The cybersecurity community does have this weird fetish demanding that every software update be applied immediately, but there’s good reasons why they aren’t, and ways of mitigating the security risk when they can’t be applied.
Software updates sometimes break things. In sensitive environments where computers must be absolutely predictable, they aren’t applied. This includes live financial systems, medical equipment, and industrial control systems.
This also includes elections. It’s simply not acceptable canceling or delaying an election because a software update broke the computer.
This is why Dominion does what they call a “Trusted Build” process that wipes out the boot-drive (deleting system-logs). To update software, they build an entire new boot image with all the software in a tested, known state. They then apply that boot disk image to all the county machines, which replaces everything on the C: boot-drive with a new version of Windows and all the software. This leaves the D: data drive untouched, where the records are preserved.
If you didn’t do things this way, then sometimes elections will fail.
This is also why having an “airgapped” network is important. The voting machines aren’t going to have software updates regularly applied, so they need to be protected. Firewalls would also be another mitigation strategy.

#5 Existence of SQL Server Management Studio.
This is just a normal part of having an SQL server installed.
Yes, in theory it would make it easy for somebody to change records in the database. But at the same time, such a thing is pretty easy even without SSMS installed. One way is command-line scripts.
#6 Referential Integrity
This “referential integrity” is a reliability concern, not an anti-hacking measure. It just means hackers would need only an extra step if they wanted to delete or change records.
Conclusion

Evidence is something that the expert understands. It’s something they can show, explain, and defend against challengers.
This report contained none of that. It contained instead anomalies the writer couldn’t explain.
Note that this doesn’t mean they weren’t an expert. Obviously, they needed enough expertise to get as far as they did. It’s just a consequence of conspiracy-theories. When searching for proof of your conspiracy-theory when there is none, it means going off into the weeds past your are of expertise.
Give that forensics image to any expert, and they’ll find anomalies they can’t explain. That includes me, I’ve posted some of them to Twitter and had other experts explain them to me. The difference is that I attributed the lack of an explanation to my own ignorance, not a conspiracy.
At some point, we have to call out conspiracy-theories for what they are. This isn’t defending the integrity of elections. If it were, it’d be proposing solutions for future elections. Instead, it’s an attack on the integrity of elections, fighting the peaceful transfer of power by unfounded conspiracy-theory claims.
And we can say this objectively. As I stated above, there’s three objective tests. These are:
  • Anomalies that can’t be explained are claimed to be evidence — when in fact they come from simple ignorance.
  • Demands that something needs explaining, when it really doesn’t, and which won’t satisfy them anyway.
  • Statements of a world view (like that the election can be “decertified” or that system-logs are “election records”) that nobody agrees with.

Prometheus Conformance Program: First round of results

Post Syndicated from Richard "RichiH" Hartmann original https://prometheus.io/blog/2021/10/14/prometheus-conformance-results/


Today, we’re launching the Prometheus Conformance Program with the goal of ensuring interoperability between different projects and vendors in the Prometheus monitoring space. While the legal paperwork still needs to be finalized, we ran tests, and we consider the below our first round of results. As part of this launch Julius Volz updated his PromQL test results.

As a quick reminder: The program is called Prometheus Conformance, software can be compliant to specific tests, which result in a compatibility rating. The nomenclature might seem complex, but it allows us to speak about this topic without using endless word snakes.

Preamble

New Categories

We found that it’s quite hard to reason about what needs to be applied to what software. To help sort my thoughts, we created an overview, introducing four new categories we can put software into:

  • Metrics Exposers
  • Agents/Collectors
  • Prometheus Storage Backends
  • Full Prometheus Compatibility

Call for Action

Feedback is very much welcome. Maybe counter-intuitively, we want the community, not just Prometheus-team, to shape this effort. To help with that, we will launch a WG Conformance within Prometheus. As with WG Docs and WG Storage, those will be public and we actively invite participation.

As we alluded to recently, the maintainer/adoption ratio of Prometheus is surprisingly, or shockingly, low. In different words, we hope that the economic incentives around Prometheus Compatibility will entice vendors to assign resources in building out the tests with us. If you always wanted to contribute to Prometheus during work time, this might be the way; and a way that will have you touch a lot of highly relevant aspects of Prometheus. There’s a variety of ways to get in touch with us.

Register for being tested

You can use the same communication channels to get in touch with us if you want to register for being tested. Once the paperwork is in place, we will hand contact information and contract operations over to CNCF.

Test results

Full Prometheus Compatibility

We know what tests we want to build out, but we are not there yet. As announced previously, it would be unfair to hold this against projects or vendors. As such, test shims are defined as being passed. The currently semi-manual nature of e.g. the PromQL tests Julius ran this week mean that Julius tested sending data through Prometheus Remote Write in most cases as part of PromQL testing. We’re reusing his results in more than one way here. This will be untangled soon, and more tests from different angles will keep ratcheting up the requirements and thus End User confidence.

It makes sense to look at projects and aaS offerings in two sets.

Projects

Passing

  • Cortex 1.10.0
  • M3 1.3.0
  • Promscale 0.6.2
  • Thanos 0.23.1

Not passing

VictoriaMetrics 1.67.0 is not passing and does not intend to pass. In the spirit of End User confidence, we decided to track their results while they position themselves as a drop-in replacement for Prometheus.

aaS

Passing

  • Chronosphere
  • Grafana Cloud

Not passing

  • Amazon Managed Service for Prometheus
  • Google Cloud Managed Service for Prometheus
  • New Relic
  • Sysdig Monitor

NB: As Amazon Managed Service for Prometheus is based on Cortex just like Grafana Cloud, we expect them to pass after the next update cycle.

Agent/Collector

Passing

  • Grafana Agent 0.19.0
  • OpenTelemetry Collector 0.37.0
  • Prometheus 2.30.3

Not passing

  • Telegraf 1.20.2
  • Timber Vector 0.16.1
  • VictoriaMetrics Agent 1.67.0

NB: We tested Vector 0.16.1 instead of 0.17.0 because there are no binary downloads for 0.17.0 and our test toolchain currently expects binaries.

Prometheus Conformance Program: First round of results

Post Syndicated from Richard "RichiH" Hartmann original https://prometheus.io/blog/2021/05/14/prometheus-conformance-results/


Today, we’re launching the Prometheus Conformance Program. While the legal paperwork still needs to be finalized, we ran tests, and we consider the below our first round of results. As part of this launch Julius Volz updated his PromQL test results.

As a quick reminder: The program is called Prometheus Conformance, software can be compliant to specific tests, which result in a compatibility rating. The nomenclature might seem complex, but it allows us to speak about this topic without using endless word snakes.

Preamble

New Categories

We found that it’s quite hard to reason about what needs to be applied to what software. To help sort my thoughts, we created an overview, introducing four new categories we can put software into:
* Metrics Exposers
* Agent/Collector
* Prometheus Storage Backends
* Full Prometheus Compatibility

Call for Action

Feedback is very much welcome. Maybe counter-intuitively, we want the community, not just Prometheus-team, to shape this effort. To help with that, we will launch a WG Conformance within Prometheus. As with WG Docs and WG Storage, those will be public and we actively invite participation.

As we alluded to recently, the maintainer/adoption ratio of Prometheus is surprisingly, or shockingly, low. In different words, we hope that the economic incentives around Prometheus Compatibility will entice vendors to assign resources in building out the tests with us. If you always wanted to contribute to Prometheus during work time, this might be the way; and a way that will have you touch a lot of highly relevant aspects of Prometheus. There’s a variety of ways to get in touch with us.

Register for being tested

You can use the same communication channels to get in touch with us if you want to register for being tested. Once the paperwork is in place, we will hand contact information and contract operations over to CNCF.

Test results

Full Prometheus Compatibility

We know what tests we want to build out, but we are not there yet. As announced previously, it would be unfair to hold this against projects or vendors. As such, test shims are defined as being passed. The currently semi-manual nature of e.g. the PromQL tests Julius ran this week mean that Julius tested sending data through Prometheus Remote Write in most cases as part of PromQL testing. We’re reusing his results in more than one way here. This will be untangled soon, and more tests from different angles will keep ratcheting up the requirements and thus End User confidence.

It makes sense to look at projects and aaS offerings in two sets.

Projects

Passing

  • Cortex 1.10.0
  • M3 1.3.0
  • Promscale 0.6.2
  • Thanos 0.23.1

Not passing

VictoriaMetrics 1.67.0 is not passing and does not intend to pass. In the spirit of End User confidence, we decided to track their results while they position themselves as a drop-in replacement for Prometheus.

aaS

Passing

  • Chronosphere
  • Grafana Cloud

Not passing

  • Amazon Managed Service for Prometheus
  • Google Cloud Managed Service for Prometheus
  • New Relic
  • Sysdig Monitor

NB: As Amazon Managed Service for Prometheus is based on Cortex just like Grafana Cloud, we expect them to pass after the next update cycle.

Agent/Collector

Passing

  • Grafana Agent 0.19.0
  • OpenTelemetry Collector 0.37.0
  • Prometheus 2.30.3

Not passing

  • Telegraf 1.20.2
  • Timber Vector 0.16.1
  • VictoriaMetrics Agent 1.67.0

NB: We tested Vector 0.16.1 instead of 0.17.0 because there are no binary downloads for 0.17.0 and our test toolchain currently expects binaries.

[$] Scrutinizing bugs found by syzbot

Post Syndicated from original https://lwn.net/Articles/872649/rss

The syzbot
kernel-fuzzing system
finds an enormous number of bugs, but, since many
of them may seem to be of a relatively low severity, they have a lower priority
when contending for the attention of developers. A talk
at the recent Linux
Security Summit North America
reported on some research that
dug further into the bugs that syzbot has
found; the results are rather worrisome. Rather than a pile of
difficult- or impossible-to-exploit bugs, there are numerous, more serious
problems lurking within.

Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format

Post Syndicated from Radhika Ravirala original https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/

VPC Flow Logs help you understand network traffic patterns, identify security issues, audit usage, and diagnose network connectivity on AWS. Customers often route their VPC flow logs directly to Amazon Simple Storage Service (Amazon S3) for long-term retention. You can then use a custom format conversion application to convert these text files into an Apache Parquet format to optimize the analytical processing of the log data and reduce the cost of log storage. This custom format conversion step added complexity, time to insight, and costs to the VPC flow log traffic analytics. Until today, VPC flow logs were delivered to Amazon S3 as raw text files in GZIP format.

Today, we’re excited to announce a new feature that delivers VPC flow logs in the Apache Parquet format, making it easier, faster, and more cost-efficient to analyze your VPC flow logs stored in Amazon S3. You can also deliver VPC flow logs to Amazon S3 with Hive-compatible S3 prefixes partitioned by the hour.

Apache Parquet is an open-source file format that stores data efficiently in columnar format, provides different encoding types, and supports predicate filtering. With good compression ratios and efficient encoding, VPC flow logs stored in Parquet reduce your Amazon S3 storage costs. When querying flow logs persisted in Parquet format with analytic frameworks, non-relevant data is skipped, requiring fewer reads on Amazon S3 and thereby improving query performance. To reduce query running time and cost with Amazon Athena and Amazon Redshift Spectrum, Apache Parquet is often the recommended file format.

In this post, we explore this new feature and how it can help you run performant queries on your flow logs with Athena.

Create flow logs with Parquet file format

To take advantage of this feature, simply create a new VPC flow log subscription with Amazon S3 as the destination using the AWS Management Console, AWS Command Line Interface (AWS CLI), or API. On the console, when creating new a VPC flow log subscription with Amazon S3, you can select one or more of the following options:

  • Log file format
  • Hive-compatible S3 prefixes
  • Partition logs by time

We now explore how each of these options can make processing and storage of flow logs more efficient

Apache Parquet formatted files

By default, your logs are delivered in text format. To change to Parquet, for Log file format, select Parquet. This delivers your VPC flow logs to Amazon S3 in the Apache Parquet format.

Note the following considerations:

  • You can’t change existing flow logs to deliver logs in Parquet format. You need to create a new VPC flow log subscription with Parquet as the log file format.
  • Consider using a higher maximum aggregation interval (10 minutes) when aggregating flow packets to ensure larger Parquet files on Amazon S3.
  • Refer to Amazon CloudWatch pricing for pricing of log delivery in Apache Parquet format for VPC flow logs

Hive-compatible partitions

Partitioning is a technique to organize your data to improve the efficiency of your query engine. Partitions aligned with the columns that are frequently used in the query filters can significantly lower your query response time. You can now specify that your flow logs be organized in Hive-compatible format. This allows you to run the MSCK REPAIR command in Athena to quickly and easily add new partitions as they get delivered into Amazon S3. Simply select Enable for Hive-compatible S3 prefix to set this up. This delivers the flow logs to Amazon S3 in the following path:

s3://my-flow-log-bucket/my-custom-flow-logs/AWSLogs/aws-account-id=123456789012/aws-service=vpcflowlogs/aws-region=us-east-1/year=2021/month=10/day=07/123456789012_vpcflowlogs_us-east-1_fl-06a0eeb1087d806aa_20211007T1930Z_d5ab7c14.log.parquet

Per-hour partitions

You can also organize your flow logs at a much more granular level by adding per-hour partitions. You should enable this feature if you constantly need to query large volumes of logs with a specific time frame as the predicate. Querying logs only during certain hours results in less data scanned, which translates to lower cost per query with engines such as Athena and Redshift Spectrum.

You can also set per-hour partitions via an API or the AWS CLI using the --destination-options parameter in create-flow-logs:

aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-001122333 \
--traffic-type ALL \
--log-destination-type s3 \
--log-destination arn:aws:s3:::my-flow-log-bucket/my-custom-flow-logs/ \
--destination-options FileFormat=parquet,HiveCompatiblePartitions=True, PerHourPartition=True

The following is a sample flow log file deposited into an hourly bucket. By default, the flow logs in Parquet are compressed using Gzip format, which has the highest compression ratio compared to other compression formats.

s3://my-flow-log-bucket/my-custom-flow-logs/AWSLogs/aws-account-id=123456789012/aws-service=vpcflowlogs/aws-region=us-east-1/year=2021/month=10/day=07/hour=19/123456789012_vpcflowlogs_us-east-1_fl-06a0eeb1087d806aa_20211007T1930Z_d5ab7c14.log.parquet

Query with Athena

You can use the Athena integration for VPC Flow Logs from the Amazon VPC console to automate the Athena setup and query VPC flow logs in Amazon S3. This integration has now been extended to support these new flow log delivery options to Amazon S3.

To demonstrate querying flow logs in Parquet and in plain text in this blog, let’s start from the Amazon Athena console.  We begin by creating an external table pointing to flow logs in Parquet.

Note that this feature supports specifying flow logs fields in Parquet’s native data types. This eliminates the need for you to cast your fields when querying the traffic logs.

Then run MSCK REPAIR TABLE.

Let’s run a sample query on these Parquet-based flow logs.

Now, let’s create a table for flow logs delivered in plain text.

We add the partitions using the ALTER TABLE statement in Athena.

Run a simple flow logs query and note the time it took to run the query.

The Athena query run time with flow logs in Parquet (1.16 seconds) is much faster than the run time with flow logs in plain text (2.51 seconds).

For benchmarks that further describe the cost savings and performance improvements from persisting data in Parquet in granular partitions, see Top 10 Performance Tuning Tips for Amazon Athena.

Summary

You can now deliver your VPC flow logs to Amazon S3 with three new options:

  • In Apache Parquet formatted files
  • With Hive-compatible S3 prefixes
  • In hourly partitioned files

These delivery options make it faster, easier, and more cost-efficient to store and run analytics on your VPC flow logs. To learn more, visit VPC Flow Logs documentation. We hope you will give this feature a try and share your experience with us. Please send feedback to the AWS forum for Amazon VPC or through your usual AWS support contacts.


About the Authors

Radhika Ravirala is a Principal Streaming Architect at Amazon Web Services, where she helps customers craft distributed streaming applications using Amazon Kinesis and Amazon MSK. In her free time, she enjoys long walks with her dog, playing board games, and reading widely.

Vaibhav Katkade is a Senior Product Manager in the Amazon VPC team. He is interested in areas of network security and cloud networking operations. Outside of work, he enjoys cooking and the outdoors.

Introducing the new AWS Well-Architected Machine Learning Lens

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/introducing-the-new-aws-well-architected-machine-learning-lens/

The AWS Well-Architected Framework provides you with a formal approach to compare your workloads against best practices. It also includes guidance on how to make improvements.

Machine learning (ML) algorithms discover and learn patterns in data, and construct mathematical models to predict future data. These solutions can revolutionize lives through better diagnoses of diseases, environmental protections, products and services transformation, and more.

Your ML models depend on the quality of input data to generate accurate results. As data changes over time, monitoring is required to continuously detect, correct, and mitigate issues. This improves accuracy and performance. It also may require you to retrain your model with the latest refined data.

Application workloads rely on step-by-step instructions to solve a problem. ML workloads enable algorithms to learn from data through an iterative and continuous cycle. We are announcing a brand-new version of the AWS Well-Architected Machine Learning Lens whitepaper. It complements and builds upon the Well-Architected Framework to address this difference between these two types of workloads.

The whitepaper provides you with a set of established cloud and technology agnostic best practices. You can apply this guidance and architectural principles when designing your ML workloads, or after your workloads have entered production as part of continuous improvement. The paper includes guidance and resources to help you implement these best practices on AWS.

The Well-Architected Machine Learning Lens components

The Lens includes four focus areas:

1. The Well-Architected Machine Learning Design Principles — A set of considerations that are used as the basis for a Well-Architected ML workload. These design principles are the guiding light for the collection of the best practices in the ML Lens.

2. The Well-Architected Machine Learning Lifecycle — This integrates the Well-Architected Framework into the Machine Learning Lifecycle as can be seen in figure 1.

    • The Well-Architected Framework pillars includes:
      1. Operational Excellence
      2. Security
      3. Reliability
      4. Performance Efficiency
      5. Cost Optimization
    • The Machine Learning Lifecycle phases referenced in the ML Lens include:
      1. Business goal identification
      2. ML problem framing
      3. Data processing (data collection, data pre-processing, feature engineering)
      4. Model development (training, tuning, evaluation)
      5. Model deployment (prediction, inference)
      6. Model monitoring
Figure 1. Well-Architected Machine Learning Lifecycle

Figure 1. Well-Architected Machine Learning Lifecycle

In the Well-Architected ML Lens whitepaper, the Well-Architected Machine Learning Lifecycle applies the Well-Architected Framework pillars to each of the lifecycle phases.

3. Cloud and technology agnostic best practices — These are best practices for each ML lifecycle phase across the Well-Architected Framework pillars. Best practices are accompanied by:

    • Implementation guidance that provides AWS implementation plans for each best practice with references to AWS technologies and resources.
    • Resources as a set of links to AWS documents, blogs, videos, and code examples as supporting resources to the best practices and their implementation plans.

4. ML Lifecycle architecture diagrams — These illustrate processes, technologies, and components that support many of the best practices, shown in Figure 2. They include: Feature stores, Model Registry, lineage tracker, alarm manager, scheduler, and more. Different pipeline technologies are illustrated using these architecture diagrams.

Figure 2. Machine Learning Lifecycle phases with expanded components

Figure 2. Machine Learning Lifecycle phases with expanded components

Where should you apply the Well-Architected Machine Learning Lens?

Use the Well-Architected ML Lens to:

  • Make informed decisions — Plan early and make informed decisions by reviewing best practices before a new workload design begins.
  • Build and deploy faster — Use the best practices to guide you through building new Well-Architected workloads across the ML lifecycle.
  • Lower or mitigate risks — Evaluate existing workloads regularly to identify, mitigate, and address potential issues early.
  • Learn AWS best practices — Use the provided implementation plans as guidance on implementing the best practices on AWS.

Conclusion

The new Well-Architected Machine Learning Lens whitepaper is available now. Use the Lens to help ensure that your ML workloads are architected with operational excellence, security, reliability, performance efficiency, and cost optimization in mind.

Special thanks to everyone across the AWS Solution Architecture and Machine Learning communities.  These contributions encompassed diverse perspectives, expertise, and experiences in developing the new AWS Well-Architected Machine Learning Lens.

[Security Nation] Michael Daniel on the Cyber Threat Alliance

Post Syndicated from Rapid7 original https://blog.rapid7.com/2021/10/13/security-nation-michael-daniel-on-the-cyber-threat-alliance/

[Security Nation] Michael Daniel on the Cyber Threat Alliance

In this episode of Security Nation, Jen and Tod chat with Michael Daniel, president and CEO of the Cyber Threat Alliance (CTA), as well as a co-chair on the IST’s Ransomware Task Force. After discussing Michael’s career in cybersecurity with the US government, they talk about what makes information sharing so hard in the security space and how the CTA has addressed this challenge in its efforts to promote better threat intelligence.

Stick around for the Rapid Rundown – with Tod on holiday (AKA vacation), Jen brings on Rapid7’s public policy guru Harley Geiger. They chat about the Cyber Incident Reporting Act, which is likely headed to a Senate floor vote and, if passed, would bring major changes to the reporting requirements around cybersecurity events for owners and operators of critical infrastructure.

Michael Daniel

[Security Nation] Michael Daniel on the Cyber Threat Alliance

Michael Daniel serves as the President and CEO of the Cyber Threat Alliance (CTA), a not-for-profit that enables high-quality cyber threat information sharing among cybersecurity organizations. Prior to CTA, Michael served for four years as US Cybersecurity Coordinator, leading US cybersecurity policy development, facilitating US government partnerships with the private sector and other nations, and coordinating significant incident response activities. From 1995 to 2012, Michael worked for the Office of Management and Budget, overseeing funding for the US Intelligence Community. Michael also works with the Aspen Cybersecurity Group, the World Economic Forum’s Partnership Against Cybercrime, and other organizations improving cybersecurity in the digital ecosystem. In his spare time, he enjoys running and martial arts.

Show notes

Interview links

Rapid Rundown links

Want More Inspiring Stories From the Security Community?

Subscribe to Security Nation Today

Offloading SQL for Amazon RDS using the Heimdall Proxy

Post Syndicated from Antony Prasad Thevaraj original https://aws.amazon.com/blogs/architecture/offloading-sql-for-amazon-rds-using-the-heimdall-proxy/

Getting the maximum scale from your database often requires fine-tuning the application. This can increase time and incur cost – effort that could be used towards other strategic initiatives. The Heimdall Proxy was designed to intelligently manage SQL connections to help you get the most out of your database.

In this blog post, we demonstrate two SQL offload features offered by this proxy:

  1. Automated query caching
  2. Read/Write split for improved database scale

By leveraging the solution shown in Figure 1, you can save on development costs and accelerate the onboarding of applications into production.

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Why query caching?

For ecommerce websites with high read calls and infrequent data changes, query caching can drastically improve your Amazon Relational Database Sevice (RDS) scale. You can use Amazon ElastiCache to serve results. Retrieving data from cache has a shorter access time, which reduces latency and improves I/O operations.

It can take developers considerable effort to create, maintain, and adjust TTLs for cache subsystems. The proxy technology covered in this article has features that allow for automated results caching in grid-caching chosen by the user, without code changes. What makes this solution unique is the distributed, scalable architecture. As your traffic grows, scaling is supported by simply adding proxies. Multiple proxies work together as a cohesive unit for caching and invalidation.

View video: Heimdall Data: Query Caching Without Code Changes

Why Read/Write splitting?

It can be fairly straightforward to configure a primary and read replica instance on the AWS Management Console. But it may be challenging for the developer to implement such a scale-out architecture.

Some of the issues they might encounter include:

  • Replication lag. A query read-after-write may result in data inconsistency due to replication lag. Many applications require strong consistency.
  • DNS dependencies. Due to the DNS cache, many connections can be routed to a single replica, creating uneven load distribution across replicas.
  • Network latency. When deploying Amazon RDS globally using the Amazon Aurora Global Database, it’s difficult to determine how the application intelligently chooses the optimal reader.

The Heimdall Proxy streamlines the ability to elastically scale out read-heavy database workloads. The Read/Write splitting supports:

  • ACID compliance. Determines the replication lag and know when it is safe to access a database table, ensuring data consistency.
  • Database load balancing. Tracks the status of each DB instance for its health and evenly distribute connections without relying on DNS.
  • Intelligent routing. Chooses the optimal reader to access based on the lowest latency to create local-like response times. Check out our Aurora Global Database blog.

View video: Heimdall Data: Scale-Out Amazon RDS with Strong Consistency

Customer use case: Tornado

Hayden Cacace, Director of Engineering at Tornado

Tornado is a modern web and mobile brokerage that empowers anyone who aspires to become a better investor.

Our engineering team was tasked to upgrade our backend such that it could handle a massive surge in traffic. With a 3-month timeline, we decided to use read replicas to reduce the load on the main database instance.

First, we migrated from Amazon RDS for PostgreSQL to Aurora for Postgres since it provided better data replication speed. But we still faced a problem – the amount of time it would take to update server code to use the read replicas would be significant. We wanted the team to stay focused on user-facing enhancements rather than server refactoring.

Enter the Heimdall Proxy: We evaluated a handful of options for a database proxy that could automatically do Read/Write splits for us with no code changes, and it became clear that Heimdall was our best option. It had the Read/Write splitting “out of the box” with zero application changes required. And it also came with database query caching built-in (integrated with Amazon ElastiCache), which promised to take additional load off the database.

Before the Tornado launch date, our load testing showed the new system handling several times more load than we were able to previously. We were using a primary Aurora Postgres instance and read replicas behind the Heimdall proxy. When the Tornado launch date arrived, the system performed well, with some background jobs averaging around a 50% hit rate on the Heimdall cache. This has really helped reduce the database load and improve the runtime of those jobs.

Using this solution, we now have a data architecture with additional room to scale. This allows us to continue to focus on enhancing the product for all our customers.

Download a free trial from the AWS Marketplace.

Resources

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced Tier ISV partner. They have Amazon Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes. For other proxy options, consider the Amazon RDS Proxy, PgBouncer, PgPool-II, or ProxySQL.

How Viasat scaled their big data applications by migrating to Amazon EMR

Post Syndicated from Manoj Gundawar original https://aws.amazon.com/blogs/big-data/how-viasat-scaled-their-big-data-applications-by-migrating-to-amazon-emr/

This post is co-written with Manoj Gundawar from Viasat.

Viasat is a satellite internet service provider based in Carlsbad, CA, with operations across the United States and worldwide. Viasat’s ambition is to be the first truly global, scalable, broadband service provider with a mission to deliver connections that can change the world. Viasat operates across three main business segments: satellite services, commercial networks, and government systems, providing high-speed satellite broadband services and secure networking systems.

In this post, we discuss how migrating our on-premises big data workloads to Amazon EMR helped us achieve a fully managed cloud-native solution and freed us from the constraints of our legacy on-premises solution, so we can focus on business innovation with a lower TCO.

Challenge with the legacy big data environment

Viasat’s big data application, Usage Data Mart, is a high-volume, low-latency Hadoop-based solution that ingests internet usage data from a multitude of source systems. It curates and aggregates with data from other sources, and provides the data in an optimized system to support high-volume access to reporting and web interfaces. This critical application processes over 1.3 billion internet usage records per day, sourced from various upstream systems. Viasat customer support representatives use this data through APIs and reports to assist end-users on any queries regarding their internet usage. Various internal teams and customers also use data for usage accounting, tracking, compliance, and billing purposes.

The Usage Data Mart was previously implemented in an on-premises footprint of over 40 nodes with three independent clusters all running a commercial distribution of Hadoop to process our big data work load. The legacy Hadoop environment required three times the data replication to achieve high availability, which resulted in a large infrastructure footprint. In this architecture, we had a data ingestion cluster to ingest data from a Kafka streaming service, MySQL, and Oracle databases. We had extract, transform, and load (ETL) jobs for each data source and data domains or models, and most of the ETL jobs filter, curate, canonicalize, and aggregate data (by specific keys) using MapReduce framework and load in HBase as well as in HDFS or Amazon Simple Storage Service (Amazon S3) in Apache Parquet format. We used HBase to provide on-demand query and aggregation of data via REST APIs that used HBase coprocessors to apply aggregation and other business logic. Our independent reporting cluster queried Parquet files on HDFS and Amazon S3 with Apache Drill to generate a few dozen periodic reports. We had separated HBase and ETL clusters from reporting clusters to avoid any resource contention.

Our legacy environment had challenges at various levels:

  • Hardware – As the hardware aged, we encountered hardware failures on a regular basis. Expiring warranties on hardware and faulty component replacement increased operational risk. The burden of managing the hardware replacement and servers failing to restart after maintenance imposed a serious risk to the business, which made maintenance unpredictable and time-consuming.
  • Software – Yearly software license renewal costs and engineering efforts involved with software version upgrades to keep it on a supported version added operational complexity. Moreover, the commercial Hadoop distribution that we were running reached end of life in 2020.
  • Scaling – We needed to scale on-premises hardware to meet growing business needs during additional satellite launches that required forecasting and capacity planning. Delays in procurement and shipment of hardware affected project timelines. Finally, increasing data usage and customer adoption of this portal presented serious scaling challenges for Viasat.

How migration to Amazon EMR helped solve this challenge

Viasat evaluated a few alternatives to modernize big data applications and determined that Amazon EMR would be the right platform for our requirements. The separate compute and storage architecture of Amazon EMR helped us address our challenges. The following are some of the key benefits we realized with migration to AWS:

  • Storage – Amazon EMR supports HBase on Amazon S3, where Amazon S3 is used as persistent storage for the HBase cluster and allows us to scale our compute needs independently of storage. It also allows us to easily decommission HBase clusters, test upgrades, and optimizes our total cost of ownership. For guidelines and best practices, see Migrating to Apache HBase on Amazon S3 on Amazon EMR.
  • DNS records – Because it’s so easy for us to provision new HBase clusters, we needed a way to maintain DNS records to make the cluster easily accessible. We use a simple AWS Lambda function to update DNS records during EMR cluster boot up. The DNS records are stored in our custom DNS service and we use a custom API to update the records.
  • Customization – Amazon EMR allows you to customize existing software on the cluster as well as install additional software. We use Apache Drill for reporting and, with EMR bootstrap actions, we were able to install Apache Drill on all the nodes of the reporting cluster to provide us with distributed reporting using Parquet files generated with a MapReduce job. Because our reporting uses different query patterns than the data in the HBase cluster, the Parquet files are written with a different partition optimized for reporting. We increased the default bucket size from approximately 8 MB to 16 MB and pointed Amazon EMR to private Amazon S3 endpoints to avoid traffic going through an external firewall, which increased performance.

The following diagram depicts the process we followed for migration to Amazon EMR.

Viasat successfully migrated our on-premises big data applications to Amazon EMR in May 2021, following a lift-and-shift approach to move the big data workload to the cloud with minimal changes. Although we have an experienced big data team, we used AWS Infrastructure Event Management (IEM) to support queries on fine-tuning the Amazon EMR infrastructure within the migration timeline.

The following diagram outlines our new architecture. With this new architecture, we can process the same workloads with 50% of the compute footprint and approximately 50% of the costs compared to our on-premises clusters and still meet our SLAs.

Conclusion

With the new data platform, we can ingest additional data sources so that our analysts and customer service teams can gain insights and improve customer experience. As Viasat is launching new satellites and growing business in multiple new countries, we’re looking to stand up this solution in new AWS Regions (closest to the host country) and scale it as needed.

Questions or feedback? Send an email to [email protected].


About the Authors

Manoj Gundawar is a product owner at Viasat. He builds product roadmap, provides architecture guidance and manages full software development life cycle to build high quality product/software with minimal TCO. He is passionate about delighting the customers by providing innovating solutions, leveraging technology, agile methodology and continues improvement mindset.

Archana Srinivasan is a Technical Account Manager within Enterprise Support at Amazon Web Services. Archana helps AWS customers leverage Enterprise Support entitlements to solve complex operational challenges and accelerate their cloud adoption.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern Big Data solutions. He is passionate about working with customers and helping them in their cloud journey. Kiran loves music, travel, food and watching football.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close