Make your monorepo feel small with Git’s sparse index

Post Syndicated from Derrick Stolee original https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/

One way that Git scales to the largest monorepos is the sparse-checkout feature, which allows you to focus on a subset of the files. This is supposed to make it feel like you are actually in a small repository, even though you are contributing to a large repository.

There’s only one problem: the Git index is still large in a monorepo, and users can feel it. Until now.

The Git index is a critical data structure in Git. It serves as the “staging area” between the files you have on your filesystem and your commit history. When you run git add, the files from your working directory are hashed and stored as objects in the index, leading them to be “staged changes”. When you run git commit, the staged changes as stored in the index are used to create that new commit. When you run git checkout, Git takes the data from a commit and writes it to the working directory and the index.

The working directory, index, and commit history

In addition to storing your staged changes, the index also stores filesystem information about your working directory. This helps Git report changed files more quickly. One problem is that the index stores this information for every file at HEAD, even if those files are outside of the sparse-checkout definition. This means that the index can be much larger in a monorepo than it would be if your important subset of files was in its own repository.

Throughout the past year, the Git Fundamentals team at GitHub contributed a new feature to Git called the sparse index, which allows the index to focus on the files within the sparse-checkout cone. If you are in a repository that can use sparse-checkout, then you can enable the sparse index using these commands:

git sparse-checkout init --cone --sparse-index
git sparse-checkout set <dir1> <dir2> ... <dirN>

The size of the sparse index will scale with the number of files within your chosen directories, instead of the full size of your repository. When enabled with a number of other performance features, this can have a dramatic performance impact.

As we built the sparse index, we tested its performance against a large monorepo that has over two million files at HEAD, and with a sparse-checkout definition that populated the working directory with about 100,000 of those files. We then compared the performance between having a normal “full” index, a sparse index, and a repository that only contained the files matching the sparse-checkout definition.

Command performance by index and repository type

The chart above demonstrates the significant performance improvements enabled by the sparse index. The bottom bars for each Git command show the runtime without the sparse index. The middle bars show the runtime of the same commands with the sparse index enabled. The top bars show the runtime of these commands, except in a repository that only contained the files within the sparse-checkout cone, representing the theoretical optimum. Since the sparse index still contains pointers to the rest of the monorepo, there is still some overhead. This overhead is hardly noticeable, since the difference is at most 60 milliseconds, even in the worst case above.

Today, I will go deep into the design and implementation of the sparse index. In particular, I’ll focus on how the Git community made such a significant change to a critical data structure in a safe way. I will include links to the actual changes in the Git codebase as I go.

This post is going to go deep into the guts of Git. If you are unfamiliar with Git’s object model, then learn about commits, trees, and blobs before continuing.

You can also get an overview of the sparse index alongside several other advanced Git features in this presentation I gave with colleague Lessley Dennington at the GitHub Nova event.

First, let’s dig into the Git index to understand its structure and purpose. I’ll use the derrickstolee/trace2-flamechart repository as a concrete example (and to generate some of the figures). If you want to follow along with the Git commands shown, then clone that repository.

The Git index

The index file stores a list of every file at HEAD, along with the object ID for its blob and some metadata. This list of files is stored as a flat list and Git parses the index into an array.

You can expose the list of files in the index using the git ls-files command:

$ git ls-files
LICENSE
README.md
bin/index.js
examples/fetch/git-fetch-after.png
examples/fetch/git-fetch-after.svg
examples/fetch/git-fetch-after.txt
examples/fetch/git-fetch-before.png
examples/fetch/git-fetch-before.svg
examples/fetch/git-fetch-before.txt
examples/fetch/git-fetch-combined.png
examples/fetch/git-fetch-combined.svg
examples/maintenance/trace.png
examples/maintenance/trace.svg
examples/maintenance/trace.txt
package.json

In this repository, many of the files live in an examples/ directory, but you don’t actually need them for the functionality of the code, which lives in the bin/ directory. You can focus the repository only on the necessary files using the git sparse-checkout command:

$ ls
LICENSE     README.md     bin       examples   package.json
$ git sparse-checkout init --cone --sparse-index
$ git sparse-checkout set bin
$ ls
LICENSE     README.md     bin       package.json

Even though I used the sparse-checkout command to reduce the size of the working directory, my git ls-files command will return the same set of files as before. In fact, I can dig in a little more and expose some more information using git ls-files --debug.

$ git ls-files --debug
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 0
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 0
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 0
examples/fetch/git-fetch-after.png
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
(...)

The above output is truncated, but it shows that each index entry contains additional filesystem information for each path. The last entry listed shows what happens for a file that is outside of the sparse-checkout definition: all of the filesystem information is removed and the flags entry has some bits enabled. These bits include a SKIP_WORKTREE bit that signifies that Git will not write that file to the working directory.

If these files are not written to disk, then why are they listed in the index at all? The reason is that Git still needs to understand the content that would be there if the index was expanded. Further, that information is used to generate a commit with the git commit command.

In the Git codebase, there is a test helper that can show additional information from the index: test-tool read-cache. (You won’t have this command if you just have normal Git installed.) Running it here, you can see that the index also stores the object IDs for every file:

$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
100644 blob e0f88531b916b92821476760672e8161b9954898    examples/fetch/git-fetch-after.png
100644 blob f4a523cd1acb0a9d2620970ad7a43405d6e305dc    examples/fetch/git-fetch-after.svg
100644 blob fc4e30dca5fcb0c3d2031dc82a43d5d644e26b41    examples/fetch/git-fetch-after.txt
100644 blob 15dc889965617df3b5a30cf01e52c491e41c59c1    examples/fetch/git-fetch-before.png
100644 blob 602bd5bcbd815914a035d0d4f0d2a3896f600de2    examples/fetch/git-fetch-before.svg
100644 blob bc40a8e4658d17c35de996f3655e737b85ce7ad9    examples/fetch/git-fetch-before.txt
100644 blob 356cdd36e0d78a62af8b010d25d658054bb6fdc7    examples/fetch/git-fetch-combined.png
100644 blob cc0c23f2c8a822c51a17c46268f38c2268b400ae    examples/fetch/git-fetch-combined.svg
100644 blob dfc0893d172d841d971e206461466db935b7c192    examples/maintenance/trace.png
100644 blob a10a876472e46c6ae58e6fc6e2adc64d4dae809b    examples/maintenance/trace.svg
100644 blob 8f5e8bfbc44674feb3aa96e0b7bf1bf717495658    examples/maintenance/trace.txt
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Notice in particular how every single row lists the object type as blob, meaning a file. Also, the bin/index.js file has executable permissions, so the file permission column shows 100755 instead of 100644 like the others. This is all important metadata to store for each index entry.

To visualize the index, the diagram below displays our blobs as boxes in a line in the order given above, but it represents the trees that connect the root tree to those blobs as triangles. Thus, the root tree has two subtrees for bin/ and examples/, and the examples/ subtree has two subtrees for fetch/ and maintenance/.

full index

This figure represents all of the links between the trees and blobs. However, the core index data structure stores only the list of blobs as a flat list. The nesting tree structure does not exist in the core of the file.

However, there is an extension to the format that includes the information of the nesting directories: the cache-tree extension. Each node of the cache-tree stores a list of sub-nodes and a range of index entries that are covered by the current node. Each node stores the object ID for the tree it represents.

The index and the cache tree

The root node always covers all of the blobs in the index. The contained nodes have ranges contained within that range.

Git commands such as git add update the cache-tree extension in order to make the next git commit command very fast. To create the new commit, Git can use the tree from the root of the cache-tree extension.

Many Git commands use the index in many different ways. Some commands compare the working directory to the index and update one or the other when there is a difference. Others compare the index and a commit. Some compare multiple indexes together. Some use the cache-tree extension to navigate the nested tree structure, but mostly they scan the flat list of files in the form of an array.

The index affects Git performance at scale

The index can be very large in monorepos. I will show Git performance data from an example monorepo that has over two million files at HEAD. Even using the latest compression techniques available, the index file is over 180 MB in this monorepo.

This has a significant effect on normal Git commands. Presented below is an annotated flamechart of a git status command with one of these large indexes. The x-axis represents time since the start of the command, and each rectangle represents a region of Git’s execution that is marked by its trace2 logging library.

`git status` with full index

Three regions are annotated here:

  1. The index is read from disk and parsed into memory.
  2. The working directory is compared to the index. This triggers a lazy initialization of some hash tables that are required for this effort.
  3. The modified index is written to disk.

Parsing is multi-threaded, but writes are not. This explains some of the differences in how long those actions take.

Clearly, the amount of data in the index is a significant portion of this command. This also affects other commands such as git add and git commit, which are expected to be fast.

This performance concern became abundantly clear when our monorepo customer wanted to group more dependencies within the monorepo. Some teams had isolated Git repositories that were hundreds of times smaller than the monorepo, but these repositories created packages that were consumed by the monorepo, causing complications in tracking dependencies. The hope was that they could merge into the monorepo and rely on sparse-checkout to make it feel like they were still working in a small repository. The user experience was actually much worse, and the root cause was the time it took to read and write the index.

The main culprit is that there are millions of index entries corresponding to files these users do not care about for their daily work. When they push to the server and create a pull request, the build machines can handle the massive scale of building the entire tree and verifying that the small change works within the larger whole of the monorepo. Users should not need to pay that cost.

As members of the Git Fundamentals team, we are very focused on Git performance, and the index has been on our minds for years. For example, the index is a bottleneck for the VFS for Git environment, but that environment has particular needs that prevent improvements in this area.

The biggest thing that has changed recently is the creation of “cone mode” sparse-checkout patterns. These use directory-based pattern matching instead of file-based pattern matching. While cone mode was originally designed to speed up pattern matching in the sparse-checkout feature, it has now unlocked a new way to shrink the index.

The sparse index

The sparse index differs from a normal “full” index in one aspect: it can store directory paths with the object ID for its tree object. This is in addition to the file paths which are paired with blob objects. Since the cone mode sparse-checkout patterns match on a directory level, we can determine that an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path.

Back in my example derrickstolee/trace2-flamegraph repository, you can enable the sparse index command and then use the test-tool read-cache tool to show the contents of the index.

$ git sparse-checkout init --cone --sparse-index
$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
040000 tree b395192a7adbf21793f9489f3623c117802b2043    examples/
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Similar to my previous visualizations, you can now see how the index contains a directory entry in addition to the four blobs.

sparse index

The sparse directory entries correspond to directories that are just outside of the sparse-checkout definition. These directories also have a cache-tree node whose range is only one entry: that sparse directory entry.

I can even display the full details of the --debug output for git ls-files. This currently requires a --sparse flag that I have implemented in my personal fork of Git, but a similar feature will eventually be available in the core Git client.

$ git ls-files --debug --sparse
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 200000
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 200000
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 200000
examples/
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
package.json
  ctime: 1634910503:288676330
  mtime: 1634910503:288676330
  dev: 16777220 ino: 119325321
  uid: 501  gid: 20
  size: 680 flags: 200000

This output is not truncated as it was before, and you can see that the sparse directory entry for examples/ is the only one with blank filesystem data. It also has the same flags value as the sparse file entries did before.

By removing the number of index entries as well as reducing the average path length, you can shrink the index size significantly. In our example monorepo, most users will reduce their index size from 180 MB to less than 10 MB!

Back to our monorepo, let’s try that git status example again and create a new flamechart. Here, I compare the flamechart for git status with a full index followed by one with a sparse index.

Annotated `git status` flame chart

With the sparse index, the git status command drops from 1.3 seconds to under 200 milliseconds! In the flame chart above I highlighted some regions that have similar appearance in each run. These represent the parts of the git status command that are actually walking the working directory and doing work independent of the index size. Everything else is slower in the full index case entirely because of the size of the index!

Building the sparse index safely

Pruning the index at the directory level is a relatively simple idea with a rather complicated result: our flat list of paths now contains two types of Git objects! There are dozens of places in the Git codebase that interact directly with the index in subtly different ways, and all of them are expecting every index entry to point to a blob object.

In order to make such a change to a critical data structure, we needed to first create a compatibility layer. To safely interact with a sparse index, we needed a way to expand a sparse index to an equivalent full index. This way, code paths that have not been integrated and tested with a sparse index can still be used, even if the on-disk format is sparse.

In the Git codebase, we started by creating the ensure_full_index() method, which converts a sparse index into a full one. This method inspects the list for directory entries and replaces them with its contained file entries. Since the directory is outside of the sparse-checkout cone, we could ignore all of the filesystem metadata information and populate the list by traversing the tree objects under that directory. Before anything else happens, the ensure_full_index() method is called immediately after parsing the index so no interactions with the index happen until the sparse directories are removed.

Expanding to a full index

When Git expands a sparse index to a full one, it scans the entries in lexicographic order. If the entry is a file, then Git copies it to the new list. If the entry is a directory, then the tree at that location is passed to the read_tree_at() method to iterate over all contained blobs. For each contained blob, Git generates an index entry for the corresponding file. Finally, the index entry list is copied back into the index and the index is no longer sparse.

Once that protection was in place, we extended Git to write the sparse index format. When writing the index, a full index is converted to a sparse one in-memory using the convert_to_sparse() method.

Collapsing to a sparse index

To convert from a full index to a sparse one, Git uses the cache-tree extension to find the object ID for our new sparse directory entries. The existing file entries are copied, and Git inserts the directory entries as needed.

Once these steps were built, we could verify that the index size was shrinking to the scale we expected. Both of these steps were included in a single series that introduced the format and implemented the conversions.

While it is nice that the index size has shrunk, we couldn’t stop there. The index is a very compact data structure, so it is more efficient to read it from disk than to recreate it by parsing trees. The ensure_full_index() method takes noticeably longer to expand a sparse index to a full one than it would take to read a full one from disk. In order to gain the performance benefits of a sparse index, we needed to teach Git what to do when encountering sparse directory entries.

Before embarking on these integrations, we first set up more guardrails. A new setting, command_requires_full_index, was created which is enabled by default. This setting is used to trigger ensure_full_index() upon parsing a sparse index unless the Git command being run has explicitly disabled the setting. This allowed us to integrate with Git commands one-by-one without disrupting the behavior of other Git commands. In addition to these settings, we inserted calls to ensure_full_index() before most index interactions. to make sure that we were operating on a full index in any code path that might iterate over the index entries. This allowed us to find which code paths needed integration: we could debug a Git command with a breakpoint on ensure_full_index() and see the call stack that led to that expansion.

The first command to integrate with the sparse index was git status. In hindsight, this was a challenging command to use as a starting point, because it performs multiple index operations that are common to other commands. This became clear when integrating with git checkout and git commit because most of the work was already done in the git status integration.

Let’s explore some of the smaller interactions that needed special care with directory entries.

Example implementation detail: git diff

The git diff command can show what is different between different representations of a working directory. There are two interesting cases that involve the index: comparing the working directory to the index, and comparing the index to a commit.

With no other arguments, git diff shows the differences in the working directory compared to what is staged in the index. This algorithm is mostly simple to integrate with the sparse index: while walking the working directory, Git drills into a directory only if it exists. If the sparse directory entries in the sparse index do not appear as directories in the working directory, then it never tries to drill into the sparse directory entries. If Git finds that a sparse directory entry does exist in the filesystem as a real directory, then the ensure_full_index() method expands the index and Git continues as normal. This is not desired, so we did everything possible to make sure that these directories do not exist, including updating the sparse-checkout feature to delete ignored files outside the cone.

The git diff --cached command compares the files staged in the index with the commit at HEAD. Here, it is easier to have differences outside the sparse-checkout cone, such as when using git reset --soft to change the HEAD commit without changing the working directory or index. In this case, the git diff --cached command wants to compare the root tree for the HEAD commit to the files in the index. This can proceed normally for the files that exist in the sparse index, but when we reach a sparse directory entry, we see the tree object staged in the index as well as a tree object from the tree walk. At this point, we shift from a tree-vs-index comparison to a tree-vs-tree comparison of those subtrees. When that diff is complete, we can continue with the larger comparison.

One major benefit to the tree-vs-tree comparison is that it is easier to compare the same type of objects. The recursive comparison can also prune the walk when it finds two subtrees with the same object ID, as all of their content is the same at that point.

This change allows us to report on these differences not only in the git diff command, but also such diffs as are written during git status, git checkout, and git commit.

Implementation detail: ORT merge strategy

When beginning the sparse index work, there was a huge question that we did not know how to tackle: three-way merges. The default merge strategy, recursive, uses the index as a data structure during its computation. It was going to be difficult to reconcile that algorithm to work within the confines of a sparse index. In fact, many merges that a monorepo user runs would need to resolve merges outside of the sparse-checkout cone.

Luckily, another contributor, Elijah Newren, announced that he was creating a new merge strategy, named the ORT strategy, that did not use the index. We prioritized reviewing and testing that strategy so that we could take advantage of it. It turns out that it is also a better algorithm in general, so it will become the new default strategy with Git 2.34.

The critical feature of the ORT strategy was its replacement of the index with a recursive tree-like structure. That structure is built from the root tree and only creates subtrees for paths that have changed since the merge base. At the end, the ORT strategy creates an index to match its representation of the resulting merge commit.

Because of the ORT merge strategy, integrating the sparse index into git merge, git rebase, git cherry-pick, and git revert was very simple. We just needed to make sure the index that was created at the end of the merge was sparse from its original creation.

Different merge strategies and their performance

As we reported earlier, the ORT strategy improves over the recursive strategy in the typical case, but also the recursive strategy has significant outliers as shown in the box plot above. Enabling the sparse index on top of the ORT strategy provides even more improvements.

Without the ORT merge strategy, the sparse index work could have easily doubled in scope. For a detailed look at the ORT strategy and its many optimizations, take a look at Elijah’s six-part blog series:

Testing the sparse index

The sparse index was touching critical code and doing so in interesting ways. We needed a way to carefully test that these changes were as correct as possible. The Git test suite is substantial and has excellent coverage of most index operations. However, almost all of those tests do not use sparse-checkout, so we couldn’t immediately gain value in checking the sparse index by enabling it globally.

We created a test script that focused on testing the sparse index in a new way. The test starts by creating a repository with some interesting data shapes in it. Then, each test case starts by copying that repository into three new repositories. Those three repositories have different configurations:

  1. The repository as-is, without sparse-checkout.
  2. The repository with cone mode sparse-checkout enabled.
  3. The repository with cone mode sparse-checkout and sparse index enabled.

Then, each test case runs a number of Git commands against all three repositories, expecting the same output and results in the working directory. This allowed us to be confident that the changes we were making to enable the sparse index would have identical behavior with the other two cases.

Along the way, we found some interesting differences between sparse-checkout and full repositories. Several of these bugs have been fixed since. Sometimes, it was unclear whether the sparse-checkout feature should do the same thing as a normal repository, specifically when interacting with files outside of the sparse-checkout cone. This led to changing how some commands interact with sparse entries. In Git 2.34, some commands will need a --sparse flag in order to modify paths outside of the sparse-checkout cone.

In addition to these test scripts, we routinely ran the Scalar functional tests against our development branches, since many of those tests focus on special circumstances around the sparse-checkout feature. If a Scalar test would fail when the Git tests did not, then we would create a similar test in Git to prevent such a bug in the future.

Once we had integrated with a core set of Git commands, we also created an experimental release that contained early versions of these integrations. We provided this version to a subset of monorepo users to evaluate the performance. We found some interesting data from some of the users, but overall the results confirmed that the sparse index was going to significantly improve the user experience in the monorepo. The most important thing we discovered is that the sparse-checkout feature should remove ignored files outside of the cone, as those files cause the sparse index to expand to a full one, negating the performance benefits. There is an additional benefit in that the working directory shrinks even more by deleting these files.

The current state of the sparse index

Not all Git commands understand the sparse index. Those that have not been integrated trigger a compatibility check that converts a sparse index into a full one during the first index read. The integrated commands are ones that have been carefully tested with the sparse index. They likely received some code updates in order to properly handle a sparse index. These integrations were dispersed across the last few Git versions, and some only exist in the microsoft/git fork until we can complete contributing them to the core Git project.

scope

In June, Git 2.32.0 was released with an understanding of the sparse index format. In August, Git 2.33.0 included integrations with git status, git commit, and git checkout. Git 2.34.0 is slated for a November release with integrations for git add, git merge, git rebase, git cherry-pick, and git reset. In order to serve our monorepo users, we fast-tracked some integrations into a pre-release in our fork including integrations with git diff, git blame, git clean, git sparse-checkout, and git stash.

Based on our understanding of how users interact with a monorepo, the commands that are listed here are sufficient to cover almost all users’ needs. We expect that users who adopt the sparse index with these integrations will have a significantly improved experience from before. This all depends on the size of their monorepo and on the size of their sparse-checkout cone.

We will release this feature widely to our monorepo users with the sparse index on by default in the 2.34 release of our Git fork. The core Git project will keep the sparse index off by default until it has all of these features and the implementation has been stable for a few versions.

Looking to the future

At this point, we have covered all of the integrations we need to have a successful monorepo experience. There is more work to be done. In particular, we need to finish contributing the final integrations in our list. Upstream progress takes time and we are grateful for all of the feedback we have received from the community so far. There are more commands that could use integrations, as well.

For now, we are focused on ensuring that monorepo users transition to the sparse index without incident. If you are interested in the sparse index, then we believe it is in an excellent state to start using it. If you have any issues with its performance or stability, then please do not hesitate to create an issue or start a discussion. We will be there to help!

Through the sparse index, we broke through a significant barrier to monorepo scale. We continue to seek the next innovation that will help Git scale to the largest repos out there.

CVE-2021-43287 Allows Pre-Authenticated Build Takeover of GoCD Pipelines

Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2021/11/10/cve-2021-43287-allows-pre-authenticated-build-takeover-of-gocd-pipelines/

CVE-2021-43287 Allows Pre-Authenticated Build Takeover of GoCD Pipelines

On October 26, 2021, open-source CI/CD solution GoCD released version 21.3.0, which included a fix for CVE-2021-43287, a critical information disclosure vulnerability whose exploitation allows unauthenticated attackers to leak configuration information, including build secrets and encryption keys. Both Rapid7 vulnerability researchers and community researchers were easily able to register a rogue agent, injecting themselves into GoCD builds and enabling full, pre-authenticated pipeline takeover. CVE-2021-43287 can be exploited with a single HTTP request.

While CVE-2021-43287 is still awaiting a formal CVSSv3 score and description, it’s no secret that CI/CD tooling and pipelines are high-value targets for both sophisticated and opportunistic attackers. GoCD customers should update to version 21.3.0 on an emergency basis, given the potential for exploitation to undermine the integrity of their software development pipelines. The US Cybersecurity and Infrastructure Security Agency (CISA) has also issued an alert and patch guidance. Rapid7’s vulnerability research team has a more detailed technical analysis of CVE-2021-43287 here.

Rapid7 customers

InsightVM and Nexpose customers can assess their exposure to CVE-2021-43287 with a remote vulnerability check available in the November 9, 2021 content release.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Security updates for Wednesday

Post Syndicated from original https://lwn.net/Articles/875708/rss

Security updates have been issued by Debian (openjdk-8 and samba), Fedora (community-mysql, firefox, and vim), openSUSE (binutils, kernel, and tinyxml), Red Hat (annobin, autotrace, babel, bind, binutils, bluez, compat-exiv2-026, container-tools:2.0, container-tools:3.0, container-tools:rhel8, cups, curl, dnf, dnsmasq, edk2, exiv2, file, file-roller, firefox, gcc, gcc-toolset-10-annobin, gcc-toolset-10-binutils, gcc-toolset-10-gcc, gcc-toolset-11-annobin, gcc-toolset-11-binutils, gcc-toolset-11-gcc, glib2, glibc, GNOME, gnutls and nettle, go-toolset:rhel8, grafana, graphviz, grilo, httpd:2.4, jasper, java-17-openjdk, json-c, kernel, kernel-rt, kexec-tools, kpatch-patch, lasso, libgcrypt, libjpeg-turbo, libsepol, libsolv, libssh, libtiff, libwebp, libX11, linuxptp, lua, mingw-glib2, mutt, ncurses, NetworkManager, openjpeg2, openssh, openssl, pcre, pcs, php:7.4, python-jinja2, python-lxml, python-pillow, python-pip, python-psutil, python27:2.7, python3, python36:3.6, python38:3.8 and python38-devel:3.8, python39:3.9 and python39-devel:3.9, qt5, resource-agents, rpm, rust-toolset:rhel8, spamassassin, sqlite, squid:4, tcpdump, tpm2-tools, vim, virt:rhel and virt-devel:rhel, and zziplib), and SUSE (binutils and kernel).

tCell by Rapid7 Supports the Newly Released .NET 6.0

Post Syndicated from Bria Grangard original https://blog.rapid7.com/2021/11/10/tcell-by-rapid7-supports-the-newly-released-net-6-0/

tCell by Rapid7 Supports the Newly Released .NET 6.0

We’re excited to share that we’ve coordinated our recent .NET and .NET Core agent releases with the brand new .NET 6.0 release from Microsoft.

What is tCell?

Since the founding of tCell by Rapid7, our web application and API protection solution, we’ve prided ourselves on providing both breadth and depth of coverage. We recognize the importance of shifting left in the software development lifecycle (SDLC), and we also see the equal importance of shifting right. This is why Rapid7 approaches application security from a holistic perspective that’s inclusive of testing, monitoring, and protecting to prevent the exploitation of web applications and APIs across the SDLC.

In addition to coverage throughout the SDLC, we recognize modern applications require modern AppSec solutions to meet them where they are. This is why our solution offers flexible deployment options including our App-Server agents, Web-Server agents, proxy agents, and CDN agents. So whether you’re securing a traditional application written in Java, you’re a Windows shop, or you’re deploying applications with a Kubernetes architecture, we’ve got you covered.

Sounds great… but what’s new?

Today, Microsoft announced the release of .NET 6, a significant release with an expected lifetime of 3 years and a slew of new features. Today, tCell also announced support for .NET 6. This is in line with a promise to our customers to support modern languages and frameworks, as soon as possible.

With this new tCell release, customers leveraging .NET or .NET Core frameworks will continue to be supported with our native .NET and .NET Core agents. Additionally, we’ve added functionality for new and existing tCell customers. For those customers who are leveraging tCell’s .NET or .NET Core agents 2.2+, they will now have full visibility into their API endpoints. What’s this mean? tCell will automatically enumerate your application’s API endpoints and show you attack attempts, attack types, and any app exceptions.

tCell by Rapid7 Supports the Newly Released .NET 6.0

This is helpful because it provides visibility into key API endpoints, what attacks were attempted, and what attacks were successful, automatically reducing the number of high alerts our customers receive. From here, customers can drill down for more context on the attack attempts and view detailed stack trace information.

tCell by Rapid7 Supports the Newly Released .NET 6.0

So, tell me more!

In addition to the real-time monitoring and detection for your web applications and APIs, tCell provides a robust set of use cases that it helps solve when leveraging either our .NET or .NET Core agents. Here are just a few examples:

  • Want visibility into your dev teams’ third-party packages and their associated vulnerabilities? Check out our runtime software composition analysis (SCA) feature, which will highlight any vulnerable third party code.
  • Want to make sure no one gets access to your user accounts? Attackers will often use a stolen list of credentials to login to an app and mimic real user behavior. With tCell’s Account Takeover capabilities, we monitor for these auth events and can take action on blocking anyone that is deemed a threat. We can also share a list of compromised users your team can follow up on.
  • Want to ensure your apps only load specific content? tCell makes it seamless and easy to enable Content Security Policies to enforce what content is loaded on your site and ensure a safe and happy user experience.
  • … And so much more, from specific actor detection to prevention of unauthorized executions of OS commands and app-level attacks.

We’re proud of our continued investment in helping customers secure their applications and APIs. Curious to learn more? Feel free to reach out to get a demo of tCell or try it out yourself!

Hacking the Sony Playstation 5

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/11/hacking-the-sony-playstation-5.html

I just don’t think it’s possible to create a hack-proof computer system, especially when the system is physically in the hands of the hackers. The Sony Playstation 5 is the latest example:

Hackers may have just made some big strides towards possibly jailbreaking the PlayStation 5 over the weekend, with the hacking group Fail0verflow claiming to have managed to obtain PS5 root keys allowing them to decrypt the console’s firmware.

[…]

The two exploits are particularly notable due to the level of access they theoretically give to the PS5’s software. Decrypted firmware ­ which is possible through Fail0verflow’s keys ­ would potentially allow for hackers to further reverse engineer the PS5 software and potentially develop the sorts of hacks that allowed for things like installing Linux, emulators, or even pirated games on past Sony consoles.

In 1999, Adam Shostack and I wrote a paper discussing the security challenges of giving people devices that included embedded secrets that needed to be kept from those people. We were writing about smart cards, but our lessons were general. And they’re no less applicable today.

Implementing interruption tolerance in Amazon EC2 Spot with AWS Fault Injection Simulator

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/implementing-interruption-tolerance-in-amazon-ec2-spot-with-aws-fault-injection-simulator/

This post is written by Steve Cole, WW SA Leader for EC2 Spot, and David Bermeo, Senior Product Manager for EC2.

On October 20, 2021, AWS released new functionality to the Amazon Fault Injection Simulator that supports triggering the interruption of Amazon EC2 Spot Instances. This functionality lets you test the fault tolerance of your software by interrupting instances on command. The triggered interruption will be preceded with a Rebalance Recommendation (RBR) and Instance Termination Notification (ITN) so that you can fully test your applications as if an actual Spot interruption had occurred.

In this post, we’ll provide two examples of how easy it has now become to simulate Spot interruptions and validate the fault-tolerance of an application or service. We will demonstrate testing an application through the console and a service via CLI.

Engineering use-case (console)

Whether you are building a Spot-capable product or service from scratch or evaluating the Spot compatibility of existing software, the first step in testing is identifying whether or not the software is tolerant of being interrupted.

In the past, one way this was accomplished was with an AWS open-source tool called the Amazon EC2 Metadata Mock. This tool let customers simulate a Spot interruption as if it had been delivered through the Instance Metadata Service (IMDS), which then let customers test how their code responded to an RBR or an ITN. However, this model wasn’t a direct plug-and-play solution with how an actual Spot interruption would occur, since the signal wasn’t coming from AWS. In particular, the method didn’t provide the centralized notifications available through Amazon EventBridge or Amazon CloudWatch Events that enabled off-instance activities like launching AWS Lambda functions or other orchestration work when an RBR or ITN was received.

Now, Fault Injection Simulator has removed the need for custom logic, since it lets RBR and ITN signals be delivered via the standard IMDS and event services simultaneously.

Let’s walk through the process in the AWS Management Console. We’ll identify an instance that’s hosting a perpetually-running queue worker that checks the IMDS before pulling messages from Amazon Simple Queue Service (SQS). It will be part of a service stack that is scaled in and out based on the queue depth. Our goal is to make sure that the IMDS is being polled properly so that no new messages are pulled once an ITN is received. The typical processing time of a message with this example is 30 seconds, so we can wait for an ITN (which provides a two minute warning) and need not act on an RBR.

First, we go to the Fault Injection Simulator in the AWS Management Console to create an experiment.

AWS Fault Injection Simulator start screen in AWS Management console

At the experiment creation screen, we start by creating an optional name (recommended for console use) and a description, and then selecting an IAM Role. If this is the first time that you’ve used Fault Injection Simulator, then you’ll need to create an IAM Role per the directions in the FIS IAM permissions documentation. I’ve named the role that we created ‘FIS.’ After that, I’ll select an action (interrupt) and identify a target (the instance).

Experiment template screen in AWS Management console with description “interrupt queue worker,” name “interrupt queue worker,” and IAM role “FIS.”

First, I name the action. The Action type I want is to interrupt the Spot Instance: aws:ec2:send-spot-instance-interruptions. In the Action parameters, we are given the option to set the duration. The minimum value here is two minutes, below which you will receive an error since Spot Instances will always receive a two minute warning. The advantage here is that, by setting the durationBeforeInterruption to a value above two minutes, you will get the RBR (an optional point for you to respond) and ITN (the actual two minute warning) at different points in time, and this lets you respond to one or both.

New action form in the create experiment template screen with name, description, action type and durationBeforeInterruption fields filled.

The target instance that we launched is depicted in the following screenshot. It is a Spot Instance that was launched as a persistent request with its interruption action set to ‘stop’ instead of ‘terminate.’ The option to stop a Spot Instance, introduced in 2020, will let us restart the instance, log in and retrieve logs, update code, and perform other work necessary to implement Spot interruption handling.

Now that an action has been defined, we configure the target. We have the option of naming the target, which we’ve done here to match the Name tagged on the EC2 instance ‘qWorker’. The target method we want to use here is Resource ID, and then we can either type or select the desired instance from a drop-down list. Selection mode will be ‘all’, as there is only one instance. If we were using tags, which we will in the next example, then we’d be able to select a count of instances, up to five, instead of just one.

Edit target form in the create experiment template screen with name, resource type, and resource id fields filled. Name is “qWorker,” Resource type is “aws:ec2:spot-Instance,” Target method is “Resource IDs” Resource IDs is “i-019ea405f8f81742b” and Selection mode is “All.”

Once you’ve saved the Action, the Target, and the Experiment, then you’ll be able to begin the experiment by selecting the ‘Start from the Action’ menu at the top right of the screen.

Fault Injection Simulator Experiment template summary screen in the AWS Management console highlights experiment template ID, stop conditions, description, creation time, IAM role, and last update time.

After the experiment starts, you’ll be able to observe its state by refreshing the screen. Generally, the process will take just seconds, and you should be greeted by the Completed state, as seen in the following screenshot.

Fault Injection Simulator Experiment summary screen in the AWS Management console shows status as “Completed.”

In the following screenshot, having opened an interruption log group created in CloudWatch Event Logs, we can see the JSON of the RBR.

CloudWatch Event Logs screen in the AWS Management Console with information on the EC2 Rebalance Recommendation.

Two minutes later, we see the ITN in the same log group.

Cloudwatch Event Logs screen in the AWS Management console with information on the Instance Termination Notification.

Another two minutes after the ITN, we can see the EC2 instance is in the process of stopping (or terminating, if you elect).

EC2 Instance List screen in the AWS Management console with information on the instance that is stopping.

Shortly after the stop is issued by EC2, we can see the instance stopped. It would now be possible to restart the instance and view logs, make code changes, or do whatever you find necessary before testing again.

EC2 Instance List screen in the AWS Management console with information on the Instance that is stopped.

Now that our experiment succeeded in interrupting our Spot Instance, we can evaluate the performance of the code running on the instance. It should have completed the processing of any messages already retrieved at the ITN, and it should have not pulled any new messages afterward.

This experiment can be saved for later use, but it will require selecting the specific instance each time that it’s run. We can also re-use the experiment template by using tags instead of an instance ID, as we’ll show in the next example. This shouldn’t prove troublesome for infrequent experiments, and especially those run through the console. Or, as we did in our example, you can set the instance interruption behavior to stop (versus terminate) and re-use the experiment as long as that particular instance continues to exist. When the experiments get more frequent, it might be advantageous to automate the process, possibly as part of the test phase of a CI/CD pipeline. Doing this is programmatically possible through the AWS CLI or SDK.

Operations use-case (CLI)

Once the developers of our product validate the single-instance fault tolerance, indicating that the target workload is capable of running on Spot Instances, then the next logical step is to deploy the product as a service on multiple instances. This will allow for more comprehensive testing of the service as a whole, and it is a key process in collecting valuable information, such as performance data, response times, error rates, and other metrics to be used in the monitoring of the service. Once data has been collected on a non-interrupted deployment, it is then possible to use the Spot interruption action of the Fault Injection Simulator to observe how well the service can handle RBR and ITN while running, and to see how those events influence the metrics collected previously.

When testing a service, whether it is launched as instances in an Amazon EC2 Auto Scaling group, or it is part of one of the AWS container services, such as Amazon Elastic Container Service (Amazon ECS) or the Amazon Elastic Kubernetes Service (EKS), EC2 Fleet, Amazon EMR, or across any instances with descriptive tagging, you now have the ability to trigger Spot interruptions to as many as five instances in a single Fault Injection Simulator experiment.

We’ll use tags, as opposed to instance IDs, to identify candidates for interruption to interrupt multiple Spot Instances simultaneously. We can further refine the candidate targets with one or more filters in our experiment, for example targeting only running instances if you perform an action repeatedly.

In the following example, we will be interrupting three instances in an Auto Scaling group that is backing a self-managed EKS node group. We already know the software will behave as desired from our previous engineering tests. Our goal here is to see how quickly EKS can launch replacement tasks and identify how the service as a whole responds during the event. In our experiment, we will identify instances that contain the tag aws:autoscaling:groupName with a value of “spotEKS”.

The key benefit here is that we don’t need a list of instance IDs in our experiment. Therefore, this is a re-usable experiment that can be incorporated into test automation without needing to make specific selections from previous steps like collecting instance IDs from the target Auto Scaling group.

We start by creating a file that describes our experiment in JSON rather than through the console:

{
    "description": "interrupt multiple random instances in ASG",
    "targets": {
        "spotEKS": {
            "resourceType": "aws:ec2:spot-instance",
            "resourceTags": {
                "aws:autoscaling:groupName": "spotEKS"
            },
            "selectionMode": "COUNT(3)"
        }
    },
    "actions": {
        "interrupt": {
            "actionId": "aws:ec2:send-spot-instance-interruptions",
            "description": "interrupt multiple instances",
            "parameters": {
                "durationBeforeInterruption": "PT4M"
            },
            "targets": {
            "SpotInstances": "spotEKS"
            }
        }
    },
    "stopConditions": [
        {
            "source": "none"
        }
    ],
    "roleArn": "arn:aws:iam::xxxxxxxxxxxx:role/FIS",
    "tags": {
        "Name": "multi-instance"
    }
}

Then we upload the experiment template to Fault Injection Simulator from the command-line.

aws fis create-experiment-template --cli-input-json file://experiment.json

The response we receive returns our template along with an ID, which we’ll need to execute the experiment.

{
    "experimentTemplate": {
        "id": "EXT3SHtpk1N4qmsn",
        ...
    }
}

We then execute the experiment from the command-line using the ID that we were given at template creation.

aws fis start-experiment --experiment-template-id EXT3SHtpk1N4qmsn

We then receive confirmation that the experiment has started.

{
    "experiment": {
        "id": "EXPaFhEaX8GusfztyY",
        "experimentTemplateId": "EXT3SHtpk1N4qmsn",
        "state": {
            "status": "initiating",
            "reason": "Experiment is initiating."
        },
        ...
    }
}

To check the status of the experiment as it runs, which for interrupting Spot Instances is quite fast, we can query the experiment ID for success or failure messages as follows:

aws fis get-experiment --id EXPaFhEaX8GusfztyY

And finally, we can confirm the results of our experiment by listing our instances through EC2. Here we use the following command-line before and after execution to generate pre- and post-experiment output:

aws ec2 describe-instances --filters\
 Name='tag:aws:autoscaling:groupName',Values='spotEKS'\
 Name='instance-state-name',Values='running'\
 | jq .Reservations[].Instances[].InstanceId | sort

We can then compare this to identify which instances were interrupted and which were launched as replacements.

< "i-003c8d95c7b6e3c63"
< "i-03aa172262c16840a"
< "i-02572fa37a61dc319"
---
> "i-04a13406d11a38ca6"
> "i-02723d957dc243981"
> "i-05ced3f71736b5c95"

Summary

In the previous examples, we have demonstrated through the console and command-line how you can use the Spot interruption action in the Fault Injection Simulator to ascertain how your software and service will behave when encountering a Spot interruption. Simulating Spot interruptions will help assess the fault-tolerance of your software and can assess the impact of interruptions in a running service. The addition of events can enable more tooling, and being able to simulate both ITNs and RBRs, along with the Capacity Rebalance feature of Auto scaling groups, now matches the end-to-end experience of an actual AWS interruption. Get started on simulating Spot interruptions in the console.

Token-based authentication for iOS applications with Amazon SNS

Post Syndicated from Talia Nassi original https://aws.amazon.com/blogs/compute/token-based-authentication-for-ios-applications-with-amazon-sns/

This post is co-written by Karen Hong, Software Development Engineer, AWS Messaging.

To use Amazon SNS to send mobile push notifications, you must provide a set of credentials for connecting to the supported push notification service (see prerequisites for push). For the Apple Push Notification service (APNs), SNS now supports using token-based authentication (.p8), in addition to the existing certificate-based method.

You can now use a .p8 file to create or update a platform application resource through the SNS console or programmatically. You can publish messages (directly or from a topic) to platform application endpoints configured for token-based authentication.

In this tutorial, you set up an example iOS application. You retrieve information from your Apple developer account and learn how to register a new signing key. Next, you use the SNS console to set up a platform application and a platform endpoint. Finally, you test the setup and watch a push notification arrive on your device.

Advantages of token-based authentication

Token-based authentication has several benefits compared to using certificates. The first is that you can use the same signing key from multiple provider servers (iOS,VoIP, and MacOS), and you can use one signing key to distribute notifications for all of your company’s application environments (sandbox, production). In contrast, a certificate is only associated with a particular subset of these channels.

A pain point for customers using certificate-based authentication is the need to renew certificates annually, an inconvenient procedure which can lead to production issues when forgotten. Your signing key for token-based authentication, on the other hand, does not expire.

Token-based authentication improves the security of your certificates. Unlike certificate-based authentication, the credential does not transfer. Hence, it is less likely to be compromised. You establish trust through encrypted tokens that are frequently regenerated. SNS manages the creation and management of these tokens.

You configure APNs platform applications for use with both .p8 and .p12 certificates, but only 1 authentication method is active at any given time.

Setting up your iOS application

To use token-based authentication, you must set up your application.

Prerequisites: An Apple developer account

  1. Create a new XCode project. Select iOS as the platform and use the App template.
    xcode project
  2. Select your Apple Developer Account team and your organization identifier.
    vscode details
  3. Go to Signing & Capabilities and select + Capability. This step creates resources on your Apple Developer Account.
    step 3
  4. Add the Push Notification Capability.
  5. In SNSPushDemoApp.swift , add the following code to print the device token and receive push notifications.
    import SwiftUI
    
    @main
    struct SNSPushDemoApp: App {
        
        @UIApplicationDelegateAdaptor private var appDelegate: AppDelegate
        
        var body: some Scene {
            WindowGroup {
                ContentView()
            }
        }
    }
    
    class AppDelegate: NSObject, UIApplicationDelegate, UNUserNotificationCenterDelegate {
        
        func application(_ application: UIApplication,
                         didFinishLaunchingWithOptions launchOptions: [UIApplication.LaunchOptionsKey : Any]? = nil) -> Bool {
            UNUserNotificationCenter.current().delegate = self
            return true
        }
        
        func application(_ application: UIApplication,
                         didRegisterForRemoteNotificationsWithDeviceToken deviceToken: Data) {
            let tokenParts = deviceToken.map { data in String(format: "%02.2hhx", data) }
            let token = tokenParts.joined()
            print("Device Token: \(token)")
        };
        
        func application(_ application: UIApplication, didFailToRegisterForRemoteNotificationsWithError error: Error) {
           print(error.localizedDescription)
        }
        
        func userNotificationCenter(_ center: UNUserNotificationCenter, willPresent notification: UNNotification, withCompletionHandler completionHandler: @escaping (UNNotificationPresentationOptions) -> Void) {
            completionHandler([.banner, .badge, .sound])
        }
    }
    
  6. In ContentView.swift, add the code to request authorization for push notifications and register for notifications.
    import SwiftUI
    
    struct ContentView: View {
        init() {
            requestPushAuthorization();
        }
        
        var body: some View {
            Button("Register") {
                registerForNotifications();
            }
        }
    }
    
    struct ContentView_Previews: PreviewProvider {
        static var previews: some View {
            ContentView()
        }
    }
    
    func requestPushAuthorization() {
        UNUserNotificationCenter.current().requestAuthorization(options: [.alert, .badge, .sound]) { success, error in
            if success {
                print("Push notifications allowed")
            } else if let error = error {
                print(error.localizedDescription)
            }
        }
    }
    
    func registerForNotifications() {
        UIApplication.shared.registerForRemoteNotifications()
    }
    
  7. Build and run the app on an iPhone. The push notification feature does not work with a simulator.
  8. On your phone, select allow notifications when the prompt appears. The debugger prints out “Push notifications allowed” if it is successful.
    allow notifications
  9. On your phone, choose the Register button. The debugger prints out the device token.
  10. You have set up an iOS application that can receive push notifications and prints the device token. We can now use this app to test sending push notifications with SNS configured for token-based authentication.

Retrieving your Apple resources

After setting up your application, you retrieve your Apple resources from your Apple developer account. There are four pieces of information you need from your Apple Developer Account: Bundle ID, Team ID, Signing Key, and Signing Key ID.

The signing key and signing key ID are credentials that you manage through your Apple Developer Account. You can register a new key by selecting the Keys tab under the Certificates, Identifiers & Profiles menu. Your Apple developer account provides the signing key in the form of a text file with a .p8 extension.

certs

Find the team ID under Membership Details. The bundle ID is the unique identifier that you set up when creating your application. Find this value in the Identifiers section under the Certificates, Identifiers & Profiles menu.

Amazon SNS uses a token constructed from the team ID, signing key, and signing key ID to authenticate with APNs for every push notification that you send. Amazon SNS manages tokens on your behalf and renews them when necessary (within an hour). The request header includes the bundle ID and helps identify where the notification goes.

Creating a new platform application using APNs token-based authentication

Prerequisites

In order to implement APNs token-based authentication, you must have:

  • An Apple Developer Account
  • A mobile application

To create a new platform application:

  1. Navigate to the Amazon SNS console and choose Push notifications. Then choose Create platform application.
    sns console
  2. Enter a name for your application. In the Push notification platform dropdown, choose Apple iOS/VoIP/Mac.
    sns name
  3. For the Push service, choose iOS, and for the Authentication method, choose Token. Select the check box labeled Used for development in sandbox. Then, input the fields from your Apple Developer Account.
    step 3
  4. You have successfully created a platform application using APNs token-based authentication.

Creating a new platform endpoint using APNs token-based authentication

A platform application stores credentials, sending configuration, and other settings but does not contain an exact sending destination. Create a platform endpoint resource to store the information to allow SNS to target push notifications to the proper application on the correct mobile device.

Any iOS application that is capable of receiving push notifications must register with APNs. Upon successful registration, APNs returns a device token that uniquely identifies an instance of an app. SNS needs this device token in order to send to that app. Each platform endpoint belongs to a specific platform application and uses the credentials and settings set in the platform application to complete the sending.

In this tutorial, you create the platform endpoint manually through the SNS console. In a real system, upon receiving the device token, you programmatically call SNS from your application server to create or update your platform endpoints.

These are the steps to create a new platform endpoint:

  1. From the details page of the platform application in the SNS console, choose Create application endpoint.
    appliation endpont
  2. From the iOS app that you set up previously, find the device token in the application logs. Enter the device token and choose Create application endpoint.
    application endpont details
  3. You have successfully created a platform application endpoint.
    application endpoint

Testing a push notification from your device

In this section, you test a push notification from your device.

  1. From the details page of the application endpoint you just created, (this is the page you end up at immediately after creating the endpoint), choose Publish message.
  2. Enter a message to send and choose Publish message.
    testing app
  3. The notification arrives on your iOS app.
    testing the app

Conclusion

Developers sending mobile push notifications can now use a .p8 key to authenticate an Apple device endpoint. Token-based authentication is more secure, and reduces operational burden of renewing the certificates every year. In this post, you learn how to set up your iOS application for mobile push using token-based authentication, by creating and configuring a new platform endpoint in the Amazon SNS console.

To learn more about APNs token-based authentication with Amazon SNS, visit the Amazon SNS Developer Guide. For more serverless content, visit Serverless Land.

Create a serverless event-driven workflow to ingest and process Microsoft data with AWS Glue and Amazon EventBridge

Post Syndicated from Venkata Sistla original https://aws.amazon.com/blogs/big-data/create-a-serverless-event-driven-workflow-to-ingest-and-process-microsoft-data-with-aws-glue-and-amazon-eventbridge/

Microsoft SharePoint is a document management system for storing files, organizing documents, and sharing and editing documents in collaboration with others. Your organization may want to ingest SharePoint data into your data lake, combine the SharePoint data with other data that’s available in the data lake, and use it for reporting and analytics purposes. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

Organizations often manage their data on SharePoint in the form of files and lists, and you can use this data for easier discovery, better auditing, and compliance. SharePoint as a data source is not a typical relational database and the data is mostly semi structured, which is why it’s often difficult to join the SharePoint data with other relational data sources. This post shows how to ingest and process SharePoint lists and files with AWS Glue and Amazon EventBridge, which enables you to join other data that is available in your data lake. We use SharePoint REST APIs with a standard open data protocol (OData) syntax. OData advocates a standard way of implementing REST APIs that allows for SQL-like querying capabilities. OData helps you focus on your business logic while building RESTful APIs without having to worry about the various approaches to define request and response headers, query options, and so on.

AWS Glue event-driven workflows

Unlike a traditional relational database, SharePoint data may or may not change frequently, and it’s difficult to predict the frequency at which your SharePoint server generates new data, which makes it difficult to plan and schedule data processing pipelines efficiently. Running data processing frequently can be expensive, whereas scheduling pipelines to run infrequently can deliver cold data. Similarly, triggering pipelines from an external process can increase complexity, cost, and job startup time.

AWS Glue supports event-driven workflows, a capability that lets developers start AWS Glue workflows based on events delivered by EventBridge. The main reason to choose EventBridge in this architecture is because it allows you to process events, update the target tables, and make information available to consume in near-real time. Because frequency of data change in SharePoint is unpredictable, using EventBridge to capture events as they arrive enables you to run the data processing pipeline only when new data is available.

To get started, you simply create a new AWS Glue trigger of type EVENT and place it as the first trigger in your workflow. You can optionally specify a batching condition. Without event batching, the AWS Glue workflow is triggered every time an EventBridge rule matches, which may result in multiple concurrent workflows running. AWS Glue protects you by setting default limits that restrict the number of concurrent runs of a workflow. You can increase the required limits by opening a support case. Event batching allows you to configure the number of events to buffer or the maximum elapsed time before firing the particular trigger. When the batching condition is met, a workflow run is started. For example, you can trigger your workflow when 100 files are uploaded in Amazon Simple Storage Service (Amazon S3) or 5 minutes after the first upload. We recommend configuring event batching to avoid too many concurrent workflows, and optimize resource usage and cost.

To illustrate this solution better, consider the following use case for a wine manufacturing and distribution company that operates across multiple countries. They currently host all their transactional system’s data on a data lake in Amazon S3. They also use SharePoint lists to capture feedback and comments on wine quality and composition from their suppliers and other stakeholders. The supply chain team wants to join their transactional data with the wine quality comments in SharePoint data to improve their wine quality and manage their production issues better. They want to capture those comments from the SharePoint server within an hour and publish that data to a wine quality dashboard in Amazon QuickSight. With an event-driven approach to ingest and process their SharePoint data, the supply chain team can consume the data in less than an hour.

Overview of solution

In this post, we walk through a solution to set up an AWS Glue job to ingest SharePoint lists and files into an S3 bucket and an AWS Glue workflow that listens to S3 PutObject data events captured by AWS CloudTrail. This workflow is configured with an event-based trigger to run when an AWS Glue ingest job adds new files into the S3 bucket. The following diagram illustrates the architecture.

To make it simple to deploy, we captured the entire solution in an AWS CloudFormation template that enables you to automatically ingest SharePoint data into Amazon S3. SharePoint uses ClientID and TenantID credentials for authentication and uses Oauth2 for authorization.

The template helps you perform the following steps:

  1. Create an AWS Glue Python shell job to make the REST API call to the SharePoint server and ingest files or lists into Amazon S3.
  2. Create an AWS Glue workflow with a starting trigger of EVENT type.
  3. Configure CloudTrail to log data events, such as PutObject API calls to CloudTrail.
  4. Create a rule in EventBridge to forward the PutObject API events to AWS Glue when they’re emitted by CloudTrail.
  5. Add an AWS Glue event-driven workflow as a target to the EventBridge rule. The workflow gets triggered when the SharePoint ingest AWS Glue job adds new files to the S3 bucket.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Configure SharePoint server authentication details

Before launching the CloudFormation stack, you need to set up your SharePoint server authentication details, namely, TenantID, Tenant, ClientID, ClientSecret, and the SharePoint URL in AWS Systems Manager Parameter Store of the account you’re deploying in. This makes sure that no authentication details are stored in the code and they’re fetched in real time from Parameter Store when the solution is running.

To create your AWS Systems Manager parameters, complete the following steps:

  1. On the Systems Manager console, under Application Management in the navigation pane, choose Parameter Store.
    systems manager
  2. Choose Create Parameter.
  3. For Name, enter the parameter name /DATALAKE/GlueIngest/SharePoint/tenant.
  4. Leave the type as string.
  5. Enter your SharePoint tenant detail into the value field.
  6. Choose Create parameter.
  7. Repeat these steps to create the following parameters:
    1. /DataLake/GlueIngest/SharePoint/tenant
    2. /DataLake/GlueIngest/SharePoint/tenant_id
    3. /DataLake/GlueIngest/SharePoint/client_id/list
    4. /DataLake/GlueIngest/SharePoint/client_secret/list
    5. /DataLake/GlueIngest/SharePoint/client_id/file
    6. /DataLake/GlueIngest/SharePoint/client_secret/file
    7. /DataLake/GlueIngest/SharePoint/url/list
    8. /DataLake/GlueIngest/SharePoint/url/file

Deploy the solution with AWS CloudFormation

For a quick start of this solution, you can deploy the provided CloudFormation stack. This creates all the required resources in your account.

The CloudFormation template generates the following resources:

  • S3 bucket – Stores data, CloudTrail logs, job scripts, and any temporary files generated during the AWS Glue extract, transform, and load (ETL) job run.
  • CloudTrail trail with S3 data events enabled – Enables EventBridge to receive PutObject API call data in a specific bucket.
  • AWS Glue Job – A Python shell job that fetches the data from the SharePoint server.
  • AWS Glue workflow – A data processing pipeline that is comprised of a crawler, jobs, and triggers. This workflow converts uploaded data files into Apache Parquet format.
  • AWS Glue database – The AWS Glue Data Catalog database that holds the tables created in this walkthrough.
  • AWS Glue table – The Data Catalog table representing the Parquet files being converted by the workflow.
  • AWS Lambda function – The AWS Lambda function is used as an AWS CloudFormation custom resource to copy job scripts from an AWS Glue-managed GitHub repository and an AWS Big Data blog S3 bucket to your S3 bucket.
  • IAM roles and policies – We use the following AWS Identity and Access Management (IAM) roles:
    • LambdaExecutionRole – Runs the Lambda function that has permission to upload the job scripts to the S3 bucket.
    • GlueServiceRole – Runs the AWS Glue job that has permission to download the script, read data from the source, and write data to the destination after conversion.
    • EventBridgeGlueExecutionRole – Has permissions to invoke the NotifyEvent API for an AWS Glue workflow.
    • IngestGlueRole – Runs the AWS Glue job that has permission to ingest data into the S3 bucket.

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For pS3BucketName, enter the unique name of your new S3 bucket.
  5. Leave pWorkflowName and pDatabaseName as the default.

cloud formation 1

  1. For pDatasetName, enter the SharePoint list name or file name you want to ingest.
  2. Choose Next.

cloud formation 2

  1. On the next page, choose Next.
  2. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create.

It takes a few minutes for the stack creation to complete; you can follow the progress on the Events tab.

You can run the ingest AWS Glue job either on a schedule or on demand. As the job successfully finishes and ingests data into the raw prefix of the S3 bucket, the AWS Glue workflow runs and transforms the ingested raw CSV files into Parquet files and loads them into the transformed prefix.

Review the EventBridge rule

The CloudFormation template created an EventBridge rule to forward S3 PutObject API events to AWS Glue. Let’s review the configuration of the EventBridge rule:

  1. On the EventBridge console, under Events, choose Rules.
  2. Choose the rule s3_file_upload_trigger_rule-<CloudFormation-stack-name>.
  3. Review the information in the Event pattern section.

event bridge

The event pattern shows that this rule is triggered when an S3 object is uploaded to s3://<bucket_name>/data/SharePoint/tablename_raw/. CloudTrail captures the PutObject API calls made and relays them as events to EventBridge.

  1. In the Targets section, you can verify that this EventBridge rule is configured with an AWS Glue workflow as a target.

event bridge target section

Run the ingest AWS Glue job and verify the AWS Glue workflow is triggered successfully

To test the workflow, we run the ingest-glue-job-SharePoint-file job using the following steps:

  1. On the AWS Glue console, select the ingest-glue-job-SharePoint-file job.

glue job

  1. On the Action menu, choose Run job.

glue job action menu

  1. Choose the History tab and wait until the job succeeds.

glue job history tab

You can now see the CSV files in the raw prefix of your S3 bucket.

csv file s3 location

Now the workflow should be triggered.

  1. On the AWS Glue console, validate that your workflow is in the RUNNING state.

glue workflow running status

  1. Choose the workflow to view the run details.
  2. On the History tab of the workflow, choose the current or most recent workflow run.
  3. Choose View run details.

glue workflow visual

When the workflow run status changes to Completed, let’s check the converted files in your S3 bucket.

  1. Switch to the Amazon S3 console, and navigate to your bucket.

You can see the Parquet files under s3://<bucket_name>/data/SharePoint/tablename_transformed/.

parquet file s3 location

Congratulations! Your workflow ran successfully based on S3 events triggered by uploading files to your bucket. You can verify everything works as expected by running a query against the generated table using Amazon Athena.

Sample wine dataset

Let’s analyze a sample red wine dataset. The following screenshot shows a SharePoint list that contains various readings that relate to the characteristics of the wine and an associated wine category. This is populated by various wine tasters from multiple countries.

redwine dataset

The following screenshot shows a supplier dataset from the data lake with wine categories ordered per supplier.

supplier dataset

We process the red wine dataset using this solution and use Athena to query the red wine data and supplier data where wine quality is greater than or equal to 7.

athena query and results

We can visualize the processed dataset using QuickSight.

Clean up

To avoid incurring unnecessary charges, you can use the AWS CloudFormation console to delete the stack that you deployed. This removes all the resources you created when deploying the solution.

Conclusion

Event-driven architectures provide access to near-real-time information and help you make business decisions on fresh data. In this post, we demonstrated how to ingest and process SharePoint data using AWS serverless services like AWS Glue and EventBridge. We saw how to configure a rule in EventBridge to forward events to AWS Glue. You can use this pattern for your analytical use cases, such as joining SharePoint data with other data in your lake to generate insights, or auditing SharePoint data and compliance requirements.


About the Author

Venkata Sistla is a Big Data & Analytics Consultant on the AWS Professional Services team. He specializes in building data processing capabilities and helping customers remove constraints that prevent them from leveraging their data to develop business insights.

Bringing AV1 Streaming to Netflix Members’ TVs

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/bringing-av1-streaming-to-netflix-members-tvs-b7fc88e42320

by Liwei Guo, Ashwin Kumar Gopi Valliammal, Raymond Tam, Chris Pham, Agata Opalach, Weibo Ni

AV1 is the first high-efficiency video codec format with a royalty-free license from Alliance of Open Media (AOMedia), made possible by wide-ranging industry commitment of expertise and resources. Netflix is proud to be a founding member of AOMedia and a key contributor to the development of AV1. The specification of AV1 was published in 2018. Since then, we have been working hard to bring AV1 streaming to Netflix members.

In February 2020, Netflix started streaming AV1 to the Android mobile app. The Android launch leveraged the open-source software decoder dav1d built by the VideoLAN, VLC, and FFmpeg communities and sponsored by AOMedia. We were very pleased to see that AV1 streaming improved members’ viewing experience, particularly under challenging network conditions.

While software decoders enable AV1 playback for more powerful devices, a majority of Netflix members enjoy their favorite shows on TVs. AV1 playback on TV platforms relies on hardware solutions, which generally take longer to be deployed.

Throughout 2020 the industry made impressive progress on AV1 hardware solutions. Semiconductor companies announced decoder SoCs for a range of consumer electronics applications. TV manufacturers released TVs ready for AV1 streaming. Netflix has also partnered with YouTube to develop an open-source solution for an AV1 decoder on game consoles that utilizes the additional power of GPUs. It is amazing to witness the rapid growth of the ecosystem in such a short time.

Today we are excited to announce that Netflix has started streaming AV1 to TVs. With this advanced encoding format, we are confident that Netflix can deliver an even more amazing experience to our members. In this techblog, we share some details about our efforts for this launch as well as the benefits we foresee for our members.

Enabling Netflix AV1 Streaming on TVs

Launching a new streaming format on TV platforms is not an easy job. In this section, we list a number of challenges we faced for this launch and share how they have been solved. As you will see, our “highly aligned, loosely coupled” culture played a key role in the success of this cross-functional project. The high alignment guides all teams to work towards the same goals, while the loose coupling keeps each team agile and fast paced.

Challenge 1: What is the best AV1 encoding recipe for Netflix streaming?

AV1 targets a wide range of applications with numerous encoding tools defined in the specification. This leads to unlimited possibilities of encoding recipes and we needed to find the one that works best for Netflix streaming.

Netflix serves movies and TV shows. Production teams spend tremendous effort creating this art, and it is critical that we faithfully preserve the original creative intent when streaming to our members. To achieve this goal, the Encoding Technologies team made the following design decisions about AV1 encoding recipes:

  • We always encode at the highest available source resolution and frame rate. For example, for titles where the source is 4K and high frame rate (HFR) such as “Formula 1: Drive to Survive”, we produce AV1 streams in 4K and HFR. This allows us to present the content exactly as creatively envisioned on devices and plans which support such high resolution and frame-rate playback.
  • All AV1 streams are encoded with 10 bit-depth even if AV1 Main Profile allows both 8 and 10 bit-depth. Almost all movies and TV shows are delivered to Netflix at 10 or higher bit-depth. Using 10-bit encoding can better preserve the creative intent and reduce the chances of artifacts (e.g., banding).
  • Dynamic optimization is used to adapt the recipe at the shot level and intelligently allocate bits. Streams on the Netflix service can easily be watched millions of times, and thus the optimization on the encoding side goes a long way in improving member experience. With dynamic optimization, we allocate more bits to more complex shots to meet Netflix’s high bar of visual quality, while encoding simple shots at the same high quality but with much fewer bits.

Challenge 2: How do we guarantee smooth AV1 playback on TVs?

We have a stream analyzer embedded in our encoding pipeline which ensures that all deployed Netflix AV1 streams are spec-compliant. TVs with an AV1 decoder also need to have decoding capabilities that meet the spec requirement to guarantee smooth playback of AV1 streams.

To evaluate decoder capabilities on these devices, the Encoding Technologies team crafted a set of special certification streams. These streams use the same production encoding recipes so they are representative of production streams, but have the addition of extreme cases to stress test the decoder. For example, some streams have a peak bitrate close to the upper limit allowed by the spec. The Client and UI Engineering team built a certification test with these streams to analyze both the device logs as well as the pictures rendered on the screen. Any issues observed in the test are flagged on a report, and if a gap in the decoding capability was identified, we worked with vendors to bring the decoder up to specification.

Challenge 3: How do we roll out AV1 encoding at Netflix scale?

Video encoding is essentially a search problem — the encoder searches the parameter space allowed by all encoding tools and finds the one that yields the best result. With a larger encoding tool set than previous codecs, it was no surprise that AV1 encoding takes more CPU hours. At the scale that Netflix operates, it is imperative that we use our computational resources efficiently; maximizing the impact of the CPU usage is a key part of AV1 encoding, as is the case with every other codec format.

The Encoding Technologies team took a first stab at this problem by fine-tuning the encoding recipe. To do so, the team evaluated different tools provided by the encoder, with the goal of optimizing the tradeoff between compression efficiency and computational efficiency. With multiple iterations, the team arrived at a recipe that significantly speeds up the encoding with negligible compression efficiency changes.

Besides speeding up the encoder, the total CPU hours could also be reduced if we can use compute resources more efficiently. The Performance Engineering team specializes in optimizing resource utilization at Netflix. Encoding Technologies teamed up with Performance Engineering to analyze the CPU usage pattern of AV1 encoding and based on our findings, Performance Engineering recommended an improved CPU scheduling strategy. This strategy improves encoding throughput by right-sizing jobs based on instance types.

Even with the above improvements, encoding the entire catalog still takes time. One aspect of the Netflix catalog is that not all titles are equally popular. Some titles (e.g., La Casa de Papel) have more viewing than others, and thus AV1 streams of these titles can reach more members. To maximize the impact of AV1 encoding while minimizing associated costs, the Data Science and Engineering team devised a catalog rollout strategy for AV1 that took into consideration title popularity and a number of other factors.

Challenge 4: How do we continuously monitor AV1 streaming?

With this launch, AV1 streaming reaches tens of millions of Netflix members. Having a suite of tools that can provide summarized metrics for these streaming sessions is critical to the success of Netflix AV1 streaming.

The Data Science and Engineering team built a number of dashboards for AV1 streaming, covering a wide range of metrics from streaming quality of experience (“QoE”) to device performance. These dashboards allow us to monitor and analyze trends over time as members stream AV1. Additionally, the Data Science and Engineering team built a dedicated AV1 alerting system which detects early signs of issues in key metrics and automatically sends alerts to teams for further investigation. Given AV1 streaming is at a relatively early stage, these tools help us be extra careful to avoid any negative member experience.

Quality of Experience Improvements

We compared AV1 to other codecs over thousands of Netflix titles, and saw significant compression efficiency improvements from AV1. While the result of this offline analysis was very exciting, what really matters to us is our members’ streaming experience.

To evaluate how the improved compression efficiency from AV1 impacts the quality of experience (QoE) of member streaming, A/B testing was conducted before the launch. Netflix encodes content into multiple formats and selects the best format for a given streaming session by considering factors such as device capabilities and content selection. Therefore, multiple A/B tests were created to compare AV1 with each of the applicable codec formats. In each of these tests, members with eligible TVs were randomly allocated to one of two cells, “control” and “treatment”. Those allocated to the “treatment” cell received AV1 streams while those allocated to the “control” cell received streams of the same codec format as before.

In all of these A/B tests, we observed improvements across many metrics for members in the “treatment” cell, in-line with our expectations:

Higher VMAF scores across the full spectrum of streaming sessions

  • VMAF is a video quality metric developed and open-sourced by Netflix, and is highly correlated to visual quality. Being more efficient, AV1 delivers videos with improved visual quality at the same bitrate, and thus higher VMAF scores.
  • The improvement is particularly significant among sessions that experience serious network congestion and the lowest visual quality. For these sessions, AV1 streaming improves quality by up to 10 VMAF without impacting the rebuffer rate.

More streaming at the highest resolution

  • With higher compression efficiency, the bandwidth needed for streaming is reduced and thus it is easier for playback to reach the highest resolution for that session.
  • For 4K eligible sessions, on average, the duration of 4K videos being streamed increased by about 5%.

Fewer noticeable drops in quality during playback

  • We want our members to have brilliant playback experiences, and our players are designed to adapt to the changing network conditions. When the current condition cannot sustain the current video quality, our players can switch to a lower bitrate stream to reduce the chance of a playback interruption. Given AV1 consumes less bandwidth for any given quality level, our players are able to sustain the video quality for a longer period of time and do not need to switch to a lower bitrate stream as much as before.
  • On some TVs, noticeable drops in quality were reduced by as much as 38%.

Reduced start play delay

  • On some TVs, with the reduced bitrate, the player can reach the target buffer level sooner to start the playback.
  • On average, we observed a 2% reduction in play delay with AV1 streaming.

Next Steps

Our initial launch includes a number of AV1 capable TVs as well as TVs connected with PS4 Pro. We are working with external partners to enable more and more devices for AV1 streaming. Another exciting direction we are exploring is AV1 with HDR. Again, the teams at Netflix are committed to delivering the best picture quality possible to our members. Stay tuned!

Acknowledgments

This is a collective effort with contributions from many of our colleagues at Netflix. We would like to thank

  • Andrey Norkin and Cyril Concolato for providing their insights about AV1 specifications.
  • Kyle Swanson for the work on reducing AV1 encoding complexity.
  • Anush Moorthy and Aditya Mavlankar for fruitful discussions about encoding recipes.
  • Frederic Turmel and his team for managing AV1 certification tests and building tools to automate device verification.
  • Susie Xia for helping improve resource utilization of AV1 encoding.
  • Client teams for integrating AV1 playback support and optimizing the experience.
  • The Partner Engineering team for coordinating with device vendors and investigating playback issues.
  • The Media Cloud Engineering team for accommodating the computing resources for the AV1 rollout.
  • The Media Content Playback team for providing tools for AV1 rollout management.
  • The Data Science and Engineering team for A/B test analysis, and for providing data to help us continuously monitor AV1.

If you are passionate about video technologies and interested in what we are doing at Netflix, come and chat with us! The Encoding Technologies team currently has a number of openings, and we can’t wait to have more stunning engineers joining us.

Senior Software Engineer, Encoding Technologies

Senior Software Engineer, Video & Image Encoding

Senior Software Engineer, Media Systems


Bringing AV1 Streaming to Netflix Members’ TVs was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

[$] Concurrency in Julia

Post Syndicated from original https://lwn.net/Articles/875367/rss

The Julia programming language has
it roots in high-performance scientific computing, so it is no surprise
that it has facilities for concurrent processing. Those features are not
well-known outside of the Julia community, though, so it is interesting to
see the different types of parallel and concurrent computation that the
language supports. In addition, the upcoming release of Julia
version 1.7
brings an improvement to the language’s
concurrent-computation palette,
in the form of “task
migration”.

How Roche democratized access to data with Google Sheets and Amazon Redshift Data API

Post Syndicated from Dr. Yannick Misteli original https://aws.amazon.com/blogs/big-data/how-roche-democratized-access-to-data-with-google-sheets-and-amazon-redshift-data-api/

This post was co-written with Dr. Yannick Misteli, João Antunes, and Krzysztof Wisniewski from the Roche global Platform and ML engineering team as the lead authors.

Roche is a Swiss multinational healthcare company that operates worldwide. Roche is the largest pharmaceutical company in the world and the leading provider of cancer treatments globally.

In this post, Roche’s global Platform and machine learning (ML) engineering team discuss how they used Amazon Redshift data API to democratize access to the data in their Amazon Redshift data warehouse with Google Sheets (gSheets).

Business needs

Go-To-Market (GTM) is the domain that lets Roche understand customers and create and deliver valuable services that meet their needs. This lets them get a better understanding of the health ecosystem and provide better services for patients, doctors, and hospitals. It extends beyond health care professionals (HCPs) to a larger Healthcare ecosystem consisting of patients, communities, health authorities, payers, providers, academia, competitors, etc. Data and analytics are essential to supporting our internal and external stakeholders in their decision-making processes through actionable insights.

For this mission, Roche embraced the modern data stack and built a scalable solution in the cloud.

Driving true data democratization requires not only providing business leaders with polished dashboards or data scientists with SQL access, but also addressing the requirements of business users that need the data. For this purpose, most business users (such as Analysts) leverage Excel—or gSheet in the case of Roche—for data analysis.

Providing access to data in Amazon Redshift to these gSheets users is a non-trivial problem. Without a powerful and flexible tool that lets data consumers use self-service analytics, most organizations will not realize the promise of the modern data stack. To solve this problem, we want to empower every data analyst who doesn’t have an SQL skillset with a means by which they can easily access and manipulate data in the applications that they are most familiar with.

The Roche GTM organization uses the Redshift Data API to simplify the integration between gSheets and Amazon Redshift, and thus facilitate the data needs of their business users for analytical processing and querying. The Amazon Redshift Data API lets you painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web service-based applications and event-driven applications. Data API simplifies data access, ingest, and egress from languages supported with AWS SDK, such as Python, Go, Java, Node.js, PHP, Ruby, and C++ so that you can focus on building applications as opposed to managing infrastructure. The process they developed using Amazon Redshift Data API has significantly lowered the barrier for entry for new users without needing any data warehousing experience.

Use-Case

In this post, you will learn how to integrate Amazon Redshift with gSheets to pull data sets directly back into gSheets. These mechanisms are facilitated through the use of the Amazon Redshift Data API and Google Apps Script. Google Apps Script is a programmatic way of manipulating and extending gSheets and the data that they contain.

Architecture

It is possible to include publicly available JS libraries such as JQuery-builder provided that Apps Script is natively a cloud-based Javascript platform.

The JQuery builder library facilitates the creation of standard SQL queries via a simple-to-use graphical user interface. The Redshift Data API can be used to retrieve the data directly to gSheets with a query in place. The following diagram illustrates the overall process from a technical standpoint:

Even though AppsScript is, in fact, Javascript, the AWS-provided SDKs for the browser (NodeJS and React) cannot be used on the Google platform, as they require specific properties that are native to the underlying infrastructure. It is possible to authenticate and access AWS resources through the available API calls. Here is an example of how to achieve that.

You can use an access key ID and a secret access key to authenticate the requests to AWS by using the code in the link example above. We recommend following the least privilege principle when granting access to this programmatic user, or assuming a role with temporary credentials. Since each user will require a different set of permissions on the Redshift objects—database, schema, and table—each user will have their own user access credentials. These credentials are safely stored under the AWS Secrets Manager service. Therefore, the programmatic user needs a set of permissions that enable them to retrieve secrets from the AWS Secrets Manager and execute queries against the Redshift Data API.

Code example for AppScript to use Data API

In this section, you will learn how to pull existing data back into a new gSheets Document. This section will not cover how to parse the data from the JQuery-builder library, as it is not within the main scope of the article.

<script src="https://cdn.jsdelivr.net/npm/jQuery-QueryBuilder/dist/js/query-builder.standalone.min.js"></script>    
  1. In the AWS console, go to Secrets Manager and create a new secret to store the database credentials to access the Redshift Cluster: username and password. These will be used to grant Redshift access to the gSheets user.
  2. In the AWS console, create a new IAM user with programmatic access, and generate the corresponding Access Key credentials. The only set of policies required for this user is to be able to read the secret created in the previous step from the AWS Secrets Manager service and to query the Redshift Data API.

    Below is the policy document:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Action": [
            "secretsmanager:GetSecretValue",
            "secretsmanager:DescribeSecret"
          ],
          "Resource": "arn:aws:secretsmanager:*::secret:*"
        },
        {
          "Sid": "VisualEditor1",
          "Effect": "Allow",
          "Action": "secretsmanager:ListSecrets",
          "Resource": "*"
        },
        {
          "Sid": "VisualEditor2",
          "Effect": "Allow",
          "Action": "redshift-data:*",
          "Resource": "arn:aws:redshift:*::cluster:*"
        }
      ]
    }

  3. Access the Google Apps Script console. Create an aws.gs file with the code available here. This will let you perform authenticated requests to the AWS services by providing an access key and a secret access key.
  4. Initiate the AWS variable providing the access key and secret access key created in step 3.
    AWS.init("<ACCESS_KEY>", "<SECRET_KEY>");

  5. Request the Redshift username and password from the AWS Secrets Manager:
    function runGetSecretValue_(secretId) {
     
      var resultJson = AWS.request(
        	getSecretsManagerTypeAWS_(),
        	getLocationAWS_(),
        	'secretsmanager.GetSecretValue',
        	{"Version": getVersionAWS_()},
        	method='POST',
        	payload={          
          	"SecretId" : secretId
        	},
        	headers={
          	"X-Amz-Target": "secretsmanager.GetSecretValue",
          	"Content-Type": "application/x-amz-json-1.1"
        	}
      );
     
      Logger.log("Execute Statement result: " + resultJson);
      return JSON.parse(resultJson);
     
    }

  6. Query a table using the Amazon Redshift Data API:
    function runExecuteStatement_(sql) {
     
      var resultJson = AWS.request(
        	getTypeAWS_(),
        	getLocationAWS_(),
        	'RedshiftData.ExecuteStatement',
        	{"Version": getVersionAWS_()},
        	method='POST',
        	payload={
          	"ClusterIdentifier": getClusterIdentifierReshift_(),
          	"Database": getDataBaseRedshift_(),
          	"DbUser": getDbUserRedshift_(),
          	"Sql": sql
        	},
        	headers={
          	"X-Amz-Target": "RedshiftData.ExecuteStatement",
          	"Content-Type": "application/x-amz-json-1.1"
        	}
      ); 
     
      Logger.log("Execute Statement result: " + resultJson); 

  7. The result can then be displayed as a table in gSheets:
    function fillGsheet_(recordArray) { 
     
      adjustRowsCount_(recordArray);
     
      var rowIndex = 1;
      for (var i = 0; i < recordArray.length; i++) {  
           
    	var rows = recordArray[i];
    	for (var j = 0; j < rows.length; j++) {
      	var columns = rows[j];
      	rowIndex++;
      	var columnIndex = 'A';
         
      	for (var k = 0; k < columns.length; k++) {
           
        	var field = columns[k];       
        	var value = getFieldValue_(field);
        	var range = columnIndex + rowIndex;
        	addToCell_(range, value);
     
        	columnIndex = nextChar_(columnIndex);
     
      	}
     
    	}
     
      }
     
    }

  8. Once finished, the Apps Script can be deployed as an Addon that enables end-users from an entire organization to leverage the capabilities of retrieving data from Amazon Redshift directly into their spreadsheets. Details on how Apps Script code can be deployed as an Addon can be found here.

How users access Google Sheets

  1. Open a gSheet, and go to manage addons -> Install addon:
  2. Once the Addon is successfully installed, select the Addon menu and select Redshift Synchronization. A dialog will appear prompting the user to select the combination of database, schema, and table from which to load the data.
  3. After choosing the intended table, a new panel will appear on the right side of the screen. Then, the user is prompted to select which columns to retrieve from the table, apply any filtering operation, and/or apply any aggregations to the data.
  4. Upon submitting the query, app scripts will translate the user selection into a query that is sent to the Amazon Redshift Data API. Then, the returned data is transformed and displayed as a regular gSheet table:

Security and Access Management

In the scripts above, there is a direct integration between AWS Secrets Manager and Google Apps Script. The scripts above can extract the currently-authenticated user’s Google email address. Using this value and a set of annotated tags, the script can appropriately pull the user’s credentials securely to authenticate the requests made to the Amazon Redshift cluster. Follow these steps to set up a new user in an existing Amazon Redshift cluster. Once the user has been created, follow these steps for creating a new AWS Secrets Manager secret for your cluster. Make sure that the appropriate tag is applied with the key of “email” along with the corresponding user’s Google email address. Here is a sample configuration that is used for creating Redshift groups, users, and data shares via the Redshift Data API:

connection:
 redshift_super_user_database: dev
 redshift_secret_name: dev_
 redshift_cluster_identifier: dev-cluster
 redshift_secrets_stack_name: dev-cluster-secrets
 environment: dev
 aws_region: eu-west-1
 tags:
   - key: "Environment"
 	value: "dev"
users:
 - name: user1
   email: [email protected]
 data_shares:
 - name: test_data_share
   schemas:
 	- schema1
   redshift_namespaces:
 	- USDFJIL234234WE
group:
 - name: readonly
   users:
 	- user1
   databases:
 	- database: database1
   	exclude-schemas:
     	- public
     	- pg_toast
     	- catalog_history
   	include-schemas:
     	- schema1
   	grant:
     	- select

Operational Metrics and Improvement

Providing access to live data that is hosted in Redshift directly to the business users and enabling true self-service decrease the burden on platform teams to provide data extracts or other mechanisms to deliver up-to-date information. Additionally, by not having different files and versions of data circulating, the business risk of reporting different key figures or KPI can be reduced, and an overall process efficiency can be achieved.

The initial success of this add-on in GTM led to the extension of this to a broader audience, where we are hoping to serve hundreds of users with all of the internal and public data in the future.

Conclusion

In this post, you learned how to create new Amazon Redshift tables and pull existing Redshift tables into a Google Sheet for business users to easily integrate with and manipulate data. This integration was seamless and demonstrated how easy the Amazon Redshift Data API makes integration with external applications, such as Google Sheets with Amazon Redshift. The outlined use-cases above are just a few examples of how the Amazon Redshift Data API can be applied and used to simplify interactions between users and Amazon Redshift clusters.


About the Authors

Dr. Yannick Misteli is leading cloud platform and ML engineering teams in global product strategy (GPS) at Roche. He is passionate about infrastructure and operationalizing data-driven solutions, and he has broad experience in driving business value creation through data analytics.

João Antunes is a Data Engineer in the Global Product Strategy (GPS) team at Roche. He has a track record of deploying Big Data batch and streaming solutions for the telco, finance, and pharma industries.

Krzysztof Wisniewski is a back-end JavaScript developer in the Global Product Strategy (GPS) team at Roche. He is passionate about full-stack development from the front-end through the back-end to databases.

Matt Noyce is a Senior Cloud Application Architect at AWS. He works together primarily with Life Sciences and Healthcare customers to architect and build solutions on AWS for their business needs.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Migrating to an Amazon Redshift Cloud Data Warehouse from Microsoft APS

Post Syndicated from Sudarshan Roy original https://aws.amazon.com/blogs/architecture/migrating-to-an-amazon-redshift-cloud-data-warehouse-from-microsoft-aps/

Before cloud data warehouses (CDWs), many organizations used hyper-converged infrastructure (HCI) for data analytics. HCIs pack storage, compute, networking, and management capabilities into a single “box” that you can plug into your data centers. However, because of its legacy architecture, an HCI is limited in how much it can scale storage and compute and continue to perform well and be cost-effective. Using an HCI can impact your business’s agility because you need to plan in advance, follow traditional purchase models, and maintain unused capacity and its associated costs. Additionally, HCIs are often proprietary and do not offer the same portability, customization, and integration options as with open-standards-based systems. Because of their proprietary nature, migrating HCIs to a CDW can present technical hurdles, which can impact your ability to realize the full potential of your data.

One of these hurdles includes using AWS Schema Conversion Tool (AWS SCT). AWS SCT is used to migrate data warehouses, and it supports several conversions. However, when you migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) platform using only AWS SCT, it results in connection errors due to the lack of server-side cursor support in Microsoft APS. In this blog post, we show you three approaches that use AWS SCT combined with other AWS services to migrate Microsoft’s Analytics Platform System (APS) SQL Server Parallel Data Warehouse (PDW) HCI platform to Amazon Redshift. These solutions will help you overcome elasticity, scalability, and agility constraints associated with proprietary HCI analytics platforms and future proof your analytics investment.

AWS Schema Conversion Tool

Though using AWS SCT only will result in server-side cursor errors, you can pair it with other AWS services to migrate your data warehouses to AWS. AWS SCT converts source database schema and code objects, including views, stored procedures, and functions, to be compatible with a target database. It highlights objects that require manual intervention. You can also scan your application source code for embedded SQL statements as part of database-schema conversion project. During this process, AWS SCT optimizes cloud-native code by converting legacy Oracle and SQL Server functions to their equivalent AWS service. This helps you modernize applications simultaneously. Once conversion is complete, AWS SCT can also migrate data.

Figure 1 shows a standard AWS SCT implementation architecture.

AWS SCT migration approach

Figure 1. AWS SCT migration approach

The next section shows you how to pair AWS SCT with other AWS services to migrate a Microsoft APS PDW to Amazon Redshift CDW. We prove you a base approach and two extensions to use for data warehouses with larger datasets and longer release outage windows.

Migration approach using SQL Server on Amazon EC2

The base approach uses Amazon Elastic Compute Cloud (Amazon EC2) to host a SQL Server in a symmetric multi-processing (SMP) architecture that is supported by AWS SCT, as opposed to Microsoft’s APS PDW’s massively parallel processing (MPP) architecture. By changing the warehouse’s architecture from MPP to SMP and using AWS SCT, you’ll avoid server-side cursor support errors.

Here’s how you’ll set up the base approach (Figure 2):

  1. Set up the SMP SQL Server on Amazon EC2 and AWS SCT in your AWS account.
  2. Set up Microsoft tools, including SQL Server Data Tools (SSDT), remote table copy, and SQL Server Integration Services (SSIS).
  3. Use the Application Diagnostic Utility (ADU) and SSDT to connect and extract table lists, indexes, table definitions, view definitions, and stored procedures.
  4. Generate data description languages (DDLs) using step 3 outputs.
  5. Apply these DDLs to the SMP SQL Server on Amazon EC2.
  6. Run AWS SCT against the SMP SQL database to begin migrating schema and data to Amazon Redshift.
  7. Extract data using remote table copy from source, which copies data into the SMP SQL Server.
  8. Load this data into Amazon Redshift using AWS SCT or AWS Database Migration Service (AWS DMS).
  9. Use SSIS to load delta data from source to the SMP SQL Server on Amazon EC2.
Base approach using SMP SQL Server on Amazon EC2

Figure 2. Base approach using SMP SQL Server on Amazon EC2

Extending the base approach

The base approach overcomes server-side issues you would have during a direct migration. However, many organizations host terabytes (TB) of data. To migrate such a large dataset, you’ll need to adjust your approach.

The following sections extend the base approach. They still use the base approach to convert the schema and procedures, but the dataset is handled via separate processes.

Extension 1: AWS Snowball Edge

Note: AWS Snowball Edge is a Region-specific service. Verify that the service is available in your Region before planning your migration. See Regional Table to verify availability.

Snowball Edge lets you transfer large datasets to the cloud at faster-than-network speeds. Each Snowball Edge device can hold up to 100 TB and uses 256-bit encryption and an industry-standard Trusted Platform Module to ensure security and full chain-of-custody for your data. Furthermore, higher volumes can be transferred by clustering 5–10 devices for increased durability and storage.

Extension 1 enhances the base approach to allow you to transfer large datasets (Figure 3) while simultaneously setting up an SMP SQL Server on Amazon EC2 for delta transfers. Here’s how you’ll set it up:

  1. Once Snowball Edge is enabled in the on-premises environment, it allows data transfer via network file system (NFS) endpoints. The device can then be used with standard Microsoft tools like SSIS, remote table copy, ADU, and SSDT.
  2. While the device is being shipped back to an AWS facility, you’ll set up an SMP SQL Server database on Amazon EC2 to replicate the base approach.
  3. After your data is converted, you’ll apply a converted schema to Amazon Redshift.
  4. Once the Snowball Edge arrives at the AWS facility, data is transferred to the SMP SQL Server database.
  5. You’ll subsequently run schema conversions and initial and delta loads per the base approach.
Solution extension that uses Snowball Edge for large datasets

Figure 3. Solution extension that uses Snowball Edge for large datasets

Note: Where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Extension 1 transfers initial load and later applies delta load. This adds time to the project because of longer cutover release schedules. Additionally, you’ll need to plan for multiple separate outages, Snowball lead times, and release management timelines.

Note that not all analytics systems are classified as business-critical systems, so they can withstand a longer outage, typically 1-2 days. This gives you an opportunity to use AWS DataSync as an additional extension to complete initial and delta load in a single release window.

Extension 2: AWS DataSync

DataSync speeds up data transfer between on-premises environments and AWS. It uses a purpose-built network protocol and a parallel, multi-threaded architecture to accelerate your transfers.

Figure 4 shows the solution extension, which works as follows:

  1. Create SMP MS SQL Server on EC2 and the DDL, as shown in the base approach.
  2. Deploy DataSync agent(s) in your on-premises environment.
  3. Provision and mount an NFS volume on the source analytics platform and DataSync agent(s).
  4. Define a DataSync transfer task after the agents are registered.
  5. Extract initial load from source onto the NFS mount that will be uploaded to Amazon Simple Storage Service (Amazon S3).
  6. Load data extracts into the SMP SQL Server on Amazon EC2 instance (created using base approach).
  7. Run delta loads per base approach, or continue using solution extension for delta loads.
Solution extension that uses DataSync for large datasets

Figure 4. Solution extension that uses DataSync for large datasets

Note: where sequence numbers overlap in the diagram is a suggestion to possible parallel execution

Transfer rates for DataSync depend on the amount of data, I/O, and network bandwidth available. A single DataSync agent can fully utilize a 10 gigabit per second (Gbps) AWS Direct Connect link to copy data from on-premises to AWS. As such, depending on initial load size, transfer window calculations must be done prior to finalizing transfer windows.

Conclusion

The approach and its extensions mentioned in this blog post provide mechanisms to migrate your Microsoft APS workloads to an Amazon Redshift CDW. They enable elasticity, scalability, and agility for your workload to future proof your analytics investment.

Related information

The collective thoughts of the interwebz