Tag Archives: апи

Summarize devices that are not reachable

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/summarize-devices-that-are-not-reachable/13219/

In this lab, we will list all devices which are not reachable by a monitoring tool. This is good when we want to improve the overall monitoring experience and decrease the size queue (metrics which has not been arrived at the instance).

Tools required for the job: Access to a database server or a Windows computer with PowerShell

To summarize devices that are not reachable at the moment we can use a database query. Tested and works on 4.0, 5.0, on MySQL and PostgreSQL:

SELECT hosts.host,
       interface.ip,
       interface.dns,
       interface.useip,
       CASE interface.type
           WHEN 1 THEN 'ZBX'
           WHEN 2 THEN 'SNMP'
           WHEN 3 THEN 'IPMI'
           WHEN 4 THEN 'JMX'
       END AS "type",
       hosts.error
FROM hosts
JOIN interface ON interface.hostid=hosts.hostid
WHERE hosts.available=2
  AND interface.main=1
  AND hosts.status=0;

A very similar (but not exactly the same) outcome can be obtained via Windows PowerShell by contacting Zabbix API. Try this snippet:

$headers = New-Object "System.Collections.Generic.Dictionary[[String],[String]]"
$headers.Add("Content-Type", "application/json")
$url = 'http://192.168.1.101/api_jsonrpc.php'
$user = 'api'
$password = 'zabbix'

# authorization
$key = Invoke-RestMethod $url -Method 'POST' -Headers $headers -Body "
{
    `"jsonrpc`": `"2.0`",
    `"method`": `"user.login`",
    `"params`": {
        `"user`": `"$user`",
        `"password`": `"$password`"
    },
    `"id`": 1
}
" | foreach { $_.result }
echo $key

# filter out unreachable Agent, SNMP, JMX, IPMI hosts
Invoke-RestMethod $url -Method 'POST' -Headers $headers -Body "
{
    `"jsonrpc`": `"2.0`",
    `"method`": `"host.get`",
    `"params`": {
        `"output`": [`"interfaces`",`"host`",`"proxy_hostid`",`"disable_until`",`"lastaccess`",`"errors_from`",`"error`"],
        `"selectInterfaces`": `"extend`",
        `"filter`": {`"available`": `"2`",`"status`":`"0`"}
    },
    `"auth`": `"$key`",
    `"id`": 1
}
" | foreach { $_.result }  | foreach { $_.interfaces } | Out-GridView

# log out
Invoke-RestMethod $url -Method 'POST' -Headers $headers -Body "
{
    `"jsonrpc`": `"2.0`",
    `"method`": `"user.logout`",
    `"params`": [],
    `"id`": 1,
    `"auth`": `"$key`"
}
"

Set a valid credential (URL, username, password) on the top of the code before executing it.

The benefit of PowerShell here is that we can use some on-the-fly filtering:

What is the exact meaning of the field ‘type’ we can understand by looking on the previous database query:

       CASE interface.type
           WHEN 1 THEN 'ZBX'
           WHEN 2 THEN 'SNMP'
           WHEN 3 THEN 'IPMI'
           WHEN 4 THEN 'JMX'
       END AS "type",

On Windows PowerShell, it is possible to download the unreachable hosts directly to CSV file. To do that, in the code above, we need to change:

Out-GridView

to

Export-Csv c:\temp\unavailable.hosts.csv

Alright, this was the knowledge bit today. Let’s keep Zabbixing!

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Close problem automatically via Zabbix API

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/close-problem-automatically-via-zabbix-api/12461/

Today we are talking about a use case when it’s impossible to find a proper way to write a recovery expression for the Zabbix trigger. In other words, we know how to identify problems. But there is no good way to detect when the problem is gone.

This mostly relates to a huge environment, for example:

  • Got one log file. There are hundreds of patterns inside. We respect all of them. We need them
  • SNMP trap item (snmptrap.fallback) with different patterns being written inside

In these situations, the trigger is most likely configured to “Event generation mode: Multiple.” This practically means: when a “problematic metric” hits the instance, it will open +1 additional problem.

Goal:
I just need to receive an email about the record, then close the event.

As a workaround (let’s call it a solution here), we can define an action which will:

  1. contact an API endpoint
  2. manually acknowledge the event and close it

The biggest reason why this functionality is possible is that: when an event hits the action, the operation actually knows the event ID of the problem. The macro {EVENT.ID} saves the day.

To solve the problem, we need to install API characteristics at the global level:

     {$Z_API_PHP}=http://127.0.0.1/api_jsonrpc.php
    {$Z_API_USER}=api
{$Z_API_PASSWORD}=zabbix

NOTE
‘http://127.0.0.1/api_jsonrpc.php’ means the frontend server runs on the same server as systemd:zabbix-server. If it is not the case, we need to plot a front-end address of Zabbix GUI + add ‘api_jsonrpc.php’.

We will have 2 actions. The first one will deliver a notification to email:

After 1 minute, a second action will close the event:

This is a full bash snippet we must put inside. No need to change anything. It works with copy and paste:

url={$Z_API_PHP}
    user={$Z_API_USER}
password={$Z_API_PASSWORD}

# authorization
auth=$(curl -sk -X POST -H "Content-Type: application/json" -d "
{
	\"jsonrpc\": \"2.0\",
	\"method\": \"user.login\",
	\"params\": {
		\"user\": \"$user\",
		\"password\": \"$password\"
	},
	\"id\": 1,
	\"auth\": null
}
" $url | \
grep -E -o "([0-9a-f]{32,32})")

# acknowledge and close event
curl -sk -X POST -H "Content-Type: application/json" -d "
{
	\"jsonrpc\": \"2.0\",
	\"method\": \"event.acknowledge\",
	\"params\": {
		\"eventids\": \"{EVENT.ID}\",
		\"action\": 1,
		\"message\": \"Problem resolved.\"
	},
	\"auth\": \"$auth\",
	\"id\": 1
}" $url

# close api key
curl -sk -X POST -H "Content-Type: application/json" -d "
{
    \"jsonrpc\": \"2.0\",
    \"method\": \"user.logout\",
    \"params\": [],
    \"id\": 1,
    \"auth\": \"$auth\"
}
" $url

Zabbix API scripting via curl and jq

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/zabbix-api-scripting-via-curl-and-jq/12434/

In this lab we will use a bash environment and utilities ‘curl’ and ‘jq’ to perform Zabbix API calls, do some scripting.

‘curl’ is a tool to exchange JSON messages over HTTP/HTTPS.
‘jq’ utility helps to locate and extract specific elements in output.

To follow the lab we need to install ‘jq’:

# On CentOS7/RHEL7:
yum install epel-release && yum install jq

# On CentOS8/RHEL8:
dnf install jq

# On Ubuntu/Debian:
apt install jq

# On any 64-bit Linux platform:
curl -skL "https://github.com/stedolan/jq/releases/download/jq1.5/jq-linux64" -o /usr/bin/jq && chmod +x /usr/bin/jq

Obtaining an authorization token

In order to operate with API calls we need to:

  • Define an API endpoint. this is an URL, a PHP file which is designed to accept requests
  • Obtain an authorization token

If you tend to execute API calls from frontend server then most likelly.

url=http://127.0.0.1/api_jsonrpc.php
# or:
url=http://127.0.0.1/zabbix/api_jsonrpc.php

It’s required to set the URL variable to jump to the next step. Test if you have it configured:

echo $url

Any API call needs to be used via authorization token. To put one token in variable use the command:

auth=$(curl -s -X POST -H 'Content-Type: application/json-rpc' \
-d '
{"jsonrpc":"2.0","method":"user.login","params":
{"user":"api","password":"zabbix"},
"id":1,"auth":null}
' $url | \
jq -r .result
)

Note
Notice there is user ‘api’ with password ‘zabbix’. This is a dedicated user for API calls.

Check if you have a session key. It should be 32 character HEX string:

echo $auth

Though process

1) visit the documentation page and pick an API flavor for example alert.get:

{
"jsonrpc": "2.0",
"method": "alert.get",
"params": {
	"output": "extend",
	"actionids": "3"
},
"auth": "038e1d7b1735c6a5436ee9eae095879e",
"id": 1
}

2) Let’s use our favorite text editor and build in Find&Replace functionality to escape all double quotes:

{
\"jsonrpc\": \"2.0\",
\"method\": \"alert.get\",
\"params\": {
	\"output\": \"extend\",
	\"actionids\": \"3\"
},
\"auth\": \"038e1d7b1735c6a5436ee9eae095879e\",
\"id\": 1
}

NOTE
Don’t ever think to do this process manually by hand!

3) Replace session key 038e1d7b1735c6a5436ee9eae095879e with our variable $auth

{
\"jsonrpc\": \"2.0\",
\"method\": \"alert.get\",
\"params\": {
	\"output\": \"extend\",
	\"actionids\": \"3\"
},
\"auth\": \"$auth\",
\"id\": 1
}

4) Now let’s encapsulate the API command with curl:

curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \

{
\"jsonrpc\": \"2.0\",
\"method\": \"alert.get\",
\"params\": {
	\"output\": \"extend\",
	\"actionids\": \"3\"
},
\"auth\": \"$auth\",
\"id\": 1
}

" $url

By executing the previous command, it should already print a JSON content in response.
To make the output more beautiful we can pipe it to jq .:

curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \

{
\"jsonrpc\": \"2.0\",
\"method\": \"alert.get\",
\"params\": {
	\"output\": \"extend\",
	\"actionids\": \"3\"
},
\"auth\": \"$auth\",
\"id\": 1
}

" $url | jq .

Wrap everything together in one file

This is ready to use the snippet:

#!/bin/bash

# 1. set connection details
url=http://127.0.0.1/api_jsonrpc.php
user=api
password=zabbix

# 2. get authorization token
auth=$(curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \
{
 \"jsonrpc\": \"2.0\",
 \"method\": \"user.login\",
 \"params\": {
  \"user\": \"$user\",
  \"password\": \"$password\"
 },
 \"id\": 1,
 \"auth\": null
}
" $url | \
jq -r '.result'
)

# 3. show triggers in problem state
curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \
{
 \"jsonrpc\": \"2.0\",
    \"method\": \"trigger.get\",
    \"params\": {
        \"output\": \"extend\",
        \"selectHosts\": \"extend\",
        \"filter\": {
            \"value\": 1
        },
        \"sortfield\": \"priority\",
        \"sortorder\": \"DESC\"
    },
    \"auth\": \"$auth\",
    \"id\": 1
}
" $url | \
jq -r '.result'

# 4. logout user
curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \
{
    \"jsonrpc\": \"2.0\",
    \"method\": \"user.logout\",
    \"params\": [],
    \"id\": 1,
    \"auth\": \"$auth\"
}
" $url

Conveniences

We can use https://jsonpathfinder.com/ to identify what should be the path to extract an element.

For example, to list all Zabbix proxies we will use and API call:

curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \
{
    \"jsonrpc\": \"2.0\",
    \"method\": \"proxy.get\",
    \"params\": {
        \"output\": [\"host\"]
    },
    \"auth\": \"$auth\",
    \"id\": 1
} 
" $url

It may print content like:

{"jsonrpc":"2.0","result":[{"host":"broceni","proxyid":"10387"},{"host":"mysql8mon","proxyid":"12066"},{"host":"riga","proxyid":"12585"}],"id":1}

Inside JSONPathFinder by using a mouse click at the right panel, we can locate a sample element what we need to extract:

It suggests a path ‘x.result[1].host’. This means to extract all elements we can remove the number and use ‘.result[].host’ like this:

curl -s -X POST \
-H 'Content-Type: application/json-rpc' \
-d " \
{
    \"jsonrpc\": \"2.0\",
    \"method\": \"proxy.get\",
    \"params\": {
        \"output\": [\"host\"]
    },
    \"auth\": \"$auth\",
    \"id\": 1
} 
" $url | jq -r '.result[].host'

Now it prints only the proxy titles:

broceni
mysql8mon
riga

That is it for today. Bye.

Zabbix API calls through Postman

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/zabbix-api-calls-through-postman/12198/

Zabbix API calls can be used through the graphical user interface (GUI), no need to jump to scripting. An application to perform API calls is called Postman.

Benefits:

  • Available on Windows, Linux, or MAC
  • Save/synchronize your collection with Google account
  • Can copy and paste examples from the official documentation page

Let’s go to basic steps on how to perform API calls:

1st step – Grab API method user.login and use a dedicated username and password to obtain and session token:

{
    "jsonrpc": "2.0",
    "method": "user.login",
    "params": {
        "user": "api",
        "password": "zabbix"
    },
    "id": 1
}

This is how it looks in Postman:

NOTE
We recommend using a dedicated user for API calls. For example, a user called “api”. Make sure the user type has been chosen as “Zabbix Super Admin” so through this user we can access any type of information.

2nd step – Use API method trigger.get to list all triggers in the problem state:

{
    "jsonrpc": "2.0",
    "method": "trigger.get",
    "params": {
        "output": [
            "triggerid",
            "description",
            "priority"
        ],
        "filter": {
            "value": 1
        },
        "sortfield": "priority",
        "sortorder": "DESC"
    },
    "auth": "<session key>",
    "id": 1
}

Replace “<session key>” inside the API snippet to make it work. Then click “Send” button. It will list all triggers in the problem state on the right side:

Postman conveniences – Environments

Environments are “a must” if you:

  • Have a separate test, development, and production Zabbix instance
  • Plan to migrate Zabbix to next version (4.0 to 5.0) so it’s better to test all API calls beforehand

On the top right corner, there is a button Manage Environments. Let’s click it.

Now Create an environment:

Each environment must consist of url and auth key:

Now we have one definition prod. Can close window with [X]:

In order to work with your new environment, select a newly created profile prod. It’s required to substitute Zabbix API endpoint with {{url}} and plot {{auth}} to serve as a dynamic authorization key:

NOTE
Every time we notice an API procedure does not work anymore, all we need to do is to enter Manage environments section and install a new session tokken..

Topic in video format:
https://youtu.be/B14tsDUasG8?t=2513

Automated Origin CA for Kubernetes

Post Syndicated from Terin Stock original https://blog.cloudflare.com/automated-origin-ca-for-kubernetes/

Automated Origin CA for Kubernetes

Automated Origin CA for Kubernetes

In 2016, we launched the Cloudflare Origin CA, a certificate authority optimized for making it easy to secure the connection between Cloudflare and an origin server. Running our own CA has allowed us to support fast issuance and renewal, simple and effective revocation, and wildcard certificates for our users.

Out of the box, managing TLS certificates and keys within Kubernetes can be challenging and error prone. The secret resources have to be constructed correctly, as components expect secrets with specific fields. Some forms of domain verification require manually rotating secrets to pass. Once you’re successful, don’t forget to renew before the certificate expires!

cert-manager is a project to fill this operational gap, providing Kubernetes resources that manage the lifecycle of a certificate. Today we’re releasing origin-ca-issuer, an extension to cert-manager integrating with Cloudflare Origin CA to easily create and renew certificates for your account’s domains.

Origin CA Integration

Creating an Issuer

After installing cert-manager and origin-ca-issuer, you can create an OriginIssuer resource. This resource creates a binding between cert-manager and the Cloudflare API for an account. Different issuers may be connected to different Cloudflare accounts in the same Kubernetes cluster.

apiVersion: cert-manager.k8s.cloudflare.com/v1
kind: OriginIssuer
metadata:
  name: prod-issuer
  namespace: default
spec:
  signatureType: OriginECC
  auth:
    serviceKeyRef:
      name: service-key
      key: key
      ```

This creates a new OriginIssuer named “prod-issuer” that issues certificates using ECDSA signatures, and the secret “service-key” in the same namespace is used to authenticate to the Cloudflare API.

Signing an Origin CA Certificate

After creating an OriginIssuer, we can now create a Certificate with cert-manager. This defines the domains, including wildcards, that the certificate should be issued for, how long the certificate should be valid, and when cert-manager should renew the certificate.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com
  namespace: default
spec:
  # The secret name where cert-manager
  # should store the signed certificate.
  secretName: example-com-tls
  dnsNames:
    - example.com
  # Duration of the certificate.
  duration: 168h
  # Renew a day before the certificate expiration.
  renewBefore: 24h
  # Reference the Origin CA Issuer you created above,
  # which must be in the same namespace.
  issuerRef:
    group: cert-manager.k8s.cloudflare.com
    kind: OriginIssuer
    name: prod-issuer

Once created, cert-manager begins managing the lifecycle of this certificate, including creating the key material, crafting a certificate signature request (CSR), and constructing a certificate request that will be processed by the origin-ca-issuer.

When signed by the Cloudflare API, the certificate will be made available, along with the private key, in the Kubernetes secret specified within the secretName field. You’ll be able to use this certificate on servers proxied behind Cloudflare.

Extra: Ingress Support

If you’re using an Ingress controller, you can use cert-manager’s Ingress support to automatically manage Certificate resources based on your Ingress resource.

apiVersion: networking/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/issuer: prod-issuer
    cert-manager.io/issuer-kind: OriginIssuer
    cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
  name: example
  namespace: default
spec:
  rules:
    - host: example.com
      http:
        paths:
          - backend:
              serviceName: examplesvc
              servicePort: 80
            path: /
  tls:
    # specifying a host in the TLS section will tell cert-manager 
    # what DNS SANs should be on the created certificate.
    - hosts:
        - example.com
      # cert-manager will create this secret
      secretName: example-tls

Building an External cert-manager Issuer

An external cert-manager issuer is a specialized Kubernetes controller. There’s no direct communication between cert-manager and external issuers at all; this means that you can use any existing tools and best practices for developing controllers to develop an external issuer.

We’ve decided to use the excellent controller-runtime project to build origin-ca-issuer, running two reconciliation controllers.

Automated Origin CA for Kubernetes

OriginIssuer Controller

The OriginIssuer controller watches for creation and modification of OriginIssuer custom resources. The controllers create a Cloudflare API client using the details and credentials referenced. This client API instance will later be used to sign certificates through the API. The controller will periodically retry to create an API client; once it is successful, it updates the OriginIssuer’s status to be ready.

CertificateRequest Controller

The CertificateRequest controller watches for the creation and modification of cert-manager’s CertificateRequest resources. These resources are created automatically by cert-manager as needed during a certificate’s lifecycle.

The controller looks for Certificate Requests that reference a known OriginIssuer, this reference is copied by cert-manager from the origin Certificate resource, and ignores all resources that do not match. The controller then verifies the OriginIssuer is in the ready state, before transforming the certificate request into an API request using the previously created clients.

On a successful response, the signed certificate is added to the certificate request, and which cert-manager will use to create or update the secret resource. On an unsuccessful request, the controller will periodically retry.

Learn More

Up-to-date documentation and complete installation instructions can be found in our GitHub repository. Feedback and contributions are greatly appreciated. If you’re interested in Kubernetes at Cloudflare, including building controllers like these, we’re hiring.

Add Watermarks to your Cloudflare Stream Video Uploads

Post Syndicated from Rachel Chen original https://blog.cloudflare.com/add-watermarks-to-your-cloudflare-stream-video-uploads/

Add Watermarks to your Cloudflare Stream Video Uploads

Add Watermarks to your Cloudflare Stream Video Uploads

Since the launch of Cloudflare Stream, our customers have been asking for a programmatic way to add watermarks to their videos. We built the Watermarks API to support a wide range of use cases: from customers who simply want to tell Stream “can you put this watermark image to the top right of my video?” to customers with more detailed asks such as “can you put this watermark image in a way it doesn’t take up more than 10% of the original video and with 20% opacity?” All that and more is now available at no additional cost through the Watermarks API.

What is Cloudflare Stream?

Cloudflare Stream provides out-of-the-box video infrastructure so developers can bring their app ideas to market faster. While building a video streaming app, developers must ask themselves questions like

  • Where do we store the videos affordably?
  • How do we encode the videos to support users with varying Internet speeds?
  • How do we maintain our video pipeline in the long term?”

Cloudflare Stream is a single product that handles video encoding, storage, delivery and presentation (with the Stream Player.) Stream lets developers launch their ideas faster while having the confidence the video infrastructure will scale with their app’s growth.

How the Watermark API works

The Watermark API lets you add a watermark to a video at the time of uploading. It consists of two new features to the Stream API:

  • A new /stream/watermarks endpoint that lets you create watermark profiles and returns a uid, a unique identifier for each watermark profile
  • Support for a watermark object containing the uid of the watermark profile that can be passed at the time of upload

Step 1: Creating a Watermark Profile

A watermark profile describes the nature of the watermark, including the image to use as a watermark and properties such as its positioning, padding and scale.

Add Watermarks to your Cloudflare Stream Video Uploads

In this example, we are going to create a watermark profile that places the Cloudflare logo to the lower left of the video:

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/stream/watermarks \
  --header 'content-type: application/json' \
  --header 'x-auth-email: $CLOUDFLARE_EMAIL \
  --header 'x-auth-key: $CLOUDFLARE_KEY \
  --data '{
  "url": "https://storage.googleapis.com/zaid-test/Watermarks%20Demo/cf-icon.png",
  "name": "Cloudflare Icon",
  "opacity": 0.5,
  "padding": 0.05,
  "scale": 0.1,
  "position": "lowerLeft"
}'

The response contains information about the watermark profile, including a uid that we will use in the next step

{
  "result": {
    "uid": "a85d289c2e3f82701103620d16cd2408",
    "size": 9165,
    "height": 504,
    "width": 600,
    "created": "2020-09-03T20:43:56.337486Z",
    "downloadedFrom": "REDACTED_VIDEO_URL",
    "name": "Cloudflare Icon",
    "opacity": 0.5,
    "padding": 0.05,
    "scale": 0.1,
    "position": "lowerLeft"
  },
  "success": true,
  "errors": [],
  "messages": []
}

Step 2: Apply the Watermark

We’ve created the watermark and are ready to use it. Below is a screengrab from the Built For This commercial. It contains no watermark:

Add Watermarks to your Cloudflare Stream Video Uploads

We are going to upload the commercial and request Stream to add the logo from the previous step as a watermark:

curl --request POST \
  --url https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/stream/copy \
  --header 'content-type: application/json' \
  --header 'x-auth-email: $EMAIL \
  --header 'x-auth-key: $AUTH_KEY' \
  --data '{
  "url": "https://storage.googleapis.com/zaid-test/Watermarks%20Demo/The%20Internet%20was%20BuiltForThis.mp4",
  "watermark": {
    "uid": "a85d289c2e3f82701103620d16cd2408"
  }
}'

Step 3: Your video, now with a watermark!

You’re done! You can watch the video with a watermark:



What’s next

Read the detailed Watermark API docs covering different use cases.

In future iterations, we plan to add support for animated watermarks. Additionally, we want to add Watermark support to the Stream Dashboard so you have a UI to manage and add watermarks.

Introducing GitHub’s OpenAPI Description

Post Syndicated from Marc-Andre Giroux original https://github.blog/2020-07-27-introducing-githubs-openapi-description/

The GitHub REST API has been through three major revisions since it was first released, only a month after the site was launched. We often receive feedback that our REST API is an inspiration to many for design, and that it’s an industry reference for what an API should look like. Today, we’re excited to announce an improvement to how developers can interact with the API. GitHub has open sourced an OpenAPI description of the REST API.

OpenAPI

The OpenAPI specification is a programming language agnostic standard that lets providers describe the interface of their HTTP APIs. This allows both humans and machines to discover the capabilities of an API without needing to first read documentation or understand the implementation. OpenAPI is a widely adopted industry standard and GitHub is proud to be part of the community and help push the standard forward.

Try it Out

The GitHub OpenAPI description contains more than 600 operations exposed in our API. For visual exploration of the API, you can load the description as a Postman Collection. Programmatically, the description can be used to generate mock servers, test suites, and bindings for languages not supported by Octokit.

The description is provided under two formats. The bundled version is preferred for most use cases as it makes use of OpenAPI components for reuse and readability. For tooling that has poor support for inline references to components, we also provide a fully dereferenced version.

Active Development

The description is currently in beta. Describing a 12-year-old API is no easy task. We’ve built this description using a mix of existing JSON schemas, documented examples, contract testing, and love. We expect to make the description even more complete and accurate as we go forward and as OpenAPI becomes central to our developer experience — internally and externally.

Quarterly releases of the description are available for GitHub Enterprise Server and GitHub Private Instances, with versions like v2.21. More frequent updates to the description will be available for GitHub.com.

How Can I Contribute?

We’re always looking to make our OpenAPI description more complete and accurate as well as making it easier to consume. If you’d like to help contribute to the description, check out our contributing guide. If something is not working for you, please file an Issue on the repository.

Building a complete OpenAPI description for the GitHub API was no easy task and could not have been possible without a great team. Thanks to Gregor Martynus for his initial work on describing the API, the Docs Engineering team for their amazing work around OpenAPI and documentation, Will Roden for his help validating the description with octo-go, as well as the folks at Redoc.ly who helped along the way.

Learn more about our REST API OpenAPI Description

*  The OpenAPI Initiative logo is a trademark of The Linux Foundation

Backblaze B2 and the S3 Compatible API on Cloudflare

Post Syndicated from Tim Obezuk original https://blog.cloudflare.com/backblaze-b2-and-the-s3-compatible-api-on-cloudflare/

Backblaze B2 and the S3 Compatible API on Cloudflare

In May 2020, Backblaze, a founding Bandwidth Alliance partner announced S3 compatible APIs for their B2 Cloud Storage service. As a refresher, the Bandwidth Alliance is a group of forward-thinking cloud and networking companies that are committed to discounting or waiving data transfer fees for shared customers. Backblaze has been a proud partner since 2018. We are excited to see Backblaze introduce a new level of compatibility in their Cloud Storage service.

History of the S3 API

First let’s dive into the history of the S3 API and why it’s important for Cloudflare users.

Prior to 2006, before the mass migration to the Cloud, if you wanted to store content for your company you needed to build your own expensive and highly available storage platform that was large enough to store all your existing content with enough growth headroom for your business. AWS launched to help eliminate this model by renting their physical computing and storage infrastructure.

Amazon Simple Storage Service (S3) led the market by offering a scalable and resilient tool for storing unlimited amounts of data without building it yourself. It could be integrated into any application but there was one catch: you couldn’t use any existing standard such as WebDAV, FTP or SMB: your application needed to interface with Amazon’s bespoke S3 API.

Fast forward to 2020 and the storage provider landscape has become highly competitive with many providers capable of providing petabyte (and exabyte) scale content storage at extremely low cost-per-gigabyte. However, Amazon S3 has remained a dominant player despite heavy competition and not being the most cost-effective player.

The broad adoption of the S3 API by developers in their codebases and internal systems has transformed the S3 API into what WebDAV promised us to be: de facto standard HTTP File Storage API.

Engineering costs of changing storage providers

With many code bases and legacy applications being entrenched in the S3 API, the process to switch to a more cost-effective storage provider is not so easy. Companies need to consider the cost of engineer time programming a new storage API while also physically moving their data.

This engineering overhead has led many storage providers to natively support the S3 API, leveling the playing field and allowing companies to focus on picking the most cost-effective provider.

First-mile bandwidth costs and the Bandwidth Alliance

Cloudflare caches content in Points of Presence located in more than 200 cities around the world. This cached content is then handed to your Internet service provider (ISP) over low cost and often free Internet exchange connections in the same facility using mutual fibre optic cables. This cost saving is fairly well understood as the benefit of content delivery networks and has become highly commoditized.

What is less well understood is the first-mile cost of moving data from a storage provider to the content delivery network. Typically storage providers expect traffic to route via the Internet and will charge the consumer per-gigabyte of data transmitted. This is not the case for Cloudflare as we also share facilities and mutual fibre optic cables with many storage providers.

Backblaze B2 and the S3 Compatible API on Cloudflare

These shared interconnects created an opportunity to waive the cost of first-mile bandwidth between Cloudflare and many providers and is what prompted us to create the Bandwidth Alliance.

Media and entertainment companies serving user-generated content have a continuous supply of new content being moved over the first-mile from the storage provider to the content delivery network. The first-mile bandwidth cost adds up and using a Bandwidth Alliance partner such Backblaze can entirely eliminate it.

Using the S3 API in Cloudflare Workers

The Solutions Engineering team at Cloudflare is tasked with providing strategic technical guidance for our enterprise customers.

It’s not uncommon for developers to connect Cloudflare’s global network directly to their storage provider and directly serve content such as live and on-demand video without an intermediate web server.

For security purposes engineers typically use Cloudflare Workers to sign each uncached request using the S3 API. Cloudflare Workers allows anyone to deploy code to our global network of over 200+ Points of Presence in seconds and is built on top of Service Workers.

We’ve tested Backblaze B2’s S3 Compatible API in Cloudflare Workers using the same code tested for Amazon S3 buckets and it works perfectly by changing the target endpoint.

Creating a S3 Compatible Worker script

Here’s how it is done using Cloudflare Worker’s CLI tool Wrangler:

Generate a new project in Wrangler using a template intended for use with Amazon S3:

wrangler generate <projectname> https://github.com/obezuk/worker-signed-s3-template

This template uses aws4fetch. A fast, lightweight implementation of an S3 Compatible signing library that is commonly used in Service Worker environments like Cloudflare Workers.

The template creates an index.js file with a standard request signing implementation:

import { AwsClient } from 'aws4fetch'

const aws = new AwsClient({
    "accessKeyId": AWS_ACCESS_KEY_ID,
    "secretAccessKey": AWS_SECRET_ACCESS_KEY,
    "region": AWS_DEFAULT_REGION
});

addEventListener('fetch', function(event) {
    event.respondWith(handleRequest(event.request))
});

async function handleRequest(request) {
    var url = new URL(request.url);
    url.hostname = AWS_S3_BUCKET;
    var signedRequest = await aws.sign(url);
    return await fetch(signedRequest, { "cf": { "cacheEverything": true } });
}

Environment Variables

Modify your wrangler.toml file to use your Backblaze B2 API Key ID and Secret:

[env.dev]
vars = { AWS_ACCESS_KEY_ID = "<BACKBLAZE B2 keyId>", 
AWS_SECRET_ACCESS_KEY = "<BACKBLAZE B2 secret>", 
AWS_DEFAULT_REGION = "", 
AWS_S3_BUCKET = "<BACKBLAZE B2 bucketName>.<BACKBLAZE B2 S3 Endpoint>"}

AWS_S3_BUCKET environment variable will be the combination of your bucket name, period and S3 Endpoint. For a Backblaze B2 Bucket named example-bucket and S3 Endpoint s3.us-west-002.backblazeb2.com use example-bucket.s3.us-west-002.backblazeb2.com

AWS_DEFAULT_REGION environment variable is interpreted from your S3 Endpoint. I use us-west-002.

We recommend using Secret Environment variables to store your AWS_SECRET_ACCESS_KEY content when using this script in production.

Preview your Cloudflare Worker

Next run wrangler preview --env dev to enter a preview window of your Worker script. My bucket contained a static website containing adaptive streaming video content stored in a Backblaze B2 bucket.

Backblaze B2 and the S3 Compatible API on Cloudflare
Note: We permit caching of third party video content only for enterprise domains. Free/Pro/Biz users wanting to serve video content via Cloudflare may use Stream which delivers an end-to-end video delivery service.

Backblaze B2’s compatibility for the S3 API is an exciting update that has made their storage platform highly compatible with existing code bases and legacy systems. And, as a special offer to Cloudflare blog readers, Backblaze will pay the migration costs for transferring your data from S3 to Backblaze B2 (click here for more detail). With the cost of migration covered and compatibility for your existing workflows, it is now easier than ever to switch to a Bandwidth Alliance partner and save on first-mile costs. By doing so, you can slash your cloud bills, gain flexibility, and make no compromises to your performance.

To learn more, join us on May 14th for a webinar focused on getting you ultra fast worldwide content delivery.

Ready for changes with Hexagonal Architecture

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/ready-for-changes-with-hexagonal-architecture-b315ec967749

by Damir Svrtan and Sergii Makagon

As the production of Netflix Originals grows each year, so does our need to build apps that enable efficiency throughout the entire creative process. Our wider Studio Engineering Organization has built more than 30 apps that help content progress from pitch (aka screenplay) to playback: ranging from script content acquisition, deal negotiations and vendor management to scheduling, streamlining production workflows, and so on.

Highly integrated from the start

About a year ago, our Studio Workflows team started working on a new app that crosses multiple domains of the business. We had an interesting challenge on our hands: we needed to build the core of our app from scratch, but we also needed data that existed in many different systems.

Some of the data points we needed, such as data about movies, production dates, employees, and shooting locations, were distributed across many services implementing various protocols: gRPC, JSON API, GraphQL and more. Existing data was crucial to the behavior and business logic of our application. We needed to be highly integrated from the start.

Swappable data sources

One of the early applications for bringing visibility into our productions was built as a monolith. The monolith allowed for rapid development and quick changes while the knowledge of the space was non-existent. At one point, more than 30 developers were working on it, and it had well over 300 database tables.

Over time applications evolved from broad service offerings towards being highly specialized. This resulted in a decision to decompose the monolith to specific services. This decision was not geared by performance issues — but with setting boundaries around all of these different domains and enabling dedicated teams to develop domain-specific services independently.

Large amounts of the data we needed for the new app were still provided by the monolith, but we knew that the monolith would be broken up at some point. We were not sure about the timing of the breakup, but we knew that it was inevitable, and we needed to be prepared.

Thus, we could leverage some of the data from the monolith at first as it was still the source of truth, but be prepared to swap those data sources to new microservices as soon as they came online.

Leveraging Hexagonal Architecture

We needed to support the ability to swap data sources without impacting business logic, so we knew we needed to keep them decoupled. We decided to build our app based on principles behind Hexagonal Architecture and Uncle Bob’s Clean Architecture.

The idea of Hexagonal Architecture is to put inputs and outputs at the edges of our design. Business logic should not depend on whether we expose a REST or a GraphQL API, and it should not depend on where we get data from — a database, a microservice API exposed via gRPC or REST, or just a simple CSV file.

The pattern allows us to isolate the core logic of our application from outside concerns. Having our core logic isolated means we can easily change data source details without a significant impact or major code rewrites to the codebase.

One of the main advantages we also saw in having an app with clear boundaries is our testing strategy — the majority of our tests can verify our business logic without relying on protocols that can easily change.

Defining the core concepts

Leveraged from the Hexagonal Architecture, the three main concepts that define our business logic are Entities, Repositories, and Interactors.

  • Entities are the domain objects (e.g., a Movie or a Shooting Location) — they have no knowledge of where they’re stored (unlike Active Record in Ruby on Rails or the Java Persistence API).
  • Repositories are the interfaces to getting entities as well as creating and changing them. They keep a list of methods that are used to communicate with data sources and return a single entity or a list of entities. (e.g. UserRepository)
  • Interactors are classes that orchestrate and perform domain actions — think of Service Objects or Use Case Objects. They implement complex business rules and validation logic specific to a domain action (e.g., onboarding a production)

With these three main types of objects, we are able to define business logic without any knowledge or care where the data is kept and how business logic is triggered. Outside of the business logic are the Data Sources and the Transport Layer:

  • Data Sources are adapters to different storage implementations.
    A data source might be an adapter to a SQL database (an Active Record class in Rails or JPA in Java), an elastic search adapter, REST API, or even an adapter to something simple such as a CSV file or a Hash. A data source implements methods defined on the repository and stores the implementation of fetching and pushing the data.
  • Transport Layer can trigger an interactor to perform business logic. We treat it as an input for our system. The most common transport layer for microservices is the HTTP API Layer and a set of controllers that handle requests. By having business logic extracted into interactors, we are not coupled to a particular transport layer or controller implementation. Interactors can be triggered not only by a controller, but also by an event, a cron job, or from the command line.
The dependency graph in Hexagonal Architecture goes inward.

With a traditional layered architecture, we would have all of our dependencies point in one direction, each layer above depending on the layer below. The transport layer would depend on the interactors, the interactors would depend on the persistence layer.

In Hexagonal Architecture all dependencies point inward — our core business logic does not know anything about the transport layer or the data sources. Still, the transport layer knows how to use interactors, and the data sources know how to conform to the repository interface.

With this, we are prepared for the inevitable changes to other Studio systems, and whenever that needs to happen, the task of swapping data sources is easy to accomplish.

Swapping data sources

The need to swap data sources came earlier than we expected — we suddenly hit a read constraint with the monolith and needed to switch a certain read for one entity to a newer microservice exposed over a GraphQL aggregation layer. Both the microservice and the monolith were kept in sync and had the same data, reading from one service or the other produced the same results.

We managed to transfer reads from a JSON API to a GraphQL data source within 2 hours.

The main reason we were able to pull it off so fast was due to the Hexagonal architecture. We didn’t let any persistence specifics leak into our business logic. We created a GraphQL data source that implemented the repository interface. A simple one-line change was all we needed to start reading from a different data source.

With a proper abstraction it was easy to change data sources

At that point, we knew that Hexagonal Architecture worked for us.

The great part about a one-line change is that it mitigates risks to the release. It is very easy to rollback in the case that a downstream microservice failed on initial deployment. This as well enables us to decouple deployment and activation, as we can decide which data source to use through configuration.

Hiding data source details

One of the great advantages of this architecture is that we are able to encapsulate data source implementation details. We ran into a case where we needed an API call that did not yet exist — a service had an API to fetch a single resource but did not have bulk fetch implemented. After talking with the team providing the API, we realized this endpoint would take some time to deliver. So we decided to move forward with another solution to solve the problem while this endpoint was being built.

We defined a repository method that would grab multiple resources given multiple record identifiers — and the initial implementation of that method on the data source sent multiple concurrent calls to the downstream service. We knew this was a temporary solution and that the second take at the data source implementation was to use the bulk API once implemented.

Our business logic doesn’t need to be aware of specific data source limitations.

A design like this enabled us to move forward with meeting the business needs without accruing much technical debt or the need to change any business logic afterward.

Testing strategy

When we started experimenting with Hexagonal Architecture, we knew we needed to come up with a testing strategy. We knew that a prerequisite to great development velocity was to have a test suite that is reliable and super fast. We didn’t think of it as a nice to have, but a must-have.

We decided to test our app at three different layers:

  • We test our interactors, where the core of our business logic lives but is independent of any type of persistence or transportation. We leverage dependency injection and mock any kind of repository interaction. This is where our business logic is tested in detail, and these are the tests we strive to have most of.
  • We test our data sources to determine if they integrate correctly with other services, whether they conform to the repository interface, and check how they behave upon errors. We try to minimize the amount of these tests.
  • We have integration specs that go through the whole stack, from our Transport / API layer, through the interactors, repositories, data sources, and hit downstream services. These specs test whether we “wired” everything correctly. If a data source is an external API, we hit that endpoint and record the responses (and store them in git), allowing our test suite to run fast on every subsequent invocation. We don’t do extensive test coverage on this layer — usually just one success scenario and one failure scenario per domain action.

We don’t test our repositories as they are simple interfaces that data sources implement, and we rarely test our entities as they are plain objects with attributes defined. We test entities if they have additional methods (without touching the persistence layer).

We have room for improvement, such as not pinging any of the services we rely on but relying 100% on contract testing. With a test suite written in the above manner, we manage to run around 3000 specs in 100 seconds on a single process.

It’s lovely to work with a test suite that can easily be run on any machine, and our development team can work on their daily features without disruption.

Delaying decisions

We are in a great position when it comes to swapping data sources to different microservices. One of the key benefits is that we can delay some of the decisions about whether and how we want to store data internal to our application. Based on the feature’s use case, we even have the flexibility to determine the type of data store — whether it be Relational or Documents.

Uncle Bob said it great:

The purpose of a good architecture is to delay decisions. Why? Because when we delay a decision, we have more information when it comes time to make it.

At the beginning of a project, we have the least amount of information about the system we are building. We should not lock ourselves into an architecture with uninformed decisions leading to a project paradox.

The decisions we made make sense for our needs now and have enabled us to move fast. The best part of Hexagonal Architecture is that it keeps our application flexible for future requirements to come.


Ready for changes with Hexagonal Architecture was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing Secrets and Environment Variables to Cloudflare Workers

Post Syndicated from John Donmoyer original https://blog.cloudflare.com/workers-secrets-environment/

Introducing Secrets and Environment  Variables to Cloudflare Workers

Introducing Secrets and Environment  Variables to Cloudflare Workers

The Workers team here at Cloudflare has been hard at work shipping a bunch of new features in the last year and we’ve seen some amazing things built with the tools we’ve provided. However, as my uncle once said, with great serverless platform growth comes great responsibility.

One of the ways we can help is by ensuring that deploying and maintaining your Workers scripts is a low risk endeavor. Rotating a set of API keys shouldn’t require risking downtime through code edits and redeployments and in some cases it may not make sense for the developer writing the script to know the actual API key value at all. To help tackle this problem, we’re releasing Secrets and Environment Variables to the Wrangler CLI and Workers Dashboard.

Supporting secrets

As we started to design support for secrets in Workers we had a sense that this was already a big concern for a lot of our users but we wanted to learn about all of the use cases to ensure we were building the right thing. We headed to the community forums, twitter, and the inbox of Louis Grace, business development representative extraordinaire, for some anecdotes about Secrets usage. We also sent out a survey to our existing users to learn about use cases and pain points.

We learned that even though there was already a way to store secrets without exposing them via Workers KV, the solution was not very intuitive, nor did it meet all the needs of our users. Many users didn’t even know we had an interim solution in place. Recognizing that we were not the first platform to encounter this problem, we surveyed the existing landscape of Platform as a Service offerings to get a better sense for what our users would expect of us.

Deciding on a solution

One of the first things we found was that not all environment variables are created equal. While the simplest use case for having a defined environment variable may be storing a piece of text that can be updated no matter where it is referenced in a script, sometimes those variables may have higher stakes associated with them. If you’re storing an API key that controls access to an important system, you may not want to allow anyone with dashboard access to see it, maybe not even the developers themselves.

With this in mind, we had to ensure the feature covered two different use cases: one for storing variables in plain text where you could see the variable being referenced and make edits to it and another where the variable would be encrypted as soon as you save it, never to be seen again. This way, we were able to serve both needs of our users, side by side, without one compromising for the other.

Testing our prototypes

Once we had a fairly good idea of what we wanted to build, we built some prototypes and rough implementations in staging environments so we would be able to perform some usability testing. We wrangled up some developers and observed them as they performed a series of tasks where they were asked to add some secrets and plain-text environment variables, reference them in one of their Workers, and bind their Worker to a Worker KV namespace.

Along the way we also asked questions to understand the developer’s professional background, familiarity with the product, and the use cases they’ve had for using Workers in the past along with any paint points they experienced.

While we were testing the new dashboard interface we also began testing the usability of the Wrangler CLI. We had Wrangler users perform the same tasks as the Workers dashboard users to help us find out if users are expecting different things out of their command-line tooling.

Findings and fixes

Through our testing we were able to make a number of changes before the final release. Some of the smaller changes included things like adjusting the behavior of form fields to ensure users knew which variable would be associated with each value. We also made larger changes like electing to separate the KV namespace bindings from the other environment variables as a way to emphasize that KV namespace bindings are not the keys and values themselves but a reference to a namespace where those keys are stored.

Cina, one of our engineers, put together a proposal to align some of our terminology with the terms that our developers were naturally using to describe their workflow. In Wrangler users were accustomed to referencing their KV namespaces by adding a KV namespace binding so when they came to the Workers dashboard interface and saw a field called “KV Variables” they were often confused, thinking they were adding keys and values to the namespace itself instead of establishing a variable that could be used to reference the namespace. As a fix, we decided to call it a “KV namespace binding” throughout the experience.

Try it out

Environment variables are available now with the Wrangler CLI and in the Workers Dashboard so go ahead and give them a shot today!

Introducing Secrets and Environment  Variables to Cloudflare Workers
Adding a secret with Wrangler
Introducing Secrets and Environment  Variables to Cloudflare Workers
Managing environment variables and KV bindings in the Workers Dashboard

As we continue to build out the Workers platform we’d love to hear from you. Let us know if you’re interested in participating in user research or just have something to say as we’d love to hear from you.

How we used our new GraphQL Analytics API to build Firewall Analytics

Post Syndicated from Nick Downie original https://blog.cloudflare.com/how-we-used-our-new-graphql-api-to-build-firewall-analytics/

How we used our new GraphQL Analytics API to build Firewall Analytics

How we used our new GraphQL Analytics API to build Firewall Analytics

Firewall Analytics is the first product in the Cloudflare dashboard to utilize the new GraphQL Analytics API. All Cloudflare dashboard products are built using the same public APIs that we provide to our customers, allowing us to understand the challenges they face when interfacing with our APIs. This parity helps us build and shape our products, most recently the new GraphQL Analytics API that we’re thrilled to release today.

By defining the data we want, along with the response format, our GraphQL Analytics API has enabled us to prototype new functionality and iterate quickly from our beta user feedback. It is helping us deliver more insightful analytics tools within the Cloudflare dashboard to our customers.

Our user research and testing for Firewall Analytics surfaced common use cases in our customers’ workflow:

  • Identifying spikes in firewall activity over time
  • Understanding the common attributes of threats
  • Drilling down into granular details of an individual event to identify potential false positives

We can address all of these use cases using our new GraphQL Analytics API.

GraphQL Basics

Before we look into how to address each of these use cases, let’s take a look at the format of a GraphQL query and how our schema is structured.

A GraphQL query is comprised of a structured set of fields, for which the server provides corresponding values in its response. The schema defines which fields are available and their type. You can find more information about the GraphQL query syntax and format in the official GraphQL documentation.

To run some GraphQL queries, we recommend downloading a GraphQL client, such as GraphiQL, to explore our schema and run some queries. You can find documentation on getting started with this in our developer docs.

At the top level of the schema is the viewer field. This represents the top level node of the user running the query. Within this, we can query the zones field to find zones the current user has access to, providing a filter argument, with a zoneTag of the identifier of the zone we’d like narrow down to.

{
  viewer {
    zones(filter: { zoneTag: "YOUR_ZONE_ID" }) {
      # Here is where we'll query our firewall events
    }
  }
}

Now that we have a query that finds our zone, we can start querying the firewall events which have occurred in that zone, to help solve some of the use cases we’ve identified.

Visualising spikes in firewall activity

It’s important for customers to be able to visualise and understand anomalies and spikes in their firewall activity, as these could indicate an attack or be the result of a misconfiguration.

Plotting events in a timeseries chart, by their respective action, provides users with a visual overview of the trend of their firewall events.

Within the zones field in the query we’ve created earlier, we can query our firewall event aggregates using the firewallEventsAdaptiveGroups field, providing arguments to limit the count of groups, a filter for the date range we’re looking for (combined with any user-entered filters), and a list of fields to order by; in this case, just the datetimeHour field that we’re grouping by.

Within the zones field in the query we created earlier, we can further query our firewall event aggregates using the firewallEventsAdaptiveGroups field and providing arguments for:

  • A limit for the count of groups
  • A filter for the date range we’re looking for (combined with any user-entered filters)
  • A list of fields to orderBy (in this case, just the datetimeHour field that we’re grouping by).

By adding the dimensions field, we’re querying for groups of firewall events, aggregated by the fields nested within dimensions. In this case, our query includes the action and datetimeHour fields, meaning the response will be groups of firewall events which share the same action, and fall within the same hour. We also add a count field, to get a numeric count of how many events fall within each group.

query FirewallEventsByTime($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptiveGroups(
        limit: 576
        filter: $filter
        orderBy: [datetimeHour_DESC]
      ) {
        count
        dimensions {
          action
          datetimeHour
        }
      }
    }
  }
}

Note – Each of our groups queries require a limit to be set. A firewall event can have one of 8 possible actions, and we are querying over a 72 hour period. At most, we’ll end up with 567 groups, so we can set that as the limit for our query.

This query would return a response in the following format:

{
  "viewer": {
    "zones": [
      {
        "firewallEventsAdaptiveGroups": [
          {
            "count": 5,
            "dimensions": {
              "action": "jschallenge",
              "datetimeHour": "2019-09-12T18:00:00Z"
            }
          }
          ...
        ]
      }
    ]
  }
}

We can then take these groups and plot each as a point on a time series chart. Mapping over the firewallEventsAdaptiveGroups array, we can use the group’s count property on the y-axis for our chart, then use the nested fields within the dimensions object, using action as unique series and the datetimeHour as the time stamp on the x-axis.

How we used our new GraphQL Analytics API to build Firewall Analytics

Top Ns

After identifying a spike in activity, our next step is to highlight events with commonality in their attributes. For example, if a certain IP address or individual user agent is causing many firewall events, this could be a sign of an individual attacker, or could be surfacing a false positive.

Similarly to before, we can query aggregate groups of firewall events using the firewallEventsAdaptiveGroups field. However, in this case, instead of supplying action and datetimeHour to the group’s dimensions, we can add individual fields that we want to find common groups of.

By ordering by descending count, we’ll retrieve groups with the highest commonality first, limiting to the top 5 of each. We can add a single field nested within dimensions to group by it. For example, adding clientIP will give five groups with the IP addresses causing the most events.

We can also add a firewallEventsAdaptiveGroups field with no nested dimensions. This will create a single group which allows us to find the total count of events matching our filter.

query FirewallEventsTopNs($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      topIPs: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          clientIP
        }
      }
      topUserAgents: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          userAgent
        }
      }
      total: firewallEventsAdaptiveGroups(
        limit: 1
        filter: $filter
      ) {
        count
      }
    }
  }
}

Note – we can add the firewallEventsAdaptiveGroups field multiple times within a single query, each aliased differently. This allows us to fetch multiple different groupings by different fields, or with no groupings at all. In this case, getting a list of top IP addresses, top user agents, and the total events.

How we used our new GraphQL Analytics API to build Firewall Analytics

We can then reference each of these aliases in the UI, mapping over their respective groups to render each row with its count, and a bar which represents the proportion of total events, showing the proportion of all events each row equates to.

Are these firewall events false positives?

After users have identified spikes, anomalies and common attributes, we wanted to surface more information as to whether these have been caused by malicious traffic, or are false positives.

To do this, we wanted to provide additional context on the events themselves, rather than just counts. We can do this by querying the firewallEventsAdaptive field for these events.

Our GraphQL schema uses the same filter format for both the aggregate firewallEventsAdaptiveGroups field and the raw firewallEventsAdaptive field. This allows us to use the same filters to fetch the individual events which summate to the counts and aggregates in the visualisations above.

query FirewallEventsList($zoneTag: string, $filter: FirewallEventsAdaptiveFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptive(
        filter: $filter
        limit: 10
        orderBy: [datetime_DESC]
      ) {
        action
        clientAsn
        clientCountryName
        clientIP
        clientRequestPath
        clientRequestQuery
        datetime
        rayName
        source
        userAgent
      }
    }
  }
}

How we used our new GraphQL Analytics API to build Firewall Analytics

Once we have our individual events, we can render all of the individual fields we’ve requested, providing users the additional context on event they need to determine whether this is a false positive or not.

That’s how we used our new GraphQL Analytics API to build Firewall Analytics, helping solve some of our customers most common security workflow use cases. We’re excited to see what you build with it, and the problems you can help tackle.

You can find out how to get started querying our GraphQL Analytics API using GraphiQL in our developer documentation, or learn more about writing GraphQL queries on the official GraphQL Foundation documentation.

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Post Syndicated from Filipp Nisenzoun original https://blog.cloudflare.com/introducing-the-graphql-analytics-api-exactly-the-data-you-need-all-in-one-place/

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Today we’re excited to announce a powerful and flexible new way to explore your Cloudflare metrics and logs, with an API conforming to the industry-standard GraphQL specification. With our new GraphQL Analytics API, all of your performance, security, and reliability data is available from one endpoint, and you can select exactly what you need, whether it’s one metric for one domain or multiple metrics aggregated for all of your domains. You can ask questions like “How many cached bytes have been returned for these three domains?” Or, “How many requests have all the domains under my account received?” Or even, “What effect did changing my firewall rule an hour ago have on the responses my users were seeing?”

The GraphQL standard also has strong community resources, from extensive documentation to front-end clients, making it easy to start creating simple queries and progress to building your own sophisticated analytics dashboards.

From many APIs…

Providing insights has always been a core part of Cloudflare’s offering. After all, by using Cloudflare, you’re relying on us for key parts of your infrastructure, and so we need to make sure you have the data to manage, monitor, and troubleshoot your website, app, or service. Over time, we developed a few key data APIs, including ones providing information regarding your domain’s traffic, DNS queries, and firewall events. This multi-API approach was acceptable while we had only a few products, but we started to run into some challenges as we added more products and analytics. We couldn’t expect users to adopt a new analytics API every time they started using a new product. In fact, some of the customers and partners that were relying on many of our products were already becoming confused by the various APIs.

Following the multi-API approach was also affecting how quickly we could develop new analytics within the Cloudflare dashboard, which is used by more people for data exploration than our APIs. Each time we built a new product, our product engineering teams had to implement a corresponding analytics API, which our user interface engineering team then had to learn to use. This process could take up to several months for each new set of analytics dashboards.

…to one

Our new GraphQL Analytics API solves these problems by providing access to all Cloudflare analytics. It offers a standard, flexible syntax for describing exactly the data you need and provides predictable, matching responses. This approach makes it an ideal tool for:

  1. Data exploration. You can think of it as a way to query your own virtual data warehouse, full of metrics and logs regarding the performance, security, and reliability of your Internet property.
  2. Building amazing dashboards, which allow for flexible filtering, sorting, and drilling down or rolling up. Creating these kinds of dashboards would normally require paying thousands of dollars for a specialized analytics tool. You get them as part of our product and can customize them for yourself using the API.

In a companion post that was also published today, my colleague Nick discusses using the GraphQL Analytics API to build dashboards. So, in this post, I’ll focus on examples of how you can use the API to explore your data. To make the queries, I’ll be using GraphiQL, a popular open-source querying tool that takes advantage of GraphQL’s capabilities.

Introspection: what data is available?

The first thing you may be wondering: if the GraphQL Analytics API offers access to so much data, how do I figure out what exactly is available, and how I can ask for it? GraphQL makes this easy by offering “introspection,” meaning you can query the API itself to see the available data sets, the fields and their types, and the operations you can perform. GraphiQL uses this functionality to provide a “Documentation Explorer,” query auto-completion, and syntax validation. For example, here is how I can see all the data sets available for a zone (domain):

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

If I’m writing a query, and I’m interested in data on firewall events, auto-complete will help me quickly find relevant data sets and fields:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Querying: examples of questions you can ask

Let’s say you’ve made a major product announcement and expect a surge in requests to your blog, your application, and several other zones (domains) under your account. You can check if this surge materializes by asking for the requests aggregated under your account, in the 30 minutes after your announcement post, broken down by the minute:

{
 viewer { 
   accounts (filter: {accountTag: $accountTag}) {
     httpRequests1mGroups(limit: 30, filter: {datetime_geq: "2019-09-16T20:00:00Z", datetime_lt: "2019-09-16T20:30:00Z"}, orderBy: [datetimeMinute_ASC]) {
	  dimensions {
		datetimeMinute
	  }
	  sum {
		requests
	  }
	}
   }
 }
}

Here is the first part of the response, showing requests for your account, by the minute:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Now, let’s say you want to compare the traffic coming to your blog versus your marketing site over the last hour. You can do this in one query, asking for the number of requests to each zone:

{
 viewer {
   zones(filter: {zoneTag_in: [$zoneTag1, $zoneTag2]}) {
     httpRequests1hGroups(limit: 2, filter: {datetime_geq: "2019-09-16T20:00:00Z",
datetime_lt: "2019-09-16T21:00:00Z"}) {
       sum {
         requests
       }
     }
   }
 }
}

Here is the response:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Finally, let’s say you’re seeing an increase in error responses. Could this be correlated to an attack? You can look at error codes and firewall events over the last 15 minutes, for example:

{
 viewer {
   zones(filter: {zoneTag: $zoneTag}) {
     httpRequests1mGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       sum {
         responseStatusMap {
           edgeResponseStatus
           requests
         }
       }
     }
    firewallEventsAdaptiveGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       dimensions {
         action
       }
       count
     }
    }
  }
}

Notice that, in this query, we’re looking at multiple datasets at once, using a common zone identifier to “join” them. Here are the results:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

By examining both data sets in parallel, we can see a correlation: 31 requests were “dropped” or blocked by the Firewall, which is exactly the same as the number of “403” responses. So, the 403 responses were a result of Firewall actions.

Try it today

To learn more about the GraphQL Analytics API and start exploring your Cloudflare data, follow the “Getting started” guide in our developer documentation, which also has details regarding the current data sets and time periods available. We’ll be adding more data sets over time, so take advantage of the introspection feature to see the latest available.

Finally, to make way for the new API, the Zone Analytics API is now deprecated and will be sunset on May 31, 2020. The data that Zone Analytics provides is available from the GraphQL Analytics API. If you’re currently using the API directly, please follow our migration guide to change your API calls. If you get your analytics using the Cloudflare dashboard or our Datadog integration, you don’t need to take any action.

One more thing….

In the API examples above, if you find it helpful to get analytics aggregated for all the domains under your account, we have something else you may like: a brand new Analytics dashboard (in beta) that provides this same information. If your account has many zones, the dashboard is helpful for knowing summary information on metrics such as requests, bandwidth, cache rate, and error rate. Give it a try and let us know what you think using the feedback link above the new dashboard.

What’s new with Workers KV?

Post Syndicated from Steve Klabnik original https://blog.cloudflare.com/whats-new-with-workers-kv/

What’s new with Workers KV?

What’s new with Workers KV?

The Storage team here at Cloudflare shipped Workers KV, our global, low-latency, key-value store, earlier this year. As people have started using it, we’ve gotten some feature requests, and have shipped some new features in response! In this post, we’ll talk about some of these use cases and how these new features enable them.

New KV APIs

We’ve shipped some new APIs, both via api.cloudflare.com, as well as inside of a Worker. The first one provides the ability to upload and delete more than one key/value pair at once. Given that Workers KV is great for read-heavy, write-light workloads, a common pattern when getting started with KV is to write a bunch of data via the API, and then read that data from within a Worker. You can now do these bulk uploads without needing a separate API call for every key/value pair. This feature is available via api.cloudflare.com, but is not yet available from within a Worker.

For example, say we’re using KV to redirect legacy URLs to their new homes. We have a list of URLs to redirect, and where they should redirect to. We can turn this list into JSON that looks like this:

[
  {
    "key": "/old/post/1",
    "value": "/new-post-slug-1"
  },
  {
    "key": "/old/post/2",
    "value": "/new-post-slug-2"
  }
]

And then POST this JSON to the new bulk endpoint, /storage/kv/namespaces/:namespace_id/bulk. This will add both key/value pairs to our namespace.

Likewise, if we wanted to drop support for these redirects, we could issue a DELETE that has this body:

[
    "/old/post/1",
    "/old/post/2"
]

to /storage/kv/namespaces/:namespace_id/bulk, and we’d delete both key/value pairs in a single call to the API.

The bulk upload API has one more trick up its sleeve: not all data is a string. For example, you may have an image as a value, which is just a bag of bytes. if you need to write some binary data, you’ll have to base64 the value’s contents so that it’s valid JSON. You’ll also need to set one more key:

[
  {
    "key": "profile-picture",
    "value": "aGVsbG8gd29ybGQ=",
    "base64": true
  }
]

Workers KV will decode the value from base64, and then store the resulting bytes.

Beyond bulk upload and delete, we’ve also given you the ability to list all of the keys you’ve stored in any of your namespaces, from both the API and within a Worker. For example, if you wrote a blog powered by Workers + Workers KV, you might have each blog post stored as a key/value pair in a namespace called “contents”. Most blogs have some sort of “index” page that lists all of the posts that you can read. To create this page, we need to get a listing of all of the keys, since each key corresponds to a given post. We could do this from within a Worker by calling list() on our namespace binding:

const value = await contents.list()

But what we get back isn’t only a list of keys. The object looks like this:

{
  keys: [
    { name: "Title 1” },
    { name: "Title 2” }
  ],
  list_complete: false,
  cursor: "6Ck1la0VxJ0djhidm1MdX2FyD"
}

We’ll talk about this “cursor” stuff in a second, but if we wanted to get the list of titles, we’d have to iterate over the keys property, and pull out the names:

const keyNames = value.keys.map(e => e.name)

keyNames would be an array of strings:

[“Title 1”, “Title 2”, “Title 3”, “Title 4”, “Title 5”]

We could take keyNames and those titles to build our page.

So what’s up with the list_complete and cursor properties? Well, imagine that we’ve been a very prolific blogger, and we’ve now written thousands of posts. The list API is paginated, meaning that it will only return the first thousand keys. To see if there are more pages available, you can check the list_complete property. If it is false, you can use the cursor to fetch another page of results. The value of cursor is an opaque token that you pass to another call to list:

const value = await NAMESPACE.list()
const cursor = value.cursor
const next_value = await NAMESAPACE.list({"cursor": cursor})

This will give us another page of results, and we can repeat this process until list_complete is true.

Listing keys has one more trick up its sleeve: you can also return only keys that have a certain prefix. Imagine we want to have a list of posts, but only the posts that were made in October of 2019. While Workers KV is only a key/value store, we can use the prefix functionality to do interesting things by filtering the list. In our original implementation, we had stored the titles of keys only:

  • Title 1
  • Title 2

We could change this to include the date in YYYY-MM-DD format, with a colon separating the two:

  • 2019-09-01:Title 1
  • 2019-10-15:Title 2

We can now ask for a list of all posts made in 2019:

const value = await NAMESAPCE.list({"prefix": "2019"})

Or a list of all posts made in October of 2019:

const value = await NAMESAPCE.list({"prefix": "2019-10"})

These calls will only return keys with the given prefix, which in our case, corresponds to a date. This technique can let you group keys together in interesting ways. We’re looking forward to seeing what you all do with this new functionality!

Relaxing limits

For various reasons, there are a few hard limits with what you can do with Workers KV. We’ve decided to raise some of these limits, which expands what you can do.

The first is the limit of the number of namespaces any account could have. This was previously set at 20, but some of you have made a lot of namespaces! We’ve decided to relax this limit to 100 instead. This means you can create five times the number of namespaces you previously could.

Additionally, we had a two megabyte maximum size for values. We’ve increased the limit for values to ten megabytes. With the release of Workers Sites, folks are keeping things like images inside of Workers KV, and two megabytes felt a bit cramped. While Workers KV is not a great fit for truly large values, ten megabytes gives you the ability to store larger images easily. As an example, a 4k monitor has a native resolution of 4096 x 2160 pixels. If we had an image at this resolution as a lossless PNG, for example, it would be just over five megabytes in size.

KV browser

Finally, you may have noticed that there’s now a KV browser in the dashboard! Needing to type out a cURL command just to see what’s in your namespace was a real pain, and so we’ve given you the ability to check out the contents of your namespaces right on the web. When you look at a namespace, you’ll also see a table of keys and values:

What’s new with Workers KV?

The browser has grown with a bunch of useful features since it initially shipped. You can not only see your keys and values, but also add new ones:

What’s new with Workers KV?

edit existing ones:

What’s new with Workers KV?

…and even upload files!

What’s new with Workers KV?

You can also download them:

What’s new with Workers KV?

As we ship new features in Workers KV, we’ll be expanding the browser to include them too.

Wrangler integration

The Workers Developer Experience team has also been shipping some features related to Workers KV. Specifically, you can fully interact with your namespaces and the key/value pairs inside of them.

For example, my personal website is running on Workers Sites. I have a Wrangler project named “website” to manage it. If I wanted to add another namespace, I could do this:

$ wrangler kv:namespace create new_namespace
Creating namespace with title "website-new_namespace"
Success: WorkersKvNamespace {
    id: "<id>",
    title: "website-new_namespace",
}

Add the following to your wrangler.toml:

kv-namespaces = [
    { binding = "new_namespace", id = "<id>" }
]

I’ve redacted the namespace IDs here, but Wrangler let me know that the creation was successful, and provided me with the configuration I need to put in my wrangler.toml. Once I’ve done that, I can add new key/value pairs:

$ wrangler kv:key put "hello" "world" --binding new_namespace
Success

And read it back out again:

> wrangler kv:key get "hello" --binding new_namespace
world

If you’d like to learn more about the design of these features, “How we design features for Wrangler, the Cloudflare Workers CLI” discusses them in depth.

More to come

The Storage team is working hard at improving Workers KV, and we’ll keep shipping new stuff every so often. Our updates will be more regular in the future. If there’s something you’d particularly like to see, please reach out!

Inside the Web Browser’s Performance API

Post Syndicated from Young Park original https://blog.cloudflare.com/browser-performance-api/

Inside the Web Browser’s Performance API

Building a beautiful, feature-rich website is easier than ever before. Not long ago, you’d have to fire up a text editor and hand-craft a lot of HTML, CSS, and JavaScript. Today, you can use WYSIWYG tools and third-party libraries that make development much simpler. The flip side of this is that it can be hard to see everything that’s going into your website — and the performance can suffer.

The good news is that modern web browsers expose lots of performance data that can help you understand how your web page performs. With the launch of Browser Insights today, we can analyze the performance from the perspective of the web browser and what the end user actually experiences. In this post, we’ll dive into how we think about performance and utilize the timing APIs in the web browser.

How web browsers measure performance

In the old days, the only way for a developer to profile performance was to intercept requests and measure the time from the beginning of the page load until the end of the load event.

Today, we can use Web APIs that are supported by modern browsers. This is part of the web standard called the Performance API. The Performance API consists of a collection of individual APIs that include:

  • Navigation Timing (for timing information relates to the page and navigation)
  • Resource Timing (for timing data regarding the loading of an application’s resources)
  • Paint Timing (that provides timing information about paint operation during the page construction)

In this blog post, we will primarily focus on the Navigation Timing API.

Inside the Performance API

To see what’s collected with the Performance API, you can open the Developer Tools in Chrome browser and type ‘performance’ in the console tab (or type in performance.timing to get direct access to the PerformanceTiming attribute).

Try expanding the Performance object by clicking on the arrow before the label and again expand the ‘timing’. This is called PerformanceTiming, which includes all the timings that relate to the current page load as UNIX epoch timestamp (milliseconds). The timing attributes shown are not in the order of the load. So for better understanding, let’s look at the illustration provided by the W3C.

Inside the Web Browser’s Performance API
Image from https://www.w3.org/TR/navigation-timing/

As we can see from the diagram shown, each element (represented as a box above) in the order from left-to-right, represents the navigation flow of the page load. Each element has an attribute from the starting point to the end (and some have multiple attributes!) so that we can measure the elapsed time for each element. For example, to get the Request Time, you could type in the command like shown below in the console which appears to be 60 milliseconds.

Inside the Web Browser’s Performance API

How Cloudflare uses performance data

Once your website is proxied through Cloudflare and Browser Insights is enabled, we write and inject a JavaScript beacon into the web page. Our beacon collects metrics from the Performance API to send to our edge, where it can be used to understand where your website is slowing down or having any network problems. The reported data is shown in the Cloudflare dashboard on the Speed Page showing as averages of each timing metric:

Inside the Web Browser’s Performance API

The metrics we surface are:

  • DNS (domainLookupEnd – domainLookupStart): How long the DNS query takes. This could appear as zero if the connection is reused or the content was stored in the local cache (memory or disk).
  • TCP (connectEnd – connectStart): How long it takes to establish a TCP connection the server. If HTTPS, this process includes TLS negotiation time.
  • Request (responseStart – requestStart): The time elapsed between making an HTTP request and receiving the first byte of the response.
  • Response (responseEnd – responseStart): The time elapsed between the first byte and the last byte of the response received. You can think of this as a resource download time.
  • Processing (domComplete – domLoading): How long it took to render the page. If this number is big, you can optimize your document architecture, resource size, or configure settings under Speed page such as Auto Minify the source code. This document process can be drill down more with domInteractive, domContentLoadedEventStart, domContentLoadedEventEnd, and domComplete. We plan to provide more detailed analytics on this later on.
  • Load Event (loadEventEnd – loadEventStart): When the browser finishes loading its document and resources, it triggers a `load` event. This duration may be helpful to you if you have additional functions or any logic for the load event.
  • Total Time: Sum of each timing metrics shown on the graph.

If you are seeing any spikes or unusual form of a line in the stacked line chart, you could start investigating on each element to see what is causing the problem.

For more about how to use Browser Insights, see our announcement blog post.

What’s next

In this blog post, we’ve focused on the Navigation Timing API, because it’s at the heart of our first version of Browser Insights. In the near future, we plan to incorporate metrics from some of the other APIs. For example, we can break down some of the longer timings by looking at individual resource loads, and pointing out what’s taking longer. In addition to that, we plan to track JavaScript errors, provide a way to measure A/B performance, set up monitoring/alerting based on the metrics, and so on. So stay tuned!

How Castle is Building Codeless Customer Account Protection

Post Syndicated from Guest Author original https://blog.cloudflare.com/castle-building-codeless-customer-account-protection/

How Castle is Building Codeless Customer Account Protection

How Castle is Building Codeless Customer Account Protection

This is a guest post by Johanna Larsson, of Castle, who designed and built the Castle Cloudflare app and the supporting infrastructure.

Strong security should be easy.

Asking your consumers again and again to take responsibility for their security through robust passwords and other security measures doesn’t work. The responsibility of security needs to shift from end users to the companies who serve them.

Castle is leading the way for companies to better protect their online accounts with millions of consumers being protected every day. Uniquely, Castle extends threat prevention and protection for both pre and post login ensuring you can keep friction low but security high. With realtime responses and automated workflows for account recovery, overwhelmed security teams are given a hand. However, when you’re that busy, sometimes deploying new solutions takes more time than you have. Reducing time to deployment was a priority so Castle turned to Cloudflare Workers.

User security and friction

When security is no longer optional and threats are not black or white, security teams are left with trying to determine how to allow end-user access and transaction completions when there are hints of risk, or when not all of the information is available. Keeping friction low is important to customer experience. Castle helps organizations be more dynamic and proactive by making continuous security decisions based on realtime risk and trust.

Some of the challenges with traditional solutions is that they are often just focused on protecting the app or they are only focused on point of access, protecting against bot access for example. Tools specifically designed for securing user accounts however are fundamentally focused on protecting the accounts of the end-users, whether they are being targeting by human or bots. Being able to understand end-user behaviors and their devices both pre and post login is therefore critical in being able to truly protect each users. The key to protecting users is being able to decipher between normal and anomalous activity on an individual account and device basis. You also need a playbook to respond to anomalies and attacks with dedicated flows, that allows your end users to interact directly and provide feedback around security events.

By understanding the end user and their good behaviors, devices, and transactions, it is possible to automatically respond to account threats in real-time based on risk level and policy. This approach not only reduces end-user friction but enables security teams to feel more confident that they won’t ever be blocking a legitimate login or transaction.

Castle processes tens of millions of events every day through its APIs, including contextual information like headers, IP, and device types. The more information that can be associated with a request the better. This allows us to better recognize abnormalities and protect the end user. Collection of this information is done in two ways. One is done on the web application’s backend side through our SDKs and the other is done on the client side using our mobile SDK or browser script. Our experience shows that any integration of a security service based on user behavior and anomaly detection can involve many different parties across an organization, and it affects multiple layers of the tech stack. On top of the security related roles, it’s not unusual to also have to coordinate between backend, devops, and frontend teams. The information related to an end user session is often spread widely over a code base.

The cost of security

One of the biggest challenges in implementing a user-facing security and risk management solution is the variety of people and teams it needs attention from, each with competing priorities. Security teams are often understaffed and overwhelmed making it difficult to take on new projects. At the same time, it consumes time from product and engineering personnel on the application side, who are responsible for UX flows and performing continuous authentication post-login.

We’ve been experimenting with approaches where we can extract that complexity from your application code base, while also reducing the effort of integrating. At Castle, we believe that strong security should be easy.

How Castle is Building Codeless Customer Account Protection

With Cloudflare we found a service that enables us to create a more friendly, simple, and in the end, safe integration process by placing the security layer directly between the end user and your application. Security-related logic shouldn’t pollute your app, but should reside in a separate service, or shield, that covers your app. When the two environments are kept separate, this reduces the time and cost of implementing complex systems making integration and maintenance less stressful and much easier.

Our integration with Cloudflare aims to solve this implementation challenge, delivering end-to-end account protection for your users, both pre and post login, with the click of a button.

The codeless integration

In our quest for a purely codeless integration, key features are required. When every customer application is different, this means every integration is different. We want to solve this problem for you once and for all. To do this, we needed to move the security work away from the implementation details so that we could instead focus on describing the key interactions with the end user, like logins or bank transactions. We also wanted to empower key decision makers to recognize and handle crucial interactions in their systems. Creating a single solution that could be customized to fit each specific use case was a priority.

Building on top of Cloudflare’s platform, we made use of three unique and powerful products: Workers, Apps for Workers, and Workers KV.

Thanks to Workers we have full access to the interactions between the end user and your application. With their impressive performance, we can confidently run inline of website requests without creating noticeable latency. We will never slow down your site. And in order to achieve the flexibility required to match your specific use case, we created an internal configuration format that fully describes the interactions of devices and servers across HTTP, including web and mobile app traffic. It is in this Worker where we’ve implemented an advanced routing engine to match and collect information about requests and responses to events, directly from the edge. It also fully handles injecting the Castle browser script — one less thing to worry about.

All of this logic is kept separate from your application code, and through the Cloudflare App Store we are able to distribute this Worker, giving you control over when and where it is enabled, as well as what configurations are used. There’s no need to copy/paste code or manage your own Workers.

In order to achieve the required speed while running in distributed edge locations, we needed a high performing low latency datastore, and we found one in the Cloudflare Workers KV Store. Cloudflare Apps are not able to access the KV Store directly, but we’ve solved this by exposing it through a separate Worker that the Castle App connects to. Because traffic between Workers never leaves the Cloudflare network, this is both secure and fast enough to match your requirements. The KV Store allows us to maintain end user sessions across the world, and also gives us a place to store and update the configurations and sessions that drive the Castle App.

In combining these products we have a complete and codeless integration that is fully configurable and that won’t slow you down.

How does it work?

The data flow is straightforward. After installing the Castle App, Cloudflare will route your traffic through the Castle App, which uses the Castle Data Store and our API to intelligently protect your end users. The impact to traffic latency is minimal because most work is done in the background, not blocking the requests. Let’s dig deeper into each technical feature:

Script injection

One of the tools we use to verify user identity is a browser script: Castle.js. It is responsible for gathering device information and UI interaction behavior, and although it is not required for our service to function, it helps improve our verdicts. This means it’s important that it is properly added to every page in your web application. The Castle App, running between the end user and your application, is able to unobtrusively add the script to each page as it is served. In order for the script to also track page interactions it needs to be able to connect them to your users, which is done through a call to our script and also works out of the box with the Cloudflare interaction. This removes 100% of the integration work from your frontend teams.

Collect contextual information

The second half of the information that forms the basis of our security analysis is the information related to the request itself, such as IP and headers, as well as timestamps. Gathering this information may seem straightforward, but our experience shows some recurring problems in traditional integrations. IP-addresses are easily lost behind reverse proxies, as they need to be maintained as separate headers, like `X-Forwarded-For`, and the internal format of headers differs from platform to platform. Headers in general might get cut off based on whitelisting. The Castle App sees the original request as it comes in, with no outside influence or platform differences, enabling it to reliably create the context of the request. This saves your infrastructure and backend engineers from huge efforts debugging edge cases.

Advanced routing engine

Finally, in order to reliably recognize important events, like login attempts, we’ve built a fully configurable routing engine. This is fast enough to run inline of your web application, and supports near real-time configuration updates. It is powerful enough to translate requests to actual events in your system, like logins, purchases, profile updates or transactions. Using information from the request, it is then able to send this information to Castle, where you are able to analyze, verify and take action on suspicious activity. What’s even better, is that at any point in the future if you want to Castle protect a new critical user event – such as a withdrawal or transfer event – all it takes is adding a record to the configuration file. You never have to touch application code in order to expand your Castle integration across sensitive events.

We’ve put together an example TypeScript snippet that naively implements the flow and features we’ve discussed. The details are glossed over so that we can focus on the functionality.

addEventListener(event => event.respondWith(handleEvent(event)));

const respondWith = async (event: CloudflareEvent) => {
  // You configure the application with your Castle API key
  const { apiKey } = INSTALL_OPTIONS;
  const { request } = event;

  // Configuration is fetched from the KV Store
  const configuration = await getConfiguration(apiKey);

  // The session is also retrieved from the KV Store
  const session = await getUserSession(request);

  // Pass the request through and get the response
  let response = await fetch(request);

  // Using the configuration we can recognize events by running
  // the request+response and configuration through our matching engine
  const securityEvent = getMatchingEvent(request, response, configuration);

  if (securityEvent) {
    // With direct access to the raw request, we can confidently build the context
    // including a device ID generated by the browser script, IP, and headers
    const requestContext = getRequestContext(request);

    // Collecting the relevant information, the data is passed to the Castle API
    event.waitUntil(sendToCastle(securityEvent, session, requestContext));
  }

  // Because we have access to the response HTML page we can safely inject the browser
  // script. If the response is not an HTML page it is passed through untouched.
  response = injectScript(response, session);

  return response;
};

We hope we have inspired you and demonstrated how Workers can provide speed and flexibility when implementing end to end account protection for your end users with Castle. If you are curious about our service, learn more here.

Top Resources for API Architects and Developers

Post Syndicated from George Mao original https://aws.amazon.com/blogs/architecture/top-resources-for-api-architects-and-developers/

We hope you’ve enjoyed reading our series on API architecture and development. We wrote about best practices for REST APIs with Amazon API Gateway  and GraphQL APIs with AWS AppSync. This post will cover the top resources that all API developers should be aware of.

Tech Talks, Webinars, and Twitch Live Stream

The technical staff at AWS have produced a variety of digital media that cover new service launches, best practices, and customer questions. Be sure to review these videos for tips and tricks on building APIs:

  • Happy Little APIs: This is a multi part series produced by our awesome Developer Advocate, Eric Johnson. He leads a series of talks that demonstrate how to build a real world API.
  • API Gateway’s WebSocket webinar: API Gateway now supports real time APIs with Websockets. This webinar covers how to use this feature and why you should let API Gateway manage your realtime APIs.
  • Best practices for building enterprise grade APIs: API Gateway reduces the time it takes to build and deploy REST development but there are strategies that can make development, security, and management easier.
  • An Intro to AWS AppSync and GraphQL: AppSync helps you build sophisticated data applications with realtime and offline capabilities.

Gain Experience With Hands-On Workshops and Examples

One of the easiest ways to get started with Serverless REST API development is to use the Serverless Application Model (SAM). SAM lets you run APIs and Lambda functions locally on your machine for easy development and testing.

For example, you can configure API Gateway as an Event source for Lambda with just a few lines of code:

Type: Api
Properties:
Path: /photos
Method: post

There are many great examples on our GitHub page to help you get started with Authorization (IAMCognito), Request, Response,  various policies , and CORS configurations for API Gateway.

If you’re working with GraphQL, you should review the Amplify Framework. This is an official AWS project that helps you quickly build Web Applications with built in AuthN and backend APIs using REST or GraphQL. With just a few lines of code, you can have Amplify add all required configurations for your GraphQL API. You have two options to integrate your application with an AppSync API:

  1. Directly using the Amplify GraphQL Client
  2. Using the AWS AppSync SDK

An excellent walk through of the Amplify toolkit is available here, including an example showing how to create a single page web app using ReactJS powered by an AppSync GraphQL API.

Finally, if you are interested in a full hands on experience, take a look at:

  • The Amazon API Gateway WildRydes workshop. This workshop teaches you how to build a functional single page web app with a REST backend, powered by API Gateway.
  • The AWS AppSync GraphQL Photo Workshop. This workshop teaches you how to use Amplify to quickly build a Photo sharing web app, powered by AppSync.

Useful Documentation

The official AWS documentation is the source of truth for architects and developers. Get started with the API Gateway developer guide. API Gateway is currently has two APIs (V1 and V2) for managing the service. Here is where you can view the SDK and CLI reference.

Get started with the AppSync developer guide, and review the AppSync management API.

Summary

As an API architect, your job is not only to design and implement the best API for your use case, but your job is also to figure out which type of API is most cost effective for your product. For example, an application with high request volume (“chatty“) may benefit from a GraphQL implementation instead of REST.

API Gateway currently charges $3.50 / million requests and provides a free tier of 1 Million requests per month. There is tiered pricing that will reduce your costs as request volume rises. AppSync currently charges $4.00 / million for Query and Mutation requests.

While AppSync pricing per request is slightly higher, keep in mind that the nature of GraphQL APIs typically result in significantly fewer overall request numbers.

Finally, we encourage you to join us in the coming weeks — we will be starting a series of posts covering messaging best practices.

About the Author

George MaoGeorge Mao is a Specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. George is responsible for helping customers design and operate Serverless applications using services like Lambda, API Gateway, Cognito, and DynamoDB. He is a regular speaker at AWS Summits, re:Invent, and various tech events. George is a software engineer and enjoys contributing to open source projects, delivering technical presentations at technology events, and working with customers to design their applications in the Cloud. George holds a Bachelor of Computer Science and Masters of IT from Virginia Tech.

Announcing the General Availability of API Tokens

Post Syndicated from Garrett Galow original https://blog.cloudflare.com/api-tokens-general-availability/

Announcing the General Availability of API Tokens

APIs at Cloudflare

Announcing the General Availability of API Tokens

Today we are announcing the general availability of API Tokens – a scalable and more secure way to interact with the Cloudflare API. As part of making a better internet, Cloudflare strives to simplify manageability of a customer’s presence at the edge. Part of the way we do this is by ensuring that all of our products and services are configurable by API. Customers ranging from partners to enterprises to developers want to automate management of Cloudflare. Sometimes that is done via our API directly, and other times it is done via open source software we help maintain like our Terraform provider or Cloudflare-Go library. It is critical that customers who are automating management of Cloudflare can keep their Cloudflare services as secure as possible.

Least Privilege and Why it Matters

Securing software systems is hard. Limiting what a piece of software can do is a good defense to prevent mistakes or malicious actions from having greater impact than they could. The principle of least privilege helps guide how much access a given system should have to perform actions. Originally formulated by Jerome Saltzer, “Every program and every privileged user of the system should operate using the least amount of privilege necessary to complete the job.” In the case of Cloudflare, many customers have various domains routing traffic leveraging many different services. If a bad actor gets unauthorized access to a system they can use whatever access that system has to cause further damage or steal additional information.

Let’s see how the capabilities of API Tokens fit into the principle of least privilege.

About API Tokens

API Tokens provide three main capabilities:

  1. Scoping API Tokens by Cloudflare resource
  2. Scoping API Tokens by permission
  3. The ability to provision multiple API Tokens

Let’s break down each of these capabilities.

Scoping API Tokens by Cloudflare Resource

Cloudflare separates service configuration by zone which typically equates to a domain. Additionally, some customers have multiple accounts each with many zones. It is important that when granting API access to a service it only has access to the accounts resources and zones that are pertinent for the job at hand. API Tokens can be scoped to only cover specific accounts and specific zones. One common use case is if you have a staging zone and a production zone, then an API Token can be limited to only be able to affect the staging zone and not have access to the production zone.

Scoping API Tokens by Permission

Being able to scope an API Token to a specific zone is great, but in one zone there are many different services that can be configured: firewall rules, page rules, and load balancers just to name a few. If a customer has a service that should only be able to create new firewall rules in response to traffic patterns, then also allowing that service to change DNS records is a violation of least privilege. API Tokens allow you to scope each token to specific permission. Multiple permissions can be combined to create custom tokens to fit specific use cases.

Multiple API Tokens

If you use Cloudflare to protect and accelerate multiple services, then may be making API changes to Cloudflare from multiple locations – different servers, VMs, containers, or workers. Being able to create an API Token per service means each service is insulated to changes from another. If one API Token is leaked or needs to be rolled, there won’t be any impact to the other services’ API Tokens. Also the capabilities mentioned previously mean that each service can be scoped to exactly what actions and resources necessary. This allows customers to better realize the practice of least privilege for accessing Cloudflare by API.

Now let’s walk through how to create an API Token and use it.

Using API Tokens

To create your first API Token go to the ‘API Tokens’ section of your user profile which can be found here: dash.cloudflare.com/profile/api-tokens

1. On this page, you will find both a list of all of your API Tokens in addition to your Global API Key and Origin CA Key.

Announcing the General Availability of API Tokens
API Tokens Getting Started – Create Token

To create your first API Token, select ‘Create Token’.


2. On the create screen there are two ways to create your token. You can create it from scratch through the ‘Custom’ option or you can start with a predefined template by selecting ‘Start with a template’.

Announcing the General Availability of API Tokens
API Token Template Selection

For this case, we will use the ‘Edit zone DNS’ template to create an API Token that can edit a single zone’s DNS records.


3. Once the template is selected, we need to pick a zone for the API Token to be scoped to. Notice that the DNS Edit permission was already pre-selected.

Announcing the General Availability of API Tokens
Specifying the zone for which the token will be able to control DNS

In this case, ‘garrettgalow.com’ is selected as the Cloudflare zone that the API Token will be able to edit DNS records for.


4. Once I select continue to summary, I’m given a chance to review my selection. In this case the resources and permissions are quite simple, but this gives you a change to make sure you are giving the API Token exactly the correct amount of privilege before creating it.

Announcing the General Availability of API Tokens
Token Summary – confirmation


5. Once created, we are presented with the API Token. This screen is the only time you will be presented with the secret so be sure to put the secret in a safe place! Anyone with this secret can perform the granted actions on the resources specified so protect it like a password. In the below screenshot I have black boxed the secret for obvious reasons. If you happen to lose the secret, you can always regenerate it from the API Tokens table so you don’t have to configure all the permissions again.

Announcing the General Availability of API Tokens
Token creation completion screen with the token secret

In addition to the secret itself this screen provides an example curl request that can be used to verify that the token was successfully created. It also provides an example of how the token should be used for any direct HTTP requests. With API Tokens we now follow the RFC Authorization Bearer standard. Calling that API we see a successful response telling us that the token is valid and active

~$ curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
>      -H "Authorization: Bearer vh9awGupxxxxxxxxxxxxxxxxxxx" \
>      -H "Content-Type:application/json" | jq

{
  "result": {
    "id": "ad599f2b67cdccf24a160f5dcd7bc57b",
    "status": "active"
  },
  "success": true,
  "errors": [],
  "messages": [
    {
      "code": 10000,
      "message": "This API Token is valid and active",
      "type": null
    }
  ]
}

What’s coming next

For anyone using the Cloudflare API, we recommend moving to using API Tokens over their predecessor API Keys going forward. With this announcement, our Terraform provider, Cloudflare-Go library, and WordPress plugin are all updated for API Token compatibility. Other libraries will receive updates soon. Both API Tokens and API Keys will be supported for the time being for customers to be able to safely migrate. We have more planned capabilities for API Tokens to further safeguard how and when tokens are used, so stay tuned for future announcements!

Let us know what you think and what you’d like to see next regarding API security on the Cloudflare Community.

Things to Consider When You Build a GraphQL API with AWS AppSync

Post Syndicated from Steve Johnson original https://aws.amazon.com/blogs/architecture/things-to-consider-when-you-build-a-graphql-api-with-aws-appsync/

Co-authored by George Mao

When building a serverless API layer in AWS (one that provides a custom grammar for your serverless resources), your choices include Amazon API Gateway (REST) and AWS AppSync (GraphQL). We’ve discussed the differences between REST and GraphQL in our first post of this series and explored REST APIs in our second post. This post will dive deeper into GraphQL implementation with AppSync.

Note that the GraphQL specification is focused on grammar and expected behavior, and is light on implementation details. Therefore, each GraphQL implementation approaches these details in a different way. While this blog post speaks to architectural principles, it will also discuss specific features AppSync for practical advice.

Schema vs. Resolver Complexity

All GraphQL APIs are defined by their schema. The schema contains operation definitions (Queries, Mutations, and Subscriptions), as well as data definitions (Data and Input Types). Since GraphQL tools provide introspection, your Schema also becomes your API documentation. So, as your Schema grows, it’s important that it’s consistent and adheres to best practices (such as the use of Input Types for mutations).

Clients will love using your API if they can do what they want with as little work as possible. A good rule of thumb is to put any necessary complexity in the resolver rather than in the client code. For example, if you know client applications will need “Book” information that includes the cover art and current sales ranking – all from different data sources – you can build a single data type that combines them:

GraphQL API -1

In this case, the complexity associated with assembling that data should be handled in the resolver, rather than forcing the client code to make multiple calls and manipulate the returned data.

Mapping data storage to schema gets more complex when you are accessing legacy data services: internal APIs, external REST services, relational database SQL, and services with custom protocols or libraries. The other end of the spectrum is new development, where some architects argue they can map their entire schema to a single source. In practice, your mapping should consider the following:

  • Should slower data sources have a caching layer?
  • Do your most frequently used operations have low latency?
  • How many layers (services) does a request touch?
  • Can high latency requests be modeled asynchronously?

With AppSync, you have the option to use Pipeline Resolvers, which execute reusable functions within a resolver context. Each function in the pipeline can call one of the native resolver types.

Security

Public APIs (ones with an external endpoint) provide access to secured resources. The first line of defense is authorization – restricting who can call an operation in the GraphQL Schema. AppSync has four methods to authenticate clients:

  • API keys: Since the API key does not reference an identity and is easily compromised, we recommend it be used for development only.
  • IAMs: These are standard AWS credentials that are often used for server-side processes. A common example is assigning an execution role to a AWS Lambda function that makes calls to AppSync.
  • OIDC Tokens: These are time-limited and suitable for external clients.
  • Cognito User Pool Tokens: These provide the advantages of OIDC tokens, and also allow you to use the @auth transform for authorization.

Authorization in GraphQL is handled in the resolver logic, which allows field-level access to your data, depending on any criteria you can express in the resolver. This allows flexibility, but also introduces complexity in the form of code.

AppSync Resolvers use a Request/Response template pattern similar to API Gateway. Like API Gateway, the template language is Apache Velocity. In an AppSync resolver, you have the ability to examine the incoming authentication information (such as the IAM username) in the context variable. You can then compare that username against an owner field in the data being retrieved.

AppSync provides declarative security using the @auth and @model transforms. Transforms are annotations you add to your schema that are interpreted by the Amplify Toolchain. Using the @auth transform, you can apply different authentication types to different operations. AppSync will automatically generate resolver logic for DynamoDB, based on the data types in your schema. You can also define field-level permissions based on identity. Get detailed information.

Performance

To get a quantitative view of your API’s performance, you should enable field-level logging on your AppSync API. Doing so will automatically emit information into CloudWatch Logs. Then you can analyze AppSync performance with CloudWatch Logs Insights to identify performance bottlenecks and the root cause of operational problems, such as:

  • Resolvers with the maximum latency
  • The most (or least) frequently invoked resolvers
  • Resolvers with the most errors

Remember, choice of resolver type has an impact on performance. When accessing your data sources, you should prefer a native resolver type such as Amazon DynamoDB or Amazon ElasticSearch using VTL templates. The HTTP resolver type is very efficient, but latency depends on the performance of the downstream service. Lambda resolvers provide flexibility, but have the performance characteristics of your application code.

AppSync also has another resolver type, the Local Resolver, which doesn’t interact with a data source. Instead, this resolver invokes a mutation operation that will result in a subscription message being sent. This is useful in use cases where AppSync is used as a message bus, or in cases where the data has been modified by an external source, and notifications must be sent without modifying the data a second time.

GraphQL API -2

GraphQL Subscriptions

One of the reasons customers choose GraphQL is the power of Subscriptions. These are notifications that are sent immediately to clients when data has changed by a mutation. AppSync subscriptions are implemented using Websockets, and are directly tied to a mutation in the schema. The AppSync SDKs and AWS Amplify Library allow clients to subscribe to these real-time notifications.

AppSync Subscriptions have many uses outside standard API CRUD operations. They can be used for inter-client communication, such as a mobile or web chat application. Subscription notifications can also be used to provide asynchronous responses to long-running requests. The initial request returns quickly, while the full result can be sent via subscription when it’s complete (local resolvers are useful for this pattern).

Subscriptions will only be received if the client is currently running, connected to the server, and is subscribed to the target mutation. If you have a mobile client, you may want to augment these notifications with mobile or web push notifications.

Summary

The choice of using a GraphQL API brings many advantages, especially for client developers. While basic considerations of Security and Performance are important (as they are in any solution), GraphQL APIs require some thought and planning around Schema and Resolver organization, and managing their complexity.

About the author

Steve Johnson

Steve Johnson is a Specialist Solutions Architect at Amazon Web Services, focused on Mobile Applications. Steve helps customers design Mobile and GraphQL applications using AWS Amplify, Amazon AppSync, Amazon Cognito, and the AWS Serverless Suite of products. He is a speaker at AWS Summits and various tech events. Steve is a software and systems engineer and enjoys tinkering with all things mechanical and cloud related. He holds a Bachelor of Mechanical Engineering and Masters in Software Systems Engineering. Steve lives and works near us-east-1, because the latency is good.

 

Scaling Kubernetes deployments with Amazon CloudWatch metrics

Post Syndicated from Ignacio Riesgo original https://aws.amazon.com/blogs/compute/scaling-kubernetes-deployments-with-amazon-cloudwatch-metrics/

This post is contributed by Kwunhok Chan | Solutions Architect, AWS

 

In an earlier post, AWS introduced Horizontal Pod Autoscaler and Kubernetes Metrics Server support for Amazon Elastic Kubernetes Service. These tools make it easy to scale your Kubernetes workloads managed by EKS in response to built-in metrics like CPU and memory.

However, one common use case for applications running on EKS is the integration with AWS services. For example, you administer an application that processes messages published to an Amazon SQS queue. You want the application to scale according to the number of messages in that queue. The Amazon CloudWatch Metrics Adapter for Kubernetes (k8s-cloudwatch-adapter) helps.

 

Amazon CloudWatch Metrics Adapter for Kubernetes

The k8s-cloudwatch-adapter is an implementation of the Kubernetes Custom Metrics API and External Metrics API with integration for CloudWatch metrics. It allows you to scale your Kubernetes deployment using the Horizontal Pod Autoscaler (HPA) with CloudWatch metrics.

 

Prerequisites

Before starting, you need the following:

 

Getting started

Before using the k8s-cloudwatch-adapter, set up a way to manage IAM credentials to Kubernetes pods. The CloudWatch Metrics Adapter requires the following permissions to access metric data from CloudWatch:

cloudwatch:GetMetricData

Create an IAM policy with the following template:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:GetMetricData"
            ],
            "Resource": "*"
        }
    ]
}

For demo purposes, I’m granting admin permissions to my Kubernetes worker nodes. Don’t do this in your production environment. To associate IAM roles to your Kubernetes pods, you may want to look at kube2iam or kiam.

If you’re using an EKS cluster, you most likely provisioned it with AWS CloudFormation. The following command uses AWS CloudFormation stacks to update the proper instance policy with the correct permissions:

aws iam attach-role-policy \
--policy-arn arn:aws:iam::aws:policy/AdministratorAccess \
--role-name $(aws cloudformation describe-stacks --stack-name ${STACK_NAME} --query 'Stacks[0].Parameters[?ParameterKey==`NodeInstanceRoleName`].ParameterValue' | jq -r ".[0]")

 

Make sure to replace ${STACK_NAME} with the nodegroup stack name from the AWS CloudFormation console .

 

You can now deploy the k8s-cloudwatch-adapter to your Kubernetes cluster.

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/deploy/adapter.yaml

 

This deployment creates a new namespace—custom-metrics—and deploys the necessary ClusterRole, Service Account, and Role Binding values, along with the deployment of the adapter. Use the created custom resource definition (CRD) to define the configuration for the external metrics to retrieve from CloudWatch. The adapter reads the configuration defined in ExternalMetric CRDs and loads its external metrics. That allows you to use HPA to autoscale your Kubernetes pods.

 

Verifying the deployment

Next, query the metrics APIs to see if the adapter is deployed correctly. Run the following command:

$ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq.
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
  ]
}

There are no resources from the response because you haven’t registered any metric resources yet.

 

Deploying an Amazon SQS application

Next, deploy a sample SQS application to test out k8s-cloudwatch-adapter. The SQS producer and consumer are provided, together with the YAML files for deploying the consumer, metric configuration, and HPA.

Both the producer and consumer use an SQS queue named helloworld. If it doesn’t exist already, the producer creates this queue.

Deploy the consumer with the following command:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/consumer-deployment.yaml

 

You can verify that the consumer is running with the following command:

$ kubectl get deploy sqs-consumer
NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
sqs-consumer   1         1         1            0           5s

 

Set up Amazon CloudWatch metric and HPA

Next, create an ExternalMetric resource for the CloudWatch metric. Take note of the Kind value for this resource. This CRD resource tells the adapter how to retrieve metric data from CloudWatch.

You define the query parameters used to retrieve the ApproximateNumberOfMessagesVisible for an SQS queue named helloworld. For details about how metric data queries work, see CloudWatch GetMetricData API.

apiVersion: metrics.aws/v1alpha1
kind: ExternalMetric:
  metadata:
    name: hello-queue-length
  spec:
    name: hello-queue-length
    resource:
      resource: "deployment"
      queries:
        - id: sqs_helloworld
          metricStat:
            metric:
              namespace: "AWS/SQS"
              metricName: "ApproximateNumberOfMessagesVisible"
              dimensions:
                - name: QueueName
                  value: "helloworld"
            period: 300
            stat: Average
            unit: Count
          returnData: true

 

Create the ExternalMetric resource:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/externalmetric.yaml

 

Then, set up the HPA for your consumer. Here is the configuration to use:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: sqs-consumer-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: sqs-consumer
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: hello-queue-length
      targetValue: 30

 

This HPA rule starts scaling out when the number of messages visible in your SQS queue exceeds 30, and scales in when there are fewer than 30 messages in the queue.

Create the HPA resource:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/hpa.yaml

 

Generate load using a producer

Finally, you can start generating messages to the queue:

$ kubectl apply -f https://raw.githubusercontent.com/awslabs/k8s-cloudwatch-adapter/master/samples/sqs/deploy/producer-deployment.yaml

On a separate terminal, you can now watch your HPA retrieving the queue length and start scaling the replicas. SQS metrics generate at five-minute intervals, so give the process a few minutes:

$ kubectl get hpa sqs-consumer-scaler -w

 

Clean up

After you complete this experiment, you can delete the Kubernetes deployment and respective resources.

Run the following commands to remove the consumer, external metric, HPA, and SQS queue:

$ kubectl delete deploy sqs-producer
$ kubectl delete hpa sqs-consumer-scaler
$ kubectl delete externalmetric sqs-helloworld-length
$ kubectl delete deploy sqs-consumer

$ aws sqs delete-queue helloworld

 

Other CloudWatch integrations

AWS recently announced the preview for Amazon CloudWatch Container Insights, which monitors, isolates, and diagnoses containerized applications running on EKS and Kubernetes clusters. To get started, see Using Container Insights.

 

Get involved

This project is currently under development. AWS welcomes issues and pull requests, and would love to hear your feedback.

How could this adapter be best implemented to work in your environment? Visit the Amazon CloudWatch Metrics Adapter for Kubernetes project on GitHub and let AWS know what you think.